Python for Web Scraping – Using BeautifulSoup and Scrapy

Web scraping is one of the most powerful use cases of Python. From extracting product prices to collecting job listings or research data, Python makes scraping simple and efficient.

In this blog, we’ll cover:

What is Web Scraping?
Using BeautifulSoup (Beginner-Friendly)
Using Scrapy (Advanced & Scalable)
Comparison
Best Practices

What is Web Scraping?

Web scraping is the process of extracting data from websites automatically using code instead of manually copying data. Python is popular for scraping because of powerful libraries like:

requests – for fetching web pages
BeautifulSoup – for parsing HTML
Scrapy – for large-scale scraping

Web Scraping Using BeautifulSoup

BeautifulSoup is part of the bs4 library and is ideal for small to medium scraping projects.

Installation


pip install requests beautifulsoup4

Basic Example – Extract Titles from a Website


import requests
from bs4 import BeautifulSoup

url = "https://example.com"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Extract all <h2> tags
titles = soup.find_all("h2")

for title in titles:
    print(title.text)

How It Works
requests.get() → Fetches webpage content
BeautifulSoup() → Parses HTML
find_all() → Finds specific tags
.text → Extracts text content
Extract Data Using CSS Selectors
products = soup.select(".product-name")

for product in products:
    print(product.get_text())


When to Use BeautifulSoup?
✔ Small projects

✔ One-page scraping

✔ Learning purposes

✔ Quick scripts
Web Scraping Using Scrapy
Scrapy is a powerful web scraping framework designed for large-scale projects.It handles:
Request scheduling
Data pipelines
Built-in concurrency
Automatic retries
Exporting data (JSON, CSV)
Installation - pip install scrapy
Create a Scrapy Project
scrapy startproject myproject
cd myproject
scrapy genspider example example.com
Example Spider
Inside spiders/example.py:
import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = ["https://example.com"]

    def parse(self, response):
        titles = response.css("h2::text").getall()

        for title in titles:
            yield {"title": title}

Run the spider: scrapy crawl example -o output.json
Why Scrapy is Powerful?
Handles multiple pages automatically
Faster due to asynchronous requests
Built-in data export
Scalable architecture
Middleware & pipelines

Feature      BeautifulSoup Scrapy
Learning Curve            Easy      Moderate
Best For            Small projects      Large-scale scraping
Speed            Slower      Faster (async)
Built-in Tools            Minimal      Many
Project Structure            Script-based      Framework-based

Before scraping any website:
✔ Check robots.txt

✔ Read terms & conditions

✔ Avoid overloading servers

✔ Use delays between requests

✔ Never scrape sensitive/private data
Example:    
    import time
    time.sleep(2)
Use User-Agent headers
Rotate proxies for large scraping
Handle pagination
Store data in database
Use XPath for complex extraction
Example with headers:
headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)

Real-World Use Cases
Price monitoring
Job listing aggregation
Market research
SEO analysis
News collection
Competitor analysis

Search This Blog

PyCraft Studio

Python for Web Scraping

Python for Web Scraping – Using BeautifulSoup and Scrapy

What is Web Scraping?

Web Scraping Using BeautifulSoup

Installation

Basic Example – Extract Titles from a Website

How It Works

Extract Data Using CSS Selectors

When to Use BeautifulSoup?

Web Scraping Using Scrapy

Installation - pip install scrapy

Create a Scrapy Project

Example Spider

Why Scrapy is Powerful?

Real-World Use Cases

Comments

Post a Comment

Popular posts from this blog

Database Integration in FastAPI (SQLAlchemy CRUD)

Middleware & CORS in FastAPI

Python Data Handling

Feature	BeautifulSoup	Scrapy
Learning Curve	Easy	Moderate
Best For	Small projects	Large-scale scraping
Speed	Slower	Faster (async)
Built-in Tools	Minimal	Many
Project Structure	Script-based	Framework-based