Python for Web Scraping

 

Python for Web Scraping – Using BeautifulSoup and Scrapy

Web scraping is one of the most powerful use cases of Python. From extracting product prices to collecting job listings or research data, Python makes scraping simple and efficient.

In this blog, we’ll cover:

  • What is Web Scraping?

  • Using BeautifulSoup (Beginner-Friendly)

  • Using Scrapy (Advanced & Scalable)

  • Comparison

  • Best Practices


What is Web Scraping?

Web scraping is the process of extracting data from websites automatically using code instead of manually copying data. Python is popular for scraping because of powerful libraries like:

  • requests – for fetching web pages

  • BeautifulSoup – for parsing HTML

  • Scrapy – for large-scale scraping

Web Scraping Using BeautifulSoup

BeautifulSoup is part of the bs4 library and is ideal for small to medium scraping projects.

Installation

pip install requests beautifulsoup4

Basic Example – Extract Titles from a Website

import requests
from bs4 import BeautifulSoup

url = "https://example.com"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Extract all <h2> tags
titles = soup.find_all("h2")

for title in titles:
print(title.text)

How It Works

  • requests.get() → Fetches webpage content

  • BeautifulSoup() → Parses HTML

  • find_all() → Finds specific tags

  • .text → Extracts text content

Extract Data Using CSS Selectors

products = soup.select(".product-name")

for product in products:
print(product.get_text())


When to Use BeautifulSoup?

✔ Small projects
✔ One-page scraping
✔ Learning purposes
✔ Quick scripts

Web Scraping Using Scrapy

Scrapy is a powerful web scraping framework designed for large-scale projects.It handles:

  • Request scheduling

  • Data pipelines

  • Built-in concurrency

  • Automatic retries

  • Exporting data (JSON, CSV)

Installation - pip install scrapy

Create a Scrapy Project

scrapy startproject myproject
cd myproject
scrapy genspider example example.com

Example Spider

Inside spiders/example.py:

import scrapy

class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = ["https://example.com"]

def parse(self, response):
titles = response.css("h2::text").getall()

for title in titles:
yield {"title": title}

Run the spider: scrapy crawl example -o output.json

Why Scrapy is Powerful?

  • Handles multiple pages automatically

  • Faster due to asynchronous requests

  • Built-in data export

  • Scalable architecture

  • Middleware & pipelines


Feature     BeautifulSoupScrapy
Learning Curve           Easy     Moderate
Best For           Small projects     Large-scale scraping
Speed           Slower     Faster (async)
Built-in Tools           Minimal     Many
Project Structure           Script-based     Framework-based

Before scraping any website:

✔ Check robots.txt
✔ Read terms & conditions
✔ Avoid overloading servers
✔ Use delays between requests
✔ Never scrape sensitive/private data

Example:    

    import time
    time.sleep(2)
  • Use User-Agent headers

  • Rotate proxies for large scraping

  • Handle pagination

  • Store data in database

  • Use XPath for complex extraction

Example with headers:

headers = {
"User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)

Real-World Use Cases

  • Price monitoring

  • Job listing aggregation

  • Market research

  • SEO analysis

  • News collection

  • Competitor analysis


Comments

Popular posts from this blog

Database Integration in FastAPI (SQLAlchemy CRUD)

Middleware & CORS in FastAPI

Python Data Handling