Python for Web Scraping
Python for Web Scraping – Using BeautifulSoup and Scrapy
Web scraping is one of the most powerful use cases of Python. From extracting product prices to collecting job listings or research data, Python makes scraping simple and efficient.
In this blog, we’ll cover:
-
What is Web Scraping?
-
Using BeautifulSoup (Beginner-Friendly)
-
Using Scrapy (Advanced & Scalable)
-
Comparison
-
Best Practices
What is Web Scraping?
Web scraping is the process of extracting data from websites automatically using code instead of manually copying data. Python is popular for scraping because of powerful libraries like:
-
requests– for fetching web pages -
BeautifulSoup– for parsing HTML -
Scrapy– for large-scale scraping
Web Scraping Using BeautifulSoup
BeautifulSoup is part of the bs4 library and is ideal for small to medium scraping projects.
Installation
pip install requests beautifulsoup4
Basic Example – Extract Titles from a Website
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Extract all <h2> tags
titles = soup.find_all("h2")
for title in titles:
print(title.text)How It Works
requests.get()→ Fetches webpage content
BeautifulSoup()→ Parses HTML
find_all()→ Finds specific tags
.text→ Extracts text contentExtract Data Using CSS Selectors
products = soup.select(".product-name")
for product in products:
print(product.get_text())When to Use BeautifulSoup?
✔ Small projects
✔ One-page scraping
✔ Learning purposes
✔ Quick scriptsWeb Scraping Using Scrapy
Scrapy is a powerful web scraping framework designed for large-scale projects.It handles:
Request scheduling
Data pipelines
Built-in concurrency
Automatic retries
Exporting data (JSON, CSV)
Installation - pip install scrapy
Create a Scrapy Project
scrapy startproject myproject
cd myproject
scrapy genspider example example.comExample Spider
Inside
spiders/example.py:import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = ["https://example.com"]
def parse(self, response):
titles = response.css("h2::text").getall()
for title in titles:
yield {"title": title}Run the spider: scrapy crawl example -o output.json
Why Scrapy is Powerful?
Handles multiple pages automatically
Faster due to asynchronous requests
Built-in data export
Scalable architecture
Middleware & pipelines
Feature BeautifulSoup Scrapy Learning Curve Easy Moderate Best For Small projects Large-scale scraping Speed Slower Faster (async) Built-in Tools Minimal Many Project Structure Script-based Framework-based
Before scraping any website:
✔ Check
robots.txt
✔ Read terms & conditions
✔ Avoid overloading servers
✔ Use delays between requests
✔ Never scrape sensitive/private dataExample:
import time
time.sleep(2)
Use
User-AgentheadersRotate proxies for large scraping
Handle pagination
Store data in database
Use XPath for complex extraction
Example with headers:
headers = {
"User-Agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)Real-World Use Cases
Price monitoring
Job listing aggregation
Market research
SEO analysis
News collection
Competitor analysis
Comments
Post a Comment