Data Handling with Pandas
Data Handling with Pandas – Reading, Analyzing, and Cleaning Data
Data is everywhere from CSV files to online databases, and handling it efficiently is a crucial skill for any Python developer or data enthusiast.
Pandas is a powerful Python library that makes data handling fast, intuitive, and effective. In this guide, you will learn:
Let’s dive in!
What is Pandas?
Pandas is a Python library for data manipulation and analysis. It provides two primary data structures:
-
Series – 1D labeled array
-
DataFrame – 2D labeled table (like Excel)
It allows easy reading, filtering, grouping, and cleaning of data — making Python ideal for data analysis.
Part 1: Reading Data with Pandas
Pandas can read multiple formats:
-
CSV →
pd.read_csv() -
Excel →
pd.read_excel() -
JSON →
pd.read_json()
Example: Reading a CSV File
import pandas as pd
# Read CSV file
df = pd.read_csv("data/sales_data.csv")
# View first 5 rows
print(df.head())
Output Example:
| Date | Product | Sales | Region |
|---|---|---|---|
| 2026-01-01 | Laptop | 1000 | North |
| 2026-01-02 | Phone | 500 | South |
Part 2: Analyzing Data
Pandas provides many tools for exploring and understanding data.
1️⃣ Basic Info
df.info() # Column info & non-null count
df.describe() # Summary statistics
2️⃣ Selecting Columns and Rows
# Select column
print(df['Product'])
# Select multiple columns
print(df[['Product', 'Sales']])
# Select rows by index
print(df.iloc[0:5])
# Filter rows
print(df[df['Sales'] > 500])
3️⃣ Sorting Data
# Sort by Sales descending
df_sorted = df.sort_values(by='Sales', ascending=False)
print(df_sorted.head())
4️⃣ Grouping Data
# Total sales by region
region_sales = df.groupby('Region')['Sales'].sum()
print(region_sales)Part 3: Cleaning Data
Real-world data is messy. Pandas provides tools to clean it.
1️⃣ Handling Missing Values
# Check for missing data
print(df.isnull().sum())
# Fill missing values
df['Sales'].fillna(0, inplace=True)
# Drop rows with missing values
df.dropna(inplace=True)2️⃣ Removing Duplicates
df.drop_duplicates(inplace=True)3️⃣ Renaming Columns
df.rename(columns={'Sales': 'Revenue'}, inplace=True)4️⃣ Changing Data Types
df['Date'] = pd.to_datetime(df['Date'])
df['Sales'] = df['Sales'].astype(float)Part 4: Advanced Data Handling
Apply Functions
df['Revenue'] = df['Sales'].apply(lambda x: x * 1.1) # Add 10% tax
Merging DataFrames
df_merged = pd.merge(df1, df2, on='Product', how='inner')
Pivot Tables
pivot = df.pivot_table(index='Region', columns='Product',values='Sales', aggfunc='sum')print(pivot)Part 5: Visualization with Pandas
Pandas integrates well with Matplotlib:
import matplotlib.pyplot as plt
# Plot sales by product
df.groupby('Product')['Sales'].sum().plot(kind='bar')
plt.title("Total Sales by Product")
plt.ylabel("Sales")
plt.show()Tips for Efficient Data Handling
Always inspect data with
head(),info(), anddescribe()Handle missing values early
Use vectorized operations instead of loops
Save cleaned data:
df.to_csv("cleaned_data.csv", index=False)Real-World Use Cases
Sales & Marketing Analysis
Financial Reporting
Research Data Cleaning
Machine Learning Preprocessing
Business Intelligence Dashboards
Pandas is the backbone of Python-based data analysis. With it, you can:
Read any data format
Explore & summarize datasets
Clean and preprocess for analysis
Prepare data for ML & BI
Comments
Post a Comment