Data Handling with Pandas

 

Data Handling with Pandas – Reading, Analyzing, and Cleaning Data


Data is everywhere from CSV files to online databases, and handling it efficiently is a crucial skill for any Python developer or data enthusiast.

Pandas is a powerful Python library that makes data handling fast, intuitive, and effective. In this guide, you will learn:

✔ How to read data into Pandas
✔ Analyze datasets efficiently
✔ Clean and preprocess data for analysis

Let’s dive in!

What is Pandas?

Pandas is a Python library for data manipulation and analysis. It provides two primary data structures:

  • Series – 1D labeled array

  • DataFrame – 2D labeled table (like Excel)

It allows easy reading, filtering, grouping, and cleaning of data — making Python ideal for data analysis.

Part 1: Reading Data with Pandas

Pandas can read multiple formats:

  • CSV → pd.read_csv()

  • Excel → pd.read_excel()

  • JSON → pd.read_json()

Example: Reading a CSV File

import pandas as pd

# Read CSV file
df = pd.read_csv("data/sales_data.csv")

# View first 5 rows
print(df.head())

Output Example:

DateProduct              Sales         Region
       2026-01-01            Laptop1000         North
       2026-01-02Phone500         South

Part 2: Analyzing Data

Pandas provides many tools for exploring and understanding data.

1️⃣ Basic Info

df.info() # Column info & non-null count
df.describe() # Summary statistics

2️⃣ Selecting Columns and Rows

# Select column
print(df['Product'])

# Select multiple columns
print(df[['Product', 'Sales']])

# Select rows by index
print(df.iloc[0:5])

# Filter rows
print(df[df['Sales'] > 500])

3️⃣ Sorting Data

# Sort by Sales descending
df_sorted = df.sort_values(by='Sales', ascending=False)
print(df_sorted.head())

4️⃣ Grouping Data

# Total sales by region
region_sales = df.groupby('Region')['Sales'].sum()
print(region_sales)

Part 3: Cleaning Data

Real-world data is messy. Pandas provides tools to clean it.

1️⃣ Handling Missing Values

# Check for missing data
print(df.isnull().sum())

# Fill missing values
df['Sales'].fillna(0, inplace=True)

# Drop rows with missing values
df.dropna(inplace=True)

2️⃣ Removing Duplicates

    df.drop_duplicates(inplace=True)

3️⃣ Renaming Columns

    df.rename(columns={'Sales': 'Revenue'}, inplace=True)

4️⃣ Changing Data Types

    df['Date'] = pd.to_datetime(df['Date'])
    df['Sales'] = df['Sales'].astype(float)

Part 4: Advanced Data Handling

  • Apply Functions

        df['Revenue'] = df['Sales'].apply(lambda x: x * 1.1) # Add 10% tax
  • Merging DataFrames

       df_merged = pd.merge(df1, df2, on='Product', how='inner')
  • Pivot Tables

        pivot = df.pivot_table(
                    index='Region', columns='Product',
                    values='Sales', aggfunc='sum'
                )
        print(pivot)


Part 5: Visualization with Pandas

Pandas integrates well with Matplotlib:

        import matplotlib.pyplot as plt
        # Plot sales by product
        df.groupby('Product')['Sales'].sum().plot(kind='bar')
        plt.title("Total Sales by Product")
        plt.ylabel("Sales")
        plt.show()

Tips for Efficient Data Handling

  • Always inspect data with head(), info(), and describe()

  • Handle missing values early

  • Use vectorized operations instead of loops

  • Save cleaned data: df.to_csv("cleaned_data.csv", index=False)

Real-World Use Cases

  • Sales & Marketing Analysis

  • Financial Reporting

  • Research Data Cleaning

  • Machine Learning Preprocessing

  • Business Intelligence Dashboards

Pandas is the backbone of Python-based data analysis. With it, you can:

  • Read any data format

  • Explore & summarize datasets

  • Clean and preprocess for analysis

  • Prepare data for ML & BI

Comments

Popular posts from this blog

Database Integration in FastAPI (SQLAlchemy CRUD)

Middleware & CORS in FastAPI

Python Data Handling