Pandas

Pandas is the Python data analysis library providing the DataFrame abstraction for working with labeled, structured tabular data — the de facto standard for data exploration, cleaning, transformation, and feature engineering throughout the entire ML pipeline from raw data ingestion to model-ready feature matrices.

What Is Pandas?

- Definition: A Python library built on NumPy that provides two primary data structures: DataFrame (2D labeled table, like a SQL table or Excel spreadsheet) and Series (1D labeled array, like a column) — with hundreds of operations for data manipulation, aggregation, merging, and transformation.
- The Key Value: Pandas combines data storage with rich metadata (column names, index labels, dtypes) — making it possible to write self-documenting data transformation code that operates by column name rather than array index.
- Under the Hood: Pandas DataFrames store columns as NumPy arrays — vectorized operations drop to C speed while the Python API provides high-level expressiveness.
- Ecosystem Role: The standard output format of data loading tools (CSV, Parquet, SQL, HDF5, Feather) and the standard input format for Scikit-Learn, XGBoost, LightGBM, and feature engineering pipelines.

Why Pandas Matters for AI

- EDA (Exploratory Data Analysis): Profile datasets — check distributions, identify nulls, detect outliers, understand class imbalances before model training.
- Data Cleaning: Handle missing values (fillna, dropna), fix data types (astype), remove duplicates, standardize inconsistent values — the grunt work that determines model quality.
- Feature Engineering: Create new features from raw data — time differences, rolling averages, categorical encodings, text length statistics — all expressible as vectorized Pandas operations.
- Train/Val/Test Splits: Stratified splits by category, time-based splits for temporal data — Pandas makes these easy with boolean indexing and groupby operations.
- Results Analysis: After model prediction, merge predictions back with metadata, analyze errors by segment, compute per-category metrics.

Core Operations

Loading Data:
import pandas as pd
df = pd.read_csv("data.csv")
df = pd.read_parquet("data.parquet") # Faster for large files
df = pd.read_sql("SELECT * FROM qa_responses", conn)

Inspection:
df.shape # (rows, columns)
df.dtypes # column data types
df.describe() # statistical summary
df.isnull().sum() # count nulls per column
df.value_counts() # frequency of each unique value

Selection:
df["column"] # Series (column)
df[["col1", "col2"]] # DataFrame (multiple columns)
df.loc[row_label, col_label] # Label-based indexing
df.iloc[row_idx, col_idx] # Integer-based indexing
df[df["length"] > 500] # Boolean filtering

Transformation:
df["len"] = df["response"].str.len() # Derived column
df["clean"] = df["text"].str.lower().str.strip() # String operations
df["category"] = df["label"].map(label_map) # Apply dictionary mapping
df = df.dropna(subset=["response"]) # Remove rows with null response
df = df.fillna({"score": 0.0}) # Fill nulls with value

Aggregation:
df.groupby("category")["score"].mean() # Mean score per category
df.groupby("model").agg({"tokens": "sum", "cost": "mean"}) # Multiple aggregations
df.pivot_table(index="model", columns="task", values="accuracy") # Pivot table

Performance Anti-Patterns and Fixes

Slow — Row iteration:
for idx, row in df.iterrows():
df.loc[idx, "new_col"] = process(row["text"]) # ~1000x slower than vectorized

Fast — Vectorized:
df["new_col"] = df["text"].apply(process) # apply() still Python but no overhead
df["new_col"] = df["text"].str.len() # True vectorized C operation

Slow — Repeated indexing in loop:
for i in range(len(df)):
result.append(df["col"][i]) # Repeated Series indexing

Fast — Direct NumPy:
result = df["col"].values.tolist() # Convert to NumPy array once, then list

Pandas for LLM Dataset Preparation

df = pd.read_json("training_data.jsonl", lines=True)
# Filter short responses
df = df[df["response"].str.len() >= 500]
# Remove duplicates
df = df.drop_duplicates(subset=["prompt"])
# Add token count
df["n_tokens"] = df["prompt"].apply(lambda x: len(tokenizer.encode(x)))
# Filter context length
df = df[df["n_tokens"] <= 4096]
# Sample balanced dataset
df_balanced = df.groupby("category").apply(lambda g: g.sample(min(len(g), 1000)))
# Save for training
df_balanced.to_parquet("training_ready.parquet", index=False)

When to Move Beyond Pandas

| Scenario | Better Tool |
|----------|------------|
| Dataset > 10GB RAM | Polars, Dask, Spark |
| Need true multi-threading | Polars (Rust, parallel) |
| Streaming data | Polars lazy, Spark Streaming |
| SQL-native workflow | DuckDB (fast, in-process) |
| NumPy operations only | Skip Pandas, use NumPy directly |

Pandas is the universal workhorse of Python data science — its DataFrame abstraction strikes the ideal balance between expressiveness and performance for datasets up to a few gigabytes, making it the first tool reached for data exploration, cleaning, and preparation tasks that precede every model training run.

Want to learn more?