Home Knowledge Base Pandas

Pandas is the Python data analysis library providing the DataFrame abstraction for working with labeled, structured tabular data — the de facto standard for data exploration, cleaning, transformation, and feature engineering throughout the entire ML pipeline from raw data ingestion to model-ready feature matrices.

What Is Pandas?

Why Pandas Matters for AI

Core Operations

Loading Data: import pandas as pd df = pd.read_csv("data.csv") df = pd.read_parquet("data.parquet") # Faster for large files df = pd.read_sql("SELECT * FROM qa_responses", conn)

Inspection: df.shape # (rows, columns) df.dtypes # column data types df.describe() # statistical summary df.isnull().sum() # count nulls per column df.value_counts() # frequency of each unique value

Selection: df["column"] # Series (column) df[["col1", "col2"]] # DataFrame (multiple columns) df.loc[row_label, col_label] # Label-based indexing df.iloc[row_idx, col_idx] # Integer-based indexing df[df["length"] > 500] # Boolean filtering

Transformation: df["len"] = df["response"].str.len() # Derived column df["clean"] = df["text"].str.lower().str.strip() # String operations df["category"] = df["label"].map(label_map) # Apply dictionary mapping df = df.dropna(subset=["response"]) # Remove rows with null response df = df.fillna({"score": 0.0}) # Fill nulls with value

Aggregation: df.groupby("category")["score"].mean() # Mean score per category df.groupby("model").agg({"tokens": "sum", "cost": "mean"}) # Multiple aggregations df.pivot_table(index="model", columns="task", values="accuracy") # Pivot table

Performance Anti-Patterns and Fixes

Slow — Row iteration: for idx, row in df.iterrows(): df.loc[idx, "new_col"] = process(row["text"]) # ~1000x slower than vectorized

Fast — Vectorized: df["new_col"] = df["text"].apply(process) # apply() still Python but no overhead df["new_col"] = df["text"].str.len() # True vectorized C operation

Slow — Repeated indexing in loop: for i in range(len(df)): result.append(df["col"][i]) # Repeated Series indexing

Fast — Direct NumPy: result = df["col"].values.tolist() # Convert to NumPy array once, then list

Pandas for LLM Dataset Preparation

df = pd.read_json("training_data.jsonl", lines=True)

Filter short responses

df = df[df["response"].str.len() >= 500]

Remove duplicates

df = df.drop_duplicates(subset=["prompt"])

Add token count

df["n_tokens"] = df["prompt"].apply(lambda x: len(tokenizer.encode(x)))

Filter context length

df = df[df["n_tokens"] <= 4096]

Sample balanced dataset

df_balanced = df.groupby("category").apply(lambda g: g.sample(min(len(g), 1000)))

Save for training

df_balanced.to_parquet("training_ready.parquet", index=False)

When to Move Beyond Pandas

ScenarioBetter Tool
Dataset > 10GB RAMPolars, Dask, Spark
Need true multi-threadingPolars (Rust, parallel)
Streaming dataPolars lazy, Spark Streaming
SQL-native workflowDuckDB (fast, in-process)
NumPy operations onlySkip Pandas, use NumPy directly

Pandas is the universal workhorse of Python data science — its DataFrame abstraction strikes the ideal balance between expressiveness and performance for datasets up to a few gigabytes, making it the first tool reached for data exploration, cleaning, and preparation tasks that precede every model training run.

pandasdataframetabular

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.