Pandas for Finance: Cleaning Market Data: Nulls, Forward-Fill, Adj Close Trap, Outliers
Video: Cleaning Market Data: Nulls, Forward-Fill, Adj Close Trap, Outliers | Pandas for Finance Ep12 by CelesteAI
df.isna().sum()— count nulls.df["Adj Close"]accounts for splits/dividends;Closedoesn’t..fillna()/.dropna()for missing.df["Volume"] > 0filters bad rows. The “your backtest lies if you skip this” lesson.
Real data is messy. Splits, dividends, suspended trading, exchange glitches, ticker changes. The data fixes you don’t see in a tutorial DataFrame are the data fixes that make or break a backtest.
Inspect: where are the holes?
import pandas as pd
df = pd.read_parquet("data/prices.parquet")
print("Total rows:", len(df))
print("Tickers:", df["Ticker"].nunique())
print("\nNull counts:")
print(df.isna().sum())
Total rows: 24375
Tickers: 14
Null counts:
Date 0
Ticker 0
Open 12
High 12
Low 12
Close 12
Adj Close 12
Volume 12
12 days where some columns are null. Either holidays affected one ticker but not others, or the ticker didn’t trade.
# Where exactly?
nulls = df[df["Close"].isna()]
print(nulls[["Date", "Ticker"]].head())
For our universe these are usually the very first day (data not available yet). Drop or fill.
fillna vs dropna
# Drop rows with any null
clean = df.dropna()
# Drop only where Close is null (other cols can be null)
clean = df.dropna(subset=["Close"])
# Fill with last good value (forward fill) — common for prices
df = df.sort_values(["Ticker", "Date"])
df[["Open","High","Low","Close","Adj Close"]] = (
df.groupby("Ticker")[["Open","High","Low","Close","Adj Close"]].ffill()
)
# Fill volume with 0 (no trades)
df["Volume"] = df["Volume"].fillna(0)
For prices, forward-fill within each ticker — carry yesterday’s close. For volume, fill 0 (literally no trades).
Don’t forward-fill across the ticker boundary (groupby("Ticker") ensures it stops there).
Splits and dividends — the trap
Yahoo gives both:
df[df["Ticker"] == "AAPL"][["Date", "Close", "Adj Close"]].iloc[[0, 100, 500, 1000, -1]]
Date Close Adj Close
0 2018-01-02 43.06 40.55
100 2018-05-29 46.93 44.51
500 2019-12-26 72.49 70.06
1000 2021-12-21 172.99 169.79
-1 2024-12-31 250.42 250.42
Close is the actual price; Adj Close is back-adjusted for all splits and dividends.
For returns and backtests, always use Adj Close. Otherwise a 4:1 split looks like a 75% drop.
# WRONG — looks like AAPL crashed in 2020 split
df["Return"] = df.groupby("Ticker")["Close"].pct_change()
# RIGHT
df["Return"] = df.groupby("Ticker")["Adj Close"].pct_change()
For “what did it actually trade at on day X” — use Close.
Detecting outliers
df["Return"] = df.groupby("Ticker")["Adj Close"].pct_change()
# Anything > 5σ from mean
mu = df.groupby("Ticker")["Return"].transform("mean")
sigma = df.groupby("Ticker")["Return"].transform("std")
outliers = df[(df["Return"] - mu).abs() > 5 * sigma]
print(outliers[["Date", "Ticker", "Return"]].head())
Most are real (earnings, market crash days). A few will be data errors — share splits not properly adjusted, decimals in the wrong place.
For automated cleaning, don’t drop outliers blindly. Inspect first. The 2020 COVID crash days look like outliers but are real.
Suspended trading
# Days with zero volume = suspended
suspended = df[df["Volume"] == 0]
print(suspended[["Date", "Ticker"]].drop_duplicates("Date").head())
Drop these — they’re not real prices:
df = df[df["Volume"] > 0].copy()
(For ETFs, zero volume can happen on quiet days legitimately. Adjust the threshold based on your universe.)
Calendar consistency
Different tickers may trade on different days (some markets close for local holidays). To align:
# Get the full calendar (union of all trading days)
all_dates = df["Date"].drop_duplicates().sort_values()
# For one ticker, reindex to the full calendar
aapl = (
df[df["Ticker"] == "AAPL"]
.set_index("Date")
.reindex(all_dates)
.ffill()
)
reindex aligns to a target index, putting NaN where missing. ffill() then carries the last value.
For multi-ticker work, pivot then forward-fill:
wide = df.pivot(index="Date", columns="Ticker", values="Adj Close").ffill()
All tickers on the same calendar, missing days filled with the last good value.
String cleanup
Sometimes ticker columns come dirty:
df["Ticker"] = df["Ticker"].str.strip().str.upper()
.str.strip() removes leading/trailing whitespace. .str.upper() standardizes case. Both are vectorized.
For replacing values:
df["Ticker"] = df["Ticker"].replace({"GOOG": "GOOGL", "FB": "META"})
Map old → new in a dict.
Type coercion
# Force numeric (errors → NaN)
df["Close"] = pd.to_numeric(df["Close"], errors="coerce")
# Force date
df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
# Categorical (saves memory for repeated strings)
df["Ticker"] = df["Ticker"].astype("category")
category dtype is huge for repeated strings. Ticker with 14 unique values across 24k rows: from ~150 KB string storage to ~25 KB.
Validation checklist
def validate(df):
checks = {
"no nulls in key cols": df[["Date", "Ticker", "Adj Close"]].isna().sum().sum() == 0,
"Date is datetime": pd.api.types.is_datetime64_any_dtype(df["Date"]),
"Close is numeric": pd.api.types.is_numeric_dtype(df["Close"]),
"Volume non-negative": (df["Volume"] >= 0).all(),
"Adj Close <= Close": (df["Adj Close"] <= df["Close"] + 1e-6).all(),
"Min date present": df["Date"].min().year >= 2018,
}
for name, ok in checks.items():
print(f" {'✓' if ok else '✗'} {name}")
validate(df)
A health check function. Run it after any pipeline that produces a clean DataFrame.
Try it
clean.py:
import pandas as pd
pd.set_option("display.float_format", lambda x: f"{x:,.2f}")
df = pd.read_parquet("data/prices.parquet")
print("=== Before cleaning ===")
print(f"Rows: {len(df)}")
print(f"Null Close rows: {df['Close'].isna().sum()}")
print(f"Zero-volume rows: {(df['Volume'] == 0).sum()}")
# Clean
df = df.sort_values(["Ticker", "Date"])
df[["Open", "High", "Low", "Close", "Adj Close"]] = (
df.groupby("Ticker")[["Open", "High", "Low", "Close", "Adj Close"]].ffill()
)
df["Volume"] = df["Volume"].fillna(0)
df = df[df["Volume"] > 0].copy()
df["Ticker"] = df["Ticker"].astype("category")
print("\n=== After cleaning ===")
print(f"Rows: {len(df)}")
print(f"Null Close: {df['Close'].isna().sum()}")
print(f"Adj Close vs Close: max diff {(df['Close'] - df['Adj Close']).abs().max():.2f}")
print("\n=== Outliers (>5σ daily moves) ===")
df["Return"] = df.groupby("Ticker", observed=True)["Adj Close"].pct_change()
mu = df.groupby("Ticker", observed=True)["Return"].transform("mean")
sigma = df.groupby("Ticker", observed=True)["Return"].transform("std")
outliers = df[(df["Return"] - mu).abs() > 5 * sigma]
print(outliers[["Date", "Ticker", "Return"]].head())
print(f"Total outlier rows: {len(outliers)}")
Common stumbles
Forward-fill across tickers. df.ffill() carries values across the boundary between Ticker A and Ticker B. Always df.groupby("Ticker").ffill().
Using Close for backtests. Splits and dividends ruin returns. Always Adj Close for performance.
Dropping outliers without checking. A real 30% gap on earnings is not noise. Inspect each before discarding.
Reindex without ffill. Just reindexing introduces NaN; forward-fill within ticker after.
category after groupby. Categorical Series in groupby may surprise — use groupby(..., observed=True) to skip empty categories.
Inconsistent date types. Mixing strings and Timestamps fails on comparison. Convert at load: df["Date"] = pd.to_datetime(df["Date"]).
replace() not finding it. .replace() works on whole values; for partial matches use .str.replace(pattern, repl, regex=True).
What’s next
Ep 13: writing out — Excel multi-sheet, Parquet partitioned, DuckDB persistent.
Recap
df.isna().sum() for null audit. Forward-fill within ticker (groupby("Ticker").ffill()); fillna(0) for volume. Always Adj Close for returns and backtests. Drop zero-volume rows. Detect outliers (>5σ) but inspect — don’t blindly drop. astype("category") for repeated strings (huge memory win). Validate clean DataFrames with a checklist function before trusting them downstream.
Next episode: writing out.