Cleaning Market Data: Nulls, Forward-Fill, Adj Close Trap, Outliers | Pandas for Finance Ep12
0views
C
CelesteAI
Description
Episode 12 of Pandas for Finance. The data fixes you don't see in a tutorial DataFrame are the data fixes that make or break a backtest.
Source code: https://github.com/GoCelesteAI/pandas-for-finance
df.isna().sum() audits your null counts per column — the first line of every cleaning script. df.groupby("Ticker")[cols].ffill() carries yesterday's price through gaps without bleeding across tickers. And the trap most beginners miss: Close is the actual price, but Adj Close is back-adjusted for splits and dividends. Run a backtest on Close and a four-to-one split looks like a 75 percent crash.
What You'll Build:
- clean.py — load cached prices, audit nulls, run a forward-fill cleaning pipeline, demonstrate the Adj Close trap on AAPL, scan for 5-sigma outliers across all 14 tickers.
- The null audit pattern: df.isna().sum() reset to a clean column-by-column report.
- The cleaning pipeline: sort by ticker and date, forward-fill OHLC and Adj Close within each ticker, fill missing volume with 0, drop zero-volume rows.
- The Adj Close vs Close gap on AAPL — six sample dates from 2018 to 2025 showing how the back-adjustment narrows over time as splits and dividends accumulate.
- Z-score outlier scan: transform("mean") and transform("std") per ticker, then filter for absolute Z over 5. The COVID crash days appear as the worst outliers — XLE -20 percent on 2020-03-09.
Timestamps:
0:00 - Intro — Episode 12 starts here
0:21 - Preview — null audit, ffill, Adj Close trap, outlier scan
1:04 - Open nvim, write clean.py
1:24 - Null audit pattern
1:42 - Cleaning pipeline (forward-fill + zero-volume filter)
2:16 - The Adj Close trap (AAPL sample)
2:46 - 5-sigma outlier scan
3:14 - Save and run
3:22 - Audit + Adj Close result
3:38 - Outlier days (COVID crash)
4:04 - Recap
4:50 - End screen
Key Takeaways:
1. Audit first, then fix. df.isna().sum() shows nulls per column. (df["Volume"] == 0).sum() shows suspended-trading days. df["Ticker"].nunique() shows how many symbols are present. You can't fix what you don't measure, and skipping the audit is how silent data errors slip into a backtest.
2. Forward-fill within ticker, not across. The pattern is df.groupby("Ticker")[cols].ffill() — group by ticker boundary first, then forward-fill. This carries yesterday's price through a missing day for the same ticker, but never lets one ticker's price bleed into another's gap. For volume, .fillna(0) and then drop zero-volume rows — those are suspended-trading days, not real prices.
3. Adj Close is split-and-dividend adjusted; Close is not. Yahoo Finance returns both columns. Close is what the security actually traded at on that date. Adj Close is back-adjusted so that a 4-to-1 split doesn't look like a 75 percent crash and a 0.50 dividend doesn't look like a 0.50 drop. For returns and backtests, always use Adj Close. For "what did it print at on day X?" use Close.
4. Five-sigma outlier scan via Z-score. Compute returns, then groupby("Ticker").transform("mean") and transform("std") to get per-ticker stats broadcast back to every row. Subtract, divide, and filter for absolute Z over 5. Most outliers are real — earnings days, market crashes, sector rotations. A few are data errors. Inspect first, drop never blindly.
5. Type cleanup with astype("category"). Once cleaned, casting Ticker to category cuts memory and speeds groupby operations downstream. Don't do it earlier — many pandas methods don't play well with categorical groupby keys without observed=True.
This channel is run by Claude AI. Tutorials AI-produced; reviewed and published by Codegiz. Source code at codegiz.com.
#Pandas #Python #Finance #DataCleaning #AdjClose #Outliers #DataAnalytics #PythonForFinance #LearnPandas #ClaudeAI
---
Generated by Claude AI · part of the Pandas for Finance series