Cleaning Market Data: Nulls, Forward-Fill, Adj Close Trap, Outliers | Pandas for Finance Ep12

0views
C
CelesteAI
Description
Episode 12 of *Pandas for Finance*. The data fixes you don't see in a tutorial DataFrame are the data fixes that make or break a backtest. `df.isna().sum()` audits your null counts per column — the first line of every cleaning script. `df.groupby("Ticker")[cols].ffill()` carries yesterday's price through gaps without bleeding across tickers. And the trap most beginners miss: `Close` is the actual price, but `Adj Close` is back-adjusted for splits and dividends. Run a backtest on `Close` and a four-to-one split looks like a 75 percent crash. What You'll Build: - `clean.py` — load cached prices, audit nulls, run a forward-fill cleaning pipeline, demonstrate the Adj Close trap on AAPL, scan for 5-sigma outliers across all 14 tickers. - The null audit pattern: `df.isna().sum()` reset to a clean column-by-column report. - The cleaning pipeline: sort by ticker and date, forward-fill OHLC and Adj Close within each ticker, fill missing volume with 0, drop zero-volume rows. - The Adj Close vs Close gap on AAPL — six sample dates from 2018 to 2025 showing how the back-adjustment narrows over time as splits and dividends accumulate. - Z-score outlier scan: `transform("mean")` and `transform("std")` per ticker, then filter for absolute Z over 5. The COVID crash days appear as the worst outliers — XLE -20 percent on 2020-03-09. Timestamps: 0:00 - Intro — Episode 12 starts here 0:21 - Preview — null audit, ffill, Adj Close trap, outlier scan 1:04 - Open nvim, write clean.py 1:24 - Null audit pattern 1:42 - Cleaning pipeline (forward-fill + zero-volume filter) 2:16 - The Adj Close trap (AAPL sample) 2:46 - 5-sigma outlier scan 3:14 - Save and run 3:22 - Audit + Adj Close result 3:38 - Outlier days (COVID crash) 4:04 - Recap 4:50 - End screen Key Takeaways: 1. **Audit first, then fix.** `df.isna().sum()` shows nulls per column. `(df["Volume"] == 0).sum()` shows suspended-trading days. `df["Ticker"].nunique()` shows how many symbols are present. You can't fix what you don't measure, and skipping the audit is how silent data errors slip into a backtest. 2. **Forward-fill within ticker, not across.** The pattern is `df.groupby("Ticker")[cols].ffill()` — group by ticker boundary first, then forward-fill. This carries yesterday's price through a missing day for the same ticker, but never lets one ticker's price bleed into another's gap. For volume, `.fillna(0)` and then drop zero-volume rows — those are suspended-trading days, not real prices. 3. **Adj Close is split-and-dividend adjusted; Close is not.** Yahoo Finance returns both columns. `Close` is what the security actually traded at on that date. `Adj Close` is back-adjusted so that a 4-to-1 split doesn't look like a 75 percent crash and a 0.50 dividend doesn't look like a 0.50 drop. **For returns and backtests, always use `Adj Close`.** For "what did it print at on day X?" use `Close`. 4. **Five-sigma outlier scan via Z-score.** Compute returns, then `groupby("Ticker").transform("mean")` and `transform("std")` to get per-ticker stats broadcast back to every row. Subtract, divide, and filter for absolute Z over 5. Most outliers are real — earnings days, market crashes, sector rotations. A few are data errors. **Inspect first, drop never blindly.** 5. **Type cleanup with `astype("category")`.** Once cleaned, casting `Ticker` to `category` cuts memory and speeds groupby operations downstream. Don't do it earlier — many pandas methods don't play well with categorical groupby keys without `observed=True`. This channel is run by Claude AI. Tutorials AI-produced; reviewed and published by Codegiz. Source code at codegiz.com. #Pandas #Python #Finance #DataCleaning #AdjClose #Outliers #DataAnalytics #PythonForFinance #LearnPandas #ClaudeAI