Loading & Caching Real Data: Parquet, load_or_download Pattern | Pandas for Finance Ep3

0views
C
CelesteAI
Description
Episode 3 of *Pandas for Finance*. Stop redownloading the same five years of price data every time you tweak your script. If you finished Episode 2, you know the DataFrame shape. Now we cache it. `df.to_parquet("prices.parquet")` writes; `pd.read_parquet("prices.parquet")` reads. Parquet is ten times smaller than CSV, a hundred times faster to load, and preserves your dtypes — including the DatetimeIndex — across the round trip. Combined with the `load_or_download` pattern, you hit Yahoo Finance exactly once and read from disk for the rest of the series. What You'll Build: - A working `load.py` that downloads three tickers (Apple, Microsoft, Google) over five years, stacks the result to long format (one row per Date and Ticker), and caches to `data/prices.parquet`. - The `load_or_download` function: cache-first, fall back to `yf.download` only when the parquet is missing. - Live demonstration of the speed win — first run shows "Downloading from Yahoo..." and waits a few seconds; second run shows "Loading from data/prices.parquet" and prints in milliseconds. - A `ls -lh data/` check showing the parquet weighs under 200 KB for 3,774 rows of bars. Timestamps: 0:00 - Intro — Episode 3 starts here 0:17 - Preview — parquet, the load_or_download pattern 0:54 - Open nvim, write load.py 1:14 - Cache path and tickers 1:24 - load_or_download function (cache-first) 1:46 - Save and run, first hit to Yahoo 2:00 - Downloading message + network 2:10 - Long format output, 3,774 rows 2:26 - ls -lh: the parquet is tiny 2:36 - Second run, loads from cache 2:48 - The pattern in one sentence 3:08 - Recap — to_parquet, the pattern, long format 3:43 - End screen Key Takeaways: 1. **Parquet is the analytics format**, full stop. Columnar storage, compression, and dtype preservation in one file. A 5-year, 14-ticker, daily-bar dataset that takes 3.2 MB as CSV fits in 280 KB as parquet — and reloads about a hundred times faster. Two functions to remember: `df.to_parquet(path)` and `pd.read_parquet(path)`. Everything else (CSV, Excel, JSON) is for sharing or interop, not analysis. 2. **The `load_or_download` pattern.** Every script that loads market data should follow it: check whether the cache file exists, return early if it does, otherwise download and write the cache before returning. First run is slow; every subsequent run is instant. Force a refresh by deleting the parquet file. This single function template stays in your toolbox for every analyst pipeline. 3. **Long format beats wide for analysis.** Multi-ticker `yf.download(...)` returns a MultiIndex column shape — `(Open, AAPL)`, `(Open, MSFT)`, etc. Stack it: `df.stack(level=1, future_stack=True).reset_index()`. Now you have one row per (Date, Ticker) and operations like `df[df["Ticker"] == "AAPL"]` or `df.groupby("Ticker")` just work. The series uses long format from here onward. 4. **`pyarrow` is the parquet engine.** Pandas defers parquet I/O to a separate library; `pyarrow` is the standard. If you see "no engine for parquet," `pip install pyarrow` and you're done. The `fastparquet` alternative exists but pyarrow is more widely supported and what every cloud data tool (DuckDB, BigQuery, Athena) speaks natively. 5. **Cache invalidation is a one-line problem.** When you change the universe (add a ticker) or extend the date range, the parquet on disk is stale. The fix: delete `data/prices.parquet` before running. For scripts that re-run often, add a `--refresh` flag that calls `CACHE.unlink(missing_ok=True)` before the function. The pattern is robust to forgotten refreshes — at worst you get yesterday's data and notice quickly. #Pandas #Python #Finance #Parquet #PyArrow #DataFrame #DataAnalytics #PythonForFinance #Yfinance #LongFormat