Loading & Caching Real Data: Parquet, load_or_download Pattern | Pandas for Finance Ep3
0views
C
CelesteAI
Description
Episode 3 of *Pandas for Finance*. Stop redownloading the same five years of price data every time you tweak your script.
If you finished Episode 2, you know the DataFrame shape. Now we cache it. `df.to_parquet("prices.parquet")` writes; `pd.read_parquet("prices.parquet")` reads. Parquet is ten times smaller than CSV, a hundred times faster to load, and preserves your dtypes — including the DatetimeIndex — across the round trip. Combined with the `load_or_download` pattern, you hit Yahoo Finance exactly once and read from disk for the rest of the series.
What You'll Build:
- A working `load.py` that downloads three tickers (Apple, Microsoft, Google) over five years, stacks the result to long format (one row per Date and Ticker), and caches to `data/prices.parquet`.
- The `load_or_download` function: cache-first, fall back to `yf.download` only when the parquet is missing.
- Live demonstration of the speed win — first run shows "Downloading from Yahoo..." and waits a few seconds; second run shows "Loading from data/prices.parquet" and prints in milliseconds.
- A `ls -lh data/` check showing the parquet weighs under 200 KB for 3,774 rows of bars.
Timestamps:
0:00 - Intro — Episode 3 starts here
0:17 - Preview — parquet, the load_or_download pattern
0:54 - Open nvim, write load.py
1:14 - Cache path and tickers
1:24 - load_or_download function (cache-first)
1:46 - Save and run, first hit to Yahoo
2:00 - Downloading message + network
2:10 - Long format output, 3,774 rows
2:26 - ls -lh: the parquet is tiny
2:36 - Second run, loads from cache
2:48 - The pattern in one sentence
3:08 - Recap — to_parquet, the pattern, long format
3:43 - End screen
Key Takeaways:
1. **Parquet is the analytics format**, full stop. Columnar storage, compression, and dtype preservation in one file. A 5-year, 14-ticker, daily-bar dataset that takes 3.2 MB as CSV fits in 280 KB as parquet — and reloads about a hundred times faster. Two functions to remember: `df.to_parquet(path)` and `pd.read_parquet(path)`. Everything else (CSV, Excel, JSON) is for sharing or interop, not analysis.
2. **The `load_or_download` pattern.** Every script that loads market data should follow it: check whether the cache file exists, return early if it does, otherwise download and write the cache before returning. First run is slow; every subsequent run is instant. Force a refresh by deleting the parquet file. This single function template stays in your toolbox for every analyst pipeline.
3. **Long format beats wide for analysis.** Multi-ticker `yf.download(...)` returns a MultiIndex column shape — `(Open, AAPL)`, `(Open, MSFT)`, etc. Stack it: `df.stack(level=1, future_stack=True).reset_index()`. Now you have one row per (Date, Ticker) and operations like `df[df["Ticker"] == "AAPL"]` or `df.groupby("Ticker")` just work. The series uses long format from here onward.
4. **`pyarrow` is the parquet engine.** Pandas defers parquet I/O to a separate library; `pyarrow` is the standard. If you see "no engine for parquet," `pip install pyarrow` and you're done. The `fastparquet` alternative exists but pyarrow is more widely supported and what every cloud data tool (DuckDB, BigQuery, Athena) speaks natively.
5. **Cache invalidation is a one-line problem.** When you change the universe (add a ticker) or extend the date range, the parquet on disk is stale. The fix: delete `data/prices.parquet` before running. For scripts that re-run often, add a `--refresh` flag that calls `CACHE.unlink(missing_ok=True)` before the function. The pattern is robust to forgotten refreshes — at worst you get yesterday's data and notice quickly.
#Pandas #Python #Finance #Parquet #PyArrow #DataFrame #DataAnalytics #PythonForFinance #Yfinance #LongFormat