Loading & Caching Real Data: Parquet, load_or_download Pattern | Pandas for Finance Ep3

0views
C
CelesteAI
Description
Episode 3 of Pandas for Finance. Stop redownloading the same five years of price data every time you tweak your script. Source code: https://github.com/GoCelesteAI/pandas-for-finance If you finished Episode 2, you know the DataFrame shape. Now we cache it. df.to_parquet("prices.parquet") writes; pd.read_parquet("prices.parquet") reads. Parquet is ten times smaller than CSV, a hundred times faster to load, and preserves your dtypes — including the DatetimeIndex — across the round trip. Combined with the load_or_download pattern, you hit Yahoo Finance exactly once and read from disk for the rest of the series. What You'll Build: - A working load.py that downloads three tickers (Apple, Microsoft, Google) over five years, stacks the result to long format (one row per Date and Ticker), and caches to data/prices.parquet. - The load_or_download function: cache-first, fall back to yf.download only when the parquet is missing. - Live demonstration of the speed win — first run shows "Downloading from Yahoo..." and waits a few seconds; second run shows "Loading from data/prices.parquet" and prints in milliseconds. - A ls -lh data/ check showing the parquet weighs under 200 KB for 3,774 rows of bars. Timestamps: 0:00 - Intro — Episode 3 starts here 0:17 - Preview — parquet, the load_or_download pattern 0:54 - Open nvim, write load.py 1:14 - Cache path and tickers 1:24 - load_or_download function (cache-first) 1:46 - Save and run, first hit to Yahoo 2:00 - Downloading message + network 2:10 - Long format output, 3,774 rows 2:26 - ls -lh: the parquet is tiny 2:36 - Second run, loads from cache 2:48 - The pattern in one sentence 3:08 - Recap — to_parquet, the pattern, long format 3:43 - End screen Key Takeaways: 1. Parquet is the analytics format, full stop. Columnar storage, compression, and dtype preservation in one file. A 5-year, 14-ticker, daily-bar dataset that takes 3.2 MB as CSV fits in 280 KB as parquet — and reloads about a hundred times faster. Two functions to remember: df.to_parquet(path) and pd.read_parquet(path). Everything else (CSV, Excel, JSON) is for sharing or interop, not analysis. 2. The load_or_download pattern. Every script that loads market data should follow it: check whether the cache file exists, return early if it does, otherwise download and write the cache before returning. First run is slow; every subsequent run is instant. Force a refresh by deleting the parquet file. This single function template stays in your toolbox for every analyst pipeline. 3. Long format beats wide for analysis. Multi-ticker yf.download(...) returns a MultiIndex column shape — (Open, AAPL), (Open, MSFT), etc. Stack it: df.stack(level=1, future_stack=True).reset_index(). Now you have one row per (Date, Ticker) and operations like df[df["Ticker"] == "AAPL"] or df.groupby("Ticker") just work. The series uses long format from here onward. 4. pyarrow is the parquet engine. Pandas defers parquet I/O to a separate library; pyarrow is the standard. If you see "no engine for parquet," pip install pyarrow and you're done. The fastparquet alternative exists but pyarrow is more widely supported and what every cloud data tool (DuckDB, BigQuery, Athena) speaks natively. 5. Cache invalidation is a one-line problem. When you change the universe (add a ticker) or extend the date range, the parquet on disk is stale. The fix: delete data/prices.parquet before running. For scripts that re-run often, add a --refresh flag that calls CACHE.unlink(missing_ok=True) before the function. The pattern is robust to forgotten refreshes — at worst you get yesterday's data and notice quickly. #Pandas #Python #Finance #Parquet #PyArrow #DataFrame #DataAnalytics #PythonForFinance #Yfinance #LongFormat --- Generated by GoCelesteAI · part of the Pandas for Finance series