Pandas for Finance: Loading & Caching Real Data: Parquet, load_or_download Pattern
Video: Loading & Caching Real Data: Parquet, load_or_download Pattern | Pandas for Finance Ep3 by CelesteAI
df.to_parquet("prices.parquet")writes;pd.read_parquet("prices.parquet")reads. 10x smaller than CSV, 100x faster to load, preserves dtypes. Theload_or_downloadpattern: hit yfinance once, cache to parquet, reload from cache forever.
Re-downloading prices every time you tweak a script gets old fast. And Yahoo throttles you when you hammer it. Cache once, reload from disk.
CSV vs Parquet
A 5-year, 14-ticker, daily-bar dataset:
| Format | Size | Load time |
|---|---|---|
| CSV | ~3.2 MB | ~80 ms |
| Parquet | ~280 KB | ~6 ms |
Parquet is columnar + compressed + dtype-aware. CSV is a text dump. For analytics, parquet wins on every metric except “human-readable in a text editor.”
Save and load
import pandas as pd
import yfinance as yf
df = yf.download(["AAPL", "MSFT", "GOOGL"], start="2020-01-01", end="2025-01-01")
df.to_parquet("prices.parquet")
reloaded = pd.read_parquet("prices.parquet")
print(reloaded.shape)
Two functions to remember. Parquet preserves the index, columns (including MultiIndex), and dtypes — no re-parsing needed.
For CSVs (when you must):
df.to_csv("prices.csv")
df = pd.read_csv("prices.csv", index_col=0, parse_dates=True)
CSV needs parse_dates=True to get back DatetimeIndex; index_col=0 to use the first column as index. Parquet just works.
The load_or_download pattern
from pathlib import Path
import pandas as pd
import yfinance as yf
CACHE = Path("data/prices.parquet")
TICKERS = ["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA"]
def load_or_download() -> pd.DataFrame:
if CACHE.exists():
return pd.read_parquet(CACHE)
CACHE.parent.mkdir(parents=True, exist_ok=True)
df = yf.download(TICKERS, start="2020-01-01", end="2025-01-01", progress=False)
df.to_parquet(CACHE)
return df
df = load_or_download()
print(df.shape)
First run: downloads (slow, network). Subsequent runs: reads from disk (instant). The pattern any data analysis script should use.
To force a refresh, delete data/prices.parquet.
Long format
Multi-ticker downloads come as MultiIndex columns:
Open High
AAPL MSFT GOOGL AAPL MSFT GOOGL
Date
2024-01-02 187.15 370.87 138.17 188.44 375.90 139.12
Wide. Convert to long format — one row per (Date, Ticker):
long = (
df.stack(level=1, future_stack=True)
.reset_index()
.rename(columns={"level_1": "Ticker"})
.sort_values(["Ticker", "Date"])
.reset_index(drop=True)
)
print(long.head())
Date Ticker Open High Low Close Adj Close Volume
0 2020-01-02 AAPL 74.06 75.15 73.80 75.09 72.88 135480400
1 2020-01-03 AAPL 74.29 75.14 74.13 74.36 72.17 146322800
2 2020-01-06 AAPL 73.45 74.99 73.19 74.95 72.74 118387200
...
Now every operation is straightforward. Filter by ticker: long[long["Ticker"] == "AAPL"]. Group by ticker: long.groupby("Ticker")....
We use long format for the rest of the series.
The series cache
The series ships with a regenerator:
python scripts/regenerate-cache.py
Outputs data/prices.parquet (long format, 14 tickers, ~7 years) and data/sector_map.csv (used in Ep 8). Every episode loads from the parquet — no network during recording or playback.
Partitioned parquet
For really big datasets, partition the parquet by a column:
long.to_parquet("data/prices/", partition_cols=["Ticker"])
This creates data/prices/Ticker=AAPL/part0.parquet, Ticker=MSFT/..., etc. Tools like DuckDB and pandas can read just the partition they need:
aapl = pd.read_parquet("data/prices/", filters=[("Ticker", "==", "AAPL")])
For our universe (~1.3M rows), unpartitioned is fine. Partition when you have millions.
DuckDB sidekick
DuckDB reads parquet natively with SQL:
import duckdb
con = duckdb.connect()
result = con.execute("""
SELECT Ticker, MAX(Close) as max_close, MIN(Close) as min_close
FROM 'data/prices.parquet'
GROUP BY Ticker
""").df()
print(result)
.df() returns a pandas DataFrame. SQL on parquet, pandas on the result. Best of both.
We’ll mix DuckDB and pandas throughout. SQL is great for filtering and aggregating; pandas is great for time-series operations and final formatting.
Other formats
# Excel — for sharing with non-Python folks
df.to_excel("prices.xlsx", sheet_name="Prices")
# JSON — for APIs
df.to_json("prices.json", orient="records", date_format="iso")
# Pickle — Python-only, fast, exact preservation
df.to_pickle("prices.pkl")
For finance work, the rule of thumb:
- Parquet for analysis (fast, compact, typed).
- Excel for sharing (your finance audience reads it).
- CSV for handoff to non-pandas tools (rare these days).
- Pickle never as a public format — Python-version-locked.
Sanity-check the cache
import pandas as pd
df = pd.read_parquet("data/prices.parquet")
print("Tickers:", df["Ticker"].unique())
print("Rows:", len(df))
print("Date range:", df["Date"].min(), "→", df["Date"].max())
print("Missing values:")
print(df.isna().sum())
Three checks every cache should pass: ticker list, row count, date range. The fourth catches data quality issues (missing values).
Try it
load.py:
from pathlib import Path
import pandas as pd
import yfinance as yf
CACHE = Path("data/prices.parquet")
TICKERS = ["AAPL", "MSFT", "GOOGL"]
def load_or_download() -> pd.DataFrame:
if CACHE.exists():
print(f"Loading from {CACHE}")
return pd.read_parquet(CACHE)
print("Downloading from Yahoo...")
CACHE.parent.mkdir(parents=True, exist_ok=True)
df = yf.download(TICKERS, start="2020-01-01", end="2025-01-01", progress=False)
long = (
df.stack(level=1, future_stack=True)
.reset_index()
.rename(columns={"level_1": "Ticker"})
.sort_values(["Ticker", "Date"])
.reset_index(drop=True)
)
long.to_parquet(CACHE)
print(f"Cached to {CACHE}")
return long
df = load_or_download()
print(df.head())
print(df.shape)
First run downloads; second run reads from disk.
Common stumbles
pyarrow missing. Parquet support comes from pyarrow. pip install pyarrow if you see “no engine for parquet.”
Different parquet engines. pandas defaults to pyarrow if available, falls back to fastparquet. Stick with pyarrow — it’s the standard.
MultiIndex columns in parquet. Round-trips fine — pyarrow encodes the levels. No special handling needed.
stack deprecation warning. Pass future_stack=True (Pandas 2.1+) to use the new behavior; older syntax stack(level=1) still works but warns.
Cache stale. Forgetting to delete the cache after extending date range. Add a --refresh flag to your script if you change the universe often.
save_xlsx ridiculously slow. to_excel is slow for 100k+ rows. For sharing, save parquet/CSV; convert to Excel only as a final report.
What’s next
Ep 4: filtering. Boolean indexing, .query(), finding “the days AAPL closed up >3%.”
Recap
df.to_parquet(path) / pd.read_parquet(path) — fast, compact, typed. Use the load_or_download pattern: cache to disk, refresh by deleting the cache. Convert multi-ticker downloads to long format (.stack().reset_index()) for easier filtering/grouping later. DuckDB reads parquet with SQL natively — pair with pandas for the best of both. Avoid CSV/Excel for analysis files; reserve for sharing.
Next episode: filtering rows.