Part of Pandas for Finance

Pandas for Finance: Loading & Caching Real Data: Parquet, load_or_download Pattern

Celest KimCelest Kim

Video: Loading & Caching Real Data: Parquet, load_or_download Pattern | Pandas for Finance Ep3 by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

df.to_parquet("prices.parquet") writes; pd.read_parquet("prices.parquet") reads. 10x smaller than CSV, 100x faster to load, preserves dtypes. The load_or_download pattern: hit yfinance once, cache to parquet, reload from cache forever.

Re-downloading prices every time you tweak a script gets old fast. And Yahoo throttles you when you hammer it. Cache once, reload from disk.

CSV vs Parquet

A 5-year, 14-ticker, daily-bar dataset:

Format Size Load time
CSV ~3.2 MB ~80 ms
Parquet ~280 KB ~6 ms

Parquet is columnar + compressed + dtype-aware. CSV is a text dump. For analytics, parquet wins on every metric except “human-readable in a text editor.”

Save and load

import pandas as pd
import yfinance as yf

df = yf.download(["AAPL", "MSFT", "GOOGL"], start="2020-01-01", end="2025-01-01")
df.to_parquet("prices.parquet")

reloaded = pd.read_parquet("prices.parquet")
print(reloaded.shape)

Two functions to remember. Parquet preserves the index, columns (including MultiIndex), and dtypes — no re-parsing needed.

For CSVs (when you must):

df.to_csv("prices.csv")
df = pd.read_csv("prices.csv", index_col=0, parse_dates=True)

CSV needs parse_dates=True to get back DatetimeIndex; index_col=0 to use the first column as index. Parquet just works.

The load_or_download pattern

from pathlib import Path

import pandas as pd
import yfinance as yf

CACHE = Path("data/prices.parquet")
TICKERS = ["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA"]


def load_or_download() -> pd.DataFrame:
  if CACHE.exists():
    return pd.read_parquet(CACHE)

  CACHE.parent.mkdir(parents=True, exist_ok=True)
  df = yf.download(TICKERS, start="2020-01-01", end="2025-01-01", progress=False)
  df.to_parquet(CACHE)
  return df


df = load_or_download()
print(df.shape)

First run: downloads (slow, network). Subsequent runs: reads from disk (instant). The pattern any data analysis script should use.

To force a refresh, delete data/prices.parquet.

Long format

Multi-ticker downloads come as MultiIndex columns:

                 Open                            High
                 AAPL    MSFT   GOOGL            AAPL    MSFT   GOOGL
Date
2024-01-02     187.15  370.87  138.17          188.44  375.90  139.12

Wide. Convert to long format — one row per (Date, Ticker):

long = (
  df.stack(level=1, future_stack=True)
    .reset_index()
    .rename(columns={"level_1": "Ticker"})
    .sort_values(["Ticker", "Date"])
    .reset_index(drop=True)
)
print(long.head())
        Date Ticker     Open     High      Low    Close   Adj Close      Volume
0 2020-01-02  AAPL    74.06    75.15    73.80    75.09       72.88  135480400
1 2020-01-03  AAPL    74.29    75.14    74.13    74.36       72.17  146322800
2 2020-01-06  AAPL    73.45    74.99    73.19    74.95       72.74  118387200
...

Now every operation is straightforward. Filter by ticker: long[long["Ticker"] == "AAPL"]. Group by ticker: long.groupby("Ticker")....

We use long format for the rest of the series.

The series cache

The series ships with a regenerator:

python scripts/regenerate-cache.py

Outputs data/prices.parquet (long format, 14 tickers, ~7 years) and data/sector_map.csv (used in Ep 8). Every episode loads from the parquet — no network during recording or playback.

Partitioned parquet

For really big datasets, partition the parquet by a column:

long.to_parquet("data/prices/", partition_cols=["Ticker"])

This creates data/prices/Ticker=AAPL/part0.parquet, Ticker=MSFT/..., etc. Tools like DuckDB and pandas can read just the partition they need:

aapl = pd.read_parquet("data/prices/", filters=[("Ticker", "==", "AAPL")])

For our universe (~1.3M rows), unpartitioned is fine. Partition when you have millions.

DuckDB sidekick

DuckDB reads parquet natively with SQL:

import duckdb

con = duckdb.connect()
result = con.execute("""
  SELECT Ticker, MAX(Close) as max_close, MIN(Close) as min_close
  FROM 'data/prices.parquet'
  GROUP BY Ticker
""").df()
print(result)

.df() returns a pandas DataFrame. SQL on parquet, pandas on the result. Best of both.

We’ll mix DuckDB and pandas throughout. SQL is great for filtering and aggregating; pandas is great for time-series operations and final formatting.

Other formats

# Excel — for sharing with non-Python folks
df.to_excel("prices.xlsx", sheet_name="Prices")

# JSON — for APIs
df.to_json("prices.json", orient="records", date_format="iso")

# Pickle — Python-only, fast, exact preservation
df.to_pickle("prices.pkl")

For finance work, the rule of thumb:

  • Parquet for analysis (fast, compact, typed).
  • Excel for sharing (your finance audience reads it).
  • CSV for handoff to non-pandas tools (rare these days).
  • Pickle never as a public format — Python-version-locked.

Sanity-check the cache

import pandas as pd

df = pd.read_parquet("data/prices.parquet")
print("Tickers:", df["Ticker"].unique())
print("Rows:", len(df))
print("Date range:", df["Date"].min(), "→", df["Date"].max())
print("Missing values:")
print(df.isna().sum())

Three checks every cache should pass: ticker list, row count, date range. The fourth catches data quality issues (missing values).

Try it

load.py:

from pathlib import Path

import pandas as pd
import yfinance as yf

CACHE = Path("data/prices.parquet")
TICKERS = ["AAPL", "MSFT", "GOOGL"]


def load_or_download() -> pd.DataFrame:
  if CACHE.exists():
    print(f"Loading from {CACHE}")
    return pd.read_parquet(CACHE)

  print("Downloading from Yahoo...")
  CACHE.parent.mkdir(parents=True, exist_ok=True)
  df = yf.download(TICKERS, start="2020-01-01", end="2025-01-01", progress=False)
  long = (
    df.stack(level=1, future_stack=True)
      .reset_index()
      .rename(columns={"level_1": "Ticker"})
      .sort_values(["Ticker", "Date"])
      .reset_index(drop=True)
  )
  long.to_parquet(CACHE)
  print(f"Cached to {CACHE}")
  return long


df = load_or_download()
print(df.head())
print(df.shape)

First run downloads; second run reads from disk.

Common stumbles

pyarrow missing. Parquet support comes from pyarrow. pip install pyarrow if you see “no engine for parquet.”

Different parquet engines. pandas defaults to pyarrow if available, falls back to fastparquet. Stick with pyarrow — it’s the standard.

MultiIndex columns in parquet. Round-trips fine — pyarrow encodes the levels. No special handling needed.

stack deprecation warning. Pass future_stack=True (Pandas 2.1+) to use the new behavior; older syntax stack(level=1) still works but warns.

Cache stale. Forgetting to delete the cache after extending date range. Add a --refresh flag to your script if you change the universe often.

save_xlsx ridiculously slow. to_excel is slow for 100k+ rows. For sharing, save parquet/CSV; convert to Excel only as a final report.

What’s next

Ep 4: filtering. Boolean indexing, .query(), finding “the days AAPL closed up >3%.”

Recap

df.to_parquet(path) / pd.read_parquet(path) — fast, compact, typed. Use the load_or_download pattern: cache to disk, refresh by deleting the cache. Convert multi-ticker downloads to long format (.stack().reset_index()) for easier filtering/grouping later. DuckDB reads parquet with SQL natively — pair with pandas for the best of both. Avoid CSV/Excel for analysis files; reserve for sharing.

Next episode: filtering rows.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.