Pandas for Finance: Pandas + DuckDB at Scale: SQL on Parquet, 100× Faster
Video: Pandas + DuckDB at Scale: SQL on Parquet, 100× Faster | Pandas for Finance Ep15 by CelesteAI
When pandas slows on huge data, DuckDB shreds.
duckdb.execute("SELECT ... FROM read_parquet('big.parquet')")runs SQL directly on parquet without loading into memory. Return small results to pandas with.df()for plotting and final touches. The series finale.
Pandas is great. It’s also memory-bound. Once your data hits “doesn’t fit in RAM” (or even “fits but is annoyingly slow”), DuckDB takes over.
When pandas slows
import pandas as pd
# Imagine: S&P 500 × 10 years = ~1.3M rows
df = pd.read_parquet("data/prices.parquet")
print(df.shape, df.memory_usage(deep=True).sum() / 1024**2, "MB")
# (24375, 8) ~6 MB
Our parquet is small. But scale up:
| Universe | Rows | Pandas memory | DuckDB |
|---|---|---|---|
| 14 stocks × 7y | 24k | 6 MB | irrelevant |
| 500 stocks × 10y | 1.3M | 300 MB | reads in seconds |
| 5000 stocks × 30y | 38M | 8 GB | streams from disk |
Once you cross 1M rows, DuckDB’s SQL-on-parquet starts winning on speed and memory.
Install + connect
import duckdb
con = duckdb.connect()
Empty connection — in-memory database. For persistent storage:
con = duckdb.connect("market.duckdb")
A single file. Survives restarts. Like SQLite for analytics.
SQL on parquet
result = con.execute("""
SELECT Ticker, COUNT(*) as days, AVG("Adj Close") as avg_close
FROM 'data/prices.parquet'
GROUP BY Ticker
ORDER BY avg_close DESC
""").df()
print(result)
DuckDB reads the parquet directly — no pd.read_parquet first. Filters and aggregations push down to the parquet engine; only the result materializes.
.df() returns a pandas DataFrame. Now you can plot, format, or pipe further.
Pandas → DuckDB → Pandas
# Have a pandas DataFrame? Register it.
df = pd.read_parquet("data/prices.parquet")
con.register("prices", df)
result = con.execute("""
SELECT Ticker, MAX(High) as ath
FROM prices
GROUP BY Ticker
""").df()
con.unregister("prices")
con.register("name", df) makes a pandas DataFrame queryable as name in SQL. No copy — DuckDB reads pandas’s memory directly.
For ad-hoc analysis where you want SQL-style brevity on pandas data, register is gold.
DuckDB’s killer feature: out-of-core
# Even if the parquet is 50GB, this works:
con.execute("""
CREATE OR REPLACE TABLE summary AS
SELECT Ticker, AVG("Adj Close") as avg_close, MAX(Volume) as max_vol
FROM 'data/big_prices.parquet'
GROUP BY Ticker
""")
summary = con.execute("SELECT * FROM summary").df()
DuckDB streams from the parquet, computes incrementally, writes the result. Only the summary is in memory.
For huge datasets, this is the only sane path. Pandas can’t do it without manual chunking.
Aggregating then plotting
import duckdb
con = duckdb.connect()
monthly_returns = con.execute("""
WITH monthly AS (
SELECT
Ticker,
DATE_TRUNC('month', Date) AS month,
LAST("Adj Close") OVER (
PARTITION BY Ticker, DATE_TRUNC('month', Date)
ORDER BY Date
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS month_close
FROM 'data/prices.parquet'
QUALIFY ROW_NUMBER() OVER (PARTITION BY Ticker, DATE_TRUNC('month', Date) ORDER BY Date DESC) = 1
)
SELECT
Ticker, month,
month_close / LAG(month_close) OVER (PARTITION BY Ticker ORDER BY month) - 1 AS monthly_return
FROM monthly
ORDER BY Ticker, month
""").df()
print(monthly_returns.head())
Window functions (LAG, LEAD, OVER) are SQL’s superpower. DuckDB has them all. For “compute monthly return from daily data,” SQL is arguably cleaner than the pandas equivalent.
When SQL beats pandas
- Joining huge tables. SQL engines are optimized for this.
- Window functions.
LAG,LEAD, ranking — sometimes more readable thangroupby + shift. - Out-of-core data. Pandas can’t, DuckDB can.
- Sharing the query. SQL is a lingua franca; pandas is Python-only.
When pandas beats SQL
- Time-series operations.
.rolling(),.expanding(),.resample()are cleaner than SQL window functions for these. - Plotting. matplotlib/plotly take DataFrames natively.
- Custom Python logic. Anything beyond standard aggregations.
- Iterative exploration. Tab-complete on
.is faster than typing SQL.
The hybrid: SQL for the heavy lift, pandas for the last mile.
A hybrid pipeline
import duckdb
import pandas as pd
con = duckdb.connect()
# 1. Heavy aggregation in SQL
monthly_volumes = con.execute("""
SELECT
Ticker,
DATE_TRUNC('month', Date) AS month,
SUM(Volume) AS monthly_volume
FROM 'data/prices.parquet'
GROUP BY Ticker, DATE_TRUNC('month', Date)
ORDER BY Ticker, month
""").df()
# 2. Pandas for time-series transformations
monthly_volumes["MonthlyVolume_zScore"] = (
monthly_volumes
.groupby("Ticker")["monthly_volume"]
.transform(lambda x: (x - x.mean()) / x.std())
)
# 3. Filter outliers and write
high_volume_months = monthly_volumes.query("MonthlyVolume_zScore > 2")
high_volume_months.to_parquet("/tmp/high_volume_months.parquet")
print(high_volume_months.head())
DuckDB groups; pandas computes a per-ticker z-score; pandas filters and writes. Each tool plays to its strength.
Series wrap-up
Across 15 episodes:
- Eps 1–3 — setup, the DataFrame, caching with parquet.
- Eps 4–5 — filter rows, pick columns.
- Eps 6 — daily/log/cumulative returns.
- Eps 7–8 — groupby, joins.
- Eps 9–11 — dates, resampling, rolling windows.
- Eps 12–13 — cleaning, exporting.
- Ep 14 — capstone backtest.
- Ep 15 — pandas + DuckDB at scale.
Enough to write the Monday-morning analyst pipeline: download, clean, aggregate, report. And the bridge to bigger data when pandas alone runs out.
Where to go next
- Polars — pandas-like API, Rust-fast, lazy evaluation. Good middle ground between pandas and DuckDB.
- PyArrow Tables — even faster than pandas for huge data; fewer ergonomics.
- Plotly / Altair — better-looking charts than matplotlib.
- Streamlit / Gradio — turn your script into an interactive dashboard in 50 lines.
- Quantitative finance —
quantstats,pyfolio,vectorbtfor backtest infrastructure. - LLM-driven analytics — combine pandas with Claude or GPT for natural-language data exploration.
A complete demo
bigquery.py:
import duckdb
import pandas as pd
con = duckdb.connect()
# SQL: monthly volume per ticker
result = con.execute("""
SELECT
Ticker,
DATE_TRUNC('month', Date) AS Month,
AVG("Adj Close") AS AvgClose,
SUM(Volume) AS TotalVolume
FROM 'data/prices.parquet'
GROUP BY Ticker, Month
ORDER BY Ticker, Month
""").df()
print(f"Aggregated: {len(result):,} rows")
print(result.head())
# Pandas: rank tickers by total volume
totals = (
result.groupby("Ticker")["TotalVolume"].sum()
.sort_values(ascending=False)
)
print("\n=== Total volume per ticker ===")
print(totals.map(lambda v: f"{v/1e9:,.1f}B"))
# Out-of-core friendliness: explain
print("\n=== EXPLAIN ===")
con.execute("EXPLAIN SELECT Ticker, COUNT(*) FROM 'data/prices.parquet' GROUP BY Ticker").df()
Common stumbles
Forgetting .df(). con.execute(...) returns a relation, not a DataFrame. .df() materializes.
SQL with pandas Date types. DuckDB recognizes pandas datetimes when registered. For dates as strings in CSV, cast: CAST(Date AS DATE) in SQL.
Window functions verbosity. LAG(col) OVER (PARTITION BY ... ORDER BY ...) is the SQL way. Pandas groupby().shift() is shorter. Pick the right tool.
con.unregister. Forget to clean up registered names → memory holds the DataFrame longer than expected. Use a context manager or try/finally for cleanup.
Identifier quoting. Column names with spaces or capitals: "Adj Close" (double quotes in SQL).
DuckDB OR REPLACE. Without it, second run errors. CREATE OR REPLACE TABLE ... makes idempotent.
Polars vs DuckDB. Both fast, both columnar. Polars is pandas-like; DuckDB is SQL. For most “I just want to be faster than pandas,” try Polars first.
Recap
DuckDB reads parquet directly with SQL — no full-load required. con.execute("SQL").df() returns pandas. con.register("name", df) makes pandas queryable in SQL. Hybrid: SQL for heavy aggregation/joining, pandas for time-series ops and final shaping. For >1M rows or >RAM datasets, DuckDB is the answer. For everything below: pandas alone is fine.
Series complete. Build your Monday-morning pipeline — the only way to internalize it.