Part of Pandas for Finance

Pandas for Finance: Pandas + DuckDB at Scale: SQL on Parquet, 100× Faster

Celest KimCelest Kim

Video: Pandas + DuckDB at Scale: SQL on Parquet, 100× Faster | Pandas for Finance Ep15 by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

When pandas slows on huge data, DuckDB shreds. duckdb.execute("SELECT ... FROM read_parquet('big.parquet')") runs SQL directly on parquet without loading into memory. Return small results to pandas with .df() for plotting and final touches. The series finale.

Pandas is great. It’s also memory-bound. Once your data hits “doesn’t fit in RAM” (or even “fits but is annoyingly slow”), DuckDB takes over.

When pandas slows

import pandas as pd

# Imagine: S&P 500 × 10 years = ~1.3M rows
df = pd.read_parquet("data/prices.parquet")
print(df.shape, df.memory_usage(deep=True).sum() / 1024**2, "MB")
# (24375, 8) ~6 MB

Our parquet is small. But scale up:

Universe Rows Pandas memory DuckDB
14 stocks × 7y 24k 6 MB irrelevant
500 stocks × 10y 1.3M 300 MB reads in seconds
5000 stocks × 30y 38M 8 GB streams from disk

Once you cross 1M rows, DuckDB’s SQL-on-parquet starts winning on speed and memory.

Install + connect

import duckdb

con = duckdb.connect()

Empty connection — in-memory database. For persistent storage:

con = duckdb.connect("market.duckdb")

A single file. Survives restarts. Like SQLite for analytics.

SQL on parquet

result = con.execute("""
  SELECT Ticker, COUNT(*) as days, AVG("Adj Close") as avg_close
  FROM 'data/prices.parquet'
  GROUP BY Ticker
  ORDER BY avg_close DESC
""").df()

print(result)

DuckDB reads the parquet directly — no pd.read_parquet first. Filters and aggregations push down to the parquet engine; only the result materializes.

.df() returns a pandas DataFrame. Now you can plot, format, or pipe further.

Pandas → DuckDB → Pandas

# Have a pandas DataFrame? Register it.
df = pd.read_parquet("data/prices.parquet")
con.register("prices", df)

result = con.execute("""
  SELECT Ticker, MAX(High) as ath
  FROM prices
  GROUP BY Ticker
""").df()

con.unregister("prices")

con.register("name", df) makes a pandas DataFrame queryable as name in SQL. No copy — DuckDB reads pandas’s memory directly.

For ad-hoc analysis where you want SQL-style brevity on pandas data, register is gold.

DuckDB’s killer feature: out-of-core

# Even if the parquet is 50GB, this works:
con.execute("""
  CREATE OR REPLACE TABLE summary AS
  SELECT Ticker, AVG("Adj Close") as avg_close, MAX(Volume) as max_vol
  FROM 'data/big_prices.parquet'
  GROUP BY Ticker
""")

summary = con.execute("SELECT * FROM summary").df()

DuckDB streams from the parquet, computes incrementally, writes the result. Only the summary is in memory.

For huge datasets, this is the only sane path. Pandas can’t do it without manual chunking.

Aggregating then plotting

import duckdb
con = duckdb.connect()

monthly_returns = con.execute("""
  WITH monthly AS (
    SELECT
      Ticker,
      DATE_TRUNC('month', Date) AS month,
      LAST("Adj Close") OVER (
        PARTITION BY Ticker, DATE_TRUNC('month', Date)
        ORDER BY Date
        RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
      ) AS month_close
    FROM 'data/prices.parquet'
    QUALIFY ROW_NUMBER() OVER (PARTITION BY Ticker, DATE_TRUNC('month', Date) ORDER BY Date DESC) = 1
  )
  SELECT
    Ticker, month,
    month_close / LAG(month_close) OVER (PARTITION BY Ticker ORDER BY month) - 1 AS monthly_return
  FROM monthly
  ORDER BY Ticker, month
""").df()

print(monthly_returns.head())

Window functions (LAG, LEAD, OVER) are SQL’s superpower. DuckDB has them all. For “compute monthly return from daily data,” SQL is arguably cleaner than the pandas equivalent.

When SQL beats pandas

  • Joining huge tables. SQL engines are optimized for this.
  • Window functions. LAG, LEAD, ranking — sometimes more readable than groupby + shift.
  • Out-of-core data. Pandas can’t, DuckDB can.
  • Sharing the query. SQL is a lingua franca; pandas is Python-only.

When pandas beats SQL

  • Time-series operations. .rolling(), .expanding(), .resample() are cleaner than SQL window functions for these.
  • Plotting. matplotlib/plotly take DataFrames natively.
  • Custom Python logic. Anything beyond standard aggregations.
  • Iterative exploration. Tab-complete on . is faster than typing SQL.

The hybrid: SQL for the heavy lift, pandas for the last mile.

A hybrid pipeline

import duckdb
import pandas as pd

con = duckdb.connect()

# 1. Heavy aggregation in SQL
monthly_volumes = con.execute("""
  SELECT
    Ticker,
    DATE_TRUNC('month', Date) AS month,
    SUM(Volume) AS monthly_volume
  FROM 'data/prices.parquet'
  GROUP BY Ticker, DATE_TRUNC('month', Date)
  ORDER BY Ticker, month
""").df()

# 2. Pandas for time-series transformations
monthly_volumes["MonthlyVolume_zScore"] = (
  monthly_volumes
    .groupby("Ticker")["monthly_volume"]
    .transform(lambda x: (x - x.mean()) / x.std())
)

# 3. Filter outliers and write
high_volume_months = monthly_volumes.query("MonthlyVolume_zScore > 2")
high_volume_months.to_parquet("/tmp/high_volume_months.parquet")
print(high_volume_months.head())

DuckDB groups; pandas computes a per-ticker z-score; pandas filters and writes. Each tool plays to its strength.

Series wrap-up

Across 15 episodes:

  • Eps 1–3 — setup, the DataFrame, caching with parquet.
  • Eps 4–5 — filter rows, pick columns.
  • Eps 6 — daily/log/cumulative returns.
  • Eps 7–8 — groupby, joins.
  • Eps 9–11 — dates, resampling, rolling windows.
  • Eps 12–13 — cleaning, exporting.
  • Ep 14 — capstone backtest.
  • Ep 15 — pandas + DuckDB at scale.

Enough to write the Monday-morning analyst pipeline: download, clean, aggregate, report. And the bridge to bigger data when pandas alone runs out.

Where to go next

  • Polars — pandas-like API, Rust-fast, lazy evaluation. Good middle ground between pandas and DuckDB.
  • PyArrow Tables — even faster than pandas for huge data; fewer ergonomics.
  • Plotly / Altair — better-looking charts than matplotlib.
  • Streamlit / Gradio — turn your script into an interactive dashboard in 50 lines.
  • Quantitative financequantstats, pyfolio, vectorbt for backtest infrastructure.
  • LLM-driven analytics — combine pandas with Claude or GPT for natural-language data exploration.

A complete demo

bigquery.py:

import duckdb
import pandas as pd

con = duckdb.connect()

# SQL: monthly volume per ticker
result = con.execute("""
  SELECT
    Ticker,
    DATE_TRUNC('month', Date) AS Month,
    AVG("Adj Close") AS AvgClose,
    SUM(Volume)      AS TotalVolume
  FROM 'data/prices.parquet'
  GROUP BY Ticker, Month
  ORDER BY Ticker, Month
""").df()
print(f"Aggregated: {len(result):,} rows")
print(result.head())

# Pandas: rank tickers by total volume
totals = (
  result.groupby("Ticker")["TotalVolume"].sum()
        .sort_values(ascending=False)
)
print("\n=== Total volume per ticker ===")
print(totals.map(lambda v: f"{v/1e9:,.1f}B"))

# Out-of-core friendliness: explain
print("\n=== EXPLAIN ===")
con.execute("EXPLAIN SELECT Ticker, COUNT(*) FROM 'data/prices.parquet' GROUP BY Ticker").df()

Common stumbles

Forgetting .df(). con.execute(...) returns a relation, not a DataFrame. .df() materializes.

SQL with pandas Date types. DuckDB recognizes pandas datetimes when registered. For dates as strings in CSV, cast: CAST(Date AS DATE) in SQL.

Window functions verbosity. LAG(col) OVER (PARTITION BY ... ORDER BY ...) is the SQL way. Pandas groupby().shift() is shorter. Pick the right tool.

con.unregister. Forget to clean up registered names → memory holds the DataFrame longer than expected. Use a context manager or try/finally for cleanup.

Identifier quoting. Column names with spaces or capitals: "Adj Close" (double quotes in SQL).

DuckDB OR REPLACE. Without it, second run errors. CREATE OR REPLACE TABLE ... makes idempotent.

Polars vs DuckDB. Both fast, both columnar. Polars is pandas-like; DuckDB is SQL. For most “I just want to be faster than pandas,” try Polars first.

Recap

DuckDB reads parquet directly with SQL — no full-load required. con.execute("SQL").df() returns pandas. con.register("name", df) makes pandas queryable in SQL. Hybrid: SQL for heavy aggregation/joining, pandas for time-series ops and final shaping. For >1M rows or >RAM datasets, DuckDB is the answer. For everything below: pandas alone is fine.

Series complete. Build your Monday-morning pipeline — the only way to internalize it.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.