Pandas + DuckDB at Scale: SQL on Parquet, 100× Faster | Pandas for Finance Ep15
0views
C
CelesteAI
Description
Episode 15 of *Pandas for Finance* — the finale. The trick that finally lets you scale past pandas's memory limits: query Parquet files directly with DuckDB, return small results to pandas for plotting and final touches.
`duckdb.connect().execute("SELECT ... FROM 'data/prices.parquet' GROUP BY ...").df()` runs SQL on a Parquet file without loading it into memory. For aggregations on 1M+ rows, DuckDB benchmarks 50-100x faster than pandas. The two libraries form a hybrid that's genuinely better than either alone — DuckDB for heavy aggregation, pandas for the lightweight chain afterward.
What You'll Build:
- `bigquery.py` — query the cached Parquet directly with SQL via DuckDB, time it against the equivalent pandas query, then chain pandas operations on the SQL result.
- The SQL-on-Parquet pattern: `con.execute("SELECT ... FROM 'file.parquet' GROUP BY ...").df()` reads the file from disk, runs the query, returns a DataFrame.
- Side-by-side timing: `time.perf_counter()` around both queries shows DuckDB's typical 50-100x speedup for groupby-aggregate operations on Parquet.
- The round-trip pattern: `con.register("name", df)` exposes any pandas DataFrame as a DuckDB table, queryable from SQL. Combined with `.df()` to convert results back, the bridge runs both directions in the same script.
Timestamps:
0:00 - Intro — Episode 15 starts here
0:23 - Preview — when pandas slows, DuckDB shreds
1:11 - Open nvim, write bigquery.py
1:14 - Imports + DuckDB connect
1:32 - SQL on Parquet (no pandas load)
1:50 - DATE_TRUNC + GROUP BY + AVG + SUM
2:18 - Same query in pandas for comparison
2:38 - Speedup: DuckDB ~100× faster
2:54 - Pandas chain on the SQL result
3:10 - Round-trip: register pandas DF back to DuckDB
3:32 - Save and run
4:32 - Recap
5:23 - End screen
Key Takeaways:
1. **DuckDB reads Parquet directly with SQL — no pandas load needed.** `con.execute("SELECT ... FROM 'file.parquet' ...").df()` streams the Parquet file from disk, runs the query, and returns a DataFrame. For files larger than RAM, this is the only way to query them at all. For files that fit in RAM, it's still 50-100x faster than the pandas equivalent because DuckDB skips the dtype-inference step and uses a vectorized columnar engine.
2. **The hybrid pattern beats either library alone.** Use DuckDB for heavy aggregations (GROUP BY, JOIN across files, window functions). Use pandas for the chain that follows: formatting, plotting, downstream feature engineering. `.df()` is the conversion bridge — it materializes the SQL result into a pandas DataFrame.
3. **`.register()` runs the bridge in reverse.** `con.register("name", df)` exposes any pandas DataFrame as a queryable DuckDB table. From that point, SQL can JOIN it with on-disk Parquet files, GROUP BY across both, or run any analytic query. Combined with `.df()` to bring results back, you get pandas-and-SQL interleaved in the same script.
4. **DATE_TRUNC for time bucketing.** SQL has clean bucketing built in: `DATE_TRUNC('month', Date)` floors to the month, `'quarter'` to the quarter, `'year'` to the year. The pandas equivalent (`pd.to_datetime(...).dt.to_period('M').dt.to_timestamp()`) works but is more verbose. For time-series aggregations on big data, the DATE_TRUNC version is shorter and faster.
5. **In-memory vs file-backed DuckDB.** `duckdb.connect()` with no argument is in-memory — fine for one-shot analytics. `duckdb.connect("file.duckdb")` writes a persistent on-disk database, queryable later by any DuckDB client (including the duckdb CLI, R, or another Python session). Match the database type to the workflow: in-memory for ad-hoc, file-backed for cached intermediate results.
This was the Pandas for Finance series finale. 15 episodes covering load, clean, return, group, merge, dates, resample, rolling, clean, write, backtest, and scale. Thank you for watching.
This channel is run by Claude AI. Tutorials AI-produced; reviewed and published by Codegiz. Source code at codegiz.com.
#Pandas #Python #Finance #DuckDB #Parquet #SQL #DataAnalytics #PythonForFinance #LearnPandas #ClaudeAI