Part of Polars for Finance

Polars for Finance: GroupBy and Aggregate in Polars — Python Tutorial

Celest KimCelest Kim

Video: GroupBy and Aggregate in Polars — Python Tutorial by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

df.group_by("Ticker").agg([...]) — one call, fourteen tickers, four statistics per ticker. The list inside .agg is a small declarative program the Polars engine compiles and runs in parallel. SQL window/aggregate semantics, Python syntax, Rust execution.

.over("Ticker") from Episode 3 added a column per row. .group_by("Ticker").agg(...) collapses the frame to one row per ticker. Same partition concept, different output shape. Between them you have the two cardinality moves that power every analyst pipeline: window functions keep the frame’s row count, aggregations reduce it.

What group_by + agg actually does

For an N-row frame with a Ticker column:

  1. Partition into groups by the Ticker value (14 groups in our universe).
  2. For each group, evaluate every expression in the .agg([...]) list.
  3. Produce a result frame with one row per group + one column per expression + the group key itself.

The result is 14 rows × 5 columns — 14 tickers, four statistics each, plus the ticker name. Bloomberg’s per-symbol-summary view in two lines.

Pandas equivalent:

df.groupby("Ticker").agg(
  avg_close=("Close", "mean"),
  max_volume=("Volume", "max"),
  ret_std=("daily_ret", "std"),
)

The pandas version uses named tuples; Polars uses .alias on the expression. Both work. The Polars version composes more naturally with the rest of the expression API — every .agg argument is the same pl.col(...) expression you’d use anywhere else.

Setup

source .venv/bin/activate
nvim groupby_agg.py

Same dataset universe — fourteen tickers, daily OHLCV, ~28k rows. If you skipped Episode 1, run python scripts/regenerate-cache.py once.

Your aggregation script

import polars as pl

df = pl.read_parquet("data/prices.parquet")

result = df.with_columns(
  daily_ret=pl.col("Close").pct_change().over("Ticker"),
).group_by("Ticker").agg([
  pl.col("Close").mean().alias("avg_close"),
  pl.col("Volume").max().alias("max_volume"),
  pl.col("daily_ret").std().alias("ret_std"),
  pl.col("daily_ret").count().alias("days"),
]).sort("avg_close", descending=True)

print(result)

Save (:wq), run:

python groupby_agg.py

Output:

shape: (14, 5)
┌────────┬────────────┬────────────┬──────────┬──────┐
│ Ticker ┆ avg_close  ┆ max_volume ┆ ret_std  ┆ days │
│ ---    ┆ ---        ┆ ---        ┆ ---      ┆ ---  │
│ str    ┆ f64        ┆ i64        ┆ f64      ┆ u32  │
╞════════╪════════════╪════════════╪══════════╪══════╡
│ SPY    ┆ 413.49     ┆ 392220700  ┆ 0.012300 ┆ 2009 │
│ MSFT   ┆ 270.70     ┆ 111242100  ┆ 0.017871 ┆ 2009 │
│ TSLA   ┆ 182.68     ┆ 914082000  ┆ 0.040118 ┆ 2009 │
│ JPM    ┆ 155.30     ┆ 54418800   ┆ 0.018332 ┆ 2009 │
│ JNJ    ┆ 154.67     ┆ 151319500  ┆ 0.012361 ┆ 2009 │
…
│ NVDA   ┆ 44.40      ┆ 2511528000 ┆ 0.032318 ┆ 2009 │
│ XLE    ┆ 35.21      ┆ 198713400  ┆ 0.020103 ┆ 2009 │
│ XLF    ┆ 35.07      ┆ 256525000  ┆ 0.014905 ┆ 2009 │
└────────┴────────────┴────────────┴──────────┴──────┘

Fourteen rows. One per ticker. Reading by column:

  • avg_close — SPY is the priciest in the window at $413, NVDA the cheapest at $44 (pre-split adjusted view). The sort is on this column.
  • max_volume — NVDA peaked at 2.5 billion shares in a single day, TSLA at 914 million. Index ETFs are an order of magnitude lower.
  • ret_std — daily return volatility. TSLA at 4.0% per day, NVDA at 3.2% — the high-vol pair. SPY at 1.2% is the market floor.
  • days — sanity check; every ticker has the same 2009 trading days across the window.

Each statistic is one expression in the .agg list. The engine fuses them — Polars walks the underlying Arrow buffers once and produces all four columns in a single pass.

What the .alias does

pl.col("Close").mean() is an expression that, when run in a .agg context, produces a single value per group. Without .alias, the result column name would be Close (the source column). That collides if you also compute pl.col("Close").max() in the same call.

.alias("avg_close") renames the output. Read it as the SQL AS:

SELECT AVG(close) AS avg_close FROM prices GROUP BY ticker

For every .agg expression in a multi-column aggregation, use .alias to give it a stable name. Future-you and pipeline consumers will thank you.

Per-group aggregation patterns

The .agg list takes any number of expressions. Common shapes:

.agg([
  pl.col("Close").mean().alias("avg_close"),                # central tendency
  pl.col("Close").std().alias("std_close"),                 # dispersion
  pl.col("Close").min().alias("min_close"),                 # tail
  pl.col("Close").max().alias("max_close"),                 # tail
  pl.col("Close").quantile(0.95).alias("p95_close"),        # percentile
  pl.len().alias("days"),                                   # row count per group
  pl.col("daily_ret").sum().alias("total_ret"),             # additive
])

Mean, std, min, max, quantile, count, sum — the seven aggregations every analyst needs. Each is one method on the expression.

Conditional aggregation — filter inside agg

The Polars idiom that’s hard to do in pandas: filter rows within an aggregation expression.

.agg([
  pl.col("Volume").filter(pl.col("Close") > 100).mean().alias("high_close_avg_vol"),
])

That mean-of-volume only includes rows where Close > $100. Pandas would need a groupby+apply with a lambda. In Polars it’s a chain on the expression.

group_by multiple columns

For two-level grouping — e.g. ticker AND year:

result = df.with_columns(
  year=pl.col("Date").dt.year(),
).group_by(["Ticker", "year"]).agg([
  pl.col("Close").mean().alias("avg_close"),
])

Pass a list to group_by instead of a single string. The result frame has one row per (Ticker, year) combination — 14 × 6 = 84 rows for our six-year window.

Maintain ordering with maintain_order=True

Polars group_by defaults to unordered output for parallelism — rows come back in whatever order the workers finished. For a stable, reproducible ordering matching first-appearance:

df.group_by("Ticker", maintain_order=True).agg(...)

Trade a small amount of speed for deterministic output. Always do this when the result is feeding a downstream report or test.

group_by vs over — when to use which

Same partition concept, different output shape:

Goal Use Result shape
Add a per-group derived column to every row .over("group") same as input
Reduce to one row per group .group_by("group").agg([...]) one row per group

If you want both — a per-row column AND a summary — compute the .over first with with_columns, then .group_by the result. Polars optimizes both into a single pass over the data.

Pandas → Polars cheatsheet for groupby

Operation Pandas Polars
Group + mean df.groupby("T")["X"].mean() df.group_by("T").agg(pl.col("X").mean())
Multi-agg df.groupby("T").agg({"X": "mean", "Y": "sum"}) df.group_by("T").agg([pl.col("X").mean(), pl.col("Y").sum()])
Named result df.groupby("T").agg(avg=("X","mean")) pl.col("X").mean().alias("avg")
Count rows per group df.groupby("T").size() df.group_by("T").agg(pl.len())
Filter inside agg groupby+apply with lambda pl.col("X").filter(pl.col("Y") > 0).mean()
Multi-key group df.groupby(["A","B"]).agg(...) df.group_by(["A","B"]).agg(...)
Sort by group key .sort_index() after .sort("T") after

Every pandas idiom maps onto a Polars expression. The substitution is direct.

Common stumbles

shape: (1, N) instead of one row per group. You forgot the group_by entirely. df.agg([...]) aggregates the whole frame as one group. Use df.group_by(...).agg([...]).

Result column named Close instead of avg_close. Missing .alias. Polars defaults to the input column name; multi-aggs on the same column collide. Always alias.

Rows in random order between runs. Default group_by is unordered for speed. Add maintain_order=True or chain .sort("Ticker") after.

PanicException: column not found. You aggregated a column you forgot to add. If you reference daily_ret inside .agg, the source frame must have that column. Add it with .with_columns(daily_ret=...) before the .group_by call.

Want median but median() errors. Polars exposes median as pl.col("X").median(). For arbitrary percentiles: pl.col("X").quantile(0.5).

Recap

df.group_by("Ticker").agg([...]) collapses the frame to one row per group, one column per expression. Use .alias on every aggregate so column names are stable. The .agg list takes the same expressions you use everywhere else in Polars — pl.col(...).mean(), .std(), .quantile(p), .filter(...).mean(), and pl.len() for counts. For deterministic output add maintain_order=True. Multi-key grouping is one list argument away. Between .over (windowed, same row count) and .group_by(...).agg(...) (aggregation, fewer rows), you cover most of the analytics surface area without ever writing a Python loop over the data.

Subscribe to the playlist for the next episode in the series.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.