Pandas for Finance: GroupBy on Time: Buckets, Pivots, Named Aggregations
Video: GroupBy on Time: Buckets, Pivots, Named Aggregations | Pandas for Finance Ep7 by CelesteAI
df.groupby(df["Date"].dt.to_period("M"))["Close"].mean()— monthly average close.df.groupby([df["Date"].dt.year, "Ticker"])["Return"].sum()— annual return per ticker. The pivot table you don’t have to drag.
Most analyst questions are “for each [time bucket], what is [metric]?” Pandas’s groupby answers them in one line.
groupby anatomy
df.groupby(key).agg(func) — three pieces:
- key — what to group by. A column, a list of columns, or a derived expression.
- agg — what to compute per group.
mean,sum,count, custom function. - result — a Series or DataFrame with the group key as index.
import pandas as pd
df = pd.read_parquet("data/prices.parquet")
df["Date"] = pd.to_datetime(df["Date"])
# Average close per ticker
avg_close = df.groupby("Ticker")["Close"].mean()
print(avg_close)
Ticker
AAPL 140.86
AMZN 145.40
GOOGL 138.12
JNJ 158.23
JPM 157.40
...
The result is a Series indexed by Ticker.
Time buckets
The killer use of groupby is bucketing by time period.
# Monthly average close, all tickers combined
monthly = df.groupby(df["Date"].dt.to_period("M"))["Close"].mean()
print(monthly.head())
Date
2018-01 85.22
2018-02 86.71
2018-03 87.04
2018-04 87.60
2018-05 90.12
.dt.to_period("M") converts each Date to “the month it belongs to.” Group by that.
Other periods:
| Code | Period |
|---|---|
"D" |
day |
"W" |
week (Sun–Sat) |
"M" |
month |
"Q" |
quarter |
"Y" |
year |
"M-NOV" |
fiscal year ending November |
Multi-key groupby
# Monthly average per ticker
monthly_per_ticker = (
df.groupby([df["Date"].dt.to_period("M"), "Ticker"])["Close"]
.mean()
.unstack("Ticker")
)
print(monthly_per_ticker.head())
unstack("Ticker") pivots the inner level to columns:
Ticker AAPL AMZN GOOGL JNJ JPM ...
Date
2018-01 42.96 71.39 55.69 131.81 111.42
2018-02 42.43 76.08 55.48 127.95 113.43
A pivot table — date as rows, ticker as columns. The drag-and-drop replaced.
Multiple aggregations
stats = (
df.groupby("Ticker")["Close"]
.agg(["mean", "min", "max", "std"])
.round(2)
)
print(stats)
mean min max std
Ticker
AAPL 140.86 35.55 258.39 60.20
AMZN 145.40 73.30 226.95 34.45
GOOGL 138.12 53.22 198.66 31.04
...
Pass a list to .agg(). Each becomes a column in the result.
For different aggs on different columns:
report = df.groupby("Ticker").agg(
AvgClose=("Close", "mean"),
MaxClose=("Close", "max"),
AvgVolume=("Volume", "mean"),
TotalVolume=("Volume", "sum"),
)
Name=("col", "func") — named aggregations. The output column name is Name, the function applies to col.
Custom aggregations
def daily_to_annual(returns):
return (1 + returns.mean()) ** 252 - 1
annualized = (
df.groupby("Ticker")["Return"]
.agg(daily_to_annual)
.sort_values(ascending=False)
)
Any function that takes a Series and returns a scalar works.
For inline:
df.groupby("Ticker")["Return"].agg(lambda r: (1+r.mean())**252 - 1)
Year-over-year, quarter-over-quarter
# Returns aggregated to monthly
monthly_ret = (
df.set_index("Date")
.groupby("Ticker")["Adj Close"]
.resample("ME").last()
.groupby(level=0)
.pct_change()
)
(We’ll cover .resample() properly in Ep 10. For now, the takeaway: bucket → aggregate → compare buckets.)
For YoY at the table level:
df["Year"] = df["Date"].dt.year
yearly_close = df.groupby(["Year", "Ticker"])["Adj Close"].last().unstack()
yoy = yearly_close.pct_change()
print(yoy.tail())
Ticker AAPL AMZN GOOGL JNJ JPM
Year
2021 0.348 0.022 0.652 0.108 0.250
2022 -0.265 -0.495 -0.391 0.032 -0.140
2023 0.480 0.808 0.583 0.018 0.270
2024 0.305 0.444 0.357 0.054 0.330
A return table you’d recognize from Bloomberg.
Year × Month heatmap input
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["DailyReturn"] = df.groupby("Ticker")["Adj Close"].pct_change()
aapl = df[df["Ticker"] == "AAPL"]
# Compound monthly return
monthly = aapl.groupby(["Year", "Month"])["DailyReturn"].apply(
lambda r: (1 + r).prod() - 1
).unstack("Month")
print(monthly.round(3))
Month 1 2 3 4 5 6 ...
Year
2020 -0.078 -0.111 -0.067 0.157 0.083 0.149 ...
2021 0.000 -0.080 -0.082 0.076 0.045 -0.020 ...
2022 -0.014 -0.057 0.058 -0.105 -0.054 -0.082 ...
Year as rows, month as columns. Hand this to seaborn or plotly for a heatmap.
Counting
# How many trading days per ticker per year
counts = (
df.groupby([df["Date"].dt.year, "Ticker"])
.size()
.unstack("Ticker")
)
.size() returns the count regardless of nulls; .count() only non-null values per column.
Weekday effect
df["DailyReturn"] = df.groupby("Ticker")["Adj Close"].pct_change()
df["Weekday"] = df["Date"].dt.day_name()
weekday_ret = df.groupby("Weekday")["DailyReturn"].mean()
print(weekday_ret.reindex(["Monday","Tuesday","Wednesday","Thursday","Friday"]))
The classic “what day of the week is best?” analysis. Spoiler: not very different.
Try it
groups.py:
import pandas as pd
pd.set_option("display.float_format", lambda x: f"{x:,.2f}")
df = pd.read_parquet("data/prices.parquet")
df["Date"] = pd.to_datetime(df["Date"])
df["Return"] = df.groupby("Ticker")["Adj Close"].pct_change()
print("=== Average close per ticker ===")
print(df.groupby("Ticker")["Close"].mean().sort_values(ascending=False))
print("\n=== Monthly close (last value) per ticker ===")
monthly = (
df.groupby([df["Date"].dt.to_period("M"), "Ticker"])["Close"]
.last()
.unstack("Ticker")
.tail()
)
print(monthly)
print("\n=== Multi-stat report ===")
report = df.groupby("Ticker").agg(
AvgClose=("Close", "mean"),
MaxClose=("Close", "max"),
AvgVolume=("Volume", "mean"),
)
print(report.round(0))
print("\n=== Year × Year Annualized Return ===")
df["Year"] = df["Date"].dt.year
yoy = (
df.groupby(["Year", "Ticker"])["Adj Close"]
.last()
.unstack("Ticker")
.pct_change()
)
print(yoy.tail())
Common stumbles
Forgetting to convert to datetime. df["Date"].dt... errors if Date is still a string. pd.to_datetime(df["Date"]) first.
size() vs count(). size() is total rows in the group (NaN-aware: counts them). count() is per column non-null. For “how big is this group?” use size().
groupby().mean() on string columns. Errors after pandas 2.0 unless you numeric_only=True. Or select numeric columns first.
unstack with too many groups. A million unique tickers → million-column DataFrame, OOM. Stack/unstack carefully.
Period vs Timestamp. .dt.to_period("M") produces Period objects (display “2024-01”). .dt.month_year doesn’t exist; use df["Date"].dt.year + df["Date"].dt.month for separate columns.
Time-aware groupby on long format. Always df.groupby(["Date_period", "Ticker"])... not just Date_period — otherwise tickers compete with dates as keys.
What’s next
Ep 8: joins. Prices × sector lookup; the VLOOKUP-for-adults episode.
Recap
df.groupby(key).agg(func). Time bucket: df["Date"].dt.to_period("M"). Multi-key for “ticker × month.” .agg([...]) for multiple stats; agg(Name=("col", "func")) for named outputs. .unstack(level) pivots one level to columns. .dt.year, .dt.month, .dt.day_name() for date pieces. The pivot table replaced.
Next episode: joins.