Pandas for Finance: Joins: Prices x Sector Lookup, merge() and the SQL JOIN
Video: Joins: Prices x Sector Lookup, merge() and the SQL JOIN | Pandas for Finance Ep8 by CelesteAI
prices.merge(sectors, on="Ticker")glues the two tables together.how="left"keeps every price row;"inner"drops unmatched. Thengroupby("Sector")["Return"].mean()answers “which sectors performed best?” The VLOOKUP-for-adults episode.
A merge is SQL’s JOIN: combine rows from two DataFrames using a shared key.
Two tables, one join
import pandas as pd
prices = pd.read_parquet("data/prices.parquet")
sectors = pd.read_csv("data/sector_map.csv")
print(sectors.head())
Ticker Sector
0 AAPL Technology
1 MSFT Technology
2 GOOGL Communication Services
3 AMZN Consumer Discretionary
4 NVDA Technology
prices has many rows per Ticker (one per Date). sectors has one row per Ticker. Join them on Ticker:
merged = prices.merge(sectors, on="Ticker", how="left")
print(merged.head())
Date Ticker Open ... Sector
0 2018-01-02 AAPL 42.54 ... Technology
1 2018-01-03 AAPL 42.95 ... Technology
2 2018-01-04 AAPL 43.13 ... Technology
...
The Sector column is repeated for every price row. SQL’s INNER JOIN behavior: rows with matching keys are kept; rest are dropped (or kept with NaN, depending on how).
Join types
prices.merge(sectors, on="Ticker", how="inner") # only matching keys
prices.merge(sectors, on="Ticker", how="left") # keep all prices
prices.merge(sectors, on="Ticker", how="right") # keep all sectors
prices.merge(sectors, on="Ticker", how="outer") # keep all from both
| How | Keeps |
|---|---|
inner |
Only rows with key in both |
left |
All from left; right’s matches |
right |
All from right; left’s matches |
outer |
All from both; NaN where unmatched |
For “I have prices, attach metadata,” how="left" — never lose price rows.
For “I have a list of tickers I care about, give me their prices,” how="inner" (or filter prices first).
Different key column names
ticker_meta = pd.read_csv("data/ticker_metadata.csv") # has "Symbol" column
merged = prices.merge(ticker_meta, left_on="Ticker", right_on="Symbol")
left_on and right_on for column names that differ between the two DataFrames.
Many-to-one vs many-to-many
The prices × sectors join is many-to-one — many price rows per ticker, one sector per ticker. Result: one row per input price row. Standard.
Many-to-many (both sides have multiple matches) creates a cartesian product within each key group. Usually a bug. Verify with:
merged = prices.merge(sectors, on="Ticker", validate="many_to_one")
validate= raises if the relationship doesn’t match. Saves debugging time on a join you didn’t expect.
Index-based join
If both DataFrames have the same index, df1.join(df2) is shorter:
prices_indexed = prices.set_index("Ticker")
sectors_indexed = sectors.set_index("Ticker")
joined = prices_indexed.join(sectors_indexed, how="left")
merge with on= is more flexible. join is convenient for index-based work. Same result either way.
Aggregating after join
The point of the join is the analysis after:
df = prices.merge(sectors, on="Ticker", how="left")
df["Return"] = df.groupby("Ticker")["Adj Close"].pct_change()
# Average daily return by sector
sector_returns = df.groupby("Sector")["Return"].mean().sort_values(ascending=False)
print(sector_returns)
Sector
Technology 0.001234
Consumer Discretionary 0.000895
Communication Services 0.000821
Financials 0.000543
Health Care 0.000312
Sector ETF 0.000654
Index ETF 0.000523
Annualize:
ann = (1 + sector_returns) ** 252 - 1
print(ann.map(lambda x: f"{x:.2%}"))
Cumulative wealth by sector
df["Wealth"] = df.groupby("Ticker")["Return"].apply(lambda r: (1 + r.fillna(0)).cumprod()).reset_index(level=0, drop=True)
# End-of-period wealth by sector (avg across tickers in sector)
end_wealth = df.groupby("Sector").apply(
lambda g: g.sort_values("Date").groupby("Ticker")["Wealth"].last().mean()
)
print(end_wealth.sort_values(ascending=False))
A bit dense. A cleaner version:
last_wealth = df.sort_values("Date").groupby(["Sector", "Ticker"])["Wealth"].last()
sector_wealth = last_wealth.groupby("Sector").mean()
print(sector_wealth.sort_values(ascending=False))
For tutorial pace, flatten complex aggregations into named steps.
Joining time series across tickers
To compare AAPL to SPY (benchmark), pivot to wide and merge:
wide = (
prices.pivot(index="Date", columns="Ticker", values="Adj Close")[["AAPL", "SPY"]]
)
wide["AAPL_Return"] = wide["AAPL"].pct_change()
wide["SPY_Return"] = wide["SPY"].pct_change()
wide["Excess"] = wide["AAPL_Return"] - wide["SPY_Return"]
print(wide.tail())
pivot is “implicit join on Date” — rows are dates, columns are tickers, cells are prices. Comparing time-aligned series is then trivial.
Concat vs merge
# Stacking same-shape DataFrames vertically
combined = pd.concat([df_2023, df_2024], ignore_index=True)
# Stacking horizontally (must share index)
side = pd.concat([prices_subset, sectors_indexed], axis=1)
concat doesn’t match keys — it just glues. Use it when you have same-schema chunks (e.g., yearly files). Use merge when you’re matching on a key.
Try it
join.py:
import pandas as pd
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")
prices = pd.read_parquet("data/prices.parquet")
sectors = pd.read_csv("data/sector_map.csv")
print("=== Sector lookup ===")
print(sectors)
print("\n=== Joined DataFrame ===")
df = prices.merge(sectors, on="Ticker", how="left")
print(df.head())
print(f"Rows: {len(df)} (should equal {len(prices)})")
print("\n=== Daily return per sector ===")
df["Return"] = df.groupby("Ticker")["Adj Close"].pct_change()
sector_ret = df.groupby("Sector")["Return"].mean().sort_values(ascending=False)
print(sector_ret)
print("\n=== Annualized return per sector ===")
ann = ((1 + sector_ret) ** 252 - 1)
print(ann.map(lambda x: f"{x:.2%}"))
print("\n=== Validate one-to-many ===")
merged = prices.merge(sectors, on="Ticker", validate="many_to_one")
print(f"Validated: {len(merged)} rows")
Common stumbles
Forgetting how. Default is inner — silently drops rows without matches. For “attach metadata to my main table,” how="left".
Cartesian explosion. Many-to-many joins blow up row counts. Use validate="many_to_one" to catch.
Duplicate suffix _x/_y. When both DataFrames have a column with the same name and you don’t say it’s the key, pandas suffixes _x and _y. Specify on= or rename first.
Ticker case mismatch. aapl vs AAPL won’t match. df["Ticker"] = df["Ticker"].str.upper() before joining.
NaN keys. Rows with NaN in the join key never match (NaN ≠ NaN). Filter or fillna first.
merge on huge DataFrames slow. merge is O(n+m). For repeated joins on the same key, use set_index(key) and .join() — sometimes a bit faster, more importantly cleaner.
What’s next
Ep 9: dates and times — the .dt accessor, business days, calendars.
Recap
df1.merge(df2, on="key", how="left") for SQL-like joins. how: inner (default), left, right, outer. left_on/right_on for differently-named keys. validate="many_to_one" to catch unexpected duplicates. After joining, groupby("Sector") answers “what’s the sector-level metric.” For time-aligned comparisons of multiple tickers, pivot(index="Date", columns="Ticker") is implicit-join.
Next episode: dates and calendars.