Part of Pandas for Finance

Pandas for Finance: Joins: Prices x Sector Lookup, merge() and the SQL JOIN

Celest KimCelest Kim

Video: Joins: Prices x Sector Lookup, merge() and the SQL JOIN | Pandas for Finance Ep8 by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

prices.merge(sectors, on="Ticker") glues the two tables together. how="left" keeps every price row; "inner" drops unmatched. Then groupby("Sector")["Return"].mean() answers “which sectors performed best?” The VLOOKUP-for-adults episode.

A merge is SQL’s JOIN: combine rows from two DataFrames using a shared key.

Two tables, one join

import pandas as pd

prices = pd.read_parquet("data/prices.parquet")
sectors = pd.read_csv("data/sector_map.csv")

print(sectors.head())
  Ticker                  Sector
0   AAPL              Technology
1   MSFT              Technology
2  GOOGL  Communication Services
3   AMZN  Consumer Discretionary
4   NVDA              Technology

prices has many rows per Ticker (one per Date). sectors has one row per Ticker. Join them on Ticker:

merged = prices.merge(sectors, on="Ticker", how="left")
print(merged.head())
        Date Ticker     Open   ...      Sector
0 2018-01-02  AAPL    42.54   ...  Technology
1 2018-01-03  AAPL    42.95   ...  Technology
2 2018-01-04  AAPL    43.13   ...  Technology
...

The Sector column is repeated for every price row. SQL’s INNER JOIN behavior: rows with matching keys are kept; rest are dropped (or kept with NaN, depending on how).

Join types

prices.merge(sectors, on="Ticker", how="inner")  # only matching keys
prices.merge(sectors, on="Ticker", how="left")   # keep all prices
prices.merge(sectors, on="Ticker", how="right")  # keep all sectors
prices.merge(sectors, on="Ticker", how="outer")  # keep all from both
How Keeps
inner Only rows with key in both
left All from left; right’s matches
right All from right; left’s matches
outer All from both; NaN where unmatched

For “I have prices, attach metadata,” how="left" — never lose price rows.

For “I have a list of tickers I care about, give me their prices,” how="inner" (or filter prices first).

Different key column names

ticker_meta = pd.read_csv("data/ticker_metadata.csv")  # has "Symbol" column
merged = prices.merge(ticker_meta, left_on="Ticker", right_on="Symbol")

left_on and right_on for column names that differ between the two DataFrames.

Many-to-one vs many-to-many

The prices × sectors join is many-to-one — many price rows per ticker, one sector per ticker. Result: one row per input price row. Standard.

Many-to-many (both sides have multiple matches) creates a cartesian product within each key group. Usually a bug. Verify with:

merged = prices.merge(sectors, on="Ticker", validate="many_to_one")

validate= raises if the relationship doesn’t match. Saves debugging time on a join you didn’t expect.

Index-based join

If both DataFrames have the same index, df1.join(df2) is shorter:

prices_indexed = prices.set_index("Ticker")
sectors_indexed = sectors.set_index("Ticker")

joined = prices_indexed.join(sectors_indexed, how="left")

merge with on= is more flexible. join is convenient for index-based work. Same result either way.

Aggregating after join

The point of the join is the analysis after:

df = prices.merge(sectors, on="Ticker", how="left")
df["Return"] = df.groupby("Ticker")["Adj Close"].pct_change()

# Average daily return by sector
sector_returns = df.groupby("Sector")["Return"].mean().sort_values(ascending=False)
print(sector_returns)
Sector
Technology                  0.001234
Consumer Discretionary      0.000895
Communication Services      0.000821
Financials                  0.000543
Health Care                 0.000312
Sector ETF                  0.000654
Index ETF                   0.000523

Annualize:

ann = (1 + sector_returns) ** 252 - 1
print(ann.map(lambda x: f"{x:.2%}"))

Cumulative wealth by sector

df["Wealth"] = df.groupby("Ticker")["Return"].apply(lambda r: (1 + r.fillna(0)).cumprod()).reset_index(level=0, drop=True)

# End-of-period wealth by sector (avg across tickers in sector)
end_wealth = df.groupby("Sector").apply(
  lambda g: g.sort_values("Date").groupby("Ticker")["Wealth"].last().mean()
)
print(end_wealth.sort_values(ascending=False))

A bit dense. A cleaner version:

last_wealth = df.sort_values("Date").groupby(["Sector", "Ticker"])["Wealth"].last()
sector_wealth = last_wealth.groupby("Sector").mean()
print(sector_wealth.sort_values(ascending=False))

For tutorial pace, flatten complex aggregations into named steps.

Joining time series across tickers

To compare AAPL to SPY (benchmark), pivot to wide and merge:

wide = (
  prices.pivot(index="Date", columns="Ticker", values="Adj Close")[["AAPL", "SPY"]]
)
wide["AAPL_Return"] = wide["AAPL"].pct_change()
wide["SPY_Return"] = wide["SPY"].pct_change()
wide["Excess"] = wide["AAPL_Return"] - wide["SPY_Return"]
print(wide.tail())

pivot is “implicit join on Date” — rows are dates, columns are tickers, cells are prices. Comparing time-aligned series is then trivial.

Concat vs merge

# Stacking same-shape DataFrames vertically
combined = pd.concat([df_2023, df_2024], ignore_index=True)

# Stacking horizontally (must share index)
side = pd.concat([prices_subset, sectors_indexed], axis=1)

concat doesn’t match keys — it just glues. Use it when you have same-schema chunks (e.g., yearly files). Use merge when you’re matching on a key.

Try it

join.py:

import pandas as pd

pd.set_option("display.float_format", lambda x: f"{x:,.4f}")

prices = pd.read_parquet("data/prices.parquet")
sectors = pd.read_csv("data/sector_map.csv")

print("=== Sector lookup ===")
print(sectors)

print("\n=== Joined DataFrame ===")
df = prices.merge(sectors, on="Ticker", how="left")
print(df.head())
print(f"Rows: {len(df)} (should equal {len(prices)})")

print("\n=== Daily return per sector ===")
df["Return"] = df.groupby("Ticker")["Adj Close"].pct_change()
sector_ret = df.groupby("Sector")["Return"].mean().sort_values(ascending=False)
print(sector_ret)

print("\n=== Annualized return per sector ===")
ann = ((1 + sector_ret) ** 252 - 1)
print(ann.map(lambda x: f"{x:.2%}"))

print("\n=== Validate one-to-many ===")
merged = prices.merge(sectors, on="Ticker", validate="many_to_one")
print(f"Validated: {len(merged)} rows")

Common stumbles

Forgetting how. Default is inner — silently drops rows without matches. For “attach metadata to my main table,” how="left".

Cartesian explosion. Many-to-many joins blow up row counts. Use validate="many_to_one" to catch.

Duplicate suffix _x/_y. When both DataFrames have a column with the same name and you don’t say it’s the key, pandas suffixes _x and _y. Specify on= or rename first.

Ticker case mismatch. aapl vs AAPL won’t match. df["Ticker"] = df["Ticker"].str.upper() before joining.

NaN keys. Rows with NaN in the join key never match (NaN ≠ NaN). Filter or fillna first.

merge on huge DataFrames slow. merge is O(n+m). For repeated joins on the same key, use set_index(key) and .join() — sometimes a bit faster, more importantly cleaner.

What’s next

Ep 9: dates and times — the .dt accessor, business days, calendars.

Recap

df1.merge(df2, on="key", how="left") for SQL-like joins. how: inner (default), left, right, outer. left_on/right_on for differently-named keys. validate="many_to_one" to catch unexpected duplicates. After joining, groupby("Sector") answers “what’s the sector-level metric.” For time-aligned comparisons of multiple tickers, pivot(index="Date", columns="Ticker") is implicit-join.

Next episode: dates and calendars.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.