Inner, Left, Outer — Merging DataFrames in pandas
Video: Inner, Left, Outer — Merging DataFrames in pandas by CelesteAI
Real data lives in pieces. A trade blotter on one side, a ticker reference on the other. A users table here, an addresses table there. A sales table, a product catalog. The work of any pandas analysis is joining them.
The good news: pandas has exactly one function for that — pd.merge. Same mental model as SQL. Same four arguments: left side, right side, a key column, a how-strategy. One call covers every join type.
The shape of the problem
A trade blotter is a fact table: things that happened. A ticker reference is a dimension table: facts about the universe those things happened in. Analysis needs both:
- “What’s the total notional traded by sector?” — you need trades (notional = quantity × price) AND tickers (sector lookup).
- “Which tickers in our universe haven’t traded?” — you need the full ticker list AND the trade history, joined.
These two questions need two different join strategies. pd.merge does both with one keyword.
Method 1 — inner join (default)
The canonical pattern:
import pandas as pd
trades = pd.read_csv("trades.csv") # 8 rows
tickers = pd.read_csv("tickers.csv") # 14 rows
inner = pd.merge(trades, tickers, on="Ticker")
print(inner.shape) # (8, 6)
Two things to note:
on="Ticker"says “join where theTickercolumn matches on both sides.” Both DataFrames must have a column with that exact name.- The default
howis"inner"— only rows where the key exists on both sides survive. No NaN. Clean matched data.
This is the right default for most analysis. If you want to compute revenue per trade and need the sector lookup to do it, inner gives you exactly the rows you can compute on.
Method 2 — left join
What if you can’t afford to drop trades? An inner join silently loses any trade whose Ticker is missing from the reference table. For a fact table you’re going to sum, that’s a quiet bug — your totals come out under-counted and the test for it would have to know to look.
how="left" is the safety net:
left = pd.merge(trades, tickers, on="Ticker", how="left")
print(left.shape) # (8, 6) — every trade survives
Every row from the left DataFrame survives. Where the right side has no match, the reference columns come back as NaN. You can detect the orphans with one line:
orphans = left[left["Sector"].isna()]
In this dataset trades.Ticker is a subset of tickers.Ticker, so left and inner happen to be the same 8 rows. Real data is messier. Reach for how="left" whenever the left side is your fact table and silent row loss would corrupt downstream metrics.
Method 3 — outer join
What about the reverse question — “which tickers are in the universe but have no trades?” That’s the outer join:
outer = pd.merge(trades, tickers, on="Ticker", how="outer")
print(outer.shape) # (14, 6) — every key from both sides
You get every key from both sides. Where either side has no match, the missing columns come back as NaN. Here that’s 8 matched rows plus 6 reference-only rows (the ETFs XLE, XLF, XLK, XLV, XLY, plus SPY — they exist in the ticker reference but no trade hit them).
The classic use case: reconciliation. “Show me everything in either source.” Then filter to find what’s missing on each side:
only_in_tickers = outer[outer["Quantity"].isna()]["Ticker"].tolist()
# ['SPY', 'XLE', 'XLF', 'XLK', 'XLV', 'XLY']
When to use which
| Question | Pattern |
|---|---|
| Joined data for analysis, drop the unmatched | pd.merge(left, right, on="key") |
| Keep every fact row, NaN missing reference data | pd.merge(left, right, on="key", how="left") |
| Full reconciliation, see what’s missing on each side | pd.merge(left, right, on="key", how="outer") |
| Different key column names | pd.merge(..., left_on="Symbol", right_on="Ticker") |
| Overlapping non-key columns on both sides | pd.merge(..., suffixes=("_buy", "_sell")) |
What you’re not going to do
A few patterns that show up on Stack Overflow but you should avoid:
forloop over the left DataFrame doing.loc[]lookups in the right. This is the manual version of a merge and it’s O(n²).pd.mergeis hash-joined under the hood — orders of magnitude faster on any non-tiny dataset.pd.concatto “merge”. Concat stacks rows (same schema, more rows). Merge joins on a key (different schemas, same key). Mixing them up is a top beginner mistake..set_indexthen arithmetic. Setting both DataFrames’ index to the join key and then doing math works for some narrow cases, but it’s brittle and doesn’t generalize. Usepd.merge.- Reaching for
pd.merge_asoffor exact-match joins.merge_asofis for approximate time-based joins (find the nearest preceding row by timestamp). For equality matches, plainpd.mergeis faster and clearer.
The full demo
Here’s the complete script the video walks through:
import pandas as pd
trades = pd.read_csv("trades.csv")
tickers = pd.read_csv("tickers.csv")
print(f"trades: {trades.shape}")
print(f"tickers: {tickers.shape}")
# Method 1: inner join (default) — only rows that exist on BOTH sides
inner = pd.merge(trades, tickers, on="Ticker")
print(f"inner shape: {inner.shape}")
# Method 2: left join — keep ALL trades, NaN where ticker has no reference row
left = pd.merge(trades, tickers, on="Ticker", how="left")
print(f"left shape: {left.shape}")
# Method 3: outer join — UNION of keys; NaN on either side where no match
outer = pd.merge(trades, tickers, on="Ticker", how="outer")
print(f"outer shape: {outer.shape}")
no_trade = outer[outer["Quantity"].isna()]["Ticker"].tolist()
print(f"tickers with no trade: {no_trade}")
Running it:
trades: (8, 4)
tickers: (14, 3)
inner shape: (8, 6)
left shape: (8, 6)
outer shape: (14, 6)
tickers with no trade: ['SPY', 'XLE', 'XLF', 'XLK', 'XLV', 'XLY']
Three joins, three shapes, one function call each.
Takeaways
pd.mergeis the join function. Same mental model as SQL. left, right, on, how.- The default is inner. Only rows where the key exists on both sides. Clean matched data.
how="left"keeps every left row. NaN where the right side has no match. The fact-table safety net.how="outer"is the union. Every key from both sides. NaN where either is missing. The reconciliation default.- Concat stacks; merge joins. Different problems, different functions.
Watch the video for the full demo, including the outer join surfacing the 6 unmatched ETF tickers.
Want the deeper dive?
This single answers “how do I merge two DataFrames?” in a few minutes. If you want the longer version — validate="many_to_one" to catch cartesian explosions, the right join, left_on / right_on for differently-named keys, and a full sector-leaders aggregation built on top of the join — that’s covered in Pandas for Finance, Episode 8 — Joins: Prices × Sector Lookup. Same merge, same data shape, more depth.