Back to Blog

Inner, Left, Outer — Merging DataFrames in pandas

Celest KimCelest Kim

Video: Inner, Left, Outer — Merging DataFrames in pandas by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

Real data lives in pieces. A trade blotter on one side, a ticker reference on the other. A users table here, an addresses table there. A sales table, a product catalog. The work of any pandas analysis is joining them.

The good news: pandas has exactly one function for that — pd.merge. Same mental model as SQL. Same four arguments: left side, right side, a key column, a how-strategy. One call covers every join type.

The shape of the problem

A trade blotter is a fact table: things that happened. A ticker reference is a dimension table: facts about the universe those things happened in. Analysis needs both:

  • “What’s the total notional traded by sector?” — you need trades (notional = quantity × price) AND tickers (sector lookup).
  • “Which tickers in our universe haven’t traded?” — you need the full ticker list AND the trade history, joined.

These two questions need two different join strategies. pd.merge does both with one keyword.

Method 1 — inner join (default)

The canonical pattern:

import pandas as pd

trades = pd.read_csv("trades.csv")     # 8 rows
tickers = pd.read_csv("tickers.csv")   # 14 rows

inner = pd.merge(trades, tickers, on="Ticker")
print(inner.shape)   # (8, 6)

Two things to note:

  • on="Ticker" says “join where the Ticker column matches on both sides.” Both DataFrames must have a column with that exact name.
  • The default how is "inner" — only rows where the key exists on both sides survive. No NaN. Clean matched data.

This is the right default for most analysis. If you want to compute revenue per trade and need the sector lookup to do it, inner gives you exactly the rows you can compute on.

Method 2 — left join

What if you can’t afford to drop trades? An inner join silently loses any trade whose Ticker is missing from the reference table. For a fact table you’re going to sum, that’s a quiet bug — your totals come out under-counted and the test for it would have to know to look.

how="left" is the safety net:

left = pd.merge(trades, tickers, on="Ticker", how="left")
print(left.shape)   # (8, 6) — every trade survives

Every row from the left DataFrame survives. Where the right side has no match, the reference columns come back as NaN. You can detect the orphans with one line:

orphans = left[left["Sector"].isna()]

In this dataset trades.Ticker is a subset of tickers.Ticker, so left and inner happen to be the same 8 rows. Real data is messier. Reach for how="left" whenever the left side is your fact table and silent row loss would corrupt downstream metrics.

Method 3 — outer join

What about the reverse question — “which tickers are in the universe but have no trades?” That’s the outer join:

outer = pd.merge(trades, tickers, on="Ticker", how="outer")
print(outer.shape)   # (14, 6) — every key from both sides

You get every key from both sides. Where either side has no match, the missing columns come back as NaN. Here that’s 8 matched rows plus 6 reference-only rows (the ETFs XLE, XLF, XLK, XLV, XLY, plus SPY — they exist in the ticker reference but no trade hit them).

The classic use case: reconciliation. “Show me everything in either source.” Then filter to find what’s missing on each side:

only_in_tickers = outer[outer["Quantity"].isna()]["Ticker"].tolist()
# ['SPY', 'XLE', 'XLF', 'XLK', 'XLV', 'XLY']

When to use which

Question Pattern
Joined data for analysis, drop the unmatched pd.merge(left, right, on="key")
Keep every fact row, NaN missing reference data pd.merge(left, right, on="key", how="left")
Full reconciliation, see what’s missing on each side pd.merge(left, right, on="key", how="outer")
Different key column names pd.merge(..., left_on="Symbol", right_on="Ticker")
Overlapping non-key columns on both sides pd.merge(..., suffixes=("_buy", "_sell"))

What you’re not going to do

A few patterns that show up on Stack Overflow but you should avoid:

  • for loop over the left DataFrame doing .loc[] lookups in the right. This is the manual version of a merge and it’s O(n²). pd.merge is hash-joined under the hood — orders of magnitude faster on any non-tiny dataset.
  • pd.concat to “merge”. Concat stacks rows (same schema, more rows). Merge joins on a key (different schemas, same key). Mixing them up is a top beginner mistake.
  • .set_index then arithmetic. Setting both DataFrames’ index to the join key and then doing math works for some narrow cases, but it’s brittle and doesn’t generalize. Use pd.merge.
  • Reaching for pd.merge_asof for exact-match joins. merge_asof is for approximate time-based joins (find the nearest preceding row by timestamp). For equality matches, plain pd.merge is faster and clearer.

The full demo

Here’s the complete script the video walks through:

import pandas as pd

trades = pd.read_csv("trades.csv")
tickers = pd.read_csv("tickers.csv")
print(f"trades:               {trades.shape}")
print(f"tickers:              {tickers.shape}")

# Method 1: inner join (default) — only rows that exist on BOTH sides
inner = pd.merge(trades, tickers, on="Ticker")
print(f"inner shape:          {inner.shape}")

# Method 2: left join — keep ALL trades, NaN where ticker has no reference row
left = pd.merge(trades, tickers, on="Ticker", how="left")
print(f"left shape:           {left.shape}")

# Method 3: outer join — UNION of keys; NaN on either side where no match
outer = pd.merge(trades, tickers, on="Ticker", how="outer")
print(f"outer shape:          {outer.shape}")
no_trade = outer[outer["Quantity"].isna()]["Ticker"].tolist()
print(f"tickers with no trade: {no_trade}")

Running it:

trades:               (8, 4)
tickers:              (14, 3)
inner shape:          (8, 6)
left shape:           (8, 6)
outer shape:          (14, 6)
tickers with no trade: ['SPY', 'XLE', 'XLF', 'XLK', 'XLV', 'XLY']

Three joins, three shapes, one function call each.

Takeaways

  1. pd.merge is the join function. Same mental model as SQL. left, right, on, how.
  2. The default is inner. Only rows where the key exists on both sides. Clean matched data.
  3. how="left" keeps every left row. NaN where the right side has no match. The fact-table safety net.
  4. how="outer" is the union. Every key from both sides. NaN where either is missing. The reconciliation default.
  5. Concat stacks; merge joins. Different problems, different functions.

Watch the video for the full demo, including the outer join surfacing the 6 unmatched ETF tickers.

Want the deeper dive?

This single answers “how do I merge two DataFrames?” in a few minutes. If you want the longer version — validate="many_to_one" to catch cartesian explosions, the right join, left_on / right_on for differently-named keys, and a full sector-leaders aggregation built on top of the join — that’s covered in Pandas for Finance, Episode 8 — Joins: Prices × Sector Lookup. Same merge, same data shape, more depth.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.