Joins: Prices x Sector Lookup, merge() and the SQL JOIN | Pandas for Finance Ep8
0views
C
CelesteAI
Description
Episode 8 of *Pandas for Finance*. The VLOOKUP for adults — pandas's `merge()` is SQL's JOIN, and it's how analysts attach metadata to a main table.
The episode walks through `prices.merge(sectors, on="Ticker", how="left")` to glue a fourteen-ticker sector lookup onto the long-format price table, then groups by Sector to answer the analyst question — which sectors led the period? Consumer Discretionary 50%, Technology 46%, Communication Services 31%, and so on. The four `how` values, the `validate` parameter that catches unexpected cartesian explosions, and the named-aggregation pattern from Episode 7 all carry forward.
What You'll Build:
- `join.py` — load the cached `prices.parquet` and the `sector_map.csv` lookup, merge on Ticker with `how="left"`, attach a Sector column to every price row, then group by Sector for the leaderboard.
- The merge with explicit `how` and the validate-many-to-one safety check that catches accidental cartesian explosions.
- A sector-level annualized-return ranking computed in one chained pipeline: groupby → mean → sort → annualize → format as percent.
- The `cat data/sector_map.csv` warm-up so viewers see the small reference table that powers the join.
Timestamps:
0:00 - Intro — Episode 8 starts here
0:19 - Preview — merge anatomy
0:58 - cat the sector lookup
1:11 - Open nvim, write join.py
1:23 - merge(prices, sectors, on=Ticker, how=left)
1:35 - groupby Sector, annualize
1:49 - Save and run
1:55 - Joined head (Sector column attached)
2:09 - Annualized return by sector
2:31 - Recap
3:10 - End screen
Key Takeaways:
1. **`df1.merge(df2, on="key", how=...)` is SQL's JOIN.** The four `how` values mirror SQL exactly: `inner` keeps only rows with the key in both (default and easy to silently drop rows on); `left` keeps everything on the left and matches what it can on the right; `right` is the symmetric case; `outer` keeps everything from both, NaN where unmatched. For "attach metadata to my main table," `how="left"` is the right answer ninety percent of the time.
2. **Use `validate="many_to_one"` to catch cartesian explosions.** When you expect each key on the right to be unique (one sector per ticker, one industry per company), pandas can verify the relationship and raise a clear error if it isn't. Without `validate`, a duplicated row on the supposed-unique side silently multiplies your row count and the bug surfaces hours later in a downstream aggregation. One keyword argument; saves debugging time.
3. **Different key column names → `left_on` and `right_on`.** When the join key is named `Ticker` on one side and `Symbol` on the other, `left_on="Ticker", right_on="Symbol"` does the join without renaming first. The output keeps both columns; drop one if you don't need it.
4. **The point of the join is the groupby on the new column.** Joining is rarely the goal; aggregating the result is. After `prices.merge(sectors)`, `groupby("Sector")["Return"].mean()` is the actual analysis — the join just made the column available. For finance work, the pattern is always: load main table, attach lookups, compute returns, group by the lookup column.
5. **Concat is not merge.** `pd.concat([df1, df2])` stacks DataFrames vertically without matching keys — use it for same-schema chunks like yearly files or month-by-month dumps. `merge` is for matching on a key. Mixing them up is one of the top mistakes new pandas users make; remember that concat is "glue them together" and merge is "match them up."
#Pandas #Python #Finance #Merge #Join #SQL #DataAnalytics #PythonForFinance #LearnPandas #DataFrame