Joins: Prices x Sector Lookup, merge() and the SQL JOIN | Pandas for Finance Ep8
0views
C
CelesteAI
Description
Episode 8 of Pandas for Finance. The VLOOKUP for adults — pandas's merge() is SQL's JOIN, and it's how analysts attach metadata to a main table.
Source code: https://github.com/GoCelesteAI/pandas-for-finance
The episode walks through prices.merge(sectors, on="Ticker", how="left") to glue a fourteen-ticker sector lookup onto the long-format price table, then groups by Sector to answer the analyst question — which sectors led the period? Consumer Discretionary 50%, Technology 46%, Communication Services 31%, and so on. The four how values, the validate parameter that catches unexpected cartesian explosions, and the named-aggregation pattern from Episode 7 all carry forward.
What You'll Build:
- join.py — load the cached prices.parquet and the sector_map.csv lookup, merge on Ticker with how="left", attach a Sector column to every price row, then group by Sector for the leaderboard.
- The merge with explicit how and the validate-many-to-one safety check that catches accidental cartesian explosions.
- A sector-level annualized-return ranking computed in one chained pipeline: groupby → mean → sort → annualize → format as percent.
- The cat data/sector_map.csv warm-up so viewers see the small reference table that powers the join.
Timestamps:
0:00 - Intro — Episode 8 starts here
0:19 - Preview — merge anatomy
0:58 - cat the sector lookup
1:11 - Open nvim, write join.py
1:23 - merge(prices, sectors, on=Ticker, how=left)
1:35 - groupby Sector, annualize
1:49 - Save and run
1:55 - Joined head (Sector column attached)
2:09 - Annualized return by sector
2:31 - Recap
3:10 - End screen
Key Takeaways:
1. df1.merge(df2, on="key", how=...) is SQL's JOIN. The four how values mirror SQL exactly: inner keeps only rows with the key in both (default and easy to silently drop rows on); left keeps everything on the left and matches what it can on the right; right is the symmetric case; outer keeps everything from both, NaN where unmatched. For "attach metadata to my main table," how="left" is the right answer ninety percent of the time.
2. Use validate="many_to_one" to catch cartesian explosions. When you expect each key on the right to be unique (one sector per ticker, one industry per company), pandas can verify the relationship and raise a clear error if it isn't. Without validate, a duplicated row on the supposed-unique side silently multiplies your row count and the bug surfaces hours later in a downstream aggregation. One keyword argument; saves debugging time.
3. Different key column names → left_on and right_on. When the join key is named Ticker on one side and Symbol on the other, left_on="Ticker", right_on="Symbol" does the join without renaming first. The output keeps both columns; drop one if you don't need it.
4. The point of the join is the groupby on the new column. Joining is rarely the goal; aggregating the result is. After prices.merge(sectors), groupby("Sector")["Return"].mean() is the actual analysis — the join just made the column available. For finance work, the pattern is always: load main table, attach lookups, compute returns, group by the lookup column.
5. Concat is not merge. pd.concat([df1, df2]) stacks DataFrames vertically without matching keys — use it for same-schema chunks like yearly files or month-by-month dumps. merge is for matching on a key. Mixing them up is one of the top mistakes new pandas users make; remember that concat is "glue them together" and merge is "match them up."
#Pandas #Python #Finance #Merge #Join #SQL #DataAnalytics #PythonForFinance #LearnPandas #DataFrame
---
Generated by GoCelesteAI · part of the Pandas for Finance series