Filter Rows: Boolean Masks, .query(), .nlargest, Date Slices | Pandas for Finance Ep4
0views
C
CelesteAI
Description
Episode 4 of *Pandas for Finance*. The WHERE clause for DataFrames.
If you finished Episode 3, you have the cached parquet on disk. Now we filter. In SQL, you write `WHERE close_return greater than 0.03`. In pandas, you pass a boolean Series as the index. Same idea — `df[df["Close"].pct_change().gt(0.03)]` keeps the days Apple closed up more than three percent. Combine with `&` and `|`, wrap each condition in parens, and you have everything you need to find the moments that matter in seven years of price data.
What You'll Build:
- `big_days.py` — load the cached prices, filter to AAPL, compute daily returns, and pull the days the stock moved more than three percent.
- Two-condition filtering with `.query("Return greater than 0.03 and Volume greater than 1e8")` — readable, SQL-shaped, no operator-precedence traps.
- `.nlargest(5, "Return")` to pull the five biggest single-day moves across the whole range — including April 2025 up fifteen percent.
- Date slicing with `.set_index("Date").loc["2024-01":"2024-03"]` to get just Q1 2024 and its average close.
Timestamps:
0:00 - Intro — Episode 4 starts here
0:17 - Preview — boolean masks, .query, top-N
0:54 - Open nvim, write big_days.py
1:08 - Filter to AAPL, compute daily Return
1:20 - Boolean filter for big up days
1:32 - Two-condition .query
1:44 - .nlargest for top 5
1:54 - Save and run
2:01 - Big up days output
2:11 - Big up + high volume; top 5 returns
2:31 - Recap — boolean masks, .query, .nlargest
3:08 - End screen
Key Takeaways:
1. **Boolean indexing is the core idea.** `df[boolean_series]` keeps the rows where the Series is `True`. `aapl["Return"].gt(0.03)` produces a Series of `True`/`False` aligned to the index, and `aapl[that]` returns the matching rows. This single pattern handles ninety percent of the row-filtering you'll ever do — anything more complex is just a different way to construct the boolean Series.
2. **Use `&`, `|`, `~` — never `and`, `or`, `not`.** Vectorized operations need bitwise operators on element-wise Series. Python's `and`/`or` keywords expect a single scalar truth value and will raise "ambiguous truth value of a Series." And always wrap each condition in parens: `(a.gt(1)) & (b.lt(5))`. Without the parens, operator precedence parses the bitwise ampersand before the comparisons — silently wrong.
3. **`.query()` is the readable filter.** Takes a string expression, parses it, and applies it. You can use `and`/`or` inside the string because the parser converts them to bitwise. Reference outside variables with `@variable`. Reference column names with spaces using backticks. Anything past two conditions is more readable as a `.query()` string than as a chain of bitwise expressions.
4. **`.isin([...])`, `.between(a, b)`, `.str.contains(...)` cover the SQL `IN`, `BETWEEN`, and `LIKE` cases.** `df[df["Ticker"].isin(["AAPL", "MSFT", "GOOGL"])]` for whitelist. `aapl[aapl["Volume"].between(5e7, 1e8)]` for ranges. `df[df["Ticker"].str.startswith("X")]` for string patterns. Every SQL filter has a one-line pandas equivalent.
5. **Date slicing requires a `DatetimeIndex`.** `df.set_index("Date").loc["2024-01":"2024-03"]` works because `loc` understands partial date strings on a `DatetimeIndex`. Without `set_index`, you get a `KeyError`. For series operations on time-indexed data — rolling windows, resample, asfreq — the DatetimeIndex pays back the conversion cost across every later episode.
#Pandas #Python #Finance #Filter #BooleanIndexing #PandasQuery #DataAnalytics #PythonForFinance #LearnPandas #DataFrame