Filter Rows: Boolean Masks, .query(), .nlargest, Date Slices | Pandas for Finance Ep4
0views
C
CelesteAI
Description
Episode 4 of Pandas for Finance. The WHERE clause for DataFrames.
Source code: https://github.com/GoCelesteAI/pandas-for-finance
If you finished Episode 3, you have the cached parquet on disk. Now we filter. In SQL, you write WHERE close_return greater than 0.03. In pandas, you pass a boolean Series as the index. Same idea — df[df["Close"].pct_change().gt(0.03)] keeps the days Apple closed up more than three percent. Combine with & and |, wrap each condition in parens, and you have everything you need to find the moments that matter in seven years of price data.
What You'll Build:
- big_days.py — load the cached prices, filter to AAPL, compute daily returns, and pull the days the stock moved more than three percent.
- Two-condition filtering with .query("Return greater than 0.03 and Volume greater than 1e8") — readable, SQL-shaped, no operator-precedence traps.
- .nlargest(5, "Return") to pull the five biggest single-day moves across the whole range — including April 2025 up fifteen percent.
- Date slicing with .set_index("Date").loc["2024-01":"2024-03"] to get just Q1 2024 and its average close.
Timestamps:
0:00 - Intro — Episode 4 starts here
0:17 - Preview — boolean masks, .query, top-N
0:54 - Open nvim, write big_days.py
1:08 - Filter to AAPL, compute daily Return
1:20 - Boolean filter for big up days
1:32 - Two-condition .query
1:44 - .nlargest for top 5
1:54 - Save and run
2:01 - Big up days output
2:11 - Big up + high volume; top 5 returns
2:31 - Recap — boolean masks, .query, .nlargest
3:08 - End screen
Key Takeaways:
1. Boolean indexing is the core idea. df[boolean_series] keeps the rows where the Series is True. aapl["Return"].gt(0.03) produces a Series of True/False aligned to the index, and aapl[that] returns the matching rows. This single pattern handles ninety percent of the row-filtering you'll ever do — anything more complex is just a different way to construct the boolean Series.
2. Use &, |, ~ — never and, or, not. Vectorized operations need bitwise operators on element-wise Series. Python's and/or keywords expect a single scalar truth value and will raise "ambiguous truth value of a Series." And always wrap each condition in parens: (a.gt(1)) & (b.lt(5)). Without the parens, operator precedence parses the bitwise ampersand before the comparisons — silently wrong.
3. .query() is the readable filter. Takes a string expression, parses it, and applies it. You can use and/or inside the string because the parser converts them to bitwise. Reference outside variables with @variable. Reference column names with spaces using backticks. Anything past two conditions is more readable as a .query() string than as a chain of bitwise expressions.
4. .isin([...]), .between(a, b), .str.contains(...) cover the SQL IN, BETWEEN, and LIKE cases. df[df["Ticker"].isin(["AAPL", "MSFT", "GOOGL"])] for whitelist. aapl[aapl["Volume"].between(5e7, 1e8)] for ranges. df[df["Ticker"].str.startswith("X")] for string patterns. Every SQL filter has a one-line pandas equivalent.
5. Date slicing requires a DatetimeIndex. df.set_index("Date").loc["2024-01":"2024-03"] works because loc understands partial date strings on a DatetimeIndex. Without set_index, you get a KeyError. For series operations on time-indexed data — rolling windows, resample, asfreq — the DatetimeIndex pays back the conversion cost across every later episode.
#Pandas #Python #Finance #Filter #BooleanIndexing #PandasQuery #DataAnalytics #PythonForFinance #LearnPandas #DataFrame
---
Generated by GoCelesteAI · part of the Pandas for Finance series