Part of Polars for Finance

Polars for Finance: Filter Rows and Select Columns in Polars — Python Tutorial

Celest KimCelest Kim

Video: Filter Rows and Select Columns in Polars — Python Tutorial by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

The Polars expression API in one image: df.filter(pl.col("Close") > 200).select(["Date", "Ticker", "Close"]). Read it left to right: take the DataFrame, keep rows where Close exceeds 200, then keep only three columns. SQL flow, Python syntax, Rust speed.

You loaded the data in Episode 1. Now we narrow it. Filter cuts rows; select cuts columns. Together they are 80% of what an analyst does to a DataFrame before any aggregation runs.

This is also the episode where the expression API appears — the thing that makes Polars different from pandas. Once it clicks, every other Polars operation becomes a variation on the same pattern.

What is an expression

pl.col("Close") > 200 is not a value. It is a query plan node — a small object that says “the column named Close, compared element-wise to 200.” Pass it to filter and Polars compiles it to multi-threaded vectorized code. Compose two of them with & and Polars optimizes the combined predicate before running it.

In pandas you write df[df["Close"] > 200] and the boolean mask materializes immediately. Polars keeps the expression abstract until filter consumes it, which is what lets the query optimizer fold filters into reads (preview of Episode 8).

The rule is: build expressions with pl.col, hand them to dataframe methods. Once that grip is in place every Polars feature snaps into place.

Setup

Same venv as Episode 1.

source .venv/bin/activate

Same dataset — data/prices.parquet (14 tickers, ~28k rows). If you skipped Ep 1, run python scripts/regenerate-cache.py once.

Your filter script

nvim filter_select.py
import polars as pl

df = pl.read_parquet("data/prices.parquet")

# 1. Filter — keep only AAPL rows where Close exceeded $200.
aapl_high = df.filter(
  (pl.col("Ticker") == "AAPL") & (pl.col("Close") > 200)
)
print("=== AAPL Close > $200 ===")
print(aapl_high.head())
print(aapl_high.shape)

Save (:wq), run:

python filter_select.py

Output:

=== AAPL Close > $200 ===
shape: (5, 8)
┌─────────────────────┬────────┬────────┬────────┬────────┬────────┬───────────┬───────────┐
│ Date                ┆ Ticker ┆ Open   ┆ High   ┆ Low    ┆ Close  ┆ Adj Close ┆ Volume    │
│ ---                 ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---       ┆ ---       │
│ datetime[ms]        ┆ str    ┆ f64    ┆ f64    ┆ f64    ┆ f64    ┆ f64       ┆ i64       │
╞═════════════════════╪════════╪════════╪════════╪════════╪════════╪═══════════╪═══════════╡
│ 2024-06-12 00:00:00 ┆ AAPL   ┆ 207.37 ┆ 220.20 ┆ 206.59 ┆ 215.04 ┆ 213.78    ┆ 198134300 │
…
└─────────────────────┴────────┴────────┴────────┴────────┴────────┴───────────┴───────────┘
(367, 8)

367 trading days where AAPL closed above $200. The expression API at work:

  • pl.col("Ticker") == "AAPL" — a boolean expression
  • pl.col("Close") > 200 — another boolean expression
  • & combines them element-wise (Polars overloads & for expression AND; same for | and ~)

The parentheses around each operand matter — & has higher precedence than ==, so without them you’d get a confusing parse error. Same gotcha as numpy and pandas; the muscle memory carries over.

Select — picking columns

Now the column-narrowing half:

# 2. Select — three columns out of eight.
slim = df.select(["Date", "Ticker", "Close"])
print("=== Date / Ticker / Close ===")
print(slim.head())
print(slim.shape)

Output:

=== Date / Ticker / Close ===
shape: (5, 3)
┌─────────────────────┬────────┬───────┐
│ Date                ┆ Ticker ┆ Close │
│ ---                 ┆ ---    ┆ ---   │
│ datetime[ms]        ┆ str    ┆ f64   │
╞═════════════════════╪════════╪═══════╡
│ 2018-01-02 00:00:00 ┆ AAPL   ┆ 43.06 │
│ 2018-01-03 00:00:00 ┆ AAPL   ┆ 43.06 │
…
└─────────────────────┴────────┴───────┘
(28140, 3)

Same 28,140 rows, but only three columns. select takes a list of column names or expressions; the result is a fresh DataFrame.

If parquet is your storage, select propagates back into the read when you go lazy:

df = pl.read_parquet("data/prices.parquet", columns=["Date", "Ticker", "Close"])

Same result, but only three columns are read from disk. This is the column-prune mentioned in Episode 1. For a 10 GB tick file with 40 columns where you want 3, this is the difference between minutes and seconds.

Chain them

The real Polars idiom is chaining filter and select:

# 3. Filter then select — high-volume days, only the four interesting columns.
big = (
  df.filter(pl.col("Volume") > 100_000_000)
    .select(["Date", "Ticker", "Close", "Volume"])
)
print("=== Volume > 100M ===")
print(big.head())
print(big.shape)

Output:

=== Volume > 100M ===
shape: (5, 4)
┌─────────────────────┬────────┬────────┬───────────┐
│ Date                ┆ Ticker ┆ Close  ┆ Volume    │
│ ---                 ┆ ---    ┆ ---    ┆ ---       │
│ datetime[ms]        ┆ str    ┆ f64    ┆ i64       │
╞═════════════════════╪════════╪════════╪═══════════╡
│ 2018-01-04 00:00:00 ┆ AAPL   ┆ 43.26  ┆ 109708400 │
…
└─────────────────────┴────────┴────────┴───────────┘
(4705, 4)

4,705 high-volume bars across 14 tickers, four columns each.

Reading left to right: take the prices, keep rows where Volume crossed 100 million, then keep four columns. SQL semantics, Python syntax, Rust execution. Each method returns a fresh DataFrame; nothing mutates the original.

Order matters for clarity but not correctness — Polars’s optimizer commutes filter and select when it’s free to. Write whichever order reads cleaner.

Pandas equivalence table

For migration from pandas:

Operation Pandas Polars
Filter by single condition df[df["Close"] > 200] df.filter(pl.col("Close") > 200)
Filter by two conditions (AND) df[(df["A"] > 1) & (df["B"] < 5)] df.filter((pl.col("A") > 1) & (pl.col("B") < 5))
Filter equality df[df["Ticker"] == "AAPL"] df.filter(pl.col("Ticker") == "AAPL")
Select columns df[["Date", "Close"]] df.select(["Date", "Close"])
Drop columns df.drop(columns=["Volume"]) df.drop("Volume")
Filter via query df.query("Close > 200") df.filter(pl.col("Close") > 200)

Episode 9 covers the full migration — including the cases where Polars expressions force you to think differently (no loc, no iloc, no chained assignment).

Filter with is_in

For “value matches any of a set” — a finance analyst’s everyday move — use is_in:

tech = df.filter(pl.col("Ticker").is_in(["AAPL", "MSFT", "NVDA", "GOOGL"]))
print(tech.head())
print(tech.shape)

Output:

…
(8040, 8)

Pandas equivalent is df[df["Ticker"].isin([...])]. Polars adds the dot-method to the column expression; same idea.

Filter with between (and date ranges)

For numeric or date ranges, use is_between:

mid_2022 = df.filter(
  pl.col("Date").is_between("2022-06-01", "2022-12-31")
)

Strings parse to dates automatically when the target column is datetime. Pandas needs pd.to_datetime wrapped around the bounds; Polars infers.

Common stumbles

AttributeError: 'Expr' object has no attribute 'iloc'. Polars columns aren’t positional — df["Close"] returns a Series you can .head(), but there’s no .iloc[5]. For positional row access, use df[5] to get a one-row frame, or df.row(5) for a tuple.

InvalidOperationError: cannot compare Utf8 with Float64. You filtered with pl.col("Date") > 200. Wrong column. Polars is strict about types — strings and floats don’t compare. Pandas would silently fall through; Polars raises.

Parentheses around each operand of & / |. Without them the operators bind in the wrong order. Same gotcha as pandas; the diagnostic is TypeError: must be Expr, not bool.

SchemaError after select with a typo. Polars validates column names up front — typos fail loud. Run df.columns once at the top of every session to confirm what you have.

Mixing filter and pandas-style boolean indexing. df[df["Close"] > 200] does NOT work in Polars — the [] accessor takes column names or row indices, not boolean masks. Always use .filter().

What’s next

Episode 3 introduces the daily returns expression — pl.col("Close").pct_change().over("Ticker") — the window function pattern that replaces pandas’s groupby + pct_change. Once you can read that line, you can read every line in this series.

Recap

pl.col("X") is an expression — a query plan node, not a value. df.filter(expr) keeps matching rows; df.select([cols]) keeps named columns. Combine boolean expressions with &, |, ~ — parenthesize each operand. Chain filter and select freely; Polars optimizes the order. is_in for set membership, is_between for ranges. SQL flow, Python syntax, Rust speed. The expression API is the spine of every Polars operation; once it clicks, the rest of the library reads like English.

Next episode: daily returns across 14 tickers in one expression.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.