Part of Polars for Finance

Polars for Finance: Read CSV and Parquet in Polars — Python Tutorial

Celest KimCelest Kim

Video: Read CSV and Parquet in Polars — Python Tutorial by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

Polars is a dataframe library written in Rust, backed by Apache Arrow. Three lines: import polars as pldf = pl.read_parquet("prices.parquet")df.head(). Same shape as pandas, ~10× faster on real-world workloads, and a query API that reads more like SQL than method chaining.

You already know pandas. Or Excel and SQL. This series shows how the same finance work — load prices, filter, group, join, roll, resample — looks in Polars. Same dataset, same questions, different library.

Why bother

Need Pandas Polars
Load a 1M-row CSV ~3s ~0.3s
Memory for a 1M-row frame ~120 MB ~50 MB
Compute returns per ticker groupby + pct_change one expression with .over("Ticker")
Query 10M rows from disk read everything, then filter scan_parquet + lazy plan
Multi-threaded by default no yes
Query optimizer no yes — predicate + projection pushdown

The speed is Rust + Arrow. The ergonomics are the expression API — Polars treats a query like a small program the engine optimizes, not a chain of one-shot operations. By Ep 8 we’ll lean on that hard. Today we just open files.

Setup

Python 3.10+ and three packages.

python3 -m venv .venv
source .venv/bin/activate
pip install polars pyarrow yfinance

pyarrow is optional for Polars itself but makes parquet I/O bulletproof and gives you DuckDB interop later in the series.

The dataset

Same universe as the Pandas for Finance series — 14 tickers, daily OHLCV, ~2018 onward, snapshotted to data/prices.parquet. If you’re following along from scratch, regenerate it:

python scripts/regenerate-cache.py

That writes data/prices.parquet (~1.2 MB) and data/sector_map.csv (Ep 5). Every episode loads from these — no network during recordings.

Your first script

nvim read_prices.py

Type:

import polars as pl

df = pl.read_parquet("data/prices.parquet")
print(df.head())
print(df.shape)

Save (:wq), run:

python read_prices.py

Output (truncated):

shape: (5, 8)
┌─────────────────────┬────────┬───────┬───────┬───────┬───────┬───────────┬───────────┐
│ Date                ┆ Ticker ┆ Open  ┆ High  ┆ Low   ┆ Close ┆ Adj Close ┆ Volume    │
│ ---                 ┆ ---    ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---       ┆ ---       │
│ datetime[ms]        ┆ str    ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64       ┆ i64       │
╞═════════════════════╪════════╪═══════╪═══════╪═══════╪═══════╪═══════════╪═══════════╡
│ 2018-01-02 00:00:00 ┆ AAPL   ┆ 42.54 ┆ 43.07 ┆ 42.31 ┆ 43.06 ┆ 40.84     ┆ 102223600 │
│ 2018-01-03 00:00:00 ┆ AAPL   ┆ 43.13 ┆ 43.64 ┆ 42.99 ┆ 43.05 ┆ 40.83     ┆ 118071600 │
…
└─────────────────────┴────────┴───────┴───────┴───────┴───────┴───────────┴───────────┘
(28140, 8)

That’s the whole universe. 28,140 rows, 8 columns. The header shows dtypes inline — datetime[ms], str, f64, i64 — which is the first thing Polars does differently from pandas. The schema is visible on every print.

What just happened

import polars as pl

Imports Polars under the alias pl. Convention — every Polars script does this.

df = pl.read_parquet("data/prices.parquet")

Reads the parquet file into a DataFrame. Parquet is a columnar binary format. Polars reads it natively (via Arrow), in parallel, with predicate pushdown if you go lazy later.

print(df.head())

.head() returns the first 5 rows. There’s also .tail(), and df.sample(n=5) for random rows.

print(df.shape)

(rows, cols) tuple. Same as pandas — a quick sanity check.

Reading CSV

The other half of this episode is CSV. Finance data still ships as CSV constantly — broker exports, FRED, regulatory filings. Polars’s CSV reader is the fastest you’ll find in Python:

sectors = pl.read_csv("data/sector_map.csv")
print(sectors)

Output:

shape: (14, 2)
┌────────┬────────────────────────┐
│ Ticker ┆ Sector                 │
│ ---    ┆ ---                    │
│ str    ┆ str                    │
╞════════╪════════════════════════╡
│ AAPL   ┆ Technology             │
│ MSFT   ┆ Technology             │
│ GOOGL  ┆ Communication Services │
│ AMZN   ┆ Consumer Discretionary │
…
└────────┴────────────────────────┘

That’s data/sector_map.csv. We’ll join it onto the prices in Ep 5.

pl.read_csv infers schema automatically, but for production-grade scripts you should pin types:

sectors = pl.read_csv(
  "data/sector_map.csv",
  schema={"Ticker": pl.Utf8, "Sector": pl.Utf8},
)

Schema-on-read catches type drift early — somebody adds a column with a stray decimal, your script throws instead of silently rounding.

Inspecting the frame

Four methods you’ll use every day:

print(df.head(3))         # first 3 rows
print(df.tail(3))         # last 3 rows
print(df.schema)          # column → dtype mapping
print(df.describe())      # summary stats per column

df.schema is the one to internalize. In pandas you call df.dtypes; in Polars the schema is a first-class object you can pass around:

{
  "Date": Datetime(time_unit='ms', time_zone=None),
  "Ticker": String,
  "Open": Float64,
  "High": Float64,
  "Low": Float64,
  "Close": Float64,
  "Adj Close": Float64,
  "Volume": Int64,
}

df.describe() is the same idea as pandas — count, mean, std, min/max, percentiles — but it returns a Polars frame, so you can keep chaining.

Selecting a column

A column in Polars is a Series, same word as pandas:

closes = df["Close"]
print(closes.head())
print(closes.max(), closes.min(), closes.mean())

Output:

shape: (5,)
Series: 'Close' [f64]
[
   43.06
   43.06
   43.26
   43.75
   43.59
]
1146.83 8.78 195.41

Same trio as pandas — max, min, mean. The values mash mega-caps and ETFs together, so Ep 4 is where this gets interesting (per-ticker aggregations).

Why parquet matters

You’ll see most pandas tutorials read CSVs. For real finance work, save and reload as parquet:

CSV Parquet
Size on disk (this dataset) ~1.6 MB ~1.2 MB
Read time ~80 ms ~5 ms
Schema preserved no — re-inferred each read yes — stored in the file
Column-prune on read no — reads everything yes — reads only the columns you ask for
Compression manual (gzip wrapper) built-in (snappy by default)

That column-prune is the big one. With parquet, this:

df = pl.read_parquet("data/prices.parquet", columns=["Date", "Ticker", "Close"])

reads only those three columns from disk. It doesn’t load the rest into memory at all. For a 10 GB tick-data file with 40 columns where you want 3, that’s the difference between 7 GB of memory and 500 MB.

The lazy version (Ep 8) takes it further — adds filter pushdown so even rows are pre-filtered before reading.

Writing parquet

The inverse, when you want to snapshot a derived frame:

df.write_parquet("derived.parquet")
df.write_csv("derived.csv")

Both are one-liners. write_parquet has compression options (snappy, gzip, zstd) — snappy is the default and the right choice for almost everything finance.

Why scripts, not notebooks

Same answer as the pandas series: scripts are easier to version, easier to schedule, easier to share. You can drop any of these scripts into a notebook cell and they’ll work. The series stays in python file.py so the runtime is one tool.

What’s coming

Each episode is a small standalone script that solves one finance problem in Polars:

  • Ep 2 — filter rows and select columns using the expression API.
  • Ep 3 — daily and log returns per ticker in one expression.
  • Ep 4 — groupby aggregates: per-ticker stats.
  • Ep 5 — joining the price frame with the sector map.
  • Ep 6 — rolling windows: SMA, Bollinger, volatility.
  • Ep 7 — resampling daily → weekly OHLC → monthly returns.
  • Ep 8 — lazy mode: scanning a 10M-row CSV with filter pushdown.
  • Ep 9 — pandas → Polars migration: the 8 idioms that change.
  • Ep 10 — Polars + DuckDB interop: zero-copy round trips for SQL queries.

Common stumbles

No module named 'polars'. Wrong terminal — venv not activated. source .venv/bin/activate before running.

FileNotFoundError: data/prices.parquet. You’re not in the series root. Run from the folder that contains data/ and scripts/.

Print is too wide. Polars’s table print auto-fits the terminal. If it’s chopping columns, widen the terminal or pass pl.Config(tbl_cols=20) once at the top of your script.

Pandas instinct: df.head() no parens. Polars’s head is a method, not a property. Always call it.

AttributeError: 'DataFrame' object has no attribute 'iloc'. Polars doesn’t have loc / iloc. Row indexing is df[0:5], column selection is df.select([...]) or df["col"]. Ep 9 covers the full migration.

Recap

Install polars + pyarrow + yfinance. import polars as pl, then pl.read_parquet("file.parquet") or pl.read_csv("file.csv"). The DataFrame prints its schema inline — dtypes are first-class. df.head(), df.tail(), df.shape, df.schema, df.describe() are the four inspection moves. Parquet beats CSV on size, speed, and column-prune — save derived frames as parquet by default. Scripts, not notebooks.

Next episode: filter rows and select columns. Polars’s expression API — the thing that makes everything else click.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.