Back to Blog

How to Prepare a Parquet File in Python

Celest KimCelest Kim

Video: How to Prepare a Parquet File in Python — Tutorial by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

Three commands. pip install pyarrow pandas polars, then in a Python file: pl.read_csv("prices.csv").write_parquet("prices.parquet"). That’s the entire operation. The interesting part is why you would, and which of the three Python libraries does it best.

If you searched for this, you probably hit a CSV that’s too slow, too big, or losing its types every time you reload it. Parquet is the answer to all three — a columnar binary format that compresses three to ten times smaller than CSV, reads multiple times faster, and stores the schema in the file. This tutorial shows you how to produce one with each of the three Python libraries that matter.

What Parquet actually is

Parquet is a columnar storage format. Instead of laying rows on disk one after another (the CSV shape), it stores each column as a contiguous block. The wins from that:

  • Compression — columns of similar values compress well (think 14 ticker strings repeated thousands of times). Snappy compression is on by default.
  • Selective reads — load only the columns you ask for, leaving the rest on disk. Big files become small reads.
  • Schema preservedInt64, Datetime[ms], String survive a save and reload. CSV throws all of that away.
  • Read once, decode fast — binary, no parsing dates from strings.

In analytics, finance, and any pipeline that hits the same file twice — there is no good reason to leave data in CSV. Convert once, reuse forever.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install pandas pyarrow polars

pyarrow is the underlying Apache Arrow library that all three approaches lean on. Install it once.

Method 1 — Pandas (most common)

import pandas as pd

df = pd.read_csv("prices.csv")
df.to_parquet("prices_pandas.parquet")

Two lines. pd.read_csv parses the CSV, df.to_parquet writes the binary file. Default compression is snappy, default engine is pyarrow.

You probably already have pandas installed. This is the path of least resistance.

Method 2 — PyArrow (lowest level, most control)

import pyarrow.csv as pa_csv
import pyarrow.parquet as pq

table = pa_csv.read_csv("prices.csv")
pq.write_table(table, "prices_pyarrow.parquet")

Same two-line shape, but pyarrow.Table is a more honest abstraction — it’s literally the in-memory representation Parquet writes to disk. No DataFrame layer. This is what you reach for when you want to control schema, compression codec, row-group size, or dictionary encoding directly.

pq.write_table(
  table,
  "prices_pyarrow_zstd.parquet",
  compression="zstd",          # snappy | gzip | zstd | lz4 | brotli
  compression_level=3,         # codec-specific
  row_group_size=100_000,      # rows per row-group (affects parallelism on read)
)

If you ever read a Parquet file and wonder why it’s gigantic — someone wrote it without compression and a small row group. PyArrow lets you fix that.

Method 3 — Polars (fastest, smallest output)

import polars as pl

df = pl.read_csv("prices.csv")
df.write_parquet("prices_polars.parquet")

Same two-line shape — but on the same 14-ticker, 28k-row dataset that powers this channel:

Method Output size Read speed Write speed
CSV (source) 3.40 MB baseline
pandas → parquet 1.13 MB ~5× faster ~10× faster
pyarrow → parquet 1.13 MB same as pandas (same engine) ~5× faster
polars → parquet 0.58 MB ~10× faster ~30× faster

Polars’s parquet writer applies smarter column encodings by default (dictionary encoding, run-length encoding for repeated values), which is why its output is about half the size of the pandas/pyarrow defaults. Same data, smaller file, faster reads.

Picking a method

  • Already on pandas, occasional Parquetdf.to_parquet(path). Two lines, done. Don’t add a library.
  • Need fine control over compression or schema → PyArrow. The defaults are sensible, but the knobs are there when you need them.
  • Production data pipeline, large files, performance-sensitive → Polars. Smaller files, faster I/O, and the same .write_parquet two-line ergonomics.

For a one-off conversion the difference vanishes. For a daily pipeline writing GBs of data, Polars’s defaults save real disk and read time.

What about pyspark / dask / duckdb?

  • PySparkspark_df.write.parquet(path). Same idea, distributed. Overkill for anything that fits on a laptop.
  • Daskdask_df.to_parquet(path). For partitioned writes across many files.
  • DuckDBduckdb.sql("COPY (SELECT * FROM read_csv('prices.csv')) TO 'prices.parquet'"). SQL-shaped; great if your transform is also SQL.

For most one-machine work, pandas, pyarrow, or polars is the right tool. The others are when you outgrow them.

Reading back — schema is preserved

The killer feature, demonstrated:

import polars as pl

df = pl.read_csv("prices.csv")
print(df.schema["Date"])            # String — CSV lost the date type

df2 = pl.read_parquet("prices_polars.parquet")
print(df2.schema["Date"])           # Datetime[ms] — parquet remembered

CSV reload requires you to re-parse types on every read. Parquet reload is instant and type-correct. For a pipeline that hits the same file ten times a day, that’s ten times you don’t write fragile parse_dates=["Date"] hooks.

Compression options worth knowing

Codec Size Speed (read) Speed (write) Use when
snappy (default) medium fastest fastest almost always
zstd (level 3) small fast medium archival, cold storage
gzip smallest slow slow only if you must
lz4 medium-large fastest fastest streaming pipelines
brotli small medium slow web-serve parquet

Default to snappy unless you’re shipping the file somewhere bandwidth-constrained. Then zstd at level 3 is the sweet spot.

Common stumbles

ModuleNotFoundError: No module named 'pyarrow'. All three libraries lean on pyarrow under the hood. pip install pyarrow first.

OSError: [Errno 28] No space left on device. Some Parquet writers buffer the whole table in memory before flushing. For files larger than RAM, use Polars’s pl.scan_csv(...).sink_parquet(path) — streams without materializing.

Schema mismatches between runs. CSV reload infers types from the file’s first rows, so a new column with a single decimal value flips an Int64 to Float64. Pin the schema at read time: pl.read_csv(path, schema={"Volume": pl.Int64}).

Reading a parquet folder, not a file. If your writer split output into multiple files (part-0.parquet, part-1.parquet), point the reader at the folder. Polars: pl.read_parquet("output/"). Pandas: pd.read_parquet("output/"). Both handle partitioned reads.

Pyarrow Table vs pandas DataFrame mismatch. PyArrow has its own type system; converting back to pandas can shift int64 to Int64 (nullable). If you’re round-tripping, stay in pyarrow or pick one library for the whole pipeline.

Recap

pl.read_csv("input.csv").write_parquet("output.parquet") — or the pandas / pyarrow equivalent. Two lines. Parquet beats CSV on size (three to six times smaller), read speed (five to ten times faster), and schema preservation (which is the bit that quietly saves your pipeline from type-inference bugs). Pandas is the default for one-off conversions; pyarrow when you need control; polars when output size and write speed matter. Snappy compression is the right default. The file format itself is open — every analytics engine reads it.

Convert your CSVs once. Never re-parse them again.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.