How to Prepare a Parquet File in Python
Video: How to Prepare a Parquet File in Python — Tutorial by CelesteAI
Three commands.
pip install pyarrow pandas polars, then in a Python file:pl.read_csv("prices.csv").write_parquet("prices.parquet"). That’s the entire operation. The interesting part is why you would, and which of the three Python libraries does it best.
If you searched for this, you probably hit a CSV that’s too slow, too big, or losing its types every time you reload it. Parquet is the answer to all three — a columnar binary format that compresses three to ten times smaller than CSV, reads multiple times faster, and stores the schema in the file. This tutorial shows you how to produce one with each of the three Python libraries that matter.
What Parquet actually is
Parquet is a columnar storage format. Instead of laying rows on disk one after another (the CSV shape), it stores each column as a contiguous block. The wins from that:
- Compression — columns of similar values compress well (think 14 ticker strings repeated thousands of times). Snappy compression is on by default.
- Selective reads — load only the columns you ask for, leaving the rest on disk. Big files become small reads.
- Schema preserved —
Int64,Datetime[ms],Stringsurvive a save and reload. CSV throws all of that away. - Read once, decode fast — binary, no parsing dates from strings.
In analytics, finance, and any pipeline that hits the same file twice — there is no good reason to leave data in CSV. Convert once, reuse forever.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install pandas pyarrow polars
pyarrow is the underlying Apache Arrow library that all three approaches lean on. Install it once.
Method 1 — Pandas (most common)
import pandas as pd
df = pd.read_csv("prices.csv")
df.to_parquet("prices_pandas.parquet")
Two lines. pd.read_csv parses the CSV, df.to_parquet writes the binary file. Default compression is snappy, default engine is pyarrow.
You probably already have pandas installed. This is the path of least resistance.
Method 2 — PyArrow (lowest level, most control)
import pyarrow.csv as pa_csv
import pyarrow.parquet as pq
table = pa_csv.read_csv("prices.csv")
pq.write_table(table, "prices_pyarrow.parquet")
Same two-line shape, but pyarrow.Table is a more honest abstraction — it’s literally the in-memory representation Parquet writes to disk. No DataFrame layer. This is what you reach for when you want to control schema, compression codec, row-group size, or dictionary encoding directly.
pq.write_table(
table,
"prices_pyarrow_zstd.parquet",
compression="zstd", # snappy | gzip | zstd | lz4 | brotli
compression_level=3, # codec-specific
row_group_size=100_000, # rows per row-group (affects parallelism on read)
)
If you ever read a Parquet file and wonder why it’s gigantic — someone wrote it without compression and a small row group. PyArrow lets you fix that.
Method 3 — Polars (fastest, smallest output)
import polars as pl
df = pl.read_csv("prices.csv")
df.write_parquet("prices_polars.parquet")
Same two-line shape — but on the same 14-ticker, 28k-row dataset that powers this channel:
| Method | Output size | Read speed | Write speed |
|---|---|---|---|
| CSV (source) | 3.40 MB | baseline | — |
| pandas → parquet | 1.13 MB | ~5× faster | ~10× faster |
| pyarrow → parquet | 1.13 MB | same as pandas (same engine) | ~5× faster |
| polars → parquet | 0.58 MB | ~10× faster | ~30× faster |
Polars’s parquet writer applies smarter column encodings by default (dictionary encoding, run-length encoding for repeated values), which is why its output is about half the size of the pandas/pyarrow defaults. Same data, smaller file, faster reads.
Picking a method
- Already on pandas, occasional Parquet →
df.to_parquet(path). Two lines, done. Don’t add a library. - Need fine control over compression or schema → PyArrow. The defaults are sensible, but the knobs are there when you need them.
- Production data pipeline, large files, performance-sensitive → Polars. Smaller files, faster I/O, and the same
.write_parquettwo-line ergonomics.
For a one-off conversion the difference vanishes. For a daily pipeline writing GBs of data, Polars’s defaults save real disk and read time.
What about pyspark / dask / duckdb?
- PySpark —
spark_df.write.parquet(path). Same idea, distributed. Overkill for anything that fits on a laptop. - Dask —
dask_df.to_parquet(path). For partitioned writes across many files. - DuckDB —
duckdb.sql("COPY (SELECT * FROM read_csv('prices.csv')) TO 'prices.parquet'"). SQL-shaped; great if your transform is also SQL.
For most one-machine work, pandas, pyarrow, or polars is the right tool. The others are when you outgrow them.
Reading back — schema is preserved
The killer feature, demonstrated:
import polars as pl
df = pl.read_csv("prices.csv")
print(df.schema["Date"]) # String — CSV lost the date type
df2 = pl.read_parquet("prices_polars.parquet")
print(df2.schema["Date"]) # Datetime[ms] — parquet remembered
CSV reload requires you to re-parse types on every read. Parquet reload is instant and type-correct. For a pipeline that hits the same file ten times a day, that’s ten times you don’t write fragile parse_dates=["Date"] hooks.
Compression options worth knowing
| Codec | Size | Speed (read) | Speed (write) | Use when |
|---|---|---|---|---|
| snappy (default) | medium | fastest | fastest | almost always |
| zstd (level 3) | small | fast | medium | archival, cold storage |
| gzip | smallest | slow | slow | only if you must |
| lz4 | medium-large | fastest | fastest | streaming pipelines |
| brotli | small | medium | slow | web-serve parquet |
Default to snappy unless you’re shipping the file somewhere bandwidth-constrained. Then zstd at level 3 is the sweet spot.
Common stumbles
ModuleNotFoundError: No module named 'pyarrow'. All three libraries lean on pyarrow under the hood. pip install pyarrow first.
OSError: [Errno 28] No space left on device. Some Parquet writers buffer the whole table in memory before flushing. For files larger than RAM, use Polars’s pl.scan_csv(...).sink_parquet(path) — streams without materializing.
Schema mismatches between runs. CSV reload infers types from the file’s first rows, so a new column with a single decimal value flips an Int64 to Float64. Pin the schema at read time: pl.read_csv(path, schema={"Volume": pl.Int64}).
Reading a parquet folder, not a file. If your writer split output into multiple files (part-0.parquet, part-1.parquet), point the reader at the folder. Polars: pl.read_parquet("output/"). Pandas: pd.read_parquet("output/"). Both handle partitioned reads.
Pyarrow Table vs pandas DataFrame mismatch. PyArrow has its own type system; converting back to pandas can shift int64 to Int64 (nullable). If you’re round-tripping, stay in pyarrow or pick one library for the whole pipeline.
Recap
pl.read_csv("input.csv").write_parquet("output.parquet") — or the pandas / pyarrow equivalent. Two lines. Parquet beats CSV on size (three to six times smaller), read speed (five to ten times faster), and schema preservation (which is the bit that quietly saves your pipeline from type-inference bugs). Pandas is the default for one-off conversions; pyarrow when you need control; polars when output size and write speed matter. Snappy compression is the right default. The file format itself is open — every analytics engine reads it.
Convert your CSVs once. Never re-parse them again.