How to Merge CSV Files in Python
Video: How to Merge CSV Files in Python — Tutorial by CelesteAI
Three CSVs in a folder, one DataFrame out.
pd.concat([pd.read_csv(f) for f in glob.glob("prices_*.csv")])is two lines.pl.scan_csv("prices_*.csv").collect()is one. Either approach takes 28 thousand rows split across three files and stitches them into one frame, in the order they appeared on disk.
If you process broker exports, FRED downloads, or any monthly data drop, you have a folder full of CSVs that need to become one table before you can do anything useful. This tutorial covers the four ways every Python data person solves this — three in libraries you already have installed, one with a built-in module — and which one wins on what.
The shape of the problem
You have files. Same schema across all of them — same columns, same types, just different row sets. The output you want is the row-wise concatenation: stack them top to bottom, keep the column order, ignore the per-file row counts.
Sometimes the files are partitioned by date (prices_2024.csv, prices_2025.csv). Sometimes by entity (prices_techs.csv, prices_etfs.csv). The merge is the same operation either way — concat along the row axis.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install pandas polars pyarrow
For the demo we’ll work with three CSVs that together cover fourteen tickers, daily OHLCV, ~28 thousand rows — split into 10,050 + 8,040 + 10,050 chunks.
Method 1 — pandas concat (most common)
import glob
import pandas as pd
files = sorted(glob.glob("prices_*.csv"))
merged = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
print(merged.shape) # (28140, 8)
Two lines if you count the import. The list comprehension reads each file into its own DataFrame; pd.concat stacks them. ignore_index=True rebuilds the row index from zero — without it, you’d get the original per-file row indexes preserved, leading to duplicate index values like 0, 1, 2, …, 0, 1, 2, ….
For most everyday work, this is the right answer. You probably already have pandas installed; the result is a DataFrame ready for the rest of your pipeline.
Method 2 — Polars concat (faster, smaller memory)
import glob
import polars as pl
files = sorted(glob.glob("prices_*.csv"))
merged = pl.concat([pl.read_csv(f) for f in files])
print(merged.shape)
Same two-line shape, Polars instead of pandas. Polars’s CSV reader is the fastest in Python; pl.concat defaults to vertical stacking. On the same fourteen-ticker dataset, this is roughly five times faster than pandas and uses about a third of the memory.
The output is a pl.DataFrame. If your pipeline is pandas downstream, call .to_pandas() to convert — it’s a near-zero-copy via Arrow.
Method 3 — Polars scan_csv with a glob (the killer feature)
import polars as pl
merged = pl.scan_csv("prices_*.csv").collect()
print(merged.shape)
One line. scan_csv takes a glob directly, builds a lazy plan that knows how to read all matching files in parallel, and .collect() materializes it. No Python-level loop, no concat call, no glob module.
This is the path to reach for at scale. On a folder with hundreds of files, Polars’s parallel reader plus the lazy plan’s predicate-pushdown is dramatically faster than any one-at-a-time approach. And because the plan is lazy, you can chain filters and selects before .collect():
recent_aapl = (
pl.scan_csv("prices_*.csv")
.filter(pl.col("Ticker") == "AAPL")
.filter(pl.col("Date") > "2024-01-01")
.select(["Date", "Close", "Volume"])
.collect()
)
Polars reads only the columns and rows you asked for. The merge becomes a non-event — never materialized in full.
Method 4 — Built-in csv module (no dependencies)
If you can’t install anything:
import csv
import glob
files = sorted(glob.glob("prices_*.csv"))
rows = []
header = None
for path in files:
with open(path) as f:
reader = csv.reader(f)
file_header = next(reader)
if header is None:
header = file_header
rows.extend(reader)
with open("merged.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(rows)
Verbose, slow, no DataFrame at the end — just a merged CSV file. Use this only when adding pandas to a project isn’t an option (locked-down corporate Python, edge runtime, etc.). For anything else, pandas or Polars wins on every axis.
Which approach when
| Scenario | Use |
|---|---|
| Already have pandas, occasional merge | pd.concat([pd.read_csv(f) for f in files]) |
| Production pipeline, 10+ files | pl.scan_csv("*.csv").collect() |
| Need to filter while merging | pl.scan_csv(...).filter(...).collect() |
| Output stays as Polars frame | pl.concat([pl.read_csv(f) for f in files]) |
| Output goes back to pandas | pandas merge, or polars merge + .to_pandas() |
| No third-party libs allowed | csv module, glob, manual write |
Sanity check after merging
Always count rows. The number you expect is the sum of per-file row counts.
import os
print("Expected:", sum(sum(1 for _ in open(f)) - 1 for f in files))
print("Actual: ", merged.shape[0])
The -1 skips the header row in each file. If actual matches expected, the concat ran clean. If not — usually the cause is a column-mismatch on one of the files, which makes pandas emit a warning that’s easy to miss.
Schema drift across files
The common silent bug: file 1 has a Volume column with integer values, file 2 has a stray decimal so its Volume is inferred as float. pd.concat quietly upcasts the merged frame to float; pl.concat defaults to strict mode and raises an error. Catch it with strict mode in pandas too:
import pandas.io.common as pdc
# Force columns to a fixed dtype:
merged = pd.concat(
[pd.read_csv(f, dtype={"Volume": "int64"}) for f in files],
ignore_index=True,
)
Better: write the merged frame as parquet (see the prepare-a-parquet-file tutorial) so the schema is stored on disk and the next reload gets the right types.
Pre-merge filtering — read only what you need
If the files are large and you only need a slice:
pandas with usecols + chunked filter:
files = glob.glob("prices_*.csv")
frames = []
for f in files:
df = pd.read_csv(f, usecols=["Date", "Ticker", "Close"])
frames.append(df[df["Ticker"] == "AAPL"])
merged = pd.concat(frames, ignore_index=True)
polars with scan_csv (does the same thing automatically via pushdown):
merged = (
pl.scan_csv("prices_*.csv")
.filter(pl.col("Ticker") == "AAPL")
.select(["Date", "Ticker", "Close"])
.collect()
)
For the same outcome, the Polars version reads less data from disk — the filter pushes into the reader. Useful when “all your CSVs together” is 100 GB and what you actually want is 200 MB.
Different schemas per file
pl.concat and pd.concat both default to “all columns must match”. For frames with different column sets you have two flavors:
how="diagonal"(polars) — union of all columns; missing values become null per file.how="vertical_relaxed"(polars) — same as diagonal but also coerces type mismatches.pd.concat(pandas) — defaults to outer-join behavior; missing columns become NaN. Passjoin="inner"to keep only shared columns.
For the broker-export scenario where each statement might have a slightly different column set, pl.concat(frames, how="diagonal") is the version you want.
Common stumbles
Header repeated as a data row. You concatenated CSVs by piping cat *.csv > merged.csv and lost track of headers. Always use a library that knows about CSV headers; never cat your way to a merge.
Index column appears as Unnamed: 0. A previous tool wrote out a CSV with df.to_csv(path) instead of df.to_csv(path, index=False). Strip on read: pd.read_csv(f, index_col=0) or pl.read_csv(f).drop("").
Memory error on a folder of large CSVs. Switch to pl.scan_csv(...).sink_parquet("out.parquet") — streams without ever materializing the full frame.
Row order matters and glob.glob returns them in OS order. Wrap with sorted(). On macOS glob is usually sorted; on Linux ext4 it’s not.
Encoding error mid-merge. Some files are UTF-8, some are latin-1. Specify per file: pd.read_csv(f, encoding="latin-1"). Or convert everything to UTF-8 once with a separate script before merging.
Recap
Three CSVs of the same schema merge in one or two lines with any modern library. pd.concat for the pandas-default path. pl.concat for ~5× speed and lower memory. pl.scan_csv("*.csv").collect() for the one-line glob that’s lazy by default and lets you filter before materializing. The plain csv module is the last-resort no-dependency path. Always sanity-check the row count after merging, and watch for schema drift — write the merged frame as parquet so future reloads come back type-correct.
Subscribe to the playlist for more Python answers.