Back to Blog

How to Merge CSV Files in Python

Celest KimCelest Kim

Video: How to Merge CSV Files in Python — Tutorial by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

Three CSVs in a folder, one DataFrame out. pd.concat([pd.read_csv(f) for f in glob.glob("prices_*.csv")]) is two lines. pl.scan_csv("prices_*.csv").collect() is one. Either approach takes 28 thousand rows split across three files and stitches them into one frame, in the order they appeared on disk.

If you process broker exports, FRED downloads, or any monthly data drop, you have a folder full of CSVs that need to become one table before you can do anything useful. This tutorial covers the four ways every Python data person solves this — three in libraries you already have installed, one with a built-in module — and which one wins on what.

The shape of the problem

You have files. Same schema across all of them — same columns, same types, just different row sets. The output you want is the row-wise concatenation: stack them top to bottom, keep the column order, ignore the per-file row counts.

Sometimes the files are partitioned by date (prices_2024.csv, prices_2025.csv). Sometimes by entity (prices_techs.csv, prices_etfs.csv). The merge is the same operation either way — concat along the row axis.

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install pandas polars pyarrow

For the demo we’ll work with three CSVs that together cover fourteen tickers, daily OHLCV, ~28 thousand rows — split into 10,050 + 8,040 + 10,050 chunks.

Method 1 — pandas concat (most common)

import glob
import pandas as pd

files = sorted(glob.glob("prices_*.csv"))
merged = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

print(merged.shape)   # (28140, 8)

Two lines if you count the import. The list comprehension reads each file into its own DataFrame; pd.concat stacks them. ignore_index=True rebuilds the row index from zero — without it, you’d get the original per-file row indexes preserved, leading to duplicate index values like 0, 1, 2, …, 0, 1, 2, ….

For most everyday work, this is the right answer. You probably already have pandas installed; the result is a DataFrame ready for the rest of your pipeline.

Method 2 — Polars concat (faster, smaller memory)

import glob
import polars as pl

files = sorted(glob.glob("prices_*.csv"))
merged = pl.concat([pl.read_csv(f) for f in files])

print(merged.shape)

Same two-line shape, Polars instead of pandas. Polars’s CSV reader is the fastest in Python; pl.concat defaults to vertical stacking. On the same fourteen-ticker dataset, this is roughly five times faster than pandas and uses about a third of the memory.

The output is a pl.DataFrame. If your pipeline is pandas downstream, call .to_pandas() to convert — it’s a near-zero-copy via Arrow.

Method 3 — Polars scan_csv with a glob (the killer feature)

import polars as pl

merged = pl.scan_csv("prices_*.csv").collect()

print(merged.shape)

One line. scan_csv takes a glob directly, builds a lazy plan that knows how to read all matching files in parallel, and .collect() materializes it. No Python-level loop, no concat call, no glob module.

This is the path to reach for at scale. On a folder with hundreds of files, Polars’s parallel reader plus the lazy plan’s predicate-pushdown is dramatically faster than any one-at-a-time approach. And because the plan is lazy, you can chain filters and selects before .collect():

recent_aapl = (
  pl.scan_csv("prices_*.csv")
    .filter(pl.col("Ticker") == "AAPL")
    .filter(pl.col("Date") > "2024-01-01")
    .select(["Date", "Close", "Volume"])
    .collect()
)

Polars reads only the columns and rows you asked for. The merge becomes a non-event — never materialized in full.

Method 4 — Built-in csv module (no dependencies)

If you can’t install anything:

import csv
import glob

files = sorted(glob.glob("prices_*.csv"))
rows = []
header = None

for path in files:
  with open(path) as f:
    reader = csv.reader(f)
    file_header = next(reader)
    if header is None:
      header = file_header
    rows.extend(reader)

with open("merged.csv", "w", newline="") as f:
  writer = csv.writer(f)
  writer.writerow(header)
  writer.writerows(rows)

Verbose, slow, no DataFrame at the end — just a merged CSV file. Use this only when adding pandas to a project isn’t an option (locked-down corporate Python, edge runtime, etc.). For anything else, pandas or Polars wins on every axis.

Which approach when

Scenario Use
Already have pandas, occasional merge pd.concat([pd.read_csv(f) for f in files])
Production pipeline, 10+ files pl.scan_csv("*.csv").collect()
Need to filter while merging pl.scan_csv(...).filter(...).collect()
Output stays as Polars frame pl.concat([pl.read_csv(f) for f in files])
Output goes back to pandas pandas merge, or polars merge + .to_pandas()
No third-party libs allowed csv module, glob, manual write

Sanity check after merging

Always count rows. The number you expect is the sum of per-file row counts.

import os
print("Expected:", sum(sum(1 for _ in open(f)) - 1 for f in files))
print("Actual:  ", merged.shape[0])

The -1 skips the header row in each file. If actual matches expected, the concat ran clean. If not — usually the cause is a column-mismatch on one of the files, which makes pandas emit a warning that’s easy to miss.

Schema drift across files

The common silent bug: file 1 has a Volume column with integer values, file 2 has a stray decimal so its Volume is inferred as float. pd.concat quietly upcasts the merged frame to float; pl.concat defaults to strict mode and raises an error. Catch it with strict mode in pandas too:

import pandas.io.common as pdc
# Force columns to a fixed dtype:
merged = pd.concat(
  [pd.read_csv(f, dtype={"Volume": "int64"}) for f in files],
  ignore_index=True,
)

Better: write the merged frame as parquet (see the prepare-a-parquet-file tutorial) so the schema is stored on disk and the next reload gets the right types.

Pre-merge filtering — read only what you need

If the files are large and you only need a slice:

pandas with usecols + chunked filter:

files = glob.glob("prices_*.csv")
frames = []
for f in files:
  df = pd.read_csv(f, usecols=["Date", "Ticker", "Close"])
  frames.append(df[df["Ticker"] == "AAPL"])
merged = pd.concat(frames, ignore_index=True)

polars with scan_csv (does the same thing automatically via pushdown):

merged = (
  pl.scan_csv("prices_*.csv")
    .filter(pl.col("Ticker") == "AAPL")
    .select(["Date", "Ticker", "Close"])
    .collect()
)

For the same outcome, the Polars version reads less data from disk — the filter pushes into the reader. Useful when “all your CSVs together” is 100 GB and what you actually want is 200 MB.

Different schemas per file

pl.concat and pd.concat both default to “all columns must match”. For frames with different column sets you have two flavors:

  • how="diagonal" (polars) — union of all columns; missing values become null per file.
  • how="vertical_relaxed" (polars) — same as diagonal but also coerces type mismatches.
  • pd.concat (pandas) — defaults to outer-join behavior; missing columns become NaN. Pass join="inner" to keep only shared columns.

For the broker-export scenario where each statement might have a slightly different column set, pl.concat(frames, how="diagonal") is the version you want.

Common stumbles

Header repeated as a data row. You concatenated CSVs by piping cat *.csv > merged.csv and lost track of headers. Always use a library that knows about CSV headers; never cat your way to a merge.

Index column appears as Unnamed: 0. A previous tool wrote out a CSV with df.to_csv(path) instead of df.to_csv(path, index=False). Strip on read: pd.read_csv(f, index_col=0) or pl.read_csv(f).drop("").

Memory error on a folder of large CSVs. Switch to pl.scan_csv(...).sink_parquet("out.parquet") — streams without ever materializing the full frame.

Row order matters and glob.glob returns them in OS order. Wrap with sorted(). On macOS glob is usually sorted; on Linux ext4 it’s not.

Encoding error mid-merge. Some files are UTF-8, some are latin-1. Specify per file: pd.read_csv(f, encoding="latin-1"). Or convert everything to UTF-8 once with a separate script before merging.

Recap

Three CSVs of the same schema merge in one or two lines with any modern library. pd.concat for the pandas-default path. pl.concat for ~5× speed and lower memory. pl.scan_csv("*.csv").collect() for the one-line glob that’s lazy by default and lets you filter before materializing. The plain csv module is the last-resort no-dependency path. Always sanity-check the row count after merging, and watch for schema drift — write the merged frame as parquet so future reloads come back type-correct.

Subscribe to the playlist for more Python answers.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.