How to Merge CSV Files in Python — Tutorial

0views
C
CelesteAI
Description
You have a folder of CSV files. Broker exports, monthly data drops, partitioned dumps — same schema across all of them, just different row sets. The output you want is the row-wise concatenation: stack them top to bottom, keep the column order, end up with one DataFrame. This tutorial shows the four ways every Python data person solves this, on real data — three CSV files covering fourteen tickers and twenty eight thousand rows of OHLCV, merged into one frame three different ways. Method one is pd.concat — two lines, the default for anyone already using pandas. Method two is pl.concat with the same shape but five times faster on the same input. Method three is the killer feature: pl.scan_csv with a glob expression, one line, lazy by default, supports filter pushdown so you can grab just the AAPL rows from the entire folder without ever materializing the full frame. Method four is the built-in csv module for the locked-down no-dependencies scenario — verbose, slow, included for completeness. What You'll Build: - merge_csv.py — three CSV files in a folder, four ways to merge them. Pandas concat. Polars concat. Polars scan_csv with glob. Plain csv module fallback. Each method prints its output shape so you can see all four arrive at the same 28140 x 8 frame. - The pd.concat idiom — list comprehension over glob results, concat with ignore_index. Two lines if you count the import. - The pl.concat alternative — same shape, faster execution, lower memory footprint. Add to_pandas if your downstream is pandas. - pl.scan_csv with a glob expression — the one-line merge. Lazy by default. Chain filter and select before collect to read only what you need from disk. - The schema-drift gotcha and how each library handles it. Polars defaults to strict mode and raises on type mismatch; pandas silently upcasts. The fix is the same in both: write the merged frame as parquet so the schema survives reload. - The how=diagonal option in pl.concat for files with different column sets — broker statements where each month might have a slightly different schema. Timestamps: 0:00 - Intro — folder of CSVs, one DataFrame out 0:18 - Preview — four methods, three favorites 0:54 - Open merge_csv.py in nvim 1:14 - glob the files 1:30 - Method 1 — pd.concat 1:54 - Method 2 — pl.concat 2:20 - Method 3 — pl.scan_csv with a glob 2:50 - Save and run 3:08 - All three arrive at 28140 x 8 3:36 - End screen — recap and what's next Key Takeaways: 1. The three modern methods all arrive at the same result on a clean schema. pd.concat for the pandas-default path most analysts will reach for. pl.concat for the same shape with about five times the speed and a third of the memory. pl.scan_csv with a glob for the one-line lazy version that supports filter pushdown. Pick based on what is already in your pipeline; do not add a library just for this. 2. Always sanity-check the row count after merging. Sum the per-file row counts, subtract one per file for the header, and compare against the merged frame's row count. If the numbers do not match, one of the input CSVs has a schema mismatch that the merger silently coerced or dropped. 3. Schema drift is the common silent bug. File one has an integer column, file two has a stray decimal so the same column is inferred as float, and pandas quietly upcasts the merged frame. Polars defaults to strict mode and raises on the mismatch, which is the safer behavior. The durable fix in either library is to write the merged frame as parquet so the schema is stored in the file and the next reload comes back type-correct. 4. pl.scan_csv with a glob is the one to learn first if you do this often. It builds a lazy plan that reads every matching file in parallel, supports filter and column pushdown so the reader only touches the bytes you actually need, and produces the same DataFrame as the other methods when you call collect. On hundreds of files it is dramatically faster than any one-at-a-time approach. 5. For different schemas across files use how=diagonal in pl.concat or pass join=outer to pd.concat. Both produce the union of columns with nulls where a column was missing in a given file. The broker-export scenario where each monthly statement has a slightly different column set is exactly what this flag is for. This channel is run by Claude AI. Tutorials AI-produced; reviewed and published by Codegiz. Source code at codegiz.com. #Python #CSV #DataEngineering #Pandas #Polars #PyArrow #DataAnalytics #PythonTutorial #LearnPython #ConcatCSV --- Generated by Claude AI · part of the Common Questions in Python series
Back to tutorials

Duration

Added to Codegiz

May 19, 2026

📖 Read the articleOpen in YouTube