Why Is My pandas Loop So Slow? iterrows vs Vectorize
0views
C
CelesteAI
Description
Why is your pandas loop slow? Because it's a loop. Pandas was built to operate on whole columns at once, not one row at a time. New users see a DataFrame, see something that looks like a spreadsheet, and reach for for row in df.iterrows() — which works on a hundred rows, takes a quarter of a second on twenty-eight thousand, and takes minutes on a million. The fix is one line.
This tutorial measures the same calculation three ways on the same 28,000-row OHLC table. The textbook anti-pattern iterrows — Series per row, slowest possible. The right-escape-hatch itertuples — named tuples, about ten times faster. And vectorize — drop the loop entirely, operate on whole columns, push the work down to NumPy and from there to C. Same answer. Over a thousand times faster.
What You'll Build:
- pandas_loop.py — compute daily return on a 28k-row prices table three ways. Measure each with time.perf_counter. Print the speed ratio at the end.
- The iterrows pattern — for idx, row in df.iterrows(): row["Close"]. Works. Slow. Wraps each row in a Series, allocates a dict, pays for it on every row. 278 ms on 28k rows.
- The itertuples upgrade — for row in df.itertuples(index=False): row.Close. Same loop shape, no Series wrapping, ~10x faster. Use it when you genuinely need a row-by-row walk that can't be expressed column-wise.
- The vectorize pattern — df["ret"] = (df["Close"] - df["Open"]) / df["Open"]. No loop. NumPy under the hood, C under that. 0.26 ms on the same 28k rows. Over a thousand times faster than iterrows.
- The mental shift — pandas is for whole-column operations. Boolean masks, arithmetic on Series, pandas built-ins like pct_change and rolling all vectorize automatically. If you find yourself writing a for loop over a DataFrame, there's usually a one-line column expression that replaces it.
Timestamps:
0:00 - Intro — Why is your pandas loop slow?
0:22 - Preview — three strategies, same answer
1:07 - Open pandas_loop.py in nvim
1:23 - Method 1 — iterrows, the textbook anti-pattern
2:01 - Method 2 — itertuples, the right escape hatch
2:31 - Method 3 — vectorize, the answer
3:18 - Save and run
3:26 - 278 ms → 25 ms → 0.26 ms (1000× speedup)
3:51 - End screen — recap
Key Takeaways:
1. iterrows is almost always the wrong tool. It wraps every row in a Series, which allocates a dict under the hood — and that allocation cost dominates the runtime. On a 28k-row OHLC table, iterrows takes about 278 milliseconds. The same calculation vectorized takes 0.26 milliseconds. There is almost always a column-expression alternative.
2. itertuples is the right escape hatch when you genuinely need to walk row-by-row — a stateful sequence, a path-dependent calculation, something that truly can't be expressed column-wise. It returns named tuples, skips the Series allocation, and runs roughly 10 times faster than iterrows. About 25 ms for the same 28k-row job.
3. Vectorize first. Column arithmetic — addition, subtraction, multiplication, division — runs in NumPy, which runs in C. Boolean masks, .pct_change, .rolling, .shift all do the same. The one-line vectorized version of "daily return" runs in 0.26 ms on 28k rows. Over a thousand times faster than the loop you would have written.
4. Treat the row loop as a red flag. If a teammate's pandas script has a for loop walking rows, that's the first place to look for a 100x to 1000x speedup. The fix is usually swapping the loop body for a column expression. The line count drops too.
5. The same applies to .apply on axis=1 — it's a loop in disguise. Faster than iterrows, slower than true vectorize. When you can rewrite df.apply(f, axis=1) as a column expression, you should.
This channel is run by Claude AI. Tutorials AI-produced; reviewed and published by Codegiz. Source code at codegiz.com.
#Python #Pandas #DataAnalytics #Performance #Vectorize #DataEngineering #PythonTutorial #LearnPython #DataFrame #pandasspeed
---
Generated by Claude AI · part of the Common Questions in Python series