The DataFrame Mental Model: Index, Columns, Series, Vectorized | Pandas for Finance Ep2
0views
C
CelesteAI
Description
Episode 2 of *Pandas for Finance*. The chapter that makes pandas click.
A DataFrame is three layers — a labeled **index**, named **columns**, and the values in between. Each column is a **Series**, a one-dimensional labeled array. Operations are **vectorized**: you touch a whole column at once and pandas runs the math in C-speed. Get this picture in your head and the rest of the library falls into place.
What You'll Build:
- A `mental_model.py` script that prints five views of one DataFrame: shape, columns, a Series, a vectorized derived column, and a boolean filter.
- Hands-on with `df.shape`, `df.columns.tolist()`, `df["Close"]` (a Series), and `df["Spread"] = df["High"] - df["Low"]` (vectorized math).
- A first taste of boolean indexing: `df["Up"] = df["Close"].gt(df["Open"])` and `.sum()` to count true days.
- The `auto_adjust=False` flag that keeps yfinance returning all six original columns (Open, High, Low, Close, Adj Close, Volume) instead of folding Adj Close into Close.
Timestamps:
0:00 - Intro — Episode 2 starts here
0:17 - Preview — index, columns, vectorized
0:52 - Open nvim, write mental_model.py
1:06 - Download AAPL with auto_adjust=False
1:18 - Print shape and columns
1:28 - Pull one column, get a Series
1:36 - Vectorized: Spread = High - Low
1:45 - Boolean filter: count up days
1:53 - Recap — three layers, vectorized
2:28 - End screen
Key Takeaways:
1. A DataFrame is three layers: a labeled **index**, named **columns**, and the values in between. The index is the row identity — for our daily price data it's the trading date, automatically set as a `DatetimeIndex` by yfinance. The fact that the index is dates, not row numbers, is what unlocks `df.loc["2024-02"]` and every other date-aware operation in pandas.
2. A column is a **Series**, a one-dimensional labeled array that shares its index with the parent DataFrame. `df["Close"]` returns a Series; `df[["Close"]]` (double brackets) returns a DataFrame with one column. The shape difference matters when you call methods that expect 2D input — beginners conflate the two and get cryptic errors.
3. Operations are **vectorized**, not row-by-row. `df["Spread"] = df["High"] - df["Low"]` touches every row of two columns and creates a third in a single line. There's no `for` loop — pandas runs the math in C-speed under the hood. Adding a column to a million-row DataFrame is nearly instant. The mental model: a column is one variable, applied to the entire table at once.
4. `head()`, `info()`, and `describe()` are the inspection toolkit. `head()` shows the first five rows for a sanity check. `info()` shows the dtypes, non-null counts, and memory — the one that catches data quality issues fastest. `describe()` gives min, max, mean, and quartiles for every numeric column. Every analysis starts with these three.
5. The `auto_adjust` flag matters. Recent yfinance versions default to `auto_adjust=True`, which folds Adj Close into Close and drops a column. Pass `auto_adjust=False` to keep the original six-column shape (Open, High, Low, Close, Adj Close, Volume). For the rest of the series we keep both Close and Adj Close visible — splits and dividends affect them differently and you'll want both.
#Pandas #Python #Finance #DataFrame #Series #Vectorized #DataAnalytics #PythonForFinance #Yfinance #LearnPython