The DataFrame Mental Model: Index, Columns, Series, Vectorized | Pandas for Finance Ep2
0views
C
CelesteAI
Description
Episode 2 of Pandas for Finance. The chapter that makes pandas click.
Source code: https://github.com/GoCelesteAI/pandas-for-finance
A DataFrame is three layers — a labeled index, named columns, and the values in between. Each column is a Series, a one-dimensional labeled array. Operations are vectorized: you touch a whole column at once and pandas runs the math in C-speed. Get this picture in your head and the rest of the library falls into place.
What You'll Build:
- A mental_model.py script that prints five views of one DataFrame: shape, columns, a Series, a vectorized derived column, and a boolean filter.
- Hands-on with df.shape, df.columns.tolist(), df["Close"] (a Series), and df["Spread"] = df["High"] - df["Low"] (vectorized math).
- A first taste of boolean indexing: df["Up"] = df["Close"].gt(df["Open"]) and .sum() to count true days.
- The auto_adjust=False flag that keeps yfinance returning all six original columns (Open, High, Low, Close, Adj Close, Volume) instead of folding Adj Close into Close.
Timestamps:
0:00 - Intro — Episode 2 starts here
0:17 - Preview — index, columns, vectorized
0:52 - Open nvim, write mental_model.py
1:06 - Download AAPL with auto_adjust=False
1:18 - Print shape and columns
1:28 - Pull one column, get a Series
1:36 - Vectorized: Spread = High - Low
1:45 - Boolean filter: count up days
1:53 - Recap — three layers, vectorized
2:28 - End screen
Key Takeaways:
1. A DataFrame is three layers: a labeled index, named columns, and the values in between. The index is the row identity — for our daily price data it's the trading date, automatically set as a DatetimeIndex by yfinance. The fact that the index is dates, not row numbers, is what unlocks df.loc["2024-02"] and every other date-aware operation in pandas.
2. A column is a Series, a one-dimensional labeled array that shares its index with the parent DataFrame. df["Close"] returns a Series; df[["Close"]] (double brackets) returns a DataFrame with one column. The shape difference matters when you call methods that expect 2D input — beginners conflate the two and get cryptic errors.
3. Operations are vectorized, not row-by-row. df["Spread"] = df["High"] - df["Low"] touches every row of two columns and creates a third in a single line. There's no for loop — pandas runs the math in C-speed under the hood. Adding a column to a million-row DataFrame is nearly instant. The mental model: a column is one variable, applied to the entire table at once.
4. head(), info(), and describe() are the inspection toolkit. head() shows the first five rows for a sanity check. info() shows the dtypes, non-null counts, and memory — the one that catches data quality issues fastest. describe() gives min, max, mean, and quartiles for every numeric column. Every analysis starts with these three.
5. The auto_adjust flag matters. Recent yfinance versions default to auto_adjust=True, which folds Adj Close into Close and drops a column. Pass auto_adjust=False to keep the original six-column shape (Open, High, Low, Close, Adj Close, Volume). For the rest of the series we keep both Close and Adj Close visible — splits and dividends affect them differently and you'll want both.
#Pandas #Python #Finance #DataFrame #Series #Vectorized #DataAnalytics #PythonForFinance #Yfinance #LearnPython
---
Generated by GoCelesteAI · part of the Pandas for Finance series