The DataFrame Mental Model: Index, Columns, Series, Vectorized | Pandas for Finance Ep2

0views
C
CelesteAI
Description
Episode 2 of *Pandas for Finance*. The chapter that makes pandas click. A DataFrame is three layers — a labeled **index**, named **columns**, and the values in between. Each column is a **Series**, a one-dimensional labeled array. Operations are **vectorized**: you touch a whole column at once and pandas runs the math in C-speed. Get this picture in your head and the rest of the library falls into place. What You'll Build: - A `mental_model.py` script that prints five views of one DataFrame: shape, columns, a Series, a vectorized derived column, and a boolean filter. - Hands-on with `df.shape`, `df.columns.tolist()`, `df["Close"]` (a Series), and `df["Spread"] = df["High"] - df["Low"]` (vectorized math). - A first taste of boolean indexing: `df["Up"] = df["Close"].gt(df["Open"])` and `.sum()` to count true days. - The `auto_adjust=False` flag that keeps yfinance returning all six original columns (Open, High, Low, Close, Adj Close, Volume) instead of folding Adj Close into Close. Timestamps: 0:00 - Intro — Episode 2 starts here 0:17 - Preview — index, columns, vectorized 0:52 - Open nvim, write mental_model.py 1:06 - Download AAPL with auto_adjust=False 1:18 - Print shape and columns 1:28 - Pull one column, get a Series 1:36 - Vectorized: Spread = High - Low 1:45 - Boolean filter: count up days 1:53 - Recap — three layers, vectorized 2:28 - End screen Key Takeaways: 1. A DataFrame is three layers: a labeled **index**, named **columns**, and the values in between. The index is the row identity — for our daily price data it's the trading date, automatically set as a `DatetimeIndex` by yfinance. The fact that the index is dates, not row numbers, is what unlocks `df.loc["2024-02"]` and every other date-aware operation in pandas. 2. A column is a **Series**, a one-dimensional labeled array that shares its index with the parent DataFrame. `df["Close"]` returns a Series; `df[["Close"]]` (double brackets) returns a DataFrame with one column. The shape difference matters when you call methods that expect 2D input — beginners conflate the two and get cryptic errors. 3. Operations are **vectorized**, not row-by-row. `df["Spread"] = df["High"] - df["Low"]` touches every row of two columns and creates a third in a single line. There's no `for` loop — pandas runs the math in C-speed under the hood. Adding a column to a million-row DataFrame is nearly instant. The mental model: a column is one variable, applied to the entire table at once. 4. `head()`, `info()`, and `describe()` are the inspection toolkit. `head()` shows the first five rows for a sanity check. `info()` shows the dtypes, non-null counts, and memory — the one that catches data quality issues fastest. `describe()` gives min, max, mean, and quartiles for every numeric column. Every analysis starts with these three. 5. The `auto_adjust` flag matters. Recent yfinance versions default to `auto_adjust=True`, which folds Adj Close into Close and drops a column. Pass `auto_adjust=False` to keep the original six-column shape (Open, High, Low, Close, Adj Close, Volume). For the rest of the series we keep both Close and Adj Close visible — splits and dividends affect them differently and you'll want both. #Pandas #Python #Finance #DataFrame #Series #Vectorized #DataAnalytics #PythonForFinance #Yfinance #LearnPython