The DataFrame Mental Model: Index, Columns, Series, Vectorized | Pandas for Finance Ep2

0views
C
CelesteAI
Description
Episode 2 of Pandas for Finance. The chapter that makes pandas click. Source code: https://github.com/GoCelesteAI/pandas-for-finance A DataFrame is three layers — a labeled index, named columns, and the values in between. Each column is a Series, a one-dimensional labeled array. Operations are vectorized: you touch a whole column at once and pandas runs the math in C-speed. Get this picture in your head and the rest of the library falls into place. What You'll Build: - A mental_model.py script that prints five views of one DataFrame: shape, columns, a Series, a vectorized derived column, and a boolean filter. - Hands-on with df.shape, df.columns.tolist(), df["Close"] (a Series), and df["Spread"] = df["High"] - df["Low"] (vectorized math). - A first taste of boolean indexing: df["Up"] = df["Close"].gt(df["Open"]) and .sum() to count true days. - The auto_adjust=False flag that keeps yfinance returning all six original columns (Open, High, Low, Close, Adj Close, Volume) instead of folding Adj Close into Close. Timestamps: 0:00 - Intro — Episode 2 starts here 0:17 - Preview — index, columns, vectorized 0:52 - Open nvim, write mental_model.py 1:06 - Download AAPL with auto_adjust=False 1:18 - Print shape and columns 1:28 - Pull one column, get a Series 1:36 - Vectorized: Spread = High - Low 1:45 - Boolean filter: count up days 1:53 - Recap — three layers, vectorized 2:28 - End screen Key Takeaways: 1. A DataFrame is three layers: a labeled index, named columns, and the values in between. The index is the row identity — for our daily price data it's the trading date, automatically set as a DatetimeIndex by yfinance. The fact that the index is dates, not row numbers, is what unlocks df.loc["2024-02"] and every other date-aware operation in pandas. 2. A column is a Series, a one-dimensional labeled array that shares its index with the parent DataFrame. df["Close"] returns a Series; df[["Close"]] (double brackets) returns a DataFrame with one column. The shape difference matters when you call methods that expect 2D input — beginners conflate the two and get cryptic errors. 3. Operations are vectorized, not row-by-row. df["Spread"] = df["High"] - df["Low"] touches every row of two columns and creates a third in a single line. There's no for loop — pandas runs the math in C-speed under the hood. Adding a column to a million-row DataFrame is nearly instant. The mental model: a column is one variable, applied to the entire table at once. 4. head(), info(), and describe() are the inspection toolkit. head() shows the first five rows for a sanity check. info() shows the dtypes, non-null counts, and memory — the one that catches data quality issues fastest. describe() gives min, max, mean, and quartiles for every numeric column. Every analysis starts with these three. 5. The auto_adjust flag matters. Recent yfinance versions default to auto_adjust=True, which folds Adj Close into Close and drops a column. Pass auto_adjust=False to keep the original six-column shape (Open, High, Low, Close, Adj Close, Volume). For the rest of the series we keep both Close and Adj Close visible — splits and dividends affect them differently and you'll want both. #Pandas #Python #Finance #DataFrame #Series #Vectorized #DataAnalytics #PythonForFinance #Yfinance #LearnPython --- Generated by GoCelesteAI · part of the Pandas for Finance series