Pandas for Finance: The DataFrame Mental Model: Index, Columns, Series, Vectorized
Video: The DataFrame Mental Model: Index, Columns, Series, Vectorized | Pandas for Finance Ep2 by CelesteAI
A DataFrame is a 2D table with a labeled index (rows) and named columns. Each column is a 1D Series. Operations are vectorized — apply to whole columns, not row-by-row. Get this model and the rest of pandas falls into place.
In Excel, the row “label” is just 1, 2, 3, .... In pandas, the row label is whatever you set — a date, a ticker, a customer ID. That choice unlocks half of pandas’s power.
The three layers
import pandas as pd
import yfinance as yf
df = yf.download("AAPL", start="2024-01-01", end="2024-04-01")
print(df)
Open High Low Close Adj Close Volume
Date
2024-01-02 187.150 188.440 183.885 185.640 184.79 82488700
2024-01-03 184.220 185.880 183.430 184.250 183.42 58414500
2024-01-04 182.150 183.085 180.880 181.910 181.10 71983600
...
Three layers:
- Index — the row labels (
Datehere). Not a column, not data — it’s the row identifier. - Columns — the column names (
Open,High,Low, …). - Values — the actual numbers in the cells.
print(df.index)
print(df.columns)
print(df.values.shape)
DatetimeIndex(['2024-01-02', '2024-01-03', '2024-01-04', ...], dtype='datetime64[ns]', name='Date')
Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
(62, 6)
The index is a DatetimeIndex — pandas understands these are dates, not strings. That’s why df.loc['2024-02'] will give you all of February, no parsing needed.
A column is a Series
close = df["Close"]
print(type(close)) # <class 'pandas.core.series.Series'>
print(close.head())
Date
2024-01-02 185.640
2024-01-03 184.250
2024-01-04 181.910
2024-01-05 181.180
2024-01-08 185.560
Name: Close, dtype: float64
A Series is a 1D labeled array. It has the same index as its parent DataFrame. Operations on a Series return another Series:
returns = close.pct_change()
print(returns.head())
Date
2024-01-02 NaN
2024-01-03 -0.0075
2024-01-04 -0.0127
2024-01-05 -0.0040
2024-01-08 0.0241
Name: Close, dtype: float64
.pct_change() computes period-over-period percent change for the whole column at once. Vectorized — no for loop. The first row is NaN because there’s no prior day.
Vectorized operations
This is the big idea. Don’t loop over rows; operate on whole columns.
# Spread = High - Low
df["Spread"] = df["High"] - df["Low"]
# Body of a candle (Close - Open)
df["Body"] = df["Close"] - df["Open"]
# Whether the day was up
df["Up"] = df["Close"] > df["Open"]
print(df[["High", "Low", "Spread", "Up"]].head())
Each line touches every row. Pandas runs this in C-speed under the hood — adding columns to a 1M-row DataFrame is nearly instant.
The mental model: a column is one variable, vectorized. Same as Excel’s =A1-B1 filled down — pandas just doesn’t need the “fill down” step.
Multiple tickers → MultiIndex columns
When you download multiple tickers, the columns become two-level:
df = yf.download(["AAPL", "MSFT", "GOOGL"], start="2024-01-01", end="2024-04-01")
print(df.head())
Open High ...
AAPL MSFT GOOGL AAPL MSFT GOOGL
Date
2024-01-02 187.150 370.870 138.170 188.440 375.900 139.120
...
The top level is the field (Open/High/Low/Close), the second is the ticker. Access:
df["Close"] # all 3 tickers' Close prices
df["Close"]["AAPL"] # just AAPL Close
df.xs("AAPL", level=1, axis=1) # all fields for AAPL
For most of this series we’ll convert to long format in Ep 3 — one row per (date, ticker), columns are the fields. Easier to work with.
Setting and changing the index
The index is whatever you set it to be. For our daily price data, Date is the natural choice (yfinance does it for you).
To switch:
# Move Close into the index, drop the rest
just_close = df.reset_index().set_index("Close")
# Or move the date back to a column
flat = df.reset_index()
print(flat.head())
Date Open High Low Close ...
0 2024-01-02 187.15 188.44 183.88 185.64
1 2024-01-03 184.22 185.88 183.43 184.25
reset_index() moves the index into a column and gives a fresh 0, 1, 2, ... index. set_index(col) does the inverse.
For long-format data with multiple tickers:
prices = (
df.stack(level=1) # (Date, Ticker) becomes the index
.reset_index()
.rename(columns={"level_1": "Ticker"})
)
print(prices.head())
We’ll use this shape from Ep 3 onward.
Inspecting any DataFrame
When you load unfamiliar data, run these in order:
df.shape # (rows, cols)
df.head() # first 5 rows
df.tail() # last 5
df.sample(5) # 5 random rows
df.info() # dtypes, non-null counts, memory
df.describe() # min/max/mean/quartiles for numeric cols
info() is the one that catches data-quality issues fast — wrong dtype, nulls where you didn’t expect them.
DataFrame vs dict vs list
Quick mental map:
# Build from a dict (one column per key)
df = pd.DataFrame({
"ticker": ["AAPL", "MSFT", "GOOGL"],
"price": [185.64, 370.87, 138.17],
"sector": ["Tech", "Tech", "Comm"],
})
# Build from a list of dicts (each dict = one row)
rows = [
{"ticker": "AAPL", "price": 185.64, "sector": "Tech"},
{"ticker": "MSFT", "price": 370.87, "sector": "Tech"},
]
df = pd.DataFrame(rows)
Both produce the same DataFrame. Use whichever the source data already looks like.
Try it
mental_model.py:
import pandas as pd
import yfinance as yf
pd.set_option("display.float_format", lambda x: f"{x:,.2f}")
df = yf.download("AAPL", start="2024-01-01", end="2024-04-01")
print("=== Shape ===")
print(df.shape)
print("\n=== Index ===")
print(df.index[:3])
print("\n=== Columns ===")
print(df.columns.tolist())
print("\n=== One column (Series) ===")
print(df["Close"].head())
print("\n=== Vectorized: add a Spread column ===")
df["Spread"] = df["High"] - df["Low"]
print(df[["High", "Low", "Spread"]].head())
print("\n=== Up days only (filter preview) ===")
df["Up"] = df["Close"] > df["Open"]
print(f"Up days: {df['Up'].sum()} / {len(df)}")
Run:
python mental_model.py
You’ll see the shape, index head, columns list, the Close Series, the Spread you computed, and the count of up days. That’s the DataFrame in five views.
Common stumbles
“Series has no rows” confusion. df["Close"] is a Series — 1D, no separate rows/columns. df[["Close"]] (double brackets) is a DataFrame with one column. The shape difference matters when you call methods that need 2D input.
Modifying a slice. df[df["Close"] > 100]["Volume"] = 0 — pandas warns “SettingWithCopyWarning” because the slice may or may not be a view. Use df.loc[df["Close"] > 100, "Volume"] = 0 instead. Always .loc for assignment.
Chained indexing. df["Close"]["2024-01-02"] works but df.loc["2024-01-02", "Close"] is canonical and faster.
Index vs reset_index. df.reset_index() returns a new DataFrame; df.reset_index(inplace=True) modifies in place but is being deprecated. Just rebind: df = df.reset_index().
MultiIndex columns confusion. Multi-ticker downloads give (field, ticker) columns. Either df["Close"]["AAPL"] or convert to long format. Long is friendlier.
What’s next
Ep 3: caching. Stop re-downloading every run. to_parquet + read_parquet + a load-or-download function template.
Recap
DataFrame = labeled index + named columns + values. A column is a Series. Operations are vectorized — touch a whole column at once. df.shape, df.head(), df.info(), df.describe() are the inspection toolkit. df["col"] selects a Series; df[["a","b"]] selects a DataFrame. set_index / reset_index to control the row identity. For multi-ticker downloads, convert to long format (Ep 3).
Next episode: caching to parquet.