Part of Pandas for Finance

Pandas for Finance: The DataFrame Mental Model: Index, Columns, Series, Vectorized

Celest KimCelest Kim

Video: The DataFrame Mental Model: Index, Columns, Series, Vectorized | Pandas for Finance Ep2 by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

A DataFrame is a 2D table with a labeled index (rows) and named columns. Each column is a 1D Series. Operations are vectorized — apply to whole columns, not row-by-row. Get this model and the rest of pandas falls into place.

In Excel, the row “label” is just 1, 2, 3, .... In pandas, the row label is whatever you set — a date, a ticker, a customer ID. That choice unlocks half of pandas’s power.

The three layers

import pandas as pd
import yfinance as yf

df = yf.download("AAPL", start="2024-01-01", end="2024-04-01")
print(df)
                  Open      High      Low     Close  Adj Close      Volume
Date
2024-01-02  187.150  188.440  183.885  185.640    184.79   82488700
2024-01-03  184.220  185.880  183.430  184.250    183.42   58414500
2024-01-04  182.150  183.085  180.880  181.910    181.10   71983600
...

Three layers:

  1. Index — the row labels (Date here). Not a column, not data — it’s the row identifier.
  2. Columns — the column names (Open, High, Low, …).
  3. Values — the actual numbers in the cells.
print(df.index)
print(df.columns)
print(df.values.shape)
DatetimeIndex(['2024-01-02', '2024-01-03', '2024-01-04', ...], dtype='datetime64[ns]', name='Date')
Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
(62, 6)

The index is a DatetimeIndex — pandas understands these are dates, not strings. That’s why df.loc['2024-02'] will give you all of February, no parsing needed.

A column is a Series

close = df["Close"]
print(type(close))   # <class 'pandas.core.series.Series'>
print(close.head())
Date
2024-01-02   185.640
2024-01-03   184.250
2024-01-04   181.910
2024-01-05   181.180
2024-01-08   185.560
Name: Close, dtype: float64

A Series is a 1D labeled array. It has the same index as its parent DataFrame. Operations on a Series return another Series:

returns = close.pct_change()
print(returns.head())
Date
2024-01-02       NaN
2024-01-03   -0.0075
2024-01-04   -0.0127
2024-01-05   -0.0040
2024-01-08    0.0241
Name: Close, dtype: float64

.pct_change() computes period-over-period percent change for the whole column at once. Vectorized — no for loop. The first row is NaN because there’s no prior day.

Vectorized operations

This is the big idea. Don’t loop over rows; operate on whole columns.

# Spread = High - Low
df["Spread"] = df["High"] - df["Low"]

# Body of a candle (Close - Open)
df["Body"] = df["Close"] - df["Open"]

# Whether the day was up
df["Up"] = df["Close"] > df["Open"]

print(df[["High", "Low", "Spread", "Up"]].head())

Each line touches every row. Pandas runs this in C-speed under the hood — adding columns to a 1M-row DataFrame is nearly instant.

The mental model: a column is one variable, vectorized. Same as Excel’s =A1-B1 filled down — pandas just doesn’t need the “fill down” step.

Multiple tickers → MultiIndex columns

When you download multiple tickers, the columns become two-level:

df = yf.download(["AAPL", "MSFT", "GOOGL"], start="2024-01-01", end="2024-04-01")
print(df.head())
                  Open                            High                          ...
                  AAPL     MSFT     GOOGL          AAPL     MSFT     GOOGL
Date
2024-01-02   187.150  370.870  138.170      188.440  375.900  139.120
...

The top level is the field (Open/High/Low/Close), the second is the ticker. Access:

df["Close"]              # all 3 tickers' Close prices
df["Close"]["AAPL"]      # just AAPL Close
df.xs("AAPL", level=1, axis=1)   # all fields for AAPL

For most of this series we’ll convert to long format in Ep 3 — one row per (date, ticker), columns are the fields. Easier to work with.

Setting and changing the index

The index is whatever you set it to be. For our daily price data, Date is the natural choice (yfinance does it for you).

To switch:

# Move Close into the index, drop the rest
just_close = df.reset_index().set_index("Close")

# Or move the date back to a column
flat = df.reset_index()
print(flat.head())
        Date     Open     High      Low    Close ...
0 2024-01-02  187.15   188.44   183.88   185.64
1 2024-01-03  184.22   185.88   183.43   184.25

reset_index() moves the index into a column and gives a fresh 0, 1, 2, ... index. set_index(col) does the inverse.

For long-format data with multiple tickers:

prices = (
  df.stack(level=1)            # (Date, Ticker) becomes the index
    .reset_index()
    .rename(columns={"level_1": "Ticker"})
)
print(prices.head())

We’ll use this shape from Ep 3 onward.

Inspecting any DataFrame

When you load unfamiliar data, run these in order:

df.shape       # (rows, cols)
df.head()      # first 5 rows
df.tail()      # last 5
df.sample(5)   # 5 random rows
df.info()      # dtypes, non-null counts, memory
df.describe()  # min/max/mean/quartiles for numeric cols

info() is the one that catches data-quality issues fast — wrong dtype, nulls where you didn’t expect them.

DataFrame vs dict vs list

Quick mental map:

# Build from a dict (one column per key)
df = pd.DataFrame({
  "ticker": ["AAPL", "MSFT", "GOOGL"],
  "price":  [185.64, 370.87, 138.17],
  "sector": ["Tech", "Tech", "Comm"],
})

# Build from a list of dicts (each dict = one row)
rows = [
  {"ticker": "AAPL", "price": 185.64, "sector": "Tech"},
  {"ticker": "MSFT", "price": 370.87, "sector": "Tech"},
]
df = pd.DataFrame(rows)

Both produce the same DataFrame. Use whichever the source data already looks like.

Try it

mental_model.py:

import pandas as pd
import yfinance as yf

pd.set_option("display.float_format", lambda x: f"{x:,.2f}")

df = yf.download("AAPL", start="2024-01-01", end="2024-04-01")

print("=== Shape ===")
print(df.shape)

print("\n=== Index ===")
print(df.index[:3])

print("\n=== Columns ===")
print(df.columns.tolist())

print("\n=== One column (Series) ===")
print(df["Close"].head())

print("\n=== Vectorized: add a Spread column ===")
df["Spread"] = df["High"] - df["Low"]
print(df[["High", "Low", "Spread"]].head())

print("\n=== Up days only (filter preview) ===")
df["Up"] = df["Close"] > df["Open"]
print(f"Up days: {df['Up'].sum()} / {len(df)}")

Run:

python mental_model.py

You’ll see the shape, index head, columns list, the Close Series, the Spread you computed, and the count of up days. That’s the DataFrame in five views.

Common stumbles

“Series has no rows” confusion. df["Close"] is a Series — 1D, no separate rows/columns. df[["Close"]] (double brackets) is a DataFrame with one column. The shape difference matters when you call methods that need 2D input.

Modifying a slice. df[df["Close"] > 100]["Volume"] = 0 — pandas warns “SettingWithCopyWarning” because the slice may or may not be a view. Use df.loc[df["Close"] > 100, "Volume"] = 0 instead. Always .loc for assignment.

Chained indexing. df["Close"]["2024-01-02"] works but df.loc["2024-01-02", "Close"] is canonical and faster.

Index vs reset_index. df.reset_index() returns a new DataFrame; df.reset_index(inplace=True) modifies in place but is being deprecated. Just rebind: df = df.reset_index().

MultiIndex columns confusion. Multi-ticker downloads give (field, ticker) columns. Either df["Close"]["AAPL"] or convert to long format. Long is friendlier.

What’s next

Ep 3: caching. Stop re-downloading every run. to_parquet + read_parquet + a load-or-download function template.

Recap

DataFrame = labeled index + named columns + values. A column is a Series. Operations are vectorized — touch a whole column at once. df.shape, df.head(), df.info(), df.describe() are the inspection toolkit. df["col"] selects a Series; df[["a","b"]] selects a DataFrame. set_index / reset_index to control the row identity. For multi-ticker downloads, convert to long format (Ep 3).

Next episode: caching to parquet.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.