How to Read SQL into a DataFrame in Python
Video: How to Read SQL into a DataFrame in Python — Tutorial by CelesteAI
Three lines.
import sqlite3→conn = sqlite3.connect("prices.db")→pd.read_sql("SELECT * FROM prices WHERE Close > 200", conn). You now have a pandas DataFrame containing every row your SQL query returned. The same shape works with Polars viapl.read_database. The same shape works against Postgres, MySQL, BigQuery, and Snowflake — just swap the connection.
The single most common analyst pipeline starts with a SQL query and ends with a DataFrame. The database has the data; the SQL filters it; the DataFrame is where the real work happens. Python’s standard library and every major dataframe library make this a two-line operation.
What’s actually happening
A database client opens a connection. The connection talks SQL — you send a query string, the server runs it, the result rows come back as cursor records. Both pandas and Polars wrap that cursor and materialize the result as a DataFrame, handling type conversion and column naming automatically.
The dataframe library doesn’t speak SQL itself — it leans on the underlying driver (sqlite3, psycopg2, pymysql, pyarrow.flight). Your job is to pass it a connection and a query string. The library handles the rest.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install pandas polars connectorx
sqlite3 is in the standard library. connectorx is the fast Rust-based connector polars uses for URI-style reads — optional but worth installing.
For the demo we’ll use a SQLite database with a prices table — fourteen tickers, daily OHLCV, ~28 thousand rows. The setup is one line: df.to_pandas().to_sql("prices", sqlite3.connect("prices.db")).
Method 1 — pandas read_sql (most common)
import sqlite3
import pandas as pd
conn = sqlite3.connect("prices.db")
df = pd.read_sql("SELECT * FROM prices WHERE Ticker='AAPL' AND Close > 200", conn)
print(df.shape) # (367, 8)
Three lines if you count the import. pd.read_sql accepts any DB-API 2.0 connection — SQLite, Postgres, MySQL, Oracle, anything that exposes a cursor.execute(sql) method.
For postgres:
import psycopg2
conn = psycopg2.connect("postgresql://user:pass@host:5432/db")
df = pd.read_sql("SELECT * FROM prices", conn)
For SQLAlchemy (preferred for production):
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://user:pass@host/db")
df = pd.read_sql("SELECT * FROM prices", engine)
The query string is whatever your database speaks. Parameterized queries belong in params=:
df = pd.read_sql(
"SELECT * FROM prices WHERE Ticker = ? AND Close > ?",
conn,
params=("AAPL", 200),
)
Never f-string user input into a SQL query string. Parameters get escaped automatically; concatenation gets you SQL injection.
Method 2 — polars read_database (connection-based)
import sqlite3
import polars as pl
conn = sqlite3.connect("prices.db")
df = pl.read_database("SELECT * FROM prices WHERE Ticker='AAPL' AND Close > 200", conn)
print(df.shape) # (367, 8)
Same shape. Polars’s read_database accepts the same DB-API connection objects pandas does. The result is a pl.DataFrame — call .to_pandas() if downstream code wants it back as pandas.
For execution-engine reasons, polars’s reader is faster on large result sets — it streams rows through Arrow buffers rather than building a pandas frame and converting.
Method 3 — polars read_database_uri (one-liner)
import polars as pl
df = pl.read_database_uri(
"SELECT * FROM prices WHERE Ticker='AAPL' AND Close > 200",
"sqlite:///prices.db",
)
print(df.shape) # (367, 8)
No connection management. Polars (via the connectorx backend) opens, reads, and closes the connection in one call. The URI string tells it everything it needs — driver, host, credentials, database name.
URI examples:
sqlite:///prices.db
postgresql://user:pass@localhost:5432/finance
mysql://user:pass@db.example.com:3306/finance
mssql://user:pass@host:1433/finance
For ad-hoc scripts and notebooks, this is the cleanest path. For production pipelines, the connection-pool methods (read_database with SQLAlchemy/psycopg2) give you finer control over reconnects and transactions.
Filter at the SQL layer, not in Python
The single most common mistake: pulling the whole table into memory and then filtering with pandas.
Wrong:
df = pd.read_sql("SELECT * FROM prices", conn)
big = df[df["Close"] > 200] # filtered in Python after reading 28k rows
Right:
big = pd.read_sql("SELECT * FROM prices WHERE Close > 200", conn)
The database is the right place to filter. For a 28k-row table the difference is small. For a 100-million-row table it’s the difference between a 5-second query and a 5-minute Python crash. Push the filter into SQL whenever you can.
Same principle for column selection — use SELECT col1, col2, … instead of SELECT * then dropping columns in Python.
What the libraries do well
| pandas read_sql | polars read_database | polars read_database_uri | |
|---|---|---|---|
| Needs a connection object | yes | yes | no |
| Output type | pd.DataFrame | pl.DataFrame | pl.DataFrame |
| Speed (large results) | baseline | ~2-5× faster | ~2-5× faster |
| Production pipelines | yes (SQLAlchemy) | yes | best for ad-hoc |
| Parameterized queries | params= |
execute_options |
not via URI directly |
| Chunked reads | chunksize= |
yes | no |
For most analyst work, pandas read_sql is the right default if you’re already on pandas. Switch to polars read_database when result sets are large enough that the speed-up matters (rule of thumb: millions of rows).
Reading the schema first
When you don’t know the columns yet:
df = pd.read_sql("SELECT * FROM prices LIMIT 0", conn)
print(df.dtypes)
LIMIT 0 returns zero rows but the schema comes back. Cheap way to inspect a table you didn’t create.
For polars:
df = pl.read_database("SELECT * FROM prices LIMIT 0", conn)
print(df.schema)
Iterating in chunks for tables too big to fit in memory
pd.read_sql supports streamed reads:
for chunk in pd.read_sql("SELECT * FROM huge_table", conn, chunksize=100_000):
process(chunk)
Each chunk is a DataFrame of up to chunksize rows. The connection stays open across iterations. The full table is never materialized in memory.
Polars handles this differently — it would rather you push the filter into SQL so chunking isn’t necessary. For genuinely massive reads, scan into Parquet first:
import duckdb
duckdb.sql("COPY (SELECT * FROM postgres_scan('postgres://...', 'prices') ) TO 'prices.parquet'")
df = pl.scan_parquet("prices.parquet").filter(...).collect()
A whole other tutorial. For most reads, parameterized SQL + a single-call read is enough.
Common stumbles
OperationalError: no such table. Your connection is pointed at the wrong database. SELECT name FROM sqlite_master WHERE type='table' lists every table SQLite can see.
Date columns come back as strings. SQLite doesn’t have a native date type — it stores dates as text. Cast on read:
df["Date"] = pd.to_datetime(df["Date"])
Polars has the same issue but resolves with .str.to_datetime() after the read.
UserWarning: pandas only supports SQLAlchemy connectable. Modern pandas wants a SQLAlchemy engine for non-SQLite databases. Either swap to SQLAlchemy or wrap your raw connection: pd.read_sql(sql, sqlalchemy_engine).
Connection times out on long queries. Set the timeout on the connection: sqlite3.connect("prices.db", timeout=60). For postgres, use the engine’s connect_args={"connect_timeout": 60}.
sqlite3.OperationalError: database is locked. Two processes wrote at the same time. SQLite serializes writes; use connection.execute("PRAGMA busy_timeout = 5000") to wait up to 5 seconds for the lock.
Recap
pd.read_sql(query, connection) is the canonical two-line SQL-to-DataFrame in Python. Same shape with pl.read_database if you want a Polars frame. pl.read_database_uri skips connection management entirely, taking a URI string instead. Use parameterized queries (params=) to escape user input; never f-string SQL. Filter and select at the SQL layer, not in Python — let the database do its job. chunksize= for streamed reads when the result set won’t fit in memory. SQLAlchemy engines for production; raw DB-API connections fine for one-off scripts.
Subscribe to the playlist for more Python answers.