Pandas for Finance: Writing Out: Excel, Parquet, DuckDB — One Pipeline, Four Formats
Video: Writing Out: Excel, Parquet, DuckDB — One Pipeline, Four Formats | Pandas for Finance Ep13 by CelesteAI
df.to_excel("out.xlsx", sheet_name="...")— single sheet.pd.ExcelWriterfor multi-sheet.df.to_parquet(path)— fast columnar storage.duckdb.execute("CREATE TABLE prices AS FROM df")— persistent SQL DB. The output stage of every analyst pipeline.
You’ve cleaned, joined, computed. Now hand the result to the next person — or future-you. Match the format to the audience.
Single-sheet Excel
import pandas as pd
df = pd.read_parquet("data/prices.parquet")
summary = (
df.groupby("Ticker")["Adj Close"]
.agg(["first", "last", "min", "max", "mean"])
.round(2)
)
summary.to_excel("summary.xlsx", sheet_name="Stats")
.to_excel(path, sheet_name="...") writes a single tab. The index becomes the leftmost column unless you pass index=False.
For sharing with finance teams, Excel is the universal hand-off format. Pandas → openpyxl → .xlsx.
Multi-sheet Excel
with pd.ExcelWriter("report.xlsx", engine="openpyxl") as writer:
for ticker in df["Ticker"].unique():
sub = df[df["Ticker"] == ticker].sort_values("Date")
sub.to_excel(writer, sheet_name=ticker, index=False)
pd.ExcelWriter is a context manager. Inside, repeated .to_excel(writer, sheet_name=...) calls add tabs.
For our 14-ticker universe: one tab per ticker. The “send the analyst the data, they’ll click the tabs” workflow.
Excel formatting
For real reports, use the writer’s formatting:
with pd.ExcelWriter("formatted.xlsx", engine="xlsxwriter") as writer:
summary.to_excel(writer, sheet_name="Summary")
workbook = writer.book
worksheet = writer.sheets["Summary"]
# Currency format on column B
money = workbook.add_format({"num_format": "$#,##0.00"})
worksheet.set_column("B:F", 14, money)
# Header bold
header = workbook.add_format({"bold": True, "bg_color": "#D9E1F2"})
worksheet.set_row(0, None, header)
xlsxwriter is faster than openpyxl for new files; use it when starting fresh. openpyxl is needed for editing existing files.
For frequent reports, write a wrapper function once, reuse.
Parquet
df.to_parquet("output.parquet")
df.to_parquet("output.parquet.gz", compression="gzip")
Default compression is snappy — fast, decent ratio. gzip for max compression at slower read.
Index handling:
df.to_parquet("out.parquet") # writes index as a column
df.to_parquet("out.parquet", index=False) # drops the index
For DataFrames with a meaningful index (DatetimeIndex), keep it.
Partitioned Parquet
df.to_parquet("data/prices/", partition_cols=["Ticker"])
Creates data/prices/Ticker=AAPL/, data/prices/Ticker=MSFT/, etc. Each partition is a separate file. Tools that read parquet (DuckDB, Polars, Spark) can read just the partitions they need.
aapl = pd.read_parquet("data/prices/", filters=[("Ticker", "==", "AAPL")])
For our universe, single-file is fine. Partition when you have millions of rows or many tickers.
DuckDB persistent storage
import duckdb
con = duckdb.connect("market.duckdb")
con.execute("CREATE OR REPLACE TABLE prices AS FROM read_parquet('data/prices.parquet')")
con.execute("CREATE INDEX idx_ticker ON prices(Ticker)")
con.close()
DuckDB stores everything in market.duckdb — a single file, like SQLite for analytics. Reopen later:
import duckdb
con = duckdb.connect("market.duckdb")
df = con.execute("SELECT * FROM prices WHERE Ticker = 'AAPL'").df()
.df() materializes results as a pandas DataFrame. For transactional work or repeated queries, DuckDB beats parquet-only by avoiding the parse cost.
CSV (when you must)
df.to_csv("out.csv", index=False)
df.to_csv("out.csv.gz", index=False, compression="gzip")
Avoid for analysis files. Use for handoff to non-pandas tools (R, Stata, old systems).
df = pd.read_csv("out.csv", parse_dates=["Date"])
parse_dates= to recover datetime columns. Without it, dates come back as strings.
JSON
# Records — list of dicts, each row a dict
df.to_json("out.json", orient="records", date_format="iso")
# Read back
df2 = pd.read_json("out.json", orient="records")
orient="records" is the most useful — easy to consume from JS, web APIs, etc. date_format="iso" for machine-readable timestamps.
A complete report writer
import pandas as pd
df = pd.read_parquet("data/prices.parquet")
sectors = pd.read_csv("data/sector_map.csv")
# Compute report sections
df["Date"] = pd.to_datetime(df["Date"])
df["Return"] = df.groupby("Ticker")["Adj Close"].pct_change()
summary = df.groupby("Ticker").agg(
FirstClose=("Adj Close", "first"),
LastClose=("Adj Close", "last"),
AvgVolume=("Volume", "mean"),
).merge(sectors, on="Ticker")
summary["TotalReturn"] = summary["LastClose"] / summary["FirstClose"] - 1
summary = summary.sort_values("TotalReturn", ascending=False).round(4)
monthly = (
df.set_index("Date")
.groupby("Ticker")["Adj Close"]
.resample("ME").last()
.unstack("Ticker")
)
sector_perf = (
df.merge(sectors, on="Ticker")
.groupby("Sector")["Return"]
.agg(lambda r: (1 + r.mean()) ** 252 - 1)
.sort_values(ascending=False)
.round(4)
)
# Write multi-sheet report
with pd.ExcelWriter("market_report.xlsx", engine="openpyxl") as writer:
summary.to_excel(writer, sheet_name="Summary", index=False)
monthly.to_excel(writer, sheet_name="Monthly Close")
sector_perf.to_excel(writer, sheet_name="Sector Performance")
print("Wrote market_report.xlsx")
# Also write parquet for downstream Python consumers
df.to_parquet("data/clean_prices.parquet")
print("Wrote data/clean_prices.parquet")
Two outputs: Excel for humans, Parquet for the next script. Matches how real teams operate.
Try it
writeout.py:
import pandas as pd
pd.set_option("display.float_format", lambda x: f"{x:,.2f}")
df = pd.read_parquet("data/prices.parquet")
df["Date"] = pd.to_datetime(df["Date"])
summary = (
df.groupby("Ticker")
.agg(
First=("Adj Close", "first"),
Last=("Adj Close", "last"),
AvgVol=("Volume", "mean"),
)
.round(2)
)
summary["TotalReturn"] = summary["Last"] / summary["First"] - 1
print("=== Summary ===")
print(summary)
print("\n=== Writing multi-sheet Excel ===")
with pd.ExcelWriter("/tmp/p4f_report.xlsx", engine="openpyxl") as writer:
summary.to_excel(writer, sheet_name="Summary")
for ticker in df["Ticker"].unique()[:3]: # first 3 tickers
sub = df[df["Ticker"] == ticker].sort_values("Date")
sub.to_excel(writer, sheet_name=ticker, index=False)
print(" → /tmp/p4f_report.xlsx")
print("\n=== Writing parquet ===")
df.to_parquet("/tmp/p4f_clean.parquet")
print(" → /tmp/p4f_clean.parquet")
print("\n=== Sanity: reload parquet ===")
back = pd.read_parquet("/tmp/p4f_clean.parquet")
print(f"Reloaded {len(back):,} rows")
Common stumbles
openpyxl missing. pip install openpyxl for Excel. pip install xlsxwriter for the alternative engine.
to_excel is slow. For 100k+ rows, expect seconds. Save parquet for analysis; Excel only for final reports.
Index in CSV. Default to_csv writes the index. Use index=False for clean Excel-friendly files.
Date round-trip in CSV. Dates become strings in CSV. pd.read_csv(..., parse_dates=["Date"]) to recover.
Parquet engine differences. pyarrow (default) and fastparquet produce slightly different files. Stick with pyarrow — the modern standard.
Multi-sheet write order. Sheets appear in the order you write them. Plan accordingly.
xlsxwriter can’t edit existing. It creates new files only. For editing, use openpyxl with mode="a" (append).
What’s next
Ep 14: build a portfolio backtest — equal-weight, monthly rebalanced, vs SPY benchmark.
Recap
Excel for humans (pd.ExcelWriter for multi-sheet). Parquet for analysis (fast, columnar, typed). DuckDB for SQL workloads (persistent file like SQLite). CSV only for hand-off to non-pandas tools. Always index=False for CSV unless the index is meaningful. For dates in CSV, round-trip with parse_dates=. For huge parquet, partition by frequently-filtered column (partition_cols=["Ticker"]).
Next episode: portfolio backtest.