Part of SQL for Finance

SQL for Finance: JOINs in Depth: Putting the Tables Back Together

Celest KimCelest Kim

Video: JOINs in Depth: Putting the Tables Back Together | SQL for Finance Ep4 by CelesteAI

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

Download the AtlasParts dataset and atlasparts.db — same as previous episodes: github.com/GoCelesteAI/excel-powerbi-for-finance.

By Episode 3, every query in this series has lived inside one table. Real finance work doesn’t. The order amount is in sales_orders, the customer name is in customers, the line-item product is in sales_order_lines, the GL account label is in chart_of_accounts. Every report you build in your day job stitches at least two of those together.

In Excel that stitching is VLOOKUP (or XLOOKUP, if your team has caught up). In SQL it’s JOIN. This episode walks through the three joins you’ll use 99% of the time — INNER, LEFT, and a brief look at RIGHT — and then spends the second half on the single most common bug an analyst writes against a real database: the join that silently quadruples your revenue number.

The AtlasParts schema, as a picture

Before we join anything, look at how the nine tables connect:

              countries (US, CA, UK, ...)
                   ▲
                   │ country
                   │
              customers ◀───┐
                   │         │
                   │ cust_id │
                   ▼         │
              sales_orders   │
                   │         │
                   │ ord_id  │
                   ▼         │
        sales_order_lines    │
                   │         │
                   │ prod_id │
                   ▼         │
              products       │
                             │
              chart_of_accounts (asset, expense, ...)
                   ▲
                   │ account_id
                   │
              gl_journal ── line_id ── sales_order_lines
                   │
              vendors ◀── vendor_id ── ap_invoices

Two patterns dominate: 1. A fact table (sales_orders, gl_journal, ap_invoices) holds the numbers. 2. Dimension tables (customers, products, vendors, chart_of_accounts, countries) hold the labels.

Joins exist to bring labels next to numbers.

INNER JOIN: keep only matched rows

The simplest join asks: which rows in table A have a match in table B? Every other row gets dropped.

SELECT o.order_id,
       o.order_date,
       c.customer_name,
       c.region,
       o.order_total
FROM   sales_orders o
JOIN   customers    c ON c.customer_id = o.customer_id
ORDER  BY o.order_date DESC
LIMIT  5;

Run that against AtlasParts and the customer name and region show up next to each order. The bare JOIN keyword means INNER JOIN. Both forms are identical, and the shorter one is the convention in modern SQL.

The ON clause is the equivalent of telling Excel which two columns are the lookup key. Once you’ve written it, the engine matches every row in sales_orders against every row in customers, keeps the pairs where customer_id agrees, and discards the rest.

Aliases. o and c are table aliases. They’re optional but they save typing the moment you have more than one table on the page. Pick a one- or two-letter alias and stick to it.

For AtlasParts the inner join produces 2,242 rows — the same as sales_orders — because every order in our dataset has a customer record. That’s a happy data set. In production you’ll see joins that should be one-to-one come back with fewer rows than expected, and the next section is how you find out why.

LEFT JOIN: keep all rows from the left table

LEFT JOIN keeps every row from the left-hand table whether or not it matches on the right. Unmatched right-hand columns come back as NULL. That’s the join you reach for any time the question is which of my X don’t have a Y?

AtlasParts has a countries dimension that maps ISO codes (US, CA, UK) to country names and regions. Customers, on the other hand, were imported from a sales spreadsheet where someone typed USA, us, United States of America, Great Britain, and Canada straight into the form. Run the obvious inner join:

SELECT c.customer_name, ctry.country_name, ctry.region
FROM   customers   c
JOIN   countries   ctry ON ctry.country = c.country;

You get 31 rows back. But customers has 50. Where did the other 19 go?

Switch to a LEFT JOIN and you find them:

SELECT c.customer_name, c.country, ctry.country_name
FROM   customers   c
LEFT JOIN countries ctry ON ctry.country = c.country
WHERE  ctry.country IS NULL;

Nineteen customers come back with country_name as NULL. Their country column holds USA, us, United States of America, Canada, Germany, France, Great Britain — values our countries dimension stores under different ISO codes. The lookup silently dropped them in the inner join.

This is the analyst’s first job whenever a report’s numbers feel low: switch the join to LEFT, filter WHERE right_table.key IS NULL, and look at what you lost.

The fix. Long term you fix this in the data — clean the customer file, or build a country-aliases table. Short term you write the join with a CASE or a normalized lookup. The bug only matters once you can see it.

RIGHT JOIN: rare, but worth knowing

RIGHT JOIN is LEFT JOIN with the table order flipped. Anything you can write as A LEFT JOIN B you can write as B RIGHT JOIN A. In practice almost no one writes RIGHT JOIN, because reading code is easier when the “anchor” table is on the left and the lookup is on the right.

SQLite added RIGHT JOIN support in version 3.39 (2022). Before that you had to flip the table order manually. Postgres, MySQL, SQL Server, and Snowflake have always had it. Know it exists; default to LEFT.

The wrong-cardinality trap (the centerpiece)

This is the join bug that ships finance dashboards with the wrong totals. Consider a query an analyst wrote on their first week:

“Show me total revenue by product category.”

Reasonable. Revenue lives on sales_orders.order_total, category lives on products.category, and the link between them is sales_order_lines. So you write:

SELECT p.category,
       ROUND(SUM(CAST(o.order_total AS REAL)),0) AS revenue
FROM   sales_orders      o
JOIN   sales_order_lines l USING (order_id)
JOIN   products          p USING (product_id)
GROUP  BY p.category
ORDER  BY revenue DESC;

The query runs. The dashboard renders. The number is wrong. Worse: the total across categories is wrong too — by a factor of more than 4×.

Here’s why. sales_orders has 2,242 rows; sales_order_lines has 7,855. The join from sales_orders to sales_order_lines produces one row per line, not one per order. Each order is duplicated by its line count — average 3.5 lines per order in this dataset. When you SUM(o.order_total) after that join, every order’s total is counted 3.5 times on average.

Run the proof:

-- Truth: sum on sales_orders alone
SELECT ROUND(SUM(CAST(order_total AS REAL)),0) FROM sales_orders;
-- $11,505,631

-- The lie: sum after joining to lines
SELECT ROUND(SUM(CAST(o.order_total AS REAL)),0)
FROM   sales_orders      o
JOIN   sales_order_lines l USING (order_id);
-- $50,017,707

Eleven and a half million becomes fifty million. The query didn’t error. The category labels are right. The number is wrong by 4.3×.

The fix is to aggregate at the right grain. If you want revenue by product category, sum line_total (which lives at the line grain), not order_total (which lives at the order grain):

SELECT p.category,
       ROUND(SUM(CAST(l.line_total AS REAL)),0) AS revenue
FROM   sales_order_lines l
JOIN   products          p USING (product_id)
GROUP  BY p.category
ORDER  BY revenue DESC;

That’s $11.5M total spread across categories. Correct.

The general rule: before you join, ask what one row of the result represents. If it’s “one line per order line”, don’t sum order-level columns. If it’s “one row per customer per month”, don’t sum monthly snapshots. Joining to a one-to-many table changes the grain. Aggregating at the wrong grain changes the answer.

Debugging the join: count first, sum later

When a join’s totals look wrong, this is the diagnostic that catches it in 30 seconds:

SELECT COUNT(*) FROM sales_orders;            -- 2,242
SELECT COUNT(*) FROM sales_orders o
       JOIN sales_order_lines l USING (order_id);  -- 7,855

If the join’s row count exceeds the left table’s row count, your join is one-to-many. Now you have to decide: do I want one row per order (aggregate the right side first, then join) or one row per line (sum at line grain)?

Production discipline: every time you write a multi-table query, run COUNT(*) against it before you sum anything. The cost is one query. The benefit is never shipping a 4× wrong number.

Recap

  • JOIN (a.k.a. INNER JOIN) keeps only matched rows.
  • LEFT JOIN keeps every left-hand row; unmatched right-hand columns come back NULL. Use it to find what’s missing.
  • RIGHT JOIN exists but is rarely written; flip the table order and use LEFT.
  • Joining a fact to a one-to-many child changes the grain. SUM at the wrong grain inflates totals silently.
  • Always COUNT(*) your join before you SUM anything.

Episode 5 covers subqueries and CTEs — the WITH clause that lets you build a query in pieces, name each piece, and read it like a recipe.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.