De-Duplicating Data in SQL: Beyond DISTINCT

1. Why `DISTINCT` Is Not Enough

Let's assume we are working with an orders table, which tracks orders by customer and has the following columns:

order_id: the unique ID for the order.
customer_id: the unique ID for the customer that placed the order.
email: the customer's email address.
updated_at: the timestamp in which this order was last updated.
created_by: the user that created the record.

DISTINCT removes duplicate rows, not duplicate entities. That distinction matters.

1SELECT DISTINCT
2    customer_id,
3    email,
4    updated_at
5FROM orders;

This only works if every column matches exactly. But in real data, you will likely see some of the following scenarios:

One record has a newer timestamp
One record has a NULL where another has a value
One record was system-generated, another user-entered

In these cases, DISTINCT either:

Keeps both rows, or
Forces you to drop columns, losing information

What we really want is one surviving record per business key, chosen intentionally. Heavy emphasis on intentionally.

2. Using Window Functions to Identify Duplicates

Window functions let us rank rows within a logical group-- without collapsing the dataset prematurely.

Basic Pattern: `ROW_NUMBER()`

1SELECT
2    customer_id,
3    email,
4    updated_at,
5    row_number() over (
6        partition by customer_id,
7        order by updated_at desc
8    ) as row_num
9FROM orders

This answers the question:

"If I had to pick one row per customer_id, which one wins?"

To actually deduplicate this dataset, simply filter where the row_num column equals 1:

1WITH customer_data_staging AS (
2    SELECT
3        customer_id,
4        email,
5        updated_at,
6        row_number() over (
7            partition by customer_id,
8            order by updated_at desc
9        ) as row_num
10    FROM orders
11)
12select * from customer_data_staging where row_num = 1;

This gives you:

One row per customer_id
The most recently updated record
A repeatable, explainable rule

3. Survivorship Rules: Choosing the "Best" Record

Real deduplication often requires multiple criteria. This is where survivorship rules come in.

Example survivorship logic

Prefer records with non-NULL email
Prefer user-created over system-created
Prefer the most recent update

To achieve this, extend the window function to order by additional criteria:

1row_number() over (
2    partition by customer_id
3    order by
4        (email is not null) desc,
5        (created_by != 'System User') desc,
6        updated_at desc
7)

This ensures the "best" record survives-- even if it isn't the newest.

4. Deterministic Tie-Breakers (So Results Don't Randomly Change)

If your ORDER BY isn't fully deterministic, the database is free to return different winners on different runs. In a well-oiled data model, the same input will always produce the same output. This can generally be achieved with the following strategy. The window function ORDER BY must:

Reflect business priority first, THEN
Fall back to a stable, unique column

Bad (non-deterministic)

1ORDER BY updated_at DESC

This ORDER BY is only taking into consideration the business priority, which is the right starting point, but it will lead to inconsistent results.

Good (deterministic)

1ORDER BY 
2    updated_at DESC,
3    order_id ASC

This ORDER BY first takes into account the business priority--ensuring that we receive the most recently updated record first. And after the business priority is taken into account, this SQL statement orders by the order table's unique record ID column to ensure consistency. Because the record ID of the order table won't change, we are certain the result set will remain consistent.

Rule of thumb: If two rows could tie, your SQL must explain how to break the tie.

5. Keeping Duplicate Rows

You may not always want to remove duplicate rows, in that case you can leverage the same logic to flag primary vs duplicate rows.

To achieve this, modify the window function with a CASE statement:

1CASE
2    WHEN row_number over (
3        partition by customer_id
4        order by
5            (email is not null) desc,
6            (created_by != 'System User') desc,
7            updated_at desc
8    ) = 1
9    THEN 'PRIMARY'
10    ELSE 'DUPLICATE'
11END AS record_status

And then, the record_status column can be used for auditing of duplicates, filtering on the primary vs duplicate records, and data pipeline observability.

Final Thoughts

Deduplication isn't above removing rows-- it's about encoding business truth into SQL.

By combining:

Window functions
Explicit survivorship rules
Deterministic tie-breakers

You move from accidental correctness to intentional data modeling. And once you adopt this pattern, SELECT DISTINCT quickly feels like a blunt instrument from a simpler time.

De-Duplication Is More Than SELECT DISTINCT

1. Why DISTINCT Is Not Enough