SQLJan 5, 202610 minute read

De-Duplicating Data in SQL: Beyond DISTINCT

CH
Conor Hollern
De-Duplicating Data in SQL: Beyond DISTINCT

De-Duplication Is More Than SELECT DISTINCT

SELECT DISTINCT feels like the right answer—until it isn’t. When duplicate rows carry different values, different timestamps, or different business meaning, deduplication becomes a data modeling problem, not a syntax trick. In this post, we’ll go beyond DISTINCT and walk through window functions, survivorship rules, and deterministic tie-breakers to build deduplication logic you can actually trust.

1. Why DISTINCT Is Not Enough

Let's assume we are working with an orders table, which tracks orders by customer and has the following columns:

  • order_id: the unique ID for the order.
  • customer_id: the unique ID for the customer that placed the order.
  • email: the customer's email address.
  • updated_at: the timestamp in which this order was last updated.
  • created_by: the user that created the record.

DISTINCT removes duplicate rows, not duplicate entities. That distinction matters.

1
SELECT DISTINCT
2
customer_id,
3
email,
4
updated_at
5
FROM orders;

This only works if every column matches exactly. But in real data, you will likely see some of the following scenarios:

  • One record has a newer timestamp
  • One record has a NULL where another has a value
  • One record was system-generated, another user-entered

In these cases, DISTINCT either:

  • Keeps both rows, or
  • Forces you to drop columns, losing information

What we really want is one surviving record per business key, chosen intentionally. Heavy emphasis on intentionally.

2. Using Window Functions to Identify Duplicates

Window functions let us rank rows within a logical group-- without collapsing the dataset prematurely.

Basic Pattern: ROW_NUMBER()

1
SELECT
2
customer_id,
3
email,
4
updated_at,
5
row_number() over (
6
partition by customer_id,
7
order by updated_at desc
8
) as row_num
9
FROM orders

This answers the question:

"If I had to pick one row per customer_id, which one wins?"

To actually deduplicate this dataset, simply filter where the row_num column equals 1:

1
WITH customer_data_staging AS (
2
SELECT
3
customer_id,
4
email,
5
updated_at,
6
row_number() over (
7
partition by customer_id,
8
order by updated_at desc
9
) as row_num
10
FROM orders
11
)
12
select * from customer_data_staging where row_num = 1;

This gives you:

  • One row per customer_id
  • The most recently updated record
  • A repeatable, explainable rule

3. Survivorship Rules: Choosing the "Best" Record

Real deduplication often requires multiple criteria. This is where survivorship rules come in.

Example survivorship logic

  • Prefer records with non-NULL email
  • Prefer user-created over system-created
  • Prefer the most recent update

To achieve this, extend the window function to order by additional criteria:

1
row_number() over (
2
partition by customer_id
3
order by
4
(email is not null) desc,
5
(created_by != 'System User') desc,
6
updated_at desc
7
)

This ensures the "best" record survives-- even if it isn't the newest.

4. Deterministic Tie-Breakers (So Results Don't Randomly Change)

If your ORDER BY isn't fully deterministic, the database is free to return different winners on different runs. In a well-oiled data model, the same input will always produce the same output. This can generally be achieved with the following strategy. The window function ORDER BY must:

  • Reflect business priority first, THEN
  • Fall back to a stable, unique column

Bad (non-deterministic)

1
ORDER BY updated_at DESC

This ORDER BY is only taking into consideration the business priority, which is the right starting point, but it will lead to inconsistent results.

Good (deterministic)

1
ORDER BY
2
updated_at DESC,
3
order_id ASC

This ORDER BY first takes into account the business priority--ensuring that we receive the most recently updated record first. And after the business priority is taken into account, this SQL statement orders by the order table's unique record ID column to ensure consistency. Because the record ID of the order table won't change, we are certain the result set will remain consistent.

Rule of thumb: If two rows could tie, your SQL must explain how to break the tie.

5. Keeping Duplicate Rows

You may not always want to remove duplicate rows, in that case you can leverage the same logic to flag primary vs duplicate rows.

To achieve this, modify the window function with a CASE statement:

1
CASE
2
WHEN row_number over (
3
partition by customer_id
4
order by
5
(email is not null) desc,
6
(created_by != 'System User') desc,
7
updated_at desc
8
) = 1
9
THEN 'PRIMARY'
10
ELSE 'DUPLICATE'
11
END AS record_status

And then, the record_status column can be used for auditing of duplicates, filtering on the primary vs duplicate records, and data pipeline observability.

Final Thoughts

Deduplication isn't above removing rows-- it's about encoding business truth into SQL.

By combining:

  • Window functions
  • Explicit survivorship rules
  • Deterministic tie-breakers

You move from accidental correctness to intentional data modeling. And once you adopt this pattern, SELECT DISTINCT quickly feels like a blunt instrument from a simpler time.