The Farber consulting group.inc

Remove duplicate records in sql

In MS SQL 2005 and later editions, a much better way has been provided to remove duplicate records smartly, while having full control of which records to remove. It can be accomplished by using the ROW_NUMBER() which will return a number of row within a partition of a result set. With this code:

ROW_NUMBER() OVER (PARTITION BY Clause ORDER BY Clause)

You can have a partition on something like Email or SSN or a combination of several fields like: First, Last, Street Number, Street Name and City.

 

Call us for Free Consultation for Remote DBA services at our SQL Consulting Firm at: 732-536-4765

 

The idea is to perform data cleansing while removing records where you can control any aspect of it. What's even better is the ability to combine different pieces of data from different records into one record while the rest of the records for that partition (group) are removed with no data being lost. Duplicate data is a problem that exists in many databases and the purpose of removing duplicate records is to improve productivity and increase the performance of your database. The below article will discuss the technique of how to remove duplicate records - rows in MSSQL database.

 

Check our Custom Software Development Services.


Let's create some sample data:

USE TempDB

-- Check if the table exists and if not then create it. 
IF NOT EXISTS(SELECT * 
    FROM Sys.Tables
    WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[#TableWithDups]'))
    
    -- Create temp table
    BEGIN
    CREATE TABLE #TableWithDups(
        [PK]    TINYINT NOT NULL PRIMARY KEY,
        [EMail] [varchar](40) NULL,
        [First] [varchar](15) NULL,
        [Last] [varchar](20) NULL,
        [Gender] [varchar] (1) NULL, 
        [DOB] [varchar](8) NULL,
        [Phone] [varchar](10) NULL,
        [StreetNumber] [varchar](30) NULL,
        [StreetName] [varchar](30) NULL
    ) 
    END
-----------------------------------------------------------------
-- Check if some temp tables exist in the TempDB database -------
-----------------------------------------------------------------
ELSE
BEGIN
    TRUNCATE TABLE #TableWithDups
END

IF EXISTS(SELECT * 
    FROM Sys.Tables
    WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[#DeDup_Testing]'))
BEGIN
    DROP TABLE #DeDup_Testing   
END

--------------------------------------------------------------------
--- Insert duplicates records into TableWithDups
--------------------------------------------------------------------
INSERT INTO #TableWithDups ([PK],[EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName]) 
VALUES (1,'MSSQL1@RemoveDupRecords.Com','Susan','Smith','F', '19600912','','1100','Main Street')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName]) 
VALUES (2,'MSSQL1@RemoveDupRecords.Com','Susan','Smith','',   '','1113058336','1100','Main Street')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName]) 
VALUES (3,'MSSQL1@RemoveDupRecords.Com','Susan','Smith','F','','1113058336','1100','Main Street')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName]) 
VALUES (4,'MSSQL2@RemoveDupRecords.Com','Joe','Smith','','19640825','2223058336','1100','Main Street')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName])  
VALUES (5,'MSSQL2@RemoveDupRecords.Com','Joe','Smith','M','19640825','2223058336','1100','Main Street')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName])  
VALUES (6,'MSSQL3@RemoveDupRecords.Com','Alex','Walker','M','','5553058888','55','KingFisher Road')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName]) 
VALUES (7,'MSSQL3@RemoveDupRecords.Com','Alex','Walker','F','19580912','','55','KingFisher Road')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName])  
VALUES (8,'MSSQL3@RemoveDupRecords.Com','Alex','Walker','','1952','','55','KingFisher Road')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName]) 
VALUES (9,'MSSQL4@RemoveDupRecords.Com','Uri','Pearl','M','1961','3333054422','110','Robertsville Road')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName])  
VALUES (10,'MSSQL4@RemoveDupRecords.Com','Uri','Pearl','','19600912','3333054422','110','Robertsville Road')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName]) 
VALUES (11,'MSSQL4@RemoveDupRecords.Com','Uri','Pearl','','','','110','Robertsville Road')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName]) 
VALUES (12,'MSSQL5@RemoveDupRecords.Com','Terry', 'Thurber','M','','4443058377','500','Larkin Street')

INSERT INTO #TableWithDups ([PK], [EMail],[First],[Last],[Gender],[DOB],[Phone],[StreetNumber],[StreetName]) 
VALUES (13,'MSSQL5@RemoveDupRecords.Com','Terry', 'Thurber','','19600809','','500','Larkin Street')

Below is some sample code to explain this idea of how to remove duplicates using the MS SQL ROW_NUMBER and partition idea:

SELECT 
    [First],
    [Last], 
    [Email],
    [DOB],
    ROW_NUMBER() OVER (PARTITION BY Email ORDER BY Email,DOB DESC,Gender) AS Rn, 
    StreetNumber,
    StreetName
FROM 
    #TableWithDups 

In the above, the Email works as a partition and has a unique value to allow us to create a group of records on that field and also control the order of how the records will show up. In the data below Susan Smith with the record that has the value of Rn = 2 and Rn=3 are considered as duplicate records and should be deleted.

 

In actuality the only difference compared to the original data that was created in the temp table named: TableWithDups is the additional column named: Rn. Every record in the Rn that is greater than 1 is considered as a duplicate one. Adding the ORDER BY on different fields can change the way the records will show up as Rn=1. See Sample 1 below to get the idea:

 

Sample 1:

Data cleansing remove duplicate records In SQL Database

 

When running the below code as Common Table Expression (CTE) flexibility is increased as shown below and a WHERE clause is added as WHERE Rn = 1. As a side note using the CTE is considered as a temporary result. The CTE can be self-referenced multiple times in the same query. It is similar to a derived table but much more flexible.

;
WITH DeDupe
AS
(
SELECT 
    [PK],
    [First],
    [Last], 
    [Email],
    [Gender],
    [DOB],
    ROW_NUMBER() OVER (PARTITION BY Email ORDER BY Email,DOB DESC) AS Rn, 
    StreetNumber,
    StreetName
FROM 
    #TableWithDups 
)   
SELECT [PK],[First],[Last],[Email], [Gender],[DOB],Rn,StreetNumber,StreetName
FROM DeDupe WHERE Rn = 1
ORDER BY [First],[Last]     

Here is the data as shown below in Sample 2 after running the above code:

Sample 2:
Find and remove duplicate records In MSSQL Data

 

As you see in the above in Sample 2, all duplicate records that have Rn > 1 do not show up anymore. But if you look closely we lost some data that could have been combined in the record with Rn=1. We can increase the partition by adding more fields to make sure there are no duplicate addresses, but that would not change the fact that some data will be lost.

 

The real power of the above technique is that we can combine several values from different duplicate records in the record that gets the value of Rn=1. Let's examine closely what we have above. If you look at Sample 2 then Joe Smith and Terry Thurber lost the data in the Gender column. Looking at Sample 1 under PK= 5, the Gender is M for Joe and also PK = 13 which is Terry Thurber. The same problem exists also for the Phone number as well. The following code will take care of this problem and incorporate all required pieces of data in one record.

;
WITH DeDupe
AS
(
SELECT
    PK,
    [First],
    [Last],
    [Email],
    [Gender],
    [DOB],
    MAX(Phone) OVER (PARTITION BY Email) AS Phone_Updated ,
    MAX(CASE WHEN CAST(DOB AS INT) > 0 THEN CAST(DOB AS INT) ELSE DOB END) 
    OVER (PARTITION BY Email) AS DOB_Updated,
    MAX(Gender) OVER (PARTITION BY Email) AS Gender_Updated,  
    ROW_NUMBER() OVER (PARTITION BY Email,First,Last,LTRIM(LEFT(UPPER([StreetNumber]),4)) + LEFT(UPPER([StreetName]),4)  
    ORDER BY DOB DESC,Phone DESC,Gender DESC,Email,First,Last) AS Rn, 
    [Phone],
    [StreetNumber],
    [StreetName]
FROM #TableWithDups 
) 
SELECT 
    PK,
    [First],
    [Last],
    [Email],
    [Gender],
    [Gender_Updated], 
    [DOB],
    CASE WHEN CAST(DOB_Updated AS VARCHAR(10))='0' THEN '' ELSE CAST(DOB_Updated   
    AS VARCHAR(10)) END DOB_Updated,
    [Phone],
    [Phone_Updated],
    [Rn],
    [StreetNumber],
    [StreetName]
INTO #DeDupe_Testing FROM DeDupe WHERE Rn > 0

-- Let's show some data
SELECT * FROM #DeDupe_Testing WHERE Rn=1

The below data is the result of running the above code:

 

Sample 3:

Remove duplicate records in sql server and merge records to one

The below is the continuation of the above data.

 

Data cleansing- find and remove duplicate records in ms sql server

 

Using the MAX(Phone) with the partition technique on each field that we don't want to lose data, will make sure to get the data in the row that Rn=1. Having this code on the Phone MAX(Phone) OVER (PARTITION BY Email) AS Phone_Updated and create a new column to show the new data near the original column helps to analyze the data to make sure everything works fine.

 

In the above code we make sure that no data is lost while we remove all duplicate records. When all partitions are handled then we create ROW_NUMBER() OVER (PARTITION…) on all combinations of fields that create a unique partition. We also ORDER each field that was part of the MAX() as DESC. Order the field as DESC is crucial and is part of the technique of merging all data from different rows into one row of the same partition. In addition, I added code around the DOB (Date of Birth) since in some cases it has only 0 value and clearing the DOB to nothing is a better data for me.

 

In sample 3 all fields such as: Gender_Updated, DOB_Updated and Phone_Updated are filled with data while all other duplicate records were removed. In Sample 3 Joe Smith and Terry Thurber did not lose any data compared to Sample 2. In addition, the idea is to copy all records to a different table while leaving the original data as UN touched for a backup.

 

The DOB is a VARCHAR (10), since when import data it is always a character based field that comes in many different formats. For additional info on that matter please see this article at: data cleansing and manipulating date format.

 

Check our Custom Software Development Services.


Conclusions:
==========

Once you have found your duplicate records, you face the problem of merging multiple records into one without losing any data. Some of the problems you might face when merging records could be that it has a phone or another record could have an email address and you cannot afford to remove these duplicate records when you are taking the risk of losing some important data. The duplicate record remover that was discussed above covers all of these points and provides a solution that safely removes duplicate records from your database. In addition, to use the code above the Sample 3 data is when you know for sure that you may have fields with empty data; otherwise simply use the code above the Sample 2 data.

 

Go back to our main blog.

6 comments for “Remove duplicate records in sql”

  1. Posted Wednesday, December 28, 2011 at 10:25:28 AM

    That's an interesting idea how you merged data from different fields and remove duplicate records.

  2. Posted Wednesday, December 28, 2011 at 12:18:10 PM

    That’s a great idea to remove duplicate records and I always look for a good source for my CRM data cleansing.

  3. Gravatar of tony Basirtony Basir
    Posted Monday, January 02, 2012 at 7:41:30 PM

    This is a problem that I have struggled with over time and created more than one solution for in various languages. You solution to use the RowNumber and Partition clauses of SQL Server is a straight forward, easy to understand and implement.
    Thanks for the copous examples and detailed notes. I will be starting a cleansing project soon and this will save me a lot of time and agony

  4. Posted Tuesday, January 10, 2012 at 3:01:29 PM

    Hi,

    That’s a great idea to remove duplicate records while merging data from different records. I will definitely keep you in mind for any new data cleansing that I may need.

    Thanks,

    Jeff

  5. Posted Thursday, March 06, 2014 at 6:45:39 AM

    Besides using Row_Number() with Partition By ( http://www.kodyaz.com/articles/delete-duplicate-rows-using-row-number-partition-by-order-by.aspx ) on SQL Server 2005 and later, for previous versions of SQL Server developers can use other methods noted at http://www.kodyaz.com/articles/delete-duplicate-records-rows-in-a-table.aspx like using SET ROWCOUNT, TOP or using an IDENTITY column, etc.

  6. Gravatar of VladVlad
    Posted Saturday, May 10, 2014 at 11:10:27 AM

    Great article!!!
    It goes beyond just removing duplicated PK from a table.
    The technique is very valuable and extended my knowledge.

Post a comment