Contact Us

Processing hundreds of millions records got much easier

SQL 2012 - Processing hundreds of millions records can be done in less than an hour.

Processing hundreds of millions of records requires a different strategy and the implementation should be different compared to smaller tables with only several millions of records.

Call us for Free Consultation for Remote DBA services at our SQL Consulting Firm at: 732-536-4765

Say you have 800 millions of records in a table and you need to delete 200 million. Obviously you can use this code:

DELETE FROM Users WHERE DOB = ''

The problem is that you are running one big transaction and the log file will grow tremendously. The process can take a long time. It also depends on the speed of your server as well.

Convert MS Access to Web Based. Call us for Free Consultation at: 732-536-4765

The solution is to use small batches and process 1 to several millions of records at the time. This way the log file stays small and whenever a new process starts, the new batch will reuse the same log file space and it will not grow. Before I used the batch process technique, in one time I had 250 GB log file when I tried to split the Address field.

If I need to move 250 millions of records from one database to another the batch processing technique is a winner. Under my server it would take 30 minutes and 41 seconds, and also can track down the time per each batch. This way I will be able to predict when the entire process is finished. Another advantage when using small batch is if you need to Cancel the process from whatever reason then it takes immediately or several seconds to recover. (Depends on your server speed).

Check our Custom Software Development Services.

I basically use this technique even for a small change that I may do in a table. Say you need to add an Identity field and you have a table with 250 millions of records. That simple code:

ALTER TABLE Users ADD Users_PK INT IDENTITY(1,1)

will take at least 7 hours to process.

Copy the whole table to an empty one will be much faster as demonstrated below. Another example for saving time is if you need to add a computed field as a Persisted one, it took us more than a day without using the batch technique for a table of 250 millions of records. The time it takes also depends of the complexity of the computed field. A persisted computed field was part of the empty table where data was inserted and that did not change the speed of the below process.

Another advantage for using ms sql batch processing code is when you have an error. It is very helpful to debug ms sql table. In my case it could be a truncate error when trying to fix data from one field to another. By looking at the Batch Process table you can see the last processed batch range and you can go right into that range and inspect the data. You know it is the last batch since the code will stop working after the error occurred. When process hundreds of millions records sometimes bad data could cause a truncate issue.

--- Inspect the data based on the last batch.
SELECT
   COUNT(StreetName)
FROM
    Users
WHERE Users_PK BETWEEN 6000007 AND 7000007 AND LEN(StreetName) > 30

The technique below requires that you have a clustered index on the PK, and this way 1 million records takes to process from 8 to 30 seconds compare to 10 minutes without a clustered index. But even without the clustered index working with batches reduces the processing time by far. To split an Address to Street Number and Street Name without a clustered index took about 8 hours and before it took days to process. For that process an UPDATE was used. For the below process even though I used the ORDER BY First and Last, the clustered index on Users_PK was sufficient for the entire process and no other indexes were needed. Again in other cases you may need to have additional indexes.

BEGIN TRY
--- Copy the PK from the source table into the target table-------
SET IDENTITY_INSERT Users..Users ON
------------------------------------------------------------------
-- Increment every 1000000 records. You can choose any number that works for you.
DECLARE @Increment INT = 1000000,
        @LastRecord_PK INT = 0,
        @iRowCount INT = 0,
        @BatchRowCount INT = 0,
        @iCurrentRowCount INT = 0,
        @FirstRec_PK INT = 0
        
-- Get the min of the PK minus 1. This way we can start the range from the first record.
-- Make sure to choose the right source table for the MIN(PK)
DECLARE @FirstRecord_PK AS INT=(SELECT
    MIN(Users_PK)
FROM
    Users.dbo.Users) - 1

SELECT @FirstRecord_PK

-- Select the database for the table that records each INSERT
USE Utilities
IF NOT EXISTS(SELECT *
            FROM Sys.Tables
            WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[BatchProcess]'))

BEGIN
CREATE TABLE [Utilities].[dbo].[BatchProcess](
    [Batch_PK] [INT] IDENTITY(1,1) NOT NULL,
    [RowCountTemp]  [INT] NULL,
    [BatchRowCount] [INT] NULL,
    [DateTimeEntered] [DATETIME] NULL,
    [FirstRecord_PK] INT NOT NULL,
    [LastRecord_PK] INT NOT NULL
) ON [PRIMARY]       
END
ELSE
BEGIN
-- Remove all records from the previous process.
TRUNCATE TABLE Utilities.dbo.BatchProcess
END

-- Create an Error handler table if not found.
IF NOT EXISTS(SELECT *
            FROM Sys.Tables
            WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[Errors]'))
BEGIN
CREATE TABLE [Utilities].[dbo].[Errors](
    [Errors_PK] [INT] IDENTITY(1,1) NOT NULL,
    [Error_No] [INT] NULL,
    [ErrorMessage] [VARCHAR] (4000) NULL,
    [ErrorLine] [VARCHAR] (50) NULL,
    [Date_Time] [DATETIME] NULL,
 ) ON [PRIMARY]
END

--- Starting the time and that will be the first record in the BatchProcess table
INSERT INTO Utilities.dbo.BatchProcess
    (RowCountTemp,BatchRowCount,DateTimeEntered,FirstRecord_PK,LastRecord_PK)
VALUES
    (0,0,GETDATE(),@FirstRecord_PK,@LastRecord_PK)

USE Users
WHILE EXISTS (SELECT TOP 1 Users_PK FROM Users.dbo.Users WHERE Users_PK > @LastRecord_PK)
BEGIN

    -- Set the values for the counters, the first and last record.
    SELECT @FirstRecord_PK = @FirstRecord_PK + 1,
            @LastRecord_PK = @FirstRecord_PK + @Increment

    ------------------------------------------------------------------------------
    --- In here you place your code to process your records ----------------------
    --- Make sure to add the range for the PKs as shown below --------------------
    ------------------------------------------------------------------------------           
        --- Insert records into the target table
    INSERT INTO Users..Users(
           [Users_PK],
           [First],
           [Last],
           [StreetNumber],
           [StreetName],
           [City],
           [State],
           [Zip],
           [Phone],
           [Date_Uploaded]
           )
    SELECT
           [Users_PK],  
           [First],
           [Last],
           [StreetNumber],
           [StreetName],
           [City],
           [State],
           [Zip],
           [Phone],
           SUBSTRING([Date_Uploaded],1,8) AS Date_Uploaded
    FROM
        Users.dbo.Users Ue
    WHERE Ue.Users_PK BETWEEN @FirstRecord_PK AND @LastRecord_PK
    ORDER BY [Ue].[Last],[Ue].[First]
    ------------------------------------------------------------------------------

    -- Gets the number of rows that were processed per batch        
    SET @iCurrentRowCount = @@ROWCOUNT
    -- Gets the number of rows for all batches that were processed so far                                           
    SET @iRowCount = @iRowcount + @iCurrentRowCount
    -- Gets the number of rows per batch
    SET @BatchRowCount = @iCurrentRowCount
    -- Gets the first PK that was processed per batch
    SET @FirstRec_PK = @FirstRecord_PK
    -- Gets the next first PK for the next batch
    SET @FirstRecord_PK = @LastRecord_PK
    PRINT @iRowCount
    
    --- Keep track of each process
    INSERT INTO Utilities.dbo.BatchProcess
        (RowCountTemp,BatchRowCount,DateTimeEntered,FirstRecord_PK,LastRecord_PK)
    VALUES
        (@iRowCount,@BatchRowCount,GETDATE(),@FirstRec_PK,@LastRecord_PK)

END
END TRY

-- Let's check for any error and if so let's log the error.
BEGIN CATCH

INSERT INTO [Utilities].[dbo].Errors
    (
        Error_No,ErrorMessage,ErrorLine,Date_Time
    )
VALUES
    (
    ERROR_NUMBER(),ERROR_MESSAGE(),ERROR_LINE(),GETDATE()
    )

SELECT
    ERROR_NUMBER()      AS ErrorNumber ,
    ERROR_SEVERITY()    AS ErrorSeverity ,
    ERROR_STATE()       AS ErrorState ,
    ERROR_PROCEDURE()   AS ErrorProcedure ,
    ERROR_LINE()        AS ErrorLine ,
    ERROR_MESSAGE()     AS ErrorMessage;
END CATCH;

Whenever the above code is running you can run the below code and see the status of the process:

SELECT * FROM Utilities.dbo.BatchProcess ORDER BY DateTimeEntered DESC


In the below image the time difference between rows 7 and 8 was 8 seconds, and in rows 1 to 2 it was 7 seconds, and so far 6,957,786 records were processed, and that batch was 994804 records. You can see the range of PK that was processed as well.

Table1

Looking further towards the end of this process, then the difference between rows 242 to 243 is 8 seconds as well. As you see you can have a very good estimate of the time for the entire process.

Table 2

Go Back to Our: Main Blog

Written by: Doron Farber - The Farber Consulting Group, Inc

There are 2 comments for "Processing hundreds of millions records got much easier"

avatar
By Uri Dimant | March 28,2011

I found that article very useful.The key statement her is to have a clustered index on PK column such as SQL Server is able efficiently process the data.

Reply
avatar
By | April 02,2012

Hello Ann,

that problem occurs when the query retrieves large amount of rows. See this article written by Uri Dimant about this problem and how it was resolved at: http://dimantdatabasesolutions.blogspot.com/2009/04/how-to-eat-asyncnetworkio.html

Regards,

Doron http://www.dfarber.com/microsoft-solutions/mssql-server.aspx

Reply