MS SQL 2008 - Processing hundreds of millions records can
be done in less than an hour.
Processing hundreds of millions of records requires a different
strategy and the implementation should be different compared to
smaller tables with only several millions of records.
Say you have 800 millions of records in a table and you need to
delete 200 million. Obviously you can use this code:
DELETE FROM Users WHERE DOB = ''
The problem is that you are running one big transaction and the
log file will grow tremendously. The process can take a long time.
It also depends on the speed of your server as well.
The solution is to use small batches and process 1 to several
millions of records at the time. This way the log file stays small
and whenever a new process starts, the new batch will reuse the
same log file space and it will not grow. Before I used the batch
process technique, in one time I had 250 GB log file when I tried
to split the Address field.
If I need to move 250 millions of records from one database to
another the batch processing technique is a winner. Under my server
it would take 30 minutes and 41 seconds, and also can track down
the time per each batch. This way I will be able to predict when
the entire process is finished. Another advantage when using small
batch is if you need to Cancel the process from whatever reason
then it takes immediately or several seconds to recover. (Depends
on your server speed).
I basically use this technique even for a small change that I
may do in a table. Say you need to add an Identity field and you
have a table with 250 millions of records. That simple code:
ALTER TABLE Users ADD Users_PK INT IDENTITY(1,1)
will take at least 7 hours to process.
Copy the whole table to an empty one will be much faster as
demonstrated below. Another example for saving time is if you need
to add a computed field as a Persisted one, it took us more than a
day without using the batch technique for a table of 250 millions
of records. The time it takes also depends of the complexity of the
computed field. A persisted computed field was part of the empty
table where data was inserted and that did not change the speed of
the below process.
Another advantage for using ms sql batch processing code is when
you have an error. It is very helpful to debug ms sql table. In my
case it could be a truncate error when trying to fix data from one
field to another. By looking at the BatchProcess table you can see
the last processed batch range and you can go right into that range
and inspect the data. You know it is the last batch since the code
will stop working after the error occurred. When process hundreds
of millions records sometimes bad data could cause a truncate
issue.
--- Inspect the data based on the last batch.
SELECT
COUNT(StreetName)
FROM
Users
WHERE Users_PK BETWEEN 6000007 AND 7000007 AND LEN(StreetName) > 30
The technique below requires that you have a clustered index on
the PK, and this way 1 million records takes to process from 8 to
30 seconds compare to 10 minutes without a clustered index. But
even without the clustered index working with batches reduces the
processing time by far. To split an Address to Street Number and
Street Name without a clustered index took about 8 hours and before
it took days to process. For that process an UPDATE was used. For
the below process even though I used the ORDER BY First and Last,
the clustered index on Users_PK was sufficient for the entire
process and no other indexes were needed. Again in other cases you
may need to have additional indexes.
BEGIN TRY
--- Copy the PK from the source table into the target table-------
SET IDENTITY_INSERT Users..Users ON
------------------------------------------------------------------
-- Increment every 1000000 records. You can choose any number that works for you.
DECLARE @Increment INT = 1000000,
@LastRecord_PK INT = 0,
@iRowCount INT = 0,
@BatchRowCount INT = 0,
@iCurrentRowCount INT = 0,
@FirstRec_PK INT = 0
-- Get the min of the PK minus 1. This way we can start the range from the first record.
-- Make sure to choose the right source table for the MIN(PK)
DECLARE @FirstRecord_PK AS INT=(SELECT
MIN(Users_PK)
FROM
Users.dbo.Users) - 1
SELECT @FirstRecord_PK
-- Select the database for the table that records each INSERT
USE Utilities
IF NOT EXISTS(SELECT *
FROM Sys.Tables
WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[BatchProcess]'))
BEGIN
CREATE TABLE [Utilities].[dbo].[BatchProcess](
[Batch_PK] [INT] IDENTITY(1,1) NOT NULL,
[RowCountTemp] [INT] NULL,
[BatchRowCount] [INT] NULL,
[DateTimeEntered] [DATETIME] NULL,
[FirstRecord_PK] INT NOT NULL,
[LastRecord_PK] INT NOT NULL
) ON [PRIMARY]
END
ELSE
BEGIN
-- Remove all records from the previous process.
TRUNCATE TABLE Utilities.dbo.BatchProcess
END
-- Create an Error handler table if not found.
IF NOT EXISTS(SELECT *
FROM Sys.Tables
WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[Errors]'))
BEGIN
CREATE TABLE [Utilities].[dbo].[Errors](
[Errors_PK] [INT] IDENTITY(1,1) NOT NULL,
[Error_No] [INT] NULL,
[ErrorMessage] [VARCHAR] (4000) NULL,
[ErrorLine] [VARCHAR] (50) NULL,
[Date_Time] [DATETIME] NULL,
) ON [PRIMARY]
END
--- Starting the time and that will be the first record in the BatchProcess table
INSERT INTO Utilities.dbo.BatchProcess
(RowCountTemp,BatchRowCount,DateTimeEntered,FirstRecord_PK,LastRecord_PK)
VALUES
(0,0,GETDATE(),@FirstRecord_PK,@LastRecord_PK)
USE Users
WHILE EXISTS (SELECT TOP 1 Users_PK FROM Users.dbo.Users WHERE Users_PK > @LastRecord_PK)
BEGIN
-- Set the values for the counters, the first and last record.
SELECT @FirstRecord_PK = @FirstRecord_PK + 1,
@LastRecord_PK = @FirstRecord_PK + @Increment
------------------------------------------------------------------------------
--- In here you place your code to process your records ----------------------
--- Make sure to add the range for the PKs as shown below --------------------
------------------------------------------------------------------------------
--- Insert records into the target table
INSERT INTO Users..Users(
[Users_PK],
[First],
[Last],
[StreetNumber],
[StreetName],
[City],
[State],
[Zip],
[Phone],
[Date_Uploaded]
)
SELECT
[Users_PK],
[First],
[Last],
[StreetNumber],
[StreetName],
[City],
[State],
[Zip],
[Phone],
SUBSTRING([Date_Uploaded],1,8) AS Date_Uploaded
FROM
Users.dbo.Users Ue
WHERE Ue.Users_PK BETWEEN @FirstRecord_PK AND @LastRecord_PK
ORDER BY [Ue].[Last],[Ue].[First]
------------------------------------------------------------------------------
-- Gets the number of rows that were processed per batch
SET @iCurrentRowCount = @@ROWCOUNT
-- Gets the number of rows for all batches that were processed so far
SET @iRowCount = @iRowcount + @iCurrentRowCount
-- Gets the number of rows per batch
SET @BatchRowCount = @iCurrentRowCount
-- Gets the first PK that was processed per batch
SET @FirstRec_PK = @FirstRecord_PK
-- Gets the next first PK for the next batch
SET @FirstRecord_PK = @LastRecord_PK
PRINT @iRowCount
--- Keep track of each process
INSERT INTO Utilities.dbo.BatchProcess
(RowCountTemp,BatchRowCount,DateTimeEntered,FirstRecord_PK,LastRecord_PK)
VALUES
(@iRowCount,@BatchRowCount,GETDATE(),@FirstRec_PK,@LastRecord_PK)
END
END TRY
-- Let's check for any error and if so let's log the error.
BEGIN CATCH
INSERT INTO [Utilities].[dbo].Errors
(
Error_No,ErrorMessage,ErrorLine,Date_Time
)
VALUES
(
ERROR_NUMBER(),ERROR_MESSAGE(),ERROR_LINE(),GETDATE()
)
SELECT
ERROR_NUMBER() AS ErrorNumber ,
ERROR_SEVERITY() AS ErrorSeverity ,
ERROR_STATE() AS ErrorState ,
ERROR_PROCEDURE() AS ErrorProcedure ,
ERROR_LINE() AS ErrorLine ,
ERROR_MESSAGE() AS ErrorMessage;
END CATCH;
Whenever the above code is running you can run the below code
and see the status of the process:
SELECT * FROM Utilities.dbo.BatchProcess ORDER BY DateTimeEntered DESC
In the below image the time difference between rows 7 and 8 was 8
seconds, and in rows 1 to 2 it was 7 seconds, and so far 6,957,786
records were processed, and that batch was 994804 records. You can
see the range of PK that was processed as well.

Looking further towards the end of this process, then the
difference between rows 242 to 243 is 8 seconds as well. As you see
you can have a very good estimate of the time for the entire
process.
