The idea is to perform data cleansing on the Date and Time field
from different formats into a unify format such as: YYYYMMDD. Here
is a sample data to be used for the data cleansing that I will
perform below.
CREATE TABLE #T
(
D1 [VARCHAR] (20) NULL
)
INSERT INTO #T VALUES('5/15/2006 18:11')
INSERT INTO #T VALUES('5/13/2006 17:24')
INSERT INTO #T VALUES('2007-02-07 11:38:07')
INSERT INTO #T VALUES('9/4/2008 6:55 19a')
INSERT INTO #T VALUES('5/11/2006 0:47:52')
INSERT INTO #T VALUES('2007-01-04 12:43:00')
INSERT INTO #T VALUES('')
INSERT INTO #T VALUES(' 6/11/2006 0:47:52')
INSERT INTO #T VALUES('9/22/2002 07:20:07')
INSERT INTO #T VALUES('09/22/2002 07:20:07')
INSERT INTO #T VALUES('4/13/2007')
INSERT INTO #T VALUES('2/1/2007 20:23')
INSERT INTO #T VALUES('04-21-2011 18:47:42')
INSERT INTO #T VALUES('04-21-2008 17:11:00')
INSERT INTO #T VALUES('2/1/2007 11:44')
INSERT INTO #T VALUES('5/9/2007 9:22')
INSERT INTO #T VALUES('04/13/2007')
INSERT INTO #T VALUES('')
INSERT INTO #T VALUES('2004-04-21 16:00:00')
INSERT INTO #T VALUES('04-21-2008 17:11:00')
INSERT INTO #T VALUES('2004')
INSERT INTO #T VALUES('2010-04-21 16:00:00')
INSERT INTO #T VALUES('0000-04-21 16:00:00')
INSERT INTO #T VALUES('2011')
INSERT INTO #T VALUES('2004')
INSERT INTO #T VALUES('1700')
INSERT INTO #T VALUES('1800')
INSERT INTO #T VALUES('Nov 2 2010')
INSERT INTO #T VALUES('02-22-03 00:00:00')
INSERT INTO #T VALUES('02-22-03 00:00:00')
INSERT INTO #T VALUES('10/5/1941 8/2/2005')
INSERT INTO #T VALUES('30.09.2002')
INSERT INTO #T VALUES('05:05:06' )
ALTER TABLE #T ADD [D2] VARCHAR(20)
ALTER TABLE #T ADD [Update_Type] TinyInt
SELECT * FROM #T
The reason that I am using string manipulation instead of the
CONVERT() function is: Since the table could have millions of dates
that could not be converted using the CONVERT() function and
therefore string manipulation is required. Even though most records
could use the CONVERT() function, I will still would rather use the
string manipulation strategy since in one transaction I convert all
dates whether they can use the CONVERT() or not. See examples for
these dates:
SELECT CONVERT(VARCHAR(15),CAST('30.09.2002' AS DATE),112)
The ISDATE() function will return 0 but this date could be
converted as 20090930 and I could have 5 millions of these
dates.
If you ran the above code you will get the below error:
Msg 241, Level 16, State 1, Line 1 Conversion failed when
converting date and/or time from character string.
Here is another example of date that should be using string
manipulation:
SELECT CONVERT(VARCHAR(15),CAST('1/27/2007 6:55 17a' AS DATE),112)
The actual date could be 20070127 but ISDATE() returns 0 as seems
below.
SELECT ISDATE('1/27/2007 6:55 17a')
The following are 4 patterns that I will use for this article:
Each different pattern will be converted to this format as:
YYYYMMDD. So the date that appears as this one 9/19/2009 will be
shown as this one: 20090919. The above is a realistic data that I
need to deal when import data and then perform data cleaning using
MS SQL scripts.
1. 9/19/2009
2. 1/1/2009
3. 09/25/2010
4. 2009-01-09
Extracting this format of date: 9/19/2009
SELECT
D1
FROM
#T
WHERE D1 <> '' AND D1 LIKE '%[0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%'
Adding the left wild card will bring un desirable results of
dates as shown below with different formats. You need to use only
the right wild card so the requested pattern will show up as
needed. This way you can perform the string manipulation without
any surprises. As a rule of thumb I also make sure all data is
trimmed before starting working on it. (For the import I use
SSIS).
--5/15/2006 18:11
--5/13/2006 17:24
--5/11/2006 0:47:52
-- 6/11/2006 0:47:52
--9/22/2002 07:20:07
--09/22/2002 07:20:07
--4/13/2007
--04/13/2007
1. When running the below code for this format: 9/19/2009 you
get the right data.
SELECT
D1
FROM
#T
WHERE D1 <> '' AND D1 LIKE '[0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%'
2. Extracting this format of this kind of date: 1/1/2009
SELECT
D1
FROM
#T
WHERE D1 <> '' AND D1 LIKE '[0-9][/][0-9][/][0-9][0-9][0-9][0-9]%'
Running the above code with the right wild card guarantees the
right format as shown below.
--9/4/2008 00:38:45
--2/1/2007 20:23
--2/1/2007 11:44
--5/9/2007 9:22
3. Extracting this format of date: 09/25/2010
SELECT
D1
FROM
#T
WHERE D1 <> '' AND D1 LIKE '[0-9][0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%'
The below data shows up
09/22/2002 07:20:07
04/13/2007
4. Handle this format: 2009-01-09
SELECT
D1
FROM
#T
WHERE D1 <> '' AND D1 LIKE '[0-9][0-9][0-9][0-9][-][0-9][0-9][-][0-9][0-9]%'
Running the above query will bring the below records:
--2007-02-07 11:38:07
--2007-01-04 12:43:00
--2004-04-21 16:00:00
Every pattern I will test as shown below to make sure I am
getting the right data. In real table I will have a Clustered index
on the PK and will add WHERE clause like this one: User_PK BETWEEN
1 AND 10000 and this way the data comes much quicker.
SELECT
D1,
REPLACE(LEFT(D1,10),'-','') AS ResultDate
FROM
#T
WHERE D1 <> '' AND D1 LIKE '[0-9][0-9][0-9][0-9][-][0-9][0-9][-][0-9][0-9]%'
Running the above query will bring the below records:
--2007-02-07 11:38:07 20070207
--2007-01-04 12:43:00 20070104
--2004-04-21 16:00:00 20040421
For the table that you perform the data cleansing, the idea is
to create another column that all updates will be implemented
there. This way I can compare and see if the conversion was done
correct. The Tiny column that also was added, I like to use it as a
flag for the different type of conversions. In reality there could
be at least 20 types of conversions that could be needed for the
full conversion for the date field. See below code that will
convert all different types of dates into YYYYMMDD in one
script.
UPDATE #T
SET D2 = CASE WHEN D1 LIKE '[0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' THEN
SUBSTRING(D1,6,4) + '0' + REPLACE(LEFT(D1,5),'/','') -- 9/19/2009
WHEN D1 LIKE '[0-9][/][0-9][/][0-9][0-9][0-9][0-9]%' THEN
SUBSTRING(D1,5,4) + '0' + LEFT(D1,1) + '0' + SUBSTRING(D1,3,1) -- 9/9/2009
WHEN D1 LIKE '[0-9][0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' THEN
SUBSTRING(D1,7,4) + REPLACE(LEFT(D1,5),'/','') -- 09/19/2009
WHEN D1 LIKE '[0-9][0-9][0-9][0-9][-][0-9][0-9][-][0-9][0-9]%' THEN
REPLACE(LEFT(D1,10),'-','') -- 2009-01-09
ELSE D2 END,
Update_Type = CASE WHEN D1 LIKE '[0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' THEN 1 -- 9/19/2009
WHEN D1 LIKE '[0-9][/][0-9][/][0-9][0-9][0-9][0-9]%' THEN 2 -- 9/9/2009
WHEN D1 LIKE '[0-9][0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' THEN 3 -- 09/19/2009
WHEN D1 LIKE '[0-9][0-9][0-9][0-9][-][0-9][0-9][-][0-9][0-9]%' THEN 4 -- 2009-01-09
ELSE D2 END
FROM
#T
WHERE D1 <> ''
SELECT * FROM #T
Running the above SELECT statement shows the updated columns as
shown in the below:
--D1 D2 Update_Type
--5/15/2006 18:11 20060515 1
--5/13/2006 17:24 20060513 1
--2007-02-07 11:38:07 20070207 4
--9/4/2008 00:38:45 20080904 2
--5/11/2006 0:47:52 20060511 1
--2007-01-04 12:43:00 20070104 4
-- NULL NULL
-- 6/11/2006 0:47:52 NULL NULL
--9/22/2002 07:20:07 20020922 1
--09/22/2002 07:20:07 20020922 3
--4/13/2007 20070413 1
--2/1/2007 20:23 20070201 2
--04-21-2011 18:47:42 NULL NULL
--04-21-2008 17:11:00 NULL NULL
--2/1/2007 11:44 20070201 2
--5/9/2007 9:22 20070509 2
--04/13/2007 20070413 3
-- NULL NULL
--2004-04-21 16:00:00 20040421 4
--04-21-2008 17:11:00 NULL NULL
The entire update technique should be used within a batch and
please see this article I wrote about it and how to implement
it:
/computer-consulting-blog/2011/1/14/processing-hundreds-of-millions-records-got-much-easier.aspx
With this technique it keeps the log file very small
If you prefer to run each process one at the time then you can
use the code as shown below for this format: 9/19/2009
UPDATE #T
SET D2 = SUBSTRING(D1,6,4) + '0' + REPLACE(LEFT(D1,5),'/',''),
Update_Type = 1
WHERE D1 <> '' AND D1 LIKE '[0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%'
I also tried to test the PATINDEX() function instead of the LIKE
operator as shown below. In this case I actually compared the
PATINEX() function to the LIKE operator. Running in a 300 million
records table, the LIKE was always faster in most cases at least in
1 second. The LIKE is getting to be by far very slow when there are
only small amount of records such as: 14030 records found for a
specific pattern in big tables as mentioned above.
Running the same pattern using the PATINDEX() function took only
10 seconds while using the LIKE took 1 min and 9 seconds. Usually I
ran the COUNT() to get the idea how many records to be processed
for each pattern. In most cases I will use the LIKE operator for
this kind of data cleansing. Make sure to create an index on the
field that you process data on it. Remember to create the same type
of field for UPDATE purpose. This one took only 10 seconds to count
14030 records out of 300 million records.
SELECT -- 10 seconds
COUNT(D1)
FROM
#T
WHERE PATINDEX('[0-9][0-9][-][0-9][0-9][-][0-9][0-9][0-9][0-9]%',D1)=1 AND D1 <>''
This one took 1 min and 9 seconds for the same table.
SELECT
COUNT(D1)
FROM
#T
WHERE D1 <> '' AND D1 LIKE '[0-9][0-9][-][0-9][0-9][-][0-9][0-9][0-9][0-9]%'
----------------------------------------------------------
------------------------ Conclusion: ------------------
----------------------------------------------------------
I covered some steps I take for different types of the data
conversion and in the process I realized that the PATINDEX() could
be much faster versus the LIKE operator for some conditions. There
could be different types of data that cannot be converted and must
be ignored. That's why a second field is nice to have, since that
will be the final field for the production database.
There are many dates that have different patterns and could be
legit ones for conversion but will fail the CONVERT() function.
Only string manipulation provides the ultimate control and minimize
any errors of date conversion.