Monday, February 15, 2010

SSIS Slowly Changing Dimension SCD Wizard is Too Slow

Pre SQL Server 2005, we had to write our own scripts for managing slowly changing dimensions (aka Type 2 dimensions). Since SQL Server 2005 we now have an easy to use SSIS wizard to maintain SCD. Unfortunately, the SSIS wizard generates an OLEDB update statement that is executed once for every record. Naturally the iterative statement is very slow with large volumes.


Hence if you have large volumes of data and need to have high performing SCD updates, you will need to use a batch update statement. This is not all that difficult. I would even suggest that if you can't write the SQL, you might not the be the best person to design a DW.

Essentially, you need to "close off" the records that are getting replaced, and "insert" all new records. The "closing" of records can be done in a single UPDATE statement as shown in the query below. Note, the exact script you need will depend on your SCD and non-SCD attributes and the format of your source data. But essentially, you are updating all records in the DW where there exists a newer record. If you have some Type 1 attributes, will need another update statement for them.

The second task is simply to insert the new (and updated) records. You should do this in an SSIS data flow task. The select statement in the image below, is an example of what the query will look like to feed that load. Note, your query will be a little different depending on how you identify changes, but the query below will give you a starting template.

By the way, if you really don't want to write the above queries (or prefer the elegance of a component) there are 3rd party SSIS components that manage SCD with set based operations.







Of course, I am sure a future version of SQL Server will have an SCD wizard that works with batch update commands. Just not sure which one.

3 comments:

Drew said...

Richard,

I'm curious if you have tried the Kimball SCD component: http://kimballscd.codeplex.com/

I have found that it performs significantly better than the SCD component that comes with SSIS 2008. It's still not as fast as straight SQL can be, but deninitely much better.

Regards,
Drew

Richard Lees said...

Kimballs SCD component is quite well known. I haven't tested it. Of course it is faster than the SCD wizard, but it couldn't be faster writing your own queries. However, if you had a lot of SCDs, it might make your packages simpler.

prav said...

Hi Richard,

It is a great post and very useful for me. We are using MERGE statement for SCD1 and SCD2 dimensions.

Thanks
prav