Imagine the scenario where we have 2,000,000 rows to insert into a 50 billion row table. We want to do it fast, but there is a possibility that some of the rows will already have been inserted (so we want to ignore them) and other rows might have data out of range and are truly rejects. If we just implement a Data Flow with rejects going to an error table, we will have good records going to the rejects as all records in the "error" batch go to error destination.
So what we can do is take the error output and perform a lookup. If the PK exists already we can ignore the record. If the PK doesn't exist, then we try to insert it to the destination table for a second attempt. This time, we set the commit size to 1, so that any records that error will not take good records with them. The error output from this second attempt are truly bad records, since they don't exist and they have caused an error themselves. During most executions there won't be duplicate records or errors so extremely fast loading will be achieved. When there are errors, these will be handled efficiently.
Here is an example Data Flow from SQL Server 2012. The input data has 2,000,001 records, one of which exists already in the Data Warehouse and can be ignored. Notice how there are 2,000 rows rejected from the Data Warehouse, this is the batch size. There is one duplicate record and the other 1,999 rows in the transaction are valid. The Lookup task finds the match and sends that row to a row count (logging purposes) while the remaining 1,999 rows can be loaded into the Data Warehouse successfully. The second attempt to insert is done with a batch size of 1. If there were any errors in this second attempt they would be redirected to the rejects table. Generally the rejects table will have a much lower data constraints so they don't error.
This is the connection properties on the Data Warehouse table (first attempt). Notice the Maximum insert commit size is 2000.
This is the lookup component to see if the PK exists already. This lookup is without a cache, as the destination table contains billions of rows and the package will generally be looking up few, if any, rows. A full cache would be extremely expensive on such a large destination table. Don't do that. A partial cache probably is of no use since the lookups will tend to be all for different PKs.
This is the connection properties on the Data Warehouse table (second attempt). Notice how the maximum commit size is 1. So any errors will only affect 1 record.
As I mentioned earlier, there are many ways to perform fast load. The main take away from this post is that an error record will spoil the whole batch. So you have to deal with that. The mechanism above, essentially retries to load records from the spoilt batches with a commit size of 1, so the residual errors are truly errors. I have seen developers do something similar by performing checks before the load. You should determine what is appropriate for your application, keeping in mind the behaviour of error batches.
2 comments:
Is there any difference between a fast load with 1 row per batch and a regular load (Data Access Mode: Table or view)?
Thanks for the share richard, it will be extremely useful
Post a Comment