Saturday, October 25, 2008

Skinny Fact Tables are good

Many of my customers have billions of rows in their fact tables, which means there cube processing time is, at least partly, dependent on how long it takes the relational database manager to scan the data from disk. With most rdbms' (eg SQL Server, Oracle, DB2 etc.) this ends up as how fast the disk subsystem can transfer the data. So one way of improving performance without changing hardware is to trim up the fact table. Naturally, incremental processing is useful for these very large fact tables, but everything will go faster with skinier fact tables.

Here are a few ways that you can trim the fat out of your fact tables

  1. Dimensionalise data. For example, if a column contains a character string such as 'Successful' or 'Unsuccessful', then put those strings into a dimension table and simply store the numeric foreign key, such as 1 or 2. If this sort of opportunity exists, it will have the biggest performance impact, and it can actually make the ETL faster, if that is where you do the dimensionalisation. The ETL can be faster because the fact table being loaded by the rdbms is thinner and requires less IO.

  2. Use appropriate and concise numeric and date-time data types. For example, if you are storing a number, put it in a numeric column and make it as small as possible without risking overflow. For example, if you are storing a number between 1 and 10, just use a tinyint (or equivalent in other rdbms). If you are storing an integer or decimal, then put the number in an integer or numeric column, not a floating point data type. Similarly for date and times.

  3. Use VARCHAR() where appropriate. Don't use CHAR(50) where VARCHAR(50) will store the same information in less space. Also, when you are loading VARCHAR columns ensure that you are not padding the column with spaces. This can happen with some ETL tools.

  4. Avoid VARCHAR(1). For a 1 byte column you are requiring 2 bytes for the length, which you already know is 1.

  5. Avoid redundant data. That is any data that can be derived (calculated) from other columns in the same record. For example, if there is a column TransactionDateTime, there is no need for other columns for TransactionDate, TransactionTime, TransactionYear, TransactionDayOfWeek, TransactionDateKey etc. All of those columns can be derived from TransactionDateTime using functions such as DatePart(hh,TransactionDateTime)*100+DatePart(mi,TransactionDateTime) to get the 24 hour minute etc. I make a great use of these functions to get foreign keys into datetime dimension tables that might have time granularity down to 5 minute, 15 minute, 1 hour levels of granularity etc. Of course, I have designed the DateTime dimension table to have a primary key that is derived from the DateTime and not synthetic. So it is just a matter of creating a view over the table, which adds in the derived columns. I am not a great fan of synthetic primary keys when there is a natural primary key. A view with these functions tends to be much faster than a SELECT from a fat table with a materialised copy of the redundant columns. Simply due to a reduction in IO and putting less pressure on the rdbms data cache.

  6. Most rdbms' have compression available for columns and data pages. SQL Server added a very good implementation this in SQL Server 2008. Since very large fact tables tend to be written once, it generally pays to have compression at the column and page level. With numeric data, I am experiencing about 20% - 30% compression ratios. The cpu cost is almost undetectable (even lower than cpu from read locks).

Keeping the fact tables as skinny as possible makes it much faster to process cubes with very large numbers of fact records. There are also ancilliary benefits such as having a smaller database to backup, overall rdbms performance is better with less data cache pressure, and less network traffic between your cube and data warehouse.

Without very skinny fact tables, I would not have been able to maintain the live weblogs and Perfmon demonstrations on The Perfmon rdbms is adding about 30,000 records/minute (500 inserts/second) and is only possible on my sata disk drives because the fact tables are skinny. All these inserts and continual cube processing is happening on two desktop PCs with single sata drives.

1 comment:

Sam Kane said...

Here are this and some other articles on SSAS Dimensional Modeling