Richard Lees on BI: Fact Table OLAP RDBMS Performance

Data types can have huge space and performance implications on large SQL tables, and there are similar implications for large OLAP cubes. We all know that double word numbers take up more space than single word numbers, but what exactly is the difference?

I have performed a simple benchmark to compare the space used of the various data types. My tests involved a

44 Million row fact table
3 measures containing values between -99 and 1039.22
5 dimensions
SQL Server 2008
Zero aggregations

I used 6 datatypes in SQL Server tables. The SQL data types I employed were

Money
Float
Decimal
Bigint
Integer
Real

For the Money data type, I created two cubes, since Analysis Services defaulted the Money data type to Double. I created one cube with the default Double data type, and another with the cube data type as Currency.

The results show that the best performing (in terms of size) were

Bigint and Integer (41% smaller than Double)
Currency (25% smaller than Double)
Single (3% smaller than Double)
Double

Note, Sometimes the data type you choose is determined (or constrained) by the domain of the numbers you are aggregating. For example, if your numbers contain significant digits after the decimal point, then Bigint and Integer are inappropriate, unless you are happy to round them. Likewise, for imprecise numbers, Double gives you the most precision, which you might need. But just because the underlying SQL table uses a single or double byte precision data type, if the number is always an integer, you might as well tell Analysis Services to use a Bigint and save on space.

It shouldn't surprise you that the binary numbers (Bigint and Integer) are the smallest on disk, nor that they took the same space as each other. Analysis Services compresses all the cells before writing to disk, and the binary numbers will tend to have lots of leading zeroes, which compress very well. Similarly, but to a lessor extent, Currency (really a packed decimal) compresses next best.

Just in case you are wondering why bother fiddling with data types when the cube is only a gigabyte or two in size, and disk space is plentiful. The reason is that if you can save 40% in your cube size, processing and querying will be noticeably faster. With a smaller cube, your cache will warm faster, and if the entire cube doesn't fit in RAM, you will have a greator proportion of your cube in RAM with a smaller cube. Ie. your queries will run much faster.

Richard

for real-time online OLAP demonstrations go to http://RichardLees.com.au/Sites/Demonstrations

Many of my customers have billions of rows in their fact tables, which means there cube processing time is, at least partly, dependent on how long it takes the relational database manager to scan the data from disk. With most rdbms' (eg SQL Server, Oracle, DB2 etc.) this ends up as how fast the disk subsystem can transfer the data. So one way of improving performance without changing hardware is to trim up the fact table. Naturally, incremental processing is useful for these very large fact tables, but everything will go faster with skinier fact tables.

Here are a few ways that you can trim the fat out of your fact tables

Dimensionalise data. For example, if a column contains a character string such as 'Successful' or 'Unsuccessful', then put those strings into a dimension table and simply store the numeric foreign key, such as 1 or 2. If this sort of opportunity exists, it will have the biggest performance impact, and it can actually make the ETL faster, if that is where you do the dimensionalisation. The ETL can be faster because the fact table being loaded by the rdbms is thinner and requires less IO.

Use appropriate and concise numeric and date-time data types. For example, if you are storing a number, put it in a numeric column and make it as small as possible without risking overflow. For example, if you are storing a number between 1 and 10, just use a tinyint (or equivalent in other rdbms). If you are storing an integer or decimal, then put the number in an integer or numeric column, not a floating point data type. Similarly for date and times.

Use VARCHAR() where appropriate. Don't use CHAR(50) where VARCHAR(50) will store the same information in less space. Also, when you are loading VARCHAR columns ensure that you are not padding the column with spaces. This can happen with some ETL tools.

Avoid VARCHAR(1). For a 1 byte column you are requiring 2 bytes for the length, which you already know is 1.

Avoid redundant data. That is any data that can be derived (calculated) from other columns in the same record. For example, if there is a column TransactionDateTime, there is no need for other columns for TransactionDate, TransactionTime, TransactionYear, TransactionDayOfWeek, TransactionDateKey etc. All of those columns can be derived from TransactionDateTime using functions such as DatePart(hh,TransactionDateTime)*100+DatePart(mi,TransactionDateTime) to get the 24 hour minute etc. I make a great use of these functions to get foreign keys into datetime dimension tables that might have time granularity down to 5 minute, 15 minute, 1 hour levels of granularity etc. Of course, I have designed the DateTime dimension table to have a primary key that is derived from the DateTime and not synthetic. So it is just a matter of creating a view over the table, which adds in the derived columns. I am not a great fan of synthetic primary keys when there is a natural primary key. A view with these functions tends to be much faster than a SELECT from a fat table with a materialised copy of the redundant columns. Simply due to a reduction in IO and putting less pressure on the rdbms data cache.

Most rdbms' have compression available for columns and data pages. SQL Server added a very good implementation this in SQL Server 2008. Since very large fact tables tend to be written once, it generally pays to have compression at the column and page level. With numeric data, I am experiencing about 20% - 30% compression ratios. The cpu cost is almost undetectable (even lower than cpu from read locks).

Keeping the fact tables as skinny as possible makes it much faster to process cubes with very large numbers of fact records. There are also ancilliary benefits such as having a smaller database to backup, overall rdbms performance is better with less data cache pressure, and less network traffic between your cube and data warehouse.

Without very skinny fact tables, I would not have been able to maintain the live weblogs and Perfmon demonstrations on http://RichardLees.com.au/Sites/Demonstrations The Perfmon rdbms is adding about 30,000 records/minute (500 inserts/second) and is only possible on my sata disk drives because the fact tables are skinny. All these inserts and continual cube processing is happening on two desktop PCs with single sata drives.

Richard Lees on BI

Wednesday, February 18, 2009

Space Considerations for Various OLAP Data Types

Saturday, October 25, 2008

Skinny Fact Tables are good

About Me

Followers