Wednesday, February 18, 2009

Space Considerations for Various OLAP Data Types

Data types can have huge space and performance implications on large SQL tables, and there are similar implications for large OLAP cubes. We all know that double word numbers take up more space than single word numbers, but what exactly is the difference?

I have performed a simple benchmark to compare the space used of the various data types. My tests involved a
  • 44 Million row fact table
  • 3 measures containing values between -99 and 1039.22
  • 5 dimensions
  • SQL Server 2008
  • Zero aggregations

I used 6 datatypes in SQL Server tables. The SQL data types I employed were

  • Money
  • Float
  • Decimal
  • Bigint
  • Integer
  • Real

For the Money data type, I created two cubes, since Analysis Services defaulted the Money data type to Double. I created one cube with the default Double data type, and another with the cube data type as Currency.

The results show that the best performing (in terms of size) were

  • Bigint and Integer (41% smaller than Double)
  • Currency (25% smaller than Double)
  • Single (3% smaller than Double)
  • Double

Note, Sometimes the data type you choose is determined (or constrained) by the domain of the numbers you are aggregating. For example, if your numbers contain significant digits after the decimal point, then Bigint and Integer are inappropriate, unless you are happy to round them. Likewise, for imprecise numbers, Double gives you the most precision, which you might need. But just because the underlying SQL table uses a single or double byte precision data type, if the number is always an integer, you might as well tell Analysis Services to use a Bigint and save on space.

It shouldn't surprise you that the binary numbers (Bigint and Integer) are the smallest on disk, nor that they took the same space as each other. Analysis Services compresses all the cells before writing to disk, and the binary numbers will tend to have lots of leading zeroes, which compress very well. Similarly, but to a lessor extent, Currency (really a packed decimal) compresses next best.

Just in case you are wondering why bother fiddling with data types when the cube is only a gigabyte or two in size, and disk space is plentiful. The reason is that if you can save 40% in your cube size, processing and querying will be noticeably faster. With a smaller cube, your cache will warm faster, and if the entire cube doesn't fit in RAM, you will have a greator proportion of your cube in RAM with a smaller cube. Ie. your queries will run much faster.

Richard

for real-time online OLAP demonstrations go to http://RichardLees.com.au/Sites/Demonstrations

5 comments:

chris said...

This is a very interesting post, Richard, thanks! Can I make a request for a follow-up where you run some tests to show what impact different data types actually have on query performance too? You could use ascmd or something to run a load test on the cube and show some averate query response times...

Chris

Richard Lees said...

Yes, that's a good idea and I will do that. The reason I tested cube size is that for very large cubes, the sheer size was an important determiner of query time. Queries on large cubes require a lot of IO until the cache is warm. The smaller the cube the less IO and faster the early queries.

chris said...

You'd also have to test performance with and without aggregations; having the right aggregation design would reduce the amount of IO necessary, even on a cold cache, and so reduce the impact of different data types

Unknown said...

I'm glad you chose to compare these, while also noting sometimes you are constrained to certain types by requirement. Some people would dismiss any "What is smaller?" as an apples to oranges comparison, but sometimes you are on the fence of choosing a more future proof datatype versus one that performs better. It is rarely black and white and it's good to have some benchmarks to make an informed decision.

Unknown said...

I'm glad you chose to compare these, while also noting sometimes you are constrained to certain types by requirement. Some people would dismiss any "What is smaller?" as an apples to oranges comparison, but sometimes you are on the fence of choosing a more future proof datatype versus one that performs better. It is rarely black and white and it's good to have some benchmarks to make an informed decision.