Richard Lees on BI: cube design

Wednesday, May 4, 2011

How many cubes

I am often surprised how many cubes some BI developers create. It would appear that each time they have a requirement for a piece of information, they will create a new cube. Sometimes this is amplified because they are using denormalised summary tables, and creating new ones when they need a bit more dimensionality or a new measure, leading to a plethora of overlapping data.

Since SQL Server 2005, it is possible to put dozens of fact tables (cubes in the old days) into one cube with separate measure groups for each fact table. This is a great feature. For example, Sales and Budget have slightly different dimensionality as Sales is by cashier and hour of day, whereas, budget does not have cashier and time granularity is only available down to week. Having these in the same cube is great, since it makes querying of sales against budget (by all the common dimensions) very easy. Also, if a user is only interested in one measure group, say sales, they can ask the browser that measure group, and they will on see the associated measures and dimensions. In the same way they might browse a separate cube.

You might ask, are there any performance implications of having one super cube?

In terms of cube size and processing overhead, the design alternatives are comparable.
MDX queries can be broken down to FE (formula engine) and SE (storage engine) components. The SE component of a query isn't materially affected by unrelated dimensions, since the storage is comparable. However, the FE component of a query is affected by the unrelated dimensions as it is works within the entire cube space (larger with more dimensions).

There is a LookupCube function in MDX that allows you to get the value from another cube. This might be useful in a design that exploits multiple cubes. But, don't see this as a panacea for multiple cubes. It has its own performance overhead and is not as functional as having the measures in one cube.

There might also be operational considerations. The design alternative I have discussed is having multiple cubes in the same database, but if the cubes really are unrelated, you might consider having separate SSAS databases for them. This would lead to more operational options including, parallel processing, and even hosting databases on separate servers.

So, as a cube designer, you must weigh up the performance, functional and maintenance costs. As an over simplified summary, on the side of mini cubes is the minimisation of FE performance costs, while on side of super cubes are increased user functionality.

For another discussion on this topic see Chris Webb's blog

Wednesday, February 18, 2009

Space Considerations for Various OLAP Data Types

Data types can have huge space and performance implications on large SQL tables, and there are similar implications for large OLAP cubes. We all know that double word numbers take up more space than single word numbers, but what exactly is the difference?

I have performed a simple benchmark to compare the space used of the various data types. My tests involved a

44 Million row fact table
3 measures containing values between -99 and 1039.22
5 dimensions
SQL Server 2008
Zero aggregations

I used 6 datatypes in SQL Server tables. The SQL data types I employed were

Money
Float
Decimal
Bigint
Integer
Real

For the Money data type, I created two cubes, since Analysis Services defaulted the Money data type to Double. I created one cube with the default Double data type, and another with the cube data type as Currency.

The results show that the best performing (in terms of size) were

Bigint and Integer (41% smaller than Double)
Currency (25% smaller than Double)
Single (3% smaller than Double)
Double

Note, Sometimes the data type you choose is determined (or constrained) by the domain of the numbers you are aggregating. For example, if your numbers contain significant digits after the decimal point, then Bigint and Integer are inappropriate, unless you are happy to round them. Likewise, for imprecise numbers, Double gives you the most precision, which you might need. But just because the underlying SQL table uses a single or double byte precision data type, if the number is always an integer, you might as well tell Analysis Services to use a Bigint and save on space.

It shouldn't surprise you that the binary numbers (Bigint and Integer) are the smallest on disk, nor that they took the same space as each other. Analysis Services compresses all the cells before writing to disk, and the binary numbers will tend to have lots of leading zeroes, which compress very well. Similarly, but to a lessor extent, Currency (really a packed decimal) compresses next best.

Just in case you are wondering why bother fiddling with data types when the cube is only a gigabyte or two in size, and disk space is plentiful. The reason is that if you can save 40% in your cube size, processing and querying will be noticeably faster. With a smaller cube, your cache will warm faster, and if the entire cube doesn't fit in RAM, you will have a greator proportion of your cube in RAM with a smaller cube. Ie. your queries will run much faster.

Richard

for real-time online OLAP demonstrations go to http://RichardLees.com.au/Sites/Demonstrations

Richard Lees on BI

Wednesday, May 4, 2011

How many cubes

Wednesday, February 18, 2009

Space Considerations for Various OLAP Data Types

About Me

Followers