Sunday, September 11, 2011

Why are there less data in the cube than the fact table?

Sereral customer have asked me why they have less data in their cube than they have in the fact table.  Essentially, when they view the cube Measures.NetSales does not equal Sum(NetSales) from the fact table.  So here are the some possible causes, that I would investigate in such circumstances.
  1. The measure datatype cannot hold the aggregated number.  For example, the fact table might have an integer datatype for the SalesQuantity column, which might be fine for the SQL table.  However, by default, SSAS will give the SalesQuantity measure the same datatype, but now it must be able to contain the sum(SalesQuantity).  If sum(SalesQuantity) is greater than 2^31 SSAS will not display any errors, and the Measures.SalesQuantity will show numbers that appear to wrap around.  So the numbers are much smaller than expected (almost random) at an aggregated level.  The solution is simple, just change the datatype in the cube to a much larger number, for example bigint, which can be up to 2^63.  One of the good things about using bigint over int, is that the cube is almost exactly the same size, as SSAS will compress the leading zeroes in binary numbers very effectively.  If you need numbers larger than 2^63, you might need to use a double datatype, which, although potentially much larger is an imprecise number.  Also, a double datatype measure, will typically make the cube significantly larger than a binary measure containing the same information.
  2. There are missing members on one or more dimensions and missing members are not enabled.  The missing member facts will not be included in the cube.   I personally believe the cube should contain all the facts in the fact able.  Even if there is a missing member, the data should still be available.  For example, if there are sales for a ProductId that does not exist, typically, you will still want to know about the sales even though you cannot say which product it was for.  I prefer to enable missing members, and just hide them if you must.  It is also a good practice to regularly check for missing members, either in the DW or in the cube (through missing members).
  3. The third situation where the cube numbers are light is an unusual one.  It is where the fact table (actually a view) returns an arithmetic error such as divide by zero.  Most times SSAS will sympathetically return the SQL error, and you will know there is a problem during cube processing.  However, sometimes SSAS does not sympathetically return the SQL error, rather, it returns a successful cube process with the rows up until the divide by zero.  There are two ways of locating this error.  Firstly, you could simply execute the same SQL query in SQLEM window and wait for the full resultset or the arithmetic error to return.  This is not practical in many circumstances due to the number of rows in the resultset.  I guess you could get SSIS to help you, but an easier way is to use SQL Profiler and trace Error events.
  4. You have a referenced dimension and the INNER JOIN clause that SSAS is generating for SQL Server is removing fact records that don't join with the intermediate dimension.  To identify this issue, you need to execute the SQL generated by SSAS, and compare its counts with the raw fact table.
  5. Is your cube (measure group) processed incrementally?   If so, there might be a logic issue with the way you are incrementally processing.  To verify, you could reprocess the entire measure group and see if the numbers now equal the fact table.
  6. Is the data in your fact table changing?  Of course, if the fact table is changing, the MOLAP cube won't know about it.  This can be related to incremental processing, where the incremental processing logic is good, but historic data records are changing.  Strictly speaking you cannot incrementally process updates, only new records.  If the data has changed, the cube (partition) will require processing.
  7. Have you a default member that is not the [All] mmeber, or do you have a dimension without an [All] member.  In these cases, you would need to take this into account when comparing Measures.NetSales with sum(NetSales).
  8. The Calculate statement in the cube has been commented out.
I hope this help you identify reconciliation issues between the cube and fact table.  A couple of suggestions when reconciling such issues
  1. Use physical measures in the cube.  Even if they are hidden, you can still query the physical measures.
  2. Ideally, start off by reconciling record counts.  If you can, put a record count in the cube.
  3. When you are running SQL queries against the fact table, use the SQL statement that is generated by SSAS, since it might be referencing a different table/view than you expected.  Also, as in point 4 above, it might be performing an INNER JOIN with an intermediate dimension that has missing members.


Barry McConnell said...

The more common reason I've seen is simply that the context of both queries is not the same. Be sure to verify the default values used for every dimension not specified in the mdx query. Be especially watchful of non-aggregated dimensions.

Richard Lees said...

Good point Barry. I have added a point on non [All] default members or non [All] dimension hierarchies.

Greg Galloway said...

Regarding your point #1, using BIDS Helper Measure Group Health Check is a quick way to identify data type overflow issues and correct it:

Marco Jansen said...

What's the best way to keep the fact data in case of point 4(a referenced dimension...)?

Richard Lees said...

Marco, it's just a matter of having referential integrity between the fact table and dimension. Ensure that all foreign keys have values (not null) and that they exist in the dimension. You don't have to have declarative RI, you can just run some sql scripts to insert an "unknown" dimension member and update invalid foreign keys to that member's key.

prav said...

Richard, I have seen this scenario today where my cube records are less than fact and everypoint you mentioned is very helpful and I found RI is the main issue in my case and solved. A big thank you - Prav

Timothy Tan said...

Check the NULL Processing type is set to 'Unknown Member' rather than 'Automatic' under 'Advance' in 'Define Relationship' of a relationship in the 'Dimension Usage' tab.

You can use BIDS Helper 'Printer Friendly Dimension Usage' in the 'Dimension Usage' tab to check which relationship is automatic (A) or Unknown Member (U).

Weaning baby Food said...

In my case, albeit rarer, I had a second Date Dimension causing issues when related to an Exchange Rate Measure Group. With both Dimensions related directly to both Measure Groups, I was only returned rows where both dates in the fact were the same. Fix was to set up a M2M relationship between the second dimension and Exchange rate measure group via the main measure group.

My point being, double check your dimension relationships across multiple Measure groups.