Friday, October 3, 2008

An Alternative to a physical Distinct Count Measure

Many SQL Server OLAP developers learn the hard way that a DistinctCount measure can be expensive on performance - in query time, processing time and cube size. A Distinct Count measure might be appropriate for your cube, but you should be aware that every cell, and every aggregation will need to contain all the distinct values at that intersection. That's why a physical DistinctCount measure can be so expensive. It's also why there is a lot of literature on how to optimise the performance of a Distinct Count measure.

That is not to say that you shouldn't use DistinctCount, it can work well and be extremely useful. However, you should be aware that there is an alternative, which has its own pros and cons. The alternative is to create a dimension on the key if you haven't got one already, then simply create a Calculated Member that references the dimension, I.E

count(nonempty([DimName].[HierarchyName].[LevelName].members,[Measures].[MyMeasure]))

The advantage of this alternative is that the cost of the distinct count is only experienced at the time of querying. Cube processing and other queries will not be impacted by the calculated member. This is unlike a distinct count, which is why we often put a Distinct Count measure in its own measure group. You can see a demonstration of the calculated member version of distinct count on http://RichardLees.com.au/sites/Demonstrations/Pages/TS_IISLog.aspx The Distinct Resources and Distinct ClientIPs are both calculated members and perform very well.

When I was helping a developer with a performance problem and suggested this some time ago, the developer informed me that my alternative would never work on his cube because there were over 10 Million distinct values and the cost of building a dimension that big would be prohibitive. That showed me his lack of understanding of Distinct Count, since his cube with Distinct Count would have the 10 Million values (and subsets of the 10 Million) at every cell and aggregation, so the 10 Million values would be repeated throughout the cube.

By the way, I am always telling Analysis Services cube designers that they should hone their MDX skills. The better your MDX skills the better cubes you will design. This is very similar to a relational database. You wouldn't expect someone to design a high performing relational database if they weren't masters of SQL.

43 comments:

STJOHNS_1994 said...

can you explain more about the calculated member witha n example, like Unique users for a website so so

Thanks

Richard Lees said...

Sure, here is an example, which is counting the distinct number of client IP addresses. Note, this isn't distinct users, but with anonymous web sites, it's a little difficult to know distinct users, but that's another issue.

with member Measures.DistinctIPs as
count(nonempty([Client Host].[Client IP Address].[Full IP Address]*[Measures].[Hits]))
select measures.distinctIPs on 0,
[Date].[Year Month Day].[Month] on 1
from EasternMining

Richard
http://RichardLees.com.au

STJOHNS_1994 said...

Hi,

Thank you for the example.

So what will be the value of [Measure].[Hits] be in the facttable.

lets say for an IP adress(192.168.1.1) there are 5 rows in the fact table means 5 hits .
how will the expression count it as only one distinct when we use count function.


Thanks

Richard Lees said...

Measures.Hits, is just a physical measure holding the number of web requests. The value in it does not matter, all we want to do is count the number if IP addresses that have a non null value in Measures.Hits. So in your example, where there are 5 "Hits" for a particular IPAddress, that will count 1. That is the idea of a distinct count. We want to know how many distinct IPAddresses there are at a particular coordinate in the cube. So, if someone was asking for Distinct IP addresses for 2008,December 12, they will count the number of IP addresses that had some hits on December 12.

Richard
http://RichardLees.com.au

STJOHNS_1994 said...

Hi Richard,

Thanks for the quick reply.

Bellow is is my fact table its fact less fact every hit gets logged as a row
C_Ip_Adress ,Dateid , time_id, hit
198.168.1.0 20081212 10 1
198.168.1.0 20081212 11 1
198.168.1.0 20081212 12 1
198.168.1.0 20081212 13 1

My Dimension table contains all the IpAdress
Ipaddress
198.168.1.0)

When I use Distinct Count calculated measure for the day(20081212 it will give Count as 1

Ilan VBI said...

Looks Interesting! I have a very large cube (16 millions hits per day) What would you say will be the largest user dimension that can be used? for example Will it work on 2 millions distinct members?

Richard Lees said...

Yes, it could be used on a 2 Million row dimension, although you will have slower query times. The first query might take several seconds, say 10-30 depending on the hardware. Subseqeunt queries will tend to be much faster since AS 2005 and AS 2008 cache intermediate resultsets.

dhills said...

Thanks for the code. It's helpful. I'm trying to alter it so that null values are not included in the distinct count. Any suggestions? Thanks.

Richard Lees said...

the NonEmpty() function should remove nulls. Remember the second parameter to NonEmpty() is the measure that you want to use to determine if it's empty (Null).

lalit said...

Considered a monster dimesion with 10 million member, what would be the approx difference in query time between two approaches(DistinctCount vs alternate by you).

Richard Lees said...

Hi Lalit,
It is really really difficult to predict the performance of the AS Distinct Count measure in a cube with 10 Million distinct values. This is because the size of the cube will vary widely depending on the "cube complexity". And performance will be hugely affected by how much of it is in cache. It is best that you understand that AS must store the actual distinct values at each cell and aggregation.

Regarding the calculated Distinct Count measure, the response time is more predictable, since for each query it will need to scan through the 10 Million (or filtered list) members. Having said that, I have noticed that AS appears to use the results cache to make subsequent requests much faster. If you take the caching effect out of it, the time required for the calculated measure will be approximately be a linear function of the number of members to scan. You can see an example of the calculated measure (real-time) on http://richardlees.com.au:8080/iislog/tsIISLog.aspx?Column=[Date].[Year%20Month%20Day]&Row=[Client%20Host].[Client%20Geography]&Measure=Measures.[Distinct%20ClientIPs]&ColLevel=1&RowLevel=1
Since this cube has 31,000 distinct members, I would expect the same query on your cube to be about 10,000,000/30,000 times slower!

Hope that helps.

mat said...

Hi Richard,

Thanks for this interesting post.

But could you more explicit on the cases where a MDX will be preferable to a native Distinct Count ?
Is it only a matter of processing time ?

I made a quick test on one of my cubes, having a distinct count implemented. The cube is 50M rows, the distinct count dimension is around 450K rows.

With a medium-heavy query, the native distinct count returns result in about 2 sec (cold cache). Using your MDX, the query is about 1mn30 (cold cache too).
Without clearing the cache, results are similar.

What has to be done to achieve similar query performance ?

Thanks

mat said...

Following my previous comment, I add that using that form of MDX :

count(nonempty([DimName].[HierarchyName].[LevelName].members , [Measures].[MyMeasure]))

is 1/ much more recommended , and 2/ brings down to 10 sec (cold cache) the query time above !

Richard Lees said...

Good point Mat.
Regarding whether a physical distinct count measure or a calculated measure is better, really depends on your cube and your query activity. You are right to try both with a performance test. However, I wouldn't just to the conclusion that physical is always better than calculated. It is good to know how the two work. DC must keep a list of all the distinct values at every cell and aggregation. This can be enormous for a very complex cube with high distinct counts. Also, performing the test on a cold cache does not necessarily indicate the query cost in a warm environment. Perhaps a better test would be to have a cold cache and run a series of queries that are indicative of production.

One thing I really like about the calcuated distinct count is that it has no overhead on the cube processing (or querying) until you actually run the query. So it can be good to make available, even though it might be used sparingly.

In summary, like many features, there is no one "best practice", it depends on what your needs are.

Dan said...

Thank you for this post. This is much faster than the CrossJoin syntax that I found on MSDN.

Robert said...

I have read this post and have worked with others on testing the distinct count calculation. The issue that I continue to run into is when I try to distinct over a set of dimensions. If I take a Distinct Count of Age Range both the Physical Distinct Count and Measure Distinct Count are about the same. However, when I take a Distinct Count of Age Range with ZipCode the Measure Distinct Count performs 40-50% worse.

Any insight on this would be very helpful.

Richard Lees said...

That's a good point. A physical distinct count measure has advantages as does the calculated measure. You are correct in trying it in the way your users would use it and using the one that would provide the best overall performance to your users.

Sam Kane said...

Here are this and some other articles on SSAS Distinct Count:

http://ssas-wiki.com/w/Articles#Distinct_Count

Ben said...

Hi Richard, I'm using this method, i've added it as a measure to my cube.

CREATE MEMBER CURRENTCUBE.[Measures].[StoreCount]
AS count(nonempty([Retail Store].[Store].members,[Measures].[Value])),
FORMAT_STRING = "#,#",
VISIBLE = 1 ;



When i browse by [Retail Store].[Store] it's all good, but if i look at [Retail Store].[City] or any other hierarchy or attribute this measure shows the top level "All" number repeated for each member of the dimension.
Is there something wrong with my dimension? Or should this method only be used in fixed reports?


The other method i have seen is like this:



CREATE MEMBER CURRENTCUBE.[MEASURES].[StoreCount]
AS NULL,
FORMAT_STRING = "#,#",
SOLVE_ORDER = 1,
VISIBLE = 1;


SCOPE([MEASURES].[StoreCount]);
THIS = IIF(ISEMPTY([Measures].[Value]), NULL,
COUNT(
FILTER(
CROSSJOIN({DESCENDANTS([Retail Store].[Store].CurrentMember,,LEAVES)} ,

[Retail Store].[City].CurrentMember,
[Retail Store].[State].CurrentMember,
[Retail Store].[Area Manager].CurrentMember, ... , ....
),
[Measures].[Value] > 0 )
)
);
END SCOPE;

This does work however has a dependancy that you add any new attributes to the MDX statement or if you rename an attribute it will fail, which is not acceptable to me.

Richard Lees said...

Hi Ben, Your calculation looks OK. All I can think is that Measures.[Value] is not connected to these dimensions. Is that possible?

Ben said...

It's definitely connected, it's the same dimension, just a different attribute or hierarchy. I will have to try this in the Adventure Works and report back

Trouble is my users want a distinct count of every bloody thing!

Richard Lees said...

Ben, that is almost certainly your problem. You have it connected to an attribute that isn't the dimension key. In this case, the measures will only be sensitive to attributes of the attribute you have connected to (and the connecting attribute)

Ben said...

Hi Richard, Thanks for your help on this. I created this measure in the Adventure Works DW 2008R2 and it works for every dimension connected to the measure group except the Hierarchies and Attributes in the Customer Dimension itself?

What could i be doing wrong?

CREATE MEMBER CURRENTCUBE.[Measures].[CustCount]
AS count(nonempty([Customer].[Customer Geography].[Customer].members,[Measures].[Internet Extended Amount])),
FORMAT_STRING = "#,##0;-#,##0",
VISIBLE = 1 , ASSOCIATED_MEASURE_GROUP = 'Internet Sales' ;

Ben said...

can someone try this on AdventureWorks to make sure i'm not doing anything wrong. I've got about 6 distinct counts which are taking ages to process.

Thanks

Richard Lees said...

Hi Ben,
For the calc to work with flters on the Customer dim you could add a multiplication (essentially a filter) for each attribute/hierarchy. I don't have the AdventureWorks database, but, might get away with just adding some bits like this
count(nonempty([Customer].[Customer Geography].[Customer].members
*Customer.City
*Customer.State
*Customer.Gender

,[Measures].[Internet Extended Amount]))

By the way, adding this dynamic calc to your cube should add no time to cube processing. That's the advantage of the calculation. Using a DistinctCount measure will take longer to process, but should (it depends actually) provide faster querying.

umko said...

Hi, Richard,
Thanks for the post.

Is there any option to use calculated distinct count with an aggregate function?

I.e. I need to calculate number of customers who shopped with some brands.

MDX (with distinct count) worked fine, but with calculated customer count AGGREGATE function doesn't work, and SUM function returns incorrect results (result includes duplicates).


with
set [Brands] AS
{[Brand].[Brand Name].[Brand Name].&[Brand1],
[Brand].[Brand Name].[Brand Name].&[Brand2]}

member [measures].[Customer_Brand] as
aggregate([Brands], [measures].[Customer Count] )

select [measures].[Customer_Brand] on 0
FROM cube

Thank you.

Richard Lees said...

Hi Umko,
You could just put it in the WHERE clause like this

with
member [measures].[Customer Count] as
Count(nonempty(Customer.CustomerHierarchy.Customer, [measures].[Sales]))

select [measures].[Customer_Brand] on 0
FROM cube
WHERE ({[Brand].[Brand Name].[Brand Name].&[Brand1],
[Brand].[Brand Name].[Brand Name].&[Brand2]})

umko said...

Thanks Richard,

The problem is that this member is only part of my MDX, so I need to keep it separate to calculate the results. I calculate a number of customers who shopped with some brands, a number of customers who doesn't shopped with some brands, several other analytics, and then calculate some proportions.

I hope all this can be kept in one script.

thomas said...

I rather like the same expression with EXISTS + measure group rather than NONEMPTY + measure...

EXISTS([Client Host].[Client IP Address].[Full IP Address],,"measure group name").count

Richard, do you think there is a difference?

thomas said...

I rather like the same expression with EXISTS + measure group rather than NONEMPTY + measure...

EXISTS([Client Host].[Client IP Address].[Full IP Address],,"measure group name").count

Richard, what do you think of this alternative?

Richard Lees said...

I like it.

gcjf said...

hello as I can improve this measure to a very large dimension.
The menbers of the dimension are more than 100,000 and has only a single attribute.

I hope you can help

greetings Jose

Richard Lees said...

Hi Jose,
One thing I didn't mention in the blog is that you can improve performance by filtering attributes from the distinct count dimension.
For example, if you have a dimension on Customers with a calculated Distinct Count on the dimension, the performance of all queries will be a linear function of the number of customers. However, if you have customer attributes, such as city, state, country, performance will be greatly improved for those queries that filter by these attributes. For example, if you have 10 Million customers, with only 500,000 customers in Australia, when you ask for a distinct count where country=Australia, the query will run about 20 times faster.

Jay said...

I am having problems with this when I introduce my parallel period calculations. It appears to do a cell by cell evaluation. It goes from taking under a second to over 5 minutes. Any ideas on how to write this to avoid the cell by cell evaluation?

Richard Lees said...

Jay, ParallelPeriod will make the query a bit harder, although, I don't think it will completely prevent block computation. Could you share your mdx?

Jay said...

Sure, I have a date calculation dimension that does prior year, year over year % etc. Then I have scope statements to calculate these values. Here is an example of the fiscal month level for my prior year.

SCOPE([Date Dimension].[Fiscal Year].[FY Month Name]);
This = IIF(Tail(Descendants([Date Dimension].[Fiscal Year],[Date Dimension].[Fiscal Year].[Day Name]),1).Item(0).Member_Key
> [FiscalLastDay].Item(0).Member_Key
,Aggregate(
PeriodsToDate([Date Dimension].[Fiscal Year].[FY Month Name]
,ParallelPeriod([Date Dimension].[Fiscal Year].[FY Year Name],1,[FiscalLastDay].Item(0)))
,[Date Comparison FY].[Calculation].&[0])
,Aggregate(
Generate(Existing [Date Dimension].[Fiscal Year].[FY Month Name]
,{ParallelPeriod([Date Dimension].[Fiscal Year].[FY Year Name],1,[Date Dimension].[Fiscal Year])})
,[Date Comparison FY].[Calculation].&[0])
);
END SCOPE;

Jay said...

Just some more information, I don't think it is the parallel period itself causing the problem. It seems to be having the calculations in a dimension. When I run the following query it returns quickly.

with member lastyear as
(Parallelperiod([Date Dimension].[Fiscal Year].[FY Year Name], 1, [Date Dimension].[Fiscal Year].[FY Week Name].&[2011]&[16]),distinctalternative)
SELECT (
{[Measures].[distinctalternative],lastyear})
ON COLUMNS
FROM [MyCube]
WHERE ([Date Dimension].[Fiscal Year].[FY Week Name].&[2011]&[16])

Pulp Ark said...
This comment has been removed by the author.
Richard Lees said...

Good question. Yes you can ignore the unknowns by excluding them from the count. Something like this

count(nonempty({[DimName].[HierarchyName].[LevelName].members-[DimName].[HierarchyName][All].UNKNOWNMEMBER},[Measures].[MyMeasure]))

Mayank Jain said...
This comment has been removed by the author.
Mayank Jain said...

Hi Richard,

I am new to MDX, and i have a doubt in the concept you mentioned:
count(nonempty([DimName].[HierarchyName].[LevelName].members,[Measures].[MyMeasure]))

in the above statement which measure do we need to include?

dataset that i have is:
Year EmpId Product
2012 1 A
2012 2 A

In this case when i bring Year & EmpId in row Labels and Distinct count of Product in Values:
i get total at Year level as 1
and total at Year/EmpId as 1 for each empId.
How can i make the total at year level to match the total at EmpId level?

Richard Lees said...

Hi Mayank,
I am not sure that I understand your requirement completely, but I'm thinking that you want to get a distinct count by two dimensions - multiplied together. So, you could do something like count(nonempty([DimName].[HierarchyName].[LevelName].members*[DimName2].[HierarchyName2].[LevelName2].members,[Measures].[MyMeasure])) So, if Course was Dim1 and Student was Dim2, the distinct count would be the number of courses that students enrolled. Hope that makes sense and helps you.

Mayank Jain said...

Hi Richard,
For this to work, do i need to have a Measure in my Fact table? What if i have a factless fact?

And also how can i achieve correct result for all the combinations if i have more than 2 dimensions?