Wednesday, February 15, 2012
Cube processing goes slow when there are dimension key errors
Unfortunately, SSAS (2008 R2) still consumes quite a lot of cpu dealing with these UnknownMembers. This is fine if you only have a few million facts to process, but if you have billions (or even just hundreds of millions) processing noticeably slows down. This manifests as a single cpu busy in msmdsrv. The difference in cube processing time can be by a factor of 20, or so. My rule of thumb is that a partition should be able to process tens of millions of facts per minute. However, if there are key errors, this might drop to less than a million per minute.
If you have configured the ErrorConfiguration to "IgnoreError" you won't even see any error messages when you process in the foreground. But the cpu overhead is still there. The same goes for KeyErrorLimit, cpu will continue to be consumed even though it has stopped logging.
My suggestion is to try, as much as possible, to avoid dimension key errors. A simple coalesce(DimKey,-1) with a "unknown" key value (-1) as the last parameter value, combined with a "unknown" (-1) dimension member. Of course, you don't have to remove every dimension key error, just ensure that there aren't millions (or billions) of them.
Look after yourself
One strategy I use is to go to work using some fun self propelled exercise. In my case this is either paddling a kayak across the harbour, or cycle around the harbour. Here is a video of my unicycle into work last week.
This ensures that I get to work in good humour and helps me keep fit and healthy.
Sunday, January 29, 2012
Analysis Services Configuration Options

Most of these options, you really shouldn't alter, but there are a few that you should know about.
Here are a few SSAS properties that I think you should be aware of, and might want to change.
DataDir
This is where SSAS will, by default, put the SSAS database files. By default, this will be a sub directory of the "program files", which is not an ideal location. The SSAS databases should be on high performance storage. IO is very often a critical factor in large SSAS cubes.
ExternalCommandTimeout
As I understand it, this property is the number of seconds that SSAS will wait for the first row to return from an SQL query during processing. The default is 3600 seconds (1 hour) which is insufficient for many environments. I often change this to 36000 (10 hours) in larger environments, or environments with Oracle as an rdbms (just kidding).
ForceCommitTimeout
This is the amount of time (milliseconds) that SSAS will wait on queries while performing process updates. A process update cannot commit while a query is active. The default is 30 seconds, which I think is quite reasonable, but you might have a need to change this value.
QueryLogSampling
The default value is 10, so that 1 in 10 queries are recorded. In practice, I either need all queries or none. So what I do is change this value to 0 to minimise overhead. When I am running aperformance tuning exercise, I will temporarily change this to 1.
LowMemoryLimit
If this number is between 1 and 100, it is the percentage of memory that SSAS can use, without concern. By default it is 65, which tends to be an appropriate value. You might wonder why your 100GB server never sees 65GB memory for SSAS, even though your cubes are much larger and the IO demands are huge. This is due to SSAS caching logic. Just because you have the memory and the cube is large doesn't mean that SSAS will hold the cube in memory. see my earlier blog http://richardlees.blogspot.com/2011/12/why-doesnt-ssas-cache-entire-cube.html
TotalMemoryLimit
You might see the LowMemoryLimit as a very soft limit, which can be exceeded when processing demands would benefit. The TotalMemoryLimit is a harder limit, but it can also be exceeded when necessary. One way that I look at these two options is the LowMemoryLimit is the memory limit during non processing times, and the TotalMemoryLimit is the limit when SSAS processing jobs are active. However, as I mentioned above, the more common concern is that SSAS isn't using the available memory, rather than it is using too much.
DrillThroughMaxRows
I haven't ever changed this property. Since 2005, SSAS hasn't offered a real drill-through. This is really a drill down option. If you want to support drill through see my blog http://richardlees.blogspot.com/2009/01/using-drill-through-in-analysis.html
Port
If you want to change the port that SSAS listens on (or you are running multiple instances) this is where you specify the port number. By default this value is 0, which denotes 2383.
ServerTimeout
This is the limit for SSAS queries in seconds. After this time (default 3600 or 1 hour) mdx queries will timeout. You may have reason to increase or decrease this value.
Wednesday, December 28, 2011
Remember What's Important
Perhaps, like me, you enjoy your work in the Microsoft software space. That's great, you are very lucky if you enjoy your work. But remember what's really important in this life. Your family and friends.
So, don't work too hard. Eat well and keep exercising. It helps if you find a sport that you really enjoy. At the moment, that is unicycling for me.
See you all in the New Year.
Sunday, December 11, 2011
My SSAS databases are corrupted
As I have mentioned earlier, SQL Server Analysis Services exploits the Windows File Cache to help minimise IO, which might lead you to suspect the Windows cache. However SSAS uses the FlushFileBuffers API to minimise the possibility of having dirty pages in the Windows File Cache. However, this does not protect SSAS from corruption on an unplanned shutdown.
So, if you really want to avoid corrupted SSAS databases on power failure, you should ensure your SSAS servers are supported by UPS.
Most production SSAS servers have UPS. But, if you don't have UPS, like my demonstration site (http://RichardLees.com.au/Sites/Demonstrations) and, like my site, the cubes are continually updating, when the power fails, SSAS databases have a good chance of being corrupted. The only solution to this, that I know of, is to restore the affected (typically all) SSAS databases. I find the quickest way to do this is to stop SSAS, empty the SSAS data directory, start SSAS and recover all the databases. One thing you will also need to do is add back any userids that had SSAS administrator privileges as they were held in a file on the SSAS data directory.
Tuesday, December 6, 2011
Windows Server Performance Tuning
The document is in Word format http://msdn.microsoft.com/en-us/windows/hardware/gg463394
Monday, December 5, 2011
Why doesn't SSAS cache the entire cube?
I have seen this effect on many production servers but never quite believed it. There is much more RAM than cube size, yet SSAS is continually losing data from cache and asking for physical IO. Actually, SSAS relies on the Windows File Cache to reduce much of its IO. However, the Windows File Cache does not cache the entire cube either.
To demonstrate this effect, I have set up a 36GB server, dedicated to SSAS, and set up an 11GB MOLAP cube. Then, over a period of 19 hours, I have client tasks querying the cube in a semi random fashion. The MOLAP cube was not updated over the period. Over the 19 hours, if SSAS or Windows File Cache were effective, there wouldn't be much more than 11GB of read IO on the SSAS data drive over the 19 hours. However, the system reached an equilibrium, where it was requesting over 20MB of read IO/second. If you add this up, the system read 1.3 terabytes of data from the 11GB cube (20MB*60*60*19). Obviously, the system was not effectively caching the cube. This is very different from SQL Server, which would quite happily keep an 11GB database in memory, effectively eliminating further read IO.
My server was running Windows 2008, SQL Server 2008 R2 with default configuration settings. The SSAS database was located on the d: drive. Nothing else is located on the d: drive
Here is a Perfmon chart with some of the key counters during my test.
SSAS (msmdsrv.exe) did not use more than 12GB of memory during the test, which is well below the LowMemoryLimit 65% (which is about 23GB).During the 19 hours, SSAS requested, on average, 102MB/second of read IO. The Windows File Cache was able to satisfy about 80% of these read requests, so the physical disk bytes/second was only 20MB/second.
The Perfmon chart above shows some of the key counters. Notice
- The Process (msmdsrv) Bytes/second y axis scale is in MB, often going off scale at 100MB/sec. These are not necessarily physical IO, as the Windows cache will satisfy many.
- The PhysicalDisk Bytes/second is averaging 20.1MB/second for the 19 hours
- The Windows Cache Bytes is sitting between 0.6GB and 1.1 GB.
- The red line (Cache Copy Reads/sec) is the IO requested of the cache. These are IO that the Windows cache has managed to satisfy by finding the page in its cache.
- See how the Cache Copy Reads/sec is satisfying IO requested of the msmdsrv process, reducing the PhysicalDisk Bytes/sec. Not so easy to see, as the reads are reads/sec, while the PhysicalDisk and Process are in bytes/sec, but you can see it.
However, you can clearly see that SSAS (or the system) is performing much more IO than would be necessary if the entire 11GB cube was cached in memory.
There are a couple of takeaways from this exercise
- Be thankful that we have a Windows Cache, and consider configuring it for SSAS.
- Try and put SSAS databases on very fast (ideally solid state disks) storage devices. If you can't avoid the IO, at least make it fast.
- Partition large cubes. This helps reduce logical IO requests and to the extent it bunches the popular data together it will probably help the Windows Cache hold the hot data more effectively.
- Hopefully, in a future version, either SSAS or the Windows File Cache will cache more of the cube.
Sunday, November 27, 2011
Causing a Deadlock in SQL Server
If you want to cause a deadlock, it is quite easy. Most DBAs will be able to do it, and it can be useful in determining what profiler and logging information is available in the event of an unplanned deadlock.
Here are some instructions to cause a deadlock between 3 transactions. Most deadlocks are between 2 transactions, but a deadlock chain can have any number of transactions. By the way, there are many ways to set up a deadlock, the only common characteristic is that there are a chain of processes that are waiting on each other for locks. So in a 2 process chain, Process A is waiting on a resource held by Process B and Process B is waiting on a resource held by Process A. As you can imagine, the only solution is for one of the transactions to be rolled back. They can't both go forward.
In my experience, a common class of deadlocks is the one caused by update (update, insert or delete) transactions that update so many rows that the lock must be escalated. For example, escalate several page locks to a table lock. When there are two transactions like this starting about the same time, they both start updating, and taking locks, until one of them escalates to a table lock. At this moment that transaction waits on the other. But if the other is also destined to escalate to a table lock on the same table, there will be a deadlock.
Now, to cause a deadlock between 3 tasks follow these instructions.
You might like to start SQL Profiler and enable a trace for Lock:Deadlock graph, Lock:Deadlock Chain and Lock:Deadlock, which will give you considerable information on the deadlock chain participants and resource locks.
Create 3 tables
- create table t1 (c1 int, c2 varchar(50))
- create table t2 (c1 int, c2 varchar(50))
- create table t3 (c1 int, c2 varchar(50))
In the first query window execute
begin tran
insert into t1 select 1, 'abc'
In the second query window execute
begin tran
insert into t2 select 2, 'xyz'
Select * from t1
The second query window will be waiting on the first query.
In the third query window execute
begin tran
insert into t3 select 3, 'mno'
Select * from t2
The third query window will be waiting on the second query. At this point in time there is no deadlock. We just have a locking chain, with query 1 at the head. You can see the locking chain if you execute.
sp_who2
In the first query window execute
--begin tran
--insert into t1 select 1, 'abc'
Select * from t3
Now, query 1 will be waiting on query 3, which is waiting on query 2, which is waiting on query 1. We have a deadlock. You should notice that within a couple of seconds, one of the queries is cancelled and the transaction rolled back.
Msg 1205, Level 13, State 45, Line 4
Transaction (Process ID 52) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Now, have a look at the Profiler output, and you should see your deadlock chart with the three processes and the locked resources. The transaction that was selected as the victim, and rolled back, has a cross through it. If you hover over the transactions, you will see the last batch to be submitted. Note, the last batch is not necessarily the entire transaction. There may have been an explicit BEGIN TRAN and many statements within the transaction that aren't included in the chart. If you manage to capture a blocking transaction in action, you can see the locks it has (and ones its waiting on) with the sp_lock command.
Remember to COMMIT or ROLLBACK the two queries that weren't rolled back. There is nothing worse that a query holding locks while the user has gone off to do something else.
Tuesday, November 22, 2011
Fix Query Performance by Creating Statistics
SQL Server's optimiser will automatically create statistics on individual columns that are used in predicates. See picture of database properties to the right. However, it will not create statistics on column combinations unless they are composite columns in an index. This is where you might be able to help the optimiser by creating statistics on column combinations. This technique is most appropriate where the columns in your predicates have a relationship. For example, my InternetLogtable (1.2 Million rows) contains a row for each internet request. It has columns ClientAgentId and ClientHostId, where ClientAgent is the name of the browser, and ClientHost is the IP address of the client machine. As you can imagine, most requests from one ClientHost will tend to have the same ClientAgent. However, SQL Server's optimiser does not know this. It has statistics on ClientAgentId and ClientHostID so if when it determines the cost of the following queries it can quite accurately determine how many rows will be returned. For the first query it estimates 332 rows will be returned. For the second query it estimates 110 rows will be returned. In both cases, the estimate is very accurate. (There do happen to be indexes on each of the individual predicate columns, although even if there weren't indexes, it would automatically create statistics and get similar estimates.) select * from internetlogtable with (nolock)
The optimiser assumes that ClientAgentId and ClientHostId are independent and thus there will be very few, if any, records satisfying both predicates. Essentially it is guessing that the number of rows returned will be 332/1200000 * 110/1200000. It rounds the number up to 1. Here is the query plan without statistics. Notice how the Estimated Number of Rows is 1. Actually, when it says 1, it typically is estimating <1 records, but it always rounds up to 1.Now, what would happen if we create statistics on the two predicate columns?
create statistics StatsAgentIdHostId
on internetlogtable (ClientHostID,ClientAgentID)
SQL Server's optimiser now has data on the combinations of ClientHostId and ClientAgentId and finds that this particular combination will return about 110 rows. Here is the query plan. Notice that the Estimated Number of Rows is now 110.96! This is much more accurate.
In the query example I have used here, the query plans are different, although not dramatically different. However, if this table was joined with a few others, some of which very large, the query plan can change dramatically and make the difference between a sub second query and a several minute query.
Understanding these subtleties of the SQL Server query optimiser will help you tune queries in an efficient manner. If you were not aware of this technique to create bespoke statistics, you might solve this issue with a new index on the two columns, which would solve this query performance but at a great cost to overall performance.
By the way, here is an illustration of what statistics SQL Server keeps. There is a lot of information. For example, there are even as many key ranges as can fit in an 8KB page with the number of distinct keys, equal keys and rows in the range. Notice the All density column in the second set of statistics. It is telling us that a single value for ClientHostId, will on average return 0.0001507841 of the total rows. It also tells us that ClientHostId and ClientAgentId will, on average return 0.0001226091 of the rows. Not much less! Ie, if you have the ClientHostId, the ClientAgentId does not filter the rows much more.I encourage you to get to know the SQL optimiser better. The more you know about the optimiser, the more you will be able to tune your queries. And the way to get to know the optimiser is through the query show plan.
Saturday, September 17, 2011
Parallelised Queries are Selfish Queries
I am prompted into writing this post after reading the msdn article Understanding and Controlling Parallel Query Processing in SQL Server. It is a very informative article. I cannot disagree with anything in the article, but there is an elephant in the room. The article dissregards IO performance bottlenecks. It actually states that it will disregard IO performance. "We assume that your I/O system is configured properly ..... so that the I/O system is not the primary performance limiter." I am going to discuss IO performance with MAXDOP, because I believe it is often the main performance limiter.As a bit of background, SQL Server has a parallel query processing feature that is enabled by default. This permits SQL Optimiser to choose a plan that involves breaking one query into sub tasks that can run in parallel. This can involve executing, essentially, the same plan, but in parallel subtasks, or it can involve a completely different plan and again breaking the query into subtasks that can run in parallel. Really quite an amazing piece of technology. When it is working well for a query, it is interesting to see figures like cpu=10 seconds, elapsed=1 second!
In many of the very large database environments where I work, IO is the most critical resource. Perhaps this is my misfortune, but I suggest that IO performance is often a performance limiter in large database systems, and probably will be so until SSD (solid state disks) are more commonly employed. The problem with parallel query plans is that one query can simultaneously request multiple table scans, selfishly requesting more resources in parallel in the hope of getting a faster response. Just one table scan from one query will tend to have parallel IO requests, as multiple SQL Server read engines are given multiple page read ahead requests. So, even with a single non parallelised SQL query, there will already be some parallel IO requests. Run several non parallelised queries at the same time and there are many more parallel IO requests than there are queries. That is all before Parallel Query Processing kicks in!
Let's take a very simple example, where the SQL query has a predicate on a non indexed column of a very large table (the table is larger than SQL cache), therefore requiring a sequential scan of the table. Without Parallel Query Processing, SQL Server will sequentially scan the very large table, using several read engines working in parallel to get multiple page ranges. Note, there is already parallel IO requests with a single non parallelised SQL query. Now if we enable Parallel Query, SQL can break the scan into several sub tasks, each of which will ask for range scans of the table. Each of these scans can busy up multiple read engines, which can lead to IO throughput saturation. That is the parallelised query will result in scanning the same table, but it will flood the IO system with many more parallel requests that can end up in a long queue. A good way to think about it is to imagine an old fashioned bank branch with 30 tellers and everyone queues up for the next available teller. That would be non parallelised. Now imagine that someone comes in with bags of money that requires lots of counting and when he/she gets to the front of the queue, all the tellers attend to counting his/her money together (in parallel). This might be good for the selfish individual, they get served faster, but it is not good for the performance of the branch, as now all customers will be waiting for the selfish customer to finish. For this reason, I like to think of parallelised queries as selfish, or unsocial. They want to take all the resources, even though this is not an efficient use of the overall resources and others are negatively impacted. This is fine in a system with very few (perhaps one) customer. Note, there is always an overhead in parallelising a query, even if it is using essentially the same plan as a non parallelised query. This is because there is overhead in distributing and regathering the sub tasks.
There is an argument that data warehouses have very few users so that Parallel Query is an ideal performance option. However, I see modern data warehouses with large numbers of users simultaneously submitting queries (dashboards, web pages, reports etc), and you typically don't want one user to selfishly hog critical resources such as IO.
I might sound like I am against Parallel Query. That isn't true. It is a fantastic feature. However, I would encourage DBAs to consider setting Max Degree of Parallelism to 1. Note; that is not disabling Parallel Query, it is just the default. So, by default, queries will not be selfish. If you have a particular ETL task or report query that you want to parallelise, you can simple add OPTION (MAXDOP n) to the query, where n is the number of potential sub tasks running in parallel. So you are not turning it off, you are just ensuring that queries aren't selfish by default. Also, just to be clear, with Max Degree of Parallelism set to 1, your system can still have many SQL queries running in parallel (with each other) it is only that individual queries will not be broken into parallel running subtasks.The reason I configured Max Degree of Parallelism to 1 the first few times was because I was working in environments with a wayward parallelised queries. SQL Optimiser chose a very very poor plan thinking that parallelisation would make this plan fast. The same queries with Max Degree of Parallelism =1 (or MAXDOP 1) the optimiser chose a very different, and efficient plan. When I reduced the "Max Degree of Parallelism" to 1, these queries, and the whole system ran more efficiently.
An annoying limitation of MAXDOP is that you are unable to set MAXDOP for queries that SQL Server Analysis Services uses when processing fact tables. You can only provide SSAS with a table or view name, and it constructs the query. There is no way to inject a MAXDOP option. This is another reason I like to set Max Degree of Parallelism = 1, so that the default setting is non parallelised.
One word of warning. Changing Max Degree of Parallelism to 1, can (in fact by design it should) make some queries run slower. It can also change dramatically the query plan (as I discussed above) and while this will usually result in more efficient queries, it can also involve a poorer performing plan. In my experience the reverse is more likely (when there are considerably different plans between parallelised and non parallelised, the non parallelised plan is more often the more efficient. Consequently, it is less risky to configure Max Degree of Parallelism =1 while developing a solution (and using MAXDOP n for individual queries where appropriate) than to change a production environment.
The effect of Max Degree of Parallelism is not just on the individual query. If you want to know the impact of Max Degree of Parallelism, you should monitor vital statistics (cpu, memory, IO), get average query response times (for common queries) and compare them with before and after configuration changes. Another, albeit small, cost of Max Degree of Parallelism<>1 is that each parallelised query will now have two plans (it always keeps a non parallel version for when cpu is high) so that more pressure is put on the plan cache (it is bigger, or it holds fewer plans). Note, SQL Server is great in that you can change the Max Degree of Parallelism option dynamically. No SQL Server restart required. You might even consider setting Max Degree of Parallelism =1 for the online day and resetting Max Degree of Parallelism =0 for the overnight ETL.
A closely related issue is Resource Governor, which also has a MAXDOP option that looks like it can override the MAXDOP option in the query. However, as I understand it, the Resource Governor does not work like MAXDOP in a query or sp_configure. If you set MAXDOP to 1 in the Resource Governor, the underlying queries can still chose parallel plans (no change there). What the Resource Governor does is to restrict the query to only one thread! This is almost the worst of both worlds, as the optimiser can choose the selfish (potentially faster) parallel plan, but the Resource Governor will ensure that the parallel tasks are serialised with just one thread.
Sunday, September 11, 2011
Why are there less data in the cube than the fact table?
- The measure datatype cannot hold the aggregated number. For example, the fact table might have an integer datatype for the SalesQuantity column, which might be fine for the SQL table. However, by default, SSAS will give the SalesQuantity measure the same datatype, but now it must be able to contain the sum(SalesQuantity). If sum(SalesQuantity) is greater than 2^31 SSAS will not display any errors, and the Measures.SalesQuantity will show numbers that appear to wrap around. So the numbers are much smaller than expected (almost random) at an aggregated level. The solution is simple, just change the datatype in the cube to a much larger number, for example bigint, which can be up to 2^63. One of the good things about using bigint over int, is that the cube is almost exactly the same size, as SSAS will compress the leading zeroes in binary numbers very effectively. If you need numbers larger than 2^63, you might need to use a double datatype, which, although potentially much larger is an imprecise number. Also, a double datatype measure, will typically make the cube significantly larger than a binary measure containing the same information.
- There are missing members on one or more dimensions and missing members are not enabled. The missing member facts will not be included in the cube. I personally believe the cube should contain all the facts in the fact able. Even if there is a missing member, the data should still be available. For example, if there are sales for a ProductId that does not exist, typically, you will still want to know about the sales even though you cannot say which product it was for. I prefer to enable missing members, and just hide them if you must. It is also a good practice to regularly check for missing members, either in the DW or in the cube (through missing members).
- The third situation where the cube numbers are light is an unusual one. It is where the fact table (actually a view) returns an arithmetic error such as divide by zero. Most times SSAS will sympathetically return the SQL error, and you will know there is a problem during cube processing. However, sometimes SSAS does not sympathetically return the SQL error, rather, it returns a successful cube process with the rows up until the divide by zero. There are two ways of locating this error. Firstly, you could simply execute the same SQL query in SQLEM window and wait for the full resultset or the arithmetic error to return. This is not practical in many circumstances due to the number of rows in the resultset. I guess you could get SSIS to help you, but an easier way is to use SQL Profiler and trace Error events.
- You have a referenced dimension and the INNER JOIN clause that SSAS is generating for SQL Server is removing fact records that don't join with the intermediate dimension. To identify this issue, you need to execute the SQL generated by SSAS, and compare its counts with the raw fact table.
- Is your cube (measure group) processed incrementally? If so, there might be a logic issue with the way you are incrementally processing. To verify, you could reprocess the entire measure group and see if the numbers now equal the fact table.
- Is the data in your fact table changing? Of course, if the fact table is changing, the MOLAP cube won't know about it. This can be related to incremental processing, where the incremental processing logic is good, but historic data records are changing. Strictly speaking you cannot incrementally process updates, only new records. If the data has changed, the cube (partition) will require processing.
- Have you a default member that is not the [All] mmeber, or do you have a dimension without an [All] member. In these cases, you would need to take this into account when comparing Measures.NetSales with sum(NetSales).
- Use physical measures in the cube. Even if they are hidden, you can still query the physical measures.
- Ideally, start off by reconciling record counts. If you can, put a record count in the cube.
- When you are running SQL queries against the fact table, use the SQL statement that is generated by SSAS, since it might be referencing a different table/view than you expected. Also, as in point 4 above, it might be performing an INNER JOIN with an intermediate dimension that has missing members.
Wednesday, August 10, 2011
Improving MDX join performance
Server: The operation has been cancelled.
The solution is to write the MDX query in such a way that the intermediate sets are smaller. Here is an example of a problematic query that is asking for the top 10 hour/City/Date combinations. The output is simply 10 rows, but AS must create a set with hundreds of Millions of members to get the top 10.
select
[Measures].[Bytes Total] on Columns,
topcount(
[Client Host].[Client Geography].[City]
*[HoursOfDay].[HoursOfDay].[Hour Of Day]
*[Date].[Year Month Day].[Day]
,10,[Measures].[Bytes Total])
on Rows
from EasternMining
Here is a way to improve performance using the generate() function, another one of my favourite MDX functions.
select
[Measures].[Bytes Total] on Columns,
topcount(
generate([Client Host].[Client Geography].[City]
,{[Client Host].[Client Geography].currentmember}
*[HoursOfDay].[HoursOfDay].[Hour Of Day]
*[Date].[Year Month Day].[Day])
,10,[Measures].[Bytes Total])
on Rows
from EasternMining
The generate statement is breaking the query into a couple of steps, so that the largest set will probably be the Days*Hours, from which it is only keeping the top 10 for each city. The outside set, for every city, will only have 10 rows per city. The query is logically equivalent to the first query, meaning it will return the same results.
We could take this logic further with a second generate statement so the inside query only has 1o rows per city day combination.
select
[Measures].[Bytes Total] on Columns,
topcount(
generate([Client Host].[Client Geography].[City]
,topcount(
generate([HoursOfDay].[HoursOfDay].[Hour Of Day]
,topcount({[HoursOfDay].[HoursOfDay].currentmember}
*{[Client Host].[Client Geography].
currentmember}
*[Date].[Year Month Day].[Day]
,10,[Measures].[Bytes Total]))
,10,[Measures].[Bytes Total]))
,10,[Measures].[Bytes Total])
on Rows
from EasternMining
We can improve this query further by requesting nonempty() sets
select
[Measures].[Bytes Total] on Columns,
topcount(
generate(nonempty([Client Host].[Client Geography].[City],[Measures].[Bytes Total])
,topcount(
generate(nonempty([HoursOfDay].[HoursOfDay].[Hour Of Day],[Measures].[Bytes Total])
,topcount({[HoursOfDay].[HoursOfDay].currentmember}
*{[Client Host].[Client Geography].currentmember}
*nonempty([Date].[Year Month Day].[Day],[Measures].[Bytes Total])
,10,[Measures].[Bytes Total]))
,10,[Measures].[Bytes Total]))
,10,[Measures].[Bytes Total])
on Rows
from EasternMining
The NonEmpty() function is a very simple way to reduce the set size, and of course it can be done without using the generate() function. So going back to the original query, we might have found that the following change was sufficient to avoid resource errors.
select
[Measures].[Bytes Total] on Columns,
topcount(
nonempty(
[Client Host].[Client Geography].[City]
*[HoursOfDay].[HoursOfDay].[Hour Of Day]
*[Date].[Year Month Day].[Day]
,[Measures].[Bytes Total])
,10,[Measures].[Bytes Total])
on Rows
from EasternMining
However, I wanted to show the generate() function first, as it can improve the query in a way that the nonempty() cannot.
I should add that generate() and nonempty() functions are not panaceas for MDX query performance. The more you know MDX and the way SSAS executes the query, the more you will have in your toolkit to improve query performance.
Saturday, July 16, 2011
MDX Root() function doesn't support multiselect
One of my other favourite MDX functions is Root(). It's a really easy way to get all the filters off a dimension. I often use it in calculations defined in the cube.However the Root() function does not support multi-select and you might get this error message.
#Error Query (6, 30) The MDX function ROOT failed because the coordinate for the 'Region' attribute contains a set.
Fortunately, there is a workaround. Unfortunately, it means not using the Root() function and going back to the .[All] member of all the attributes (or all the ones being used in the query) of the dimension.
For example, the following query will hit the error.
WITH
MEMBER [Client Host].Region.[AfricaAsia] AS
AGGREGATE({[Client Host].Region.&[Africa]
,[Client Host].Region.&[Asia]})
MEMBER Measures.[HitsPercent] as
[Measures].[Request Count]/(ROOT([Client Host]), [Measures].[Request Count])
SELECT
{Measures.[HitsPercent]}
ON COLUMNS
,[Date].[Month].[Month]
ON ROWS
FROM [EasternMining]
WHERE
([Client Host].Region.[AfricaAsia])
The fix is simply to change Root([Client Host]) to [Client Host].Region.[All]. ie.
WITH
MEMBER [Client Host].Region.[AfricaAsia] AS
AGGREGATE({[Client Host].Region.&[Africa]
,[Client Host].Region.&[Asia]})
MEMBER Measures.[HitsPercent] as
// [Measures].[Request Count]/(ROOT([Client Host]), [Measures].[Request Count])
[Measures].[Request Count]/([Client Host].Region.[All], [Measures].[Request Count])
,format_string="#,#0.00%"
,non_empty_behavior="measures.[Request Count]"
SELECT
{Measures.[HitsPercent]}
ON COLUMNS
,[Date].[Month].[Month]
ON ROWS
FROM [EasternMining]
WHERE
([Client Host].Region.[AfricaAsia])
Note, I have only put the .[All] member in for the Region attribute, whereas you really need to add in a .[All] member for all the dimension attributes to be equivalent to the Root() function.
By the way, I don't know why the Root() function doesn't support multi-select. One would imagine that it should.
Because of this limitation, I might be using the Root() function less liberally in calculations stored in the cube.
Saturday, June 25, 2011
Single Click SQL Management Studio
On the shortcut simple add -S
For eample, on my PC, the shortcut is
"C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\Ssms.exe" -S Manly -d master -E
Friday, June 10, 2011
linkmember, my favourite MDX function
You might have a report that displays the order date on rows, with sales on columns. You would also like add a column to show sales, but sliced by delivery date. So what you want to do, is put the same, sales, measure on columns, but have the "delivery date" dimension on rows.
This is exactly what the linkmember does. Simply create a calc member that uses linkmember to switch the OrderDate on rows to DeliveryDate. Here is a simple example.
with member
Measures.AmountDelivered as
(Measures.Amount,
linkmember(OrderDate.CalendarYWD,DeliveryDate.CalendarYWD),
root(OrderDate))
select
{Measures.Amount,
Measures.AmountDelivered}
on Columns,
non empty
OrderDate.CalendarYWD.Day on Rows
from MyCube
One thing that you need to remember is that the linkmember will effectively get the member with the same key in another dimension. However, it will not change the original member. So you probably want to also override the original member with a root() or .[All].
Another way of achieving the same functionality as LinkMember is to extract the key and use StrToMember() to construct the appropriate member of the other dimension. However, LinkMember is much more efficient, and is the preferred tool.
Sunday, June 5, 2011
Tech Ed 2011 Sessions available
Wednesday, May 4, 2011
How many cubes
Since SQL Server 2005, it is possible to put dozens of fact tables (cubes in the old days) into one cube with separate measure groups for each fact table. This is a great feature. For example, Sales and Budget have slightly different dimensionality as Sales is by cashier and hour of day, whereas, budget does not have cashier and time granularity is only available down to week. Having these in the same cube is great, since it makes querying of sales against budget (by all the common dimensions) very easy. Also, if a user is only interested in one measure group, say sales, they can ask the browser that measure group, and they will on see the associated measures and dimensions. In the same way they might browse a separate cube.
You might ask, are there any performance implications of having one super cube?
- In terms of cube size and processing overhead, the design alternatives are comparable.
- MDX queries can be broken down to FE (formula engine) and SE (storage engine) components. The SE component of a query isn't materially affected by unrelated dimensions, since the storage is comparable. However, the FE component of a query is affected by the unrelated dimensions as it is works within the entire cube space (larger with more dimensions).
There might also be operational considerations. The design alternative I have discussed is having multiple cubes in the same database, but if the cubes really are unrelated, you might consider having separate SSAS databases for them. This would lead to more operational options including, parallel processing, and even hosting databases on separate servers.
So, as a cube designer, you must weigh up the performance, functional and maintenance costs. As an over simplified summary, on the side of mini cubes is the minimisation of FE performance costs, while on side of super cubes are increased user functionality.
For another discussion on this topic see Chris Webb's blog
MDX for Top n by group
Here is a little MDX that will get the top n list for each of another set of members.For example, for Each region, I might want to get a list of the top images in that region.
The query is very simple, just a matter of employing the generate function.
select [Measures].[Hits] on 0,
non empty
generate([Client Host].[Client Geography].[Region],
([Client Host].[Client Geography],
topcount(order([Target].[Resource Hierarchy].[Resource Type].&[.jpg].children,[Measures].[Hits],bdesc),10)))
on 1
from EasternMining
You can plug in the grouping, ordering that you want. Even nested top n lists.
Sunday, March 13, 2011
(re) Clustering a Large Online Table
- Create the new table with the clustered (and non clustered) indexes you want.
- Create a control table to maintain high-low watermarks. Set the high water mark to the maximum ascending key value in the live table.
- Perform a bulk load (SSIS Data Flow task) for all the live table rows up to and including the high water mark. This task might take several hours.
- Set the low water mark to the high water mark, then re set the high water mark to the new maximum ascending key value in the live table.
- Perform the bulk load again. This time it will just be the rows that have been inserted during our earlier bulk load. After this task, the new table will be very close, but not quite, up to date to the live table.
- Within the scope of a transaction lock the live table, insert records above the high water mark, rename the tables (and PKs) and commit the transaction. The new table is now live.
A few notes on the process
- You will need to have sufficient space in your database to accommodate two copies of the table.
- You will need to have sufficient space in your database logs to hold log records from the start of the Main Bulk Load to the end of the Main Bulk Load. So, if the bulk load takes 4 hours, their will be a transaction open for 4 hours and a log truncation won't truncate anything within the 4 hours.
- You will need to have sufficient space in tempdb to perform the sorts for the clustered index and nonclustered indexes.
- If you have sufficient space in the database files and logs, you should set the maximum commit size on the destination to 0 (zero). Having other values will tend to slow down the load process (although less disk space is required) since the indexes will be maintained at each commit.
- It is quite likely that your large tables are partitioned and compressed. This doesn't need to make any difference to the reclustering process above. However, if the table is already partitioned, you could improve performance by looping through each partition. If your table isn't already partitioned, this process could be the opportunity to partition your table.
- Don't be tempted to use NOLOCK when reading the live table.
- Test your package before you run it in production.
There is no special code in the package I have created. The most significant task is the very last one, which is performed within the scope of a transaction. To ensure there is no opportunity of a deadlock, the task first asks for an X lock on the live table. Here is some sample SQL from each of the tasks.
Create Control table
create table ReclusteringControl( ReclusteringControlID int default 1, LowWater bigint not null default 0, HighWater bigint not null default 0, CONSTRAINT [PK_ReclusteringControl] PRIMARY KEY NONCLUSTERED (ReclusteringControlID ASC))
Insert into ReclusteringControl values(1,0,0)
update ReclusteringControl set HighWater=(select max(rid) from MyLargeTable)
Input for Main Bulk Load
select * from MyLargeTablewhere rid<=(select HighWater from ReclusteringControl)
Set New HighWatermark
update ReclusteringControl set LowWater=HighWater
update ReclusteringControl set HighWater=(select max(rid) from MyLargeTable)
Second Bulk Load (input statement)
select * from MyLargeTablewhere rid<=(select HighWater from ReclusteringControl) and rid>(select LowWater from ReclusteringControl)
Final INSERT and Rename
--set identity_insert MyLargeTable_new on
--This is required if there is an identity column on the table
GO
begin transaction
update ReclusteringControl
set LowWater=HighWater
select top 0 * from MyLargeTable with (tablockx)
update ReclusteringControl set HighWater=(select max(rid) from MyLargeTable)
insert into MyLargeTable_new --columns need to be explicitly declared if identity cols are inserted
select * from MyLargeTable where rid<=(select HighWater from ReclusteringControl) and rid>(select LowWater from ReclusteringControl);
exec sp_rename MyLargeTable,XXXX_MyLargeTable;--should delete this table
exec sp_rename MyLargeTable_new,MyLargeTable;
exec sp_rename PK_MyLargeTable, PK_MyLargeTable_XXXX
exec sp_rename PK_MyLargeTable_new, PK_MyLargeTable
commit transaction
--set identity_insert MyLargeTable off
Monday, February 14, 2011
Anything is possible with MDX
tail(nonempty([Date].[Year Month Day].[Day],[Measures].[Bytes Total]),7)
with
set Last7Days as tail(nonempty([Date].[Year Month Day].[Day],[Measures].[Bytes Total]),7)
member Measures.TopHourName as order([HoursOfDay].[Hour Of Day],[Measures].[Hits],BDESC).item(0).member_caption
member Measures.TopHourSessions as (order([HoursOfDay].[Hour Of Day],[Measures].[Hits],BDESC).item(0),[Measures].[Sessions])
member Measures.TopHourHits as (order([HoursOfDay].[Hour Of Day],[Measures].[Hits],BDESC).item(0),[Measures].[Hits])
select
{Measures.TopHourName,Measures.TopHourSessions,Measures.TopHourHits} on columns,
[Date].[Month].[Month]*Last7Days on rows
from EasternMining
where ([Client Host].[Client Geography].[Region].&[Americas].[UNITED STATES])
Sunday, January 16, 2011
PowerPivot versus Analysis Services
If you are really interested in hearing Chris speak on this topic, there is a recording of him talking in Belgium on this topic here http://sqlserverday.be/video/view/?id=2
I found the quiz very interesting. Not the quiz final recommendation (PP vs AS), but the quiz questions and how the answers were interpreted, which exposes Chris' analysis. Here is a copy of Chris' quiz with the questions shaded in green and red interpreting Chris' analysis of the question answer here https://spreadsheets0.google.com/ccc?key=tcyHIY2MtHCkj-wTJUAb3Tg&hl=en_GB#
You can use this copy of the quiz to quickly get an understanding of Chris' analysis of the strengths and weaknesses of PowerPivot.
Sunday, December 12, 2010
Warming the OLAP cache
CREATE CACHE FOR [EasternMining] AS
[Port].[Ports].&[80]
*{[Measures].Members}
*{[Date].[Year Month Day].[Year].&[2010].[January].[1]
:[Date].[Year Month Day].[Year].&[2010].[December].[3]}
go
CREATE CACHE FOR [EasternMining] AS
[Port].[Ports].&[80]
*{[Measures].Members }
*[HoursOfDay].[Hour Of Day]
*{[Date].[Year Month Day].[Year].&[2010].[November].[1]
:[Date].[Year Month Day].[Year].&[2010].[December].[3]}
go
member measures.MyCalc as
avg(tail(nonempty([Date].[Year Month Day].[Year].&[2010].[November].[1]
:[Date].[Year Month Day].[Year].&[2010].[December].[3].lag(1)
,[Measures].[Bytes Total]),6)
,[Measures].[Bytes Total])
SELECT {measures.MyCalc} ON COLUMNS ,
{[HoursOfDay].[All HoursOfDay].[00:00-01:59].[00:00-00:59],[HoursOfDay].[All HoursOfDay].[00:00-01:59].[01:00-01:59],
[HoursOfDay].[All HoursOfDay].[02:00-03:59].[02:00-02:59],[HoursOfDay].[All HoursOfDay].[02:00-03:59].[03:00-03:59],
[HoursOfDay].[All HoursOfDay].[04:00-05:59].[04:00-04:59],[HoursOfDay].[All HoursOfDay].[04:00-05:59].[05:00-05:59],
[HoursOfDay].[All HoursOfDay].[06:00-07:59].[06:00-06:59] ,[HoursOfDay].[All HoursOfDay].[06:00-07:59].[07:00-07:59],
[HoursOfDay].[All HoursOfDay].[08:00-09:59].[08:00-08:59],[HoursOfDay].[All HoursOfDay].[08:00-09:59].[09:00-09:59] }
ON ROWS
FROM EasternMining
WHERE ([Port].[Ports].&[80] )
Saturday, October 16, 2010
What's happening on my site?
It's great with Microsoft PerformancePoint dashboards that you can dynamically drill down the table as an ad hoc cube browsing tool. I have drilled down to see the pages from the referrer, the http command
Sunday, October 10, 2010
SSAS Dynamic Security
Wednesday, August 18, 2010
Write Back cubes are easy with Excel 2010
Monday, July 26, 2010
Getting the last non empty value
iif(isempty(Measures.Hits),
([Date].[Year Month Day].prevmember,
Measures.LastHits
),Measures.Hits)
With Member Measures.LastHits as
iif(isempty(Measures.Hits),
([Date].[Year Month Day].prevmember,
Measures.LastHits
),Measures.Hits)
Member Measures.LastHitsMTD as
sum(PeriodsToDate([Date].[Year Month Day].[Month],[Date].[Year Month Day]),Measures.LastHits)
select
{Measures.Hits,Measures.LastHits,Measures.LastHitsMTD}
on Columns,
tail([Date].[Year Month Day].[Day],31)
on rows
from
EasternMining






