Wednesday, February 15, 2012

Cube processing goes slow when there are dimension key errors

I try and keep my cubes as clean as possible, so that dimension key errors are minimised. However, as a safeguard I tend to enable UnknownMembers. This way the totals are correct even if the customer, item, cashier etc isn't found in the dimension. 
Unfortunately, SSAS (2008 R2) still consumes quite a lot of cpu dealing with these UnknownMembers. This is fine if you only have a few million facts to process, but if you have billions (or even just hundreds of millions) processing noticeably slows down. This manifests as a single cpu busy in msmdsrv. The difference in cube processing time can be by a factor of 20, or so. My rule of thumb is that a partition should be able to process tens of millions of facts per minute. However, if there are key errors, this might drop to less than a million per minute.

If you have configured the ErrorConfiguration to "IgnoreError" you won't even see any error messages when you process in the foreground. But the cpu overhead is still there. The same goes for KeyErrorLimit, cpu will continue to be consumed even though it has stopped logging.

My suggestion is to try, as much as possible, to avoid dimension key errors. A simple coalesce(DimKey,-1) with a "unknown" key value (-1) as the last parameter value, combined with a "unknown" (-1) dimension member. Of course, you don't have to remove every dimension key error, just ensure that there aren't millions (or billions) of them.

Look after yourself

I primarily write about business intelligence topics.  However, I believe there is much more to life than BI, and I encourage everyone working in BI to ensure they maintain a balanced life, whatever that is. 

One strategy I use is to go to work using some fun self propelled exercise. In my case this is either paddling a kayak across the harbour, or cycle around the harbour. Here is a video of my unicycle into work last week.
 
This ensures that I get to work in good humour and helps me keep fit and healthy.

Sunday, January 29, 2012

Analysis Services Configuration Options

To see the SSAS configuration options (properties), right click on the service from SQL Server Management Studio and select properties.  You will only see the Basic options, by default.  To see all options, check the "Show Advanced (All) Properties".

Most of these options, you really shouldn't alter, but there are a few that you should know about. 
Here are a few SSAS properties that I think you should be aware of, and might want to change.

DataDir
This is where SSAS will, by default, put the SSAS database files.  By default, this will be a sub directory of the "program files", which is not an ideal location.  The SSAS databases should be on high performance storage.  IO is very often a critical factor in large SSAS cubes.

ExternalCommandTimeout
As I understand it, this property is the number of seconds that SSAS will wait for the first row to return from an SQL query during processing.  The default is 3600 seconds (1 hour) which is insufficient for many environments.   I often change this to 36000 (10 hours) in larger environments, or environments with Oracle as an rdbms (just kidding).

ForceCommitTimeout
This is the amount of time (milliseconds) that SSAS will wait on queries while performing process updates. A process update cannot commit while a query is active. The default is 30 seconds, which I think is quite reasonable, but you might have a need to change this value.

QueryLogSampling
The default value is 10, so that 1 in 10 queries are recorded.  In practice, I either need all queries or none.  So what I do is change this value to 0 to minimise overhead.   When I am running aperformance tuning exercise, I will temporarily change this to 1.

LowMemoryLimit
If this number is between 1 and 100, it is the percentage of memory that SSAS can use, without concern.  By default it is 65, which tends to be an appropriate value.  You might wonder why your 100GB server never sees 65GB memory for SSAS, even though your cubes are much larger and the IO demands are huge.  This is due to SSAS caching logic.  Just because you have the memory and the cube is large doesn't mean that SSAS will hold the cube in memory.  see my earlier blog http://richardlees.blogspot.com/2011/12/why-doesnt-ssas-cache-entire-cube.html

TotalMemoryLimit
You might see the LowMemoryLimit as a very soft limit, which can be exceeded when processing demands would benefit.  The TotalMemoryLimit is a harder limit, but it can also be exceeded when necessary.   One way that I look at these two options is the LowMemoryLimit is the memory limit during non processing times, and the TotalMemoryLimit is the limit when SSAS processing jobs are active.  However, as I mentioned above, the more common concern is that SSAS isn't using the available memory, rather than it is using too much.

DrillThroughMaxRows
I haven't ever changed this property.  Since 2005, SSAS hasn't offered a real drill-through.  This is really a drill down option.  If you want to support drill through see my blog http://richardlees.blogspot.com/2009/01/using-drill-through-in-analysis.html

Port
If you want to change the port that SSAS listens on (or you are running multiple instances) this is where you specify the port number.  By default this value is 0, which denotes 2383.

ServerTimeout
This is the limit for SSAS queries in seconds.  After this time (default 3600 or 1 hour) mdx queries will timeout.  You may have reason to increase or decrease this value.

Wednesday, December 28, 2011

Remember What's Important

Happy New Year to everybody.

Perhaps, like me, you enjoy your work in the Microsoft software space.  That's great, you are very lucky if you enjoy your work.  But remember what's really important in this life.  Your family and friends.

So, don't work too hard.  Eat well and keep exercising.  It helps if you find a sport that you really enjoy.  At the moment, that is unicycling for me.

See you all in the New Year.

Sunday, December 11, 2011

My SSAS databases are corrupted

Did you happen to have a unplanned shutdown?  This is quite likely to be the cause of the failure.  This is unlike SQL Server, which has a transaction log and will always recover databases on restart.
As I have mentioned earlier, SQL Server Analysis Services exploits the Windows File Cache to help minimise IO, which might lead you to suspect the Windows cache.  However SSAS uses the FlushFileBuffers API to minimise the possibility of having dirty pages in the Windows File Cache.  However, this does not protect SSAS from corruption on an unplanned shutdown.
So, if you really want to avoid corrupted SSAS databases on power failure, you should ensure your SSAS servers are supported by UPS.
Most production SSAS servers have UPS.  But, if you don't have UPS, like my demonstration site (http://RichardLees.com.au/Sites/Demonstrations) and, like my site, the cubes are continually updating, when the power fails, SSAS databases have a good chance of being corrupted.  The only solution to this, that I know of, is to restore the affected (typically all) SSAS databases.  I find the quickest way to do this is to stop SSAS, empty the SSAS data directory, start SSAS and recover all the databases.  One thing you will also need to do is add back any userids that had SSAS administrator privileges as they were held in a file on the SSAS data directory.

Tuesday, December 6, 2011

Windows Server Performance Tuning

I just came across this very good whitepaper on Windows 2008 Server performance tuning.  There is will be a lot of information that you already know, but you will probably learn some new things from this document.  There is quite a lot of information on storage, networking, processor utilisation and virtualisation.  The virtualisation section was most informative for me.
The document is in Word format http://msdn.microsoft.com/en-us/windows/hardware/gg463394

Monday, December 5, 2011

Why doesn't SSAS cache the entire cube?

Would you expect a 36GB server dedicated to SSAS to eventually cache an entire 11GB cube in memory?  If so, you would be wrong.  SSAS does not necessarily keep aggregations in memory, even though memory used is less than Memory\LowMemoryLimit.

I have seen this effect on many production servers but never quite believed it.  There is much more RAM than cube size, yet SSAS is continually losing data from cache and asking for physical IO.  Actually, SSAS relies on the Windows File Cache to reduce much of its IO.  However, the Windows File Cache does not cache the entire cube either.

To demonstrate this effect, I have set up a 36GB server, dedicated to SSAS, and set up an 11GB MOLAP cube.  Then, over a period of 19 hours, I have client tasks querying the cube in a semi random fashion.  The MOLAP cube was not updated over the period.  Over the 19 hours, if SSAS or Windows File Cache were effective, there wouldn't be much more than 11GB of read IO on the SSAS data drive over the 19 hours.  However, the system reached an equilibrium, where it was requesting over 20MB of read IO/second.  If you add this up, the system read 1.3 terabytes of data from the 11GB cube (20MB*60*60*19).  Obviously, the system was not effectively caching the cube.  This is very different from SQL Server, which would quite happily keep an 11GB database in memory, effectively eliminating further read IO.

My server was running Windows 2008, SQL Server 2008 R2 with default configuration settings.  The SSAS database was located on the d: drive.  Nothing else is located on the d: drive
Here is a Perfmon chart with some of the key counters during my test.
SSAS (msmdsrv.exe) did not use more than 12GB of memory during the test, which is well below the LowMemoryLimit 65% (which is about 23GB).

During the 19 hours, SSAS requested, on average, 102MB/second of read IO.  The Windows File Cache was able to satisfy about 80% of these read requests, so the physical disk bytes/second was only 20MB/second. 

The Perfmon chart above shows some of the key counters.  Notice
  • The Process (msmdsrv) Bytes/second y axis scale is in MB, often going off scale at 100MB/sec. These are not necessarily physical IO, as the Windows cache will satisfy many.
  • The PhysicalDisk Bytes/second is averaging 20.1MB/second for the 19 hours
  • The Windows Cache Bytes is sitting between 0.6GB and 1.1 GB.
  • The red line (Cache Copy Reads/sec) is the IO requested of the cache.  These are IO that the Windows cache has managed to satisfy by finding the page in its cache.
  • See how the Cache Copy Reads/sec is satisfying IO requested of the msmdsrv process, reducing the PhysicalDisk Bytes/sec.  Not so easy to see, as the reads are reads/sec, while the PhysicalDisk and Process are in bytes/sec, but you can see it.
Now that I understand this effect, I am much more appreciative of the Windows Cache.  This is very different from a SQL Server system, where the Windows Cache has little to do as SQL disables the Windows File Cache for its database files.

However, you can clearly see that SSAS (or the system) is performing much more IO than would be necessary if the entire 11GB cube was cached in memory.
There are a couple of takeaways from this exercise
  1. Be thankful that we have a Windows Cache, and consider configuring it for SSAS.
  2. Try and put SSAS databases on very fast (ideally solid state disks) storage devices.  If you can't avoid the IO, at least make it fast.
  3. Partition large cubes.  This helps reduce logical IO requests and to the extent it bunches the popular data together it will probably help the Windows Cache hold the hot data more effectively.
  4. Hopefully, in a future version, either SSAS or the Windows File Cache will cache more of the cube.
By the way, this effect is not always noticeable.  If your cube is less than 2-3GB, then it is likely to be well cached between SSAS and Windows File Cache.  Also, if you cube is much larger than the memory available to SSAS, then you would expect to see continual IO, and it is likely to be quite well optimised.  However, when you have a 64 bit server with a cube that is larger than 3GB but is comfortably less than the server memory, you might be surprised to see the volume of continual IO.

Sunday, November 27, 2011

Causing a Deadlock in SQL Server

SQL Server has a background process that is continually looking for deadlock chains.  If it finds a deadlock chain, it will rollback the transaction that has done the least amount of work, which should be the fastest to rollback.  That's unless one of the transactions has volunteered as a deadlock victim  by setting their deadlock_priority to low.
If you want to cause a deadlock, it is quite easy.  Most DBAs will be able to do it, and it can be useful in determining what profiler and logging information is available in the event of an unplanned deadlock.

Here are some instructions to cause a deadlock between 3 transactions.  Most deadlocks are between 2 transactions, but a deadlock chain can have any number of transactions.  By the way, there are many ways to set up a deadlock, the only common characteristic is that there are a chain of processes that are waiting on each other for locks.  So in a 2 process chain, Process A is waiting on a resource held by Process B and Process B is waiting on a resource held by Process A.  As you can imagine, the only solution is for one of the transactions to be rolled back.  They can't both go forward.

In my experience, a common class of deadlocks is the one caused by update (update, insert or delete) transactions that update so many rows that the lock must be escalated.  For example, escalate several page locks to a table lock.  When there are two transactions like this starting about the same time, they both start updating, and taking locks, until one of them escalates to a table lock.  At this moment that transaction waits on the other.  But if the other is also destined to escalate to a table lock on the same table, there will be a deadlock.

Now, to cause a deadlock between 3 tasks follow these instructions.
You might like to start SQL Profiler and enable a trace for Lock:Deadlock graph, Lock:Deadlock Chain and Lock:Deadlock, which will give you considerable information on the deadlock chain participants and resource locks.
Create 3 tables
    • create table t1 (c1 int, c2 varchar(50))
    • create table t2 (c1 int, c2 varchar(50))
    • create table t3 (c1 int, c2 varchar(50))
Create 3 query windows in SQL Server Enterprise Manager.
In the first query window execute
begin tran
insert into t1 select 1, 'abc'

In the second query window execute
begin tran
insert into t2 select 2, 'xyz'
Select * from t1


The second query window will be waiting on the first query.
In the third query window execute
begin tran
insert into t3 select 3, 'mno'
Select * from t2


The third query window will be waiting on the second query.  At this point in time there is no deadlock.  We just have a locking chain, with query 1 at the head.  You can see the locking chain if you execute.
sp_who2

In the first query window execute
--begin tran
--insert into t1 select 1, 'abc'
Select * from t3

Now, query 1 will be waiting on query 3, which is waiting on query 2, which is waiting on query 1.  We have a deadlock.  You should notice that within a couple of seconds, one of the queries is cancelled and the transaction rolled back.

Msg 1205, Level 13, State 45, Line 4
Transaction (Process ID 52) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.


Now, have a look at the Profiler output, and you should see your deadlock chart with the three processes and the locked resources.  The transaction that was selected as the victim, and rolled back, has a cross through it.  If you hover over the transactions, you will see the last batch to be submitted.  Note, the last batch is not necessarily the entire transaction.  There may have been an explicit BEGIN TRAN and many statements within the transaction that aren't included in the chart.  If you manage to capture a blocking transaction in action, you can see the locks it has (and ones its waiting on) with the sp_lock command. 

Remember to COMMIT or ROLLBACK the two queries that weren't rolled back.  There is nothing worse that a query holding locks while the user has gone off to do something else.


Tuesday, November 22, 2011

Fix Query Performance by Creating Statistics

SQL Server's optimiser will automatically create statistics on individual columns that are used in predicates.  See picture of database properties to the right.  However, it will not create statistics on column combinations unless they are composite columns in an index.  This is where you might be able to help the optimiser by creating statistics on column combinations.  This technique is most appropriate where the columns in your predicates have a relationship.  For example, my InternetLogtable (1.2 Million rows) contains a row for each internet request.  It has columns ClientAgentId and ClientHostId, where ClientAgent is the name of the browser, and ClientHost is the IP address of the client machine.  As you can imagine, most requests from one ClientHost will tend to have the same ClientAgent.  However, SQL Server's optimiser does not know this.  It has statistics on ClientAgentId and ClientHostID so if when it determines the cost of the following queries it can quite accurately determine how many rows will be returned.  For the first query it estimates 332 rows will be returned.  For the second query it estimates 110 rows will be returned.  In both cases, the estimate is very accurate.  (There do happen to be indexes on each of the individual predicate columns, although even if there weren't indexes, it would automatically create statistics and get similar estimates.) 
select * from internetlogtable with (nolock)
       where ClientAgentId=72383
select * from internetlogtable with (nolock)
      where ClientHostID=417408
However if we combine both predicates, as in the following query

select * from internetlogtable with (nolock)
where ClientAgentId=72383
    and ClientHostID=417408
The optimiser assumes that ClientAgentId and ClientHostId are independent and thus there will be very few, if any, records satisfying both predicates.  Essentially it is guessing that the number of rows returned will be 332/1200000 * 110/1200000.  It rounds the number up to 1.   Here is the query plan without statistics.  Notice how the Estimated Number of Rows is 1.  Actually, when it says 1, it typically is estimating <1 records, but it always rounds up to 1.

Now, what would happen if we create statistics on the two predicate columns?
create statistics StatsAgentIdHostId
  on internetlogtable (ClientHostID,ClientAgentID)

SQL Server's optimiser now has data on the combinations of ClientHostId and ClientAgentId and finds that this particular combination will return about 110 rows.  Here is the query plan.  Notice that the Estimated Number of Rows is now 110.96!  This is much more accurate.

In the query example I have used here, the query plans are different, although not  dramatically different.  However, if this table was joined with a few others, some of which very large, the query plan can change dramatically and make the difference between a sub second query and a several minute query.

Understanding these subtleties of the SQL Server query optimiser will help you tune queries in an efficient manner.  If you were not aware of this technique to create bespoke statistics, you might solve this issue with a new index on the two columns, which would solve this query performance but at a great cost to overall performance.

By the way, here is an illustration of what statistics SQL Server keeps.  There is a lot of information.  For example, there are even as many key ranges as can fit in an 8KB page with the number of distinct keys, equal keys and rows in the range.  Notice the All density column in the second set of statistics.  It is telling us that a single value for ClientHostId, will on average return 0.0001507841 of the total rows.  It also tells us that ClientHostId and ClientAgentId will, on average return 0.0001226091 of the rows.  Not much less!  Ie, if you have the ClientHostId, the ClientAgentId does not filter the rows much more.

I encourage you to get to know the SQL optimiser better.  The more you know about the optimiser, the more you will be able to tune your queries.  And the way to get to know the optimiser is through the query show plan.



Saturday, September 17, 2011

Parallelised Queries are Selfish Queries

SQL Server Parallel Query Processing and MAXDOP

I am prompted into writing this post after reading the msdn article Understanding and Controlling Parallel Query Processing in SQL ServerIt is a very informative article.  I cannot disagree with anything in the article, but there is an elephant in the room.  The article dissregards IO performance bottlenecks.  It actually states that it will disregard IO performance.  "We assume that your I/O system is configured properly ..... so that the I/O system is not the primary performance limiter."  I am going to discuss IO performance with MAXDOP, because I believe it is often the main performance limiter.

As a bit of background, SQL Server has a parallel query processing feature that is enabled by default.  This permits SQL Optimiser to choose a plan that involves breaking one query into sub tasks that can run in parallel.  This can involve executing, essentially, the same plan, but in parallel subtasks, or it can involve a completely different plan and again breaking the query into subtasks that can run in parallel.  Really quite an amazing piece of technology. When it is working well for a query, it is interesting to see figures like cpu=10 seconds, elapsed=1 second!

In many of the very large database environments where I work, IO is the most critical resource.  Perhaps this is my misfortune, but I suggest that IO performance is often a performance limiter in large database systems, and probably will be so until SSD (solid state disks) are more commonly employed.  The problem with parallel query plans is that one query can simultaneously request multiple table scans, selfishly requesting more resources in parallel in the hope of getting a faster response.  Just one table scan from one query will tend to have parallel IO requests, as multiple SQL Server read engines are given multiple page read ahead requests.  So, even with a single non parallelised SQL query, there will already be some parallel IO requests.  Run several non parallelised queries at the same time and there are many more parallel IO requests than there are queries.  That is all before Parallel Query Processing kicks in!

Let's take a very simple example, where the SQL query has a predicate on a non indexed column of a very large table (the table is larger than SQL cache), therefore requiring a sequential scan of the table.  Without Parallel Query Processing, SQL Server will sequentially scan the very large table, using several read engines working in parallel to get multiple page ranges.  Note, there is already parallel IO requests with a single non parallelised SQL query.  Now if we enable Parallel Query, SQL can break the scan into several sub tasks, each of which will ask for range scans of the table.  Each of these scans can busy up multiple read engines, which can lead to IO throughput saturation.  That is the parallelised query will result in scanning the same table, but it will flood the IO system with many more parallel requests that can end up in a long queue.  A good way to think about it is to imagine an old fashioned bank branch with 30 tellers and everyone queues up for the next available teller.  That would be non parallelised.  Now imagine that someone comes in with bags of money that requires lots of counting and when he/she gets to the front of the queue, all the tellers attend to counting his/her money together (in parallel).  This might be good for the selfish individual, they get served faster, but it is not good for the performance of the branch, as now all customers will be waiting for the selfish customer to finish.  For this reason, I like to think of parallelised queries as selfish, or unsocial.  They want to take all the resources, even though this is not an efficient use of the overall resources and others are negatively impacted.  This is fine in a system with very few (perhaps one) customer.  Note, there is always an overhead in parallelising a query, even if it is using essentially the same plan as a non parallelised query.  This is because there is overhead in distributing and regathering the sub tasks.

There is an argument that data warehouses have very few users so that Parallel Query is an ideal performance option.  However, I see modern data warehouses with large numbers of users simultaneously submitting queries (dashboards, web pages, reports etc), and you typically don't want one user to selfishly hog critical resources such as IO.
I might sound like I am against Parallel Query.  That isn't true.  It is a fantastic feature.  However, I would encourage DBAs to consider setting Max Degree of Parallelism to 1.  Note; that is not disabling Parallel Query, it is just the default.  So, by default, queries will not be selfish.  If you have a particular ETL task or report query that you want to parallelise, you can simple add OPTION (MAXDOP n) to the query, where n is the number of potential sub tasks running in parallel.  So you are not turning it off, you are just ensuring that queries aren't selfish by default.  Also, just to be clear, with Max Degree of Parallelism set to 1, your system can still have many SQL queries running in parallel (with each other) it is only that individual queries will not be broken into parallel running subtasks.

The reason I configured Max Degree of Parallelism to 1 the first few times was because I was working in environments with a wayward parallelised queries.  SQL Optimiser chose a very very poor plan thinking that parallelisation would make this plan fast.  The same queries with Max Degree of Parallelism =1 (or MAXDOP 1) the optimiser chose a very different, and efficient plan.  When I reduced the "Max Degree of Parallelism" to 1, these queries, and the whole system ran more efficiently.

An annoying limitation of MAXDOP is that you are unable to set MAXDOP for queries that SQL Server Analysis Services uses when processing fact tables.  You can only provide SSAS with a table or view name, and it constructs the query.  There is no way to inject a MAXDOP option.  This is another reason I like to set Max Degree of Parallelism = 1, so that the default setting is non parallelised.

One word of warning.  Changing Max Degree of Parallelism to 1, can (in fact by design it should) make some queries run slower.  It can also change dramatically the query plan (as I discussed above) and while this will usually result in more efficient queries, it can also involve a poorer performing plan.  In my experience the reverse is more likely (when there are considerably different plans between parallelised and non parallelised, the non parallelised plan is more often the more efficient.  Consequently, it is less risky to configure Max Degree of Parallelism =1 while developing a solution (and using MAXDOP n for individual queries where appropriate) than to change a production environment.

The effect of Max Degree of Parallelism is not just on the individual query.  If you want to know the impact of Max Degree of Parallelism, you should monitor vital statistics (cpu, memory, IO), get average query response times (for common queries) and compare them with before and after  configuration changes.  Another, albeit small, cost of Max Degree of Parallelism<>1 is that each parallelised query will now have two plans (it always keeps a non parallel version for when cpu is high) so that more pressure is put on the plan cache (it is bigger, or it holds fewer plans).  Note, SQL Server is great in that you can change the Max Degree of Parallelism option dynamically.  No SQL Server restart required.  You might even consider setting Max Degree of Parallelism =1 for the online day and resetting Max Degree of Parallelism =0 for the overnight ETL.

A closely related issue is Resource Governor, which also has a MAXDOP option that looks like it can override the MAXDOP option in the query.  However, as I understand it, the Resource Governor does not work like MAXDOP in a query or sp_configure.  If you set MAXDOP to 1 in the Resource Governor, the underlying queries can still chose parallel plans (no change there).  What the Resource Governor does is to restrict the query to only one thread!   This is almost the worst of both worlds, as the optimiser can choose the selfish (potentially faster) parallel plan, but the Resource Governor will ensure that the parallel tasks are serialised with just one thread.

Sunday, September 11, 2011

Why are there less data in the cube than the fact table?

Sereral customer have asked me why they have less data in their cube than they have in the fact table.  Essentially, when they view the cube Measures.NetSales does not equal Sum(NetSales) from the fact table.  So here are the some possible causes, that I would investigate in such circumstances.
  1. The measure datatype cannot hold the aggregated number.  For example, the fact table might have an integer datatype for the SalesQuantity column, which might be fine for the SQL table.  However, by default, SSAS will give the SalesQuantity measure the same datatype, but now it must be able to contain the sum(SalesQuantity).  If sum(SalesQuantity) is greater than 2^31 SSAS will not display any errors, and the Measures.SalesQuantity will show numbers that appear to wrap around.  So the numbers are much smaller than expected (almost random) at an aggregated level.  The solution is simple, just change the datatype in the cube to a much larger number, for example bigint, which can be up to 2^63.  One of the good things about using bigint over int, is that the cube is almost exactly the same size, as SSAS will compress the leading zeroes in binary numbers very effectively.  If you need numbers larger than 2^63, you might need to use a double datatype, which, although potentially much larger is an imprecise number.  Also, a double datatype measure, will typically make the cube significantly larger than a binary measure containing the same information.
  2. There are missing members on one or more dimensions and missing members are not enabled.  The missing member facts will not be included in the cube.   I personally believe the cube should contain all the facts in the fact able.  Even if there is a missing member, the data should still be available.  For example, if there are sales for a ProductId that does not exist, typically, you will still want to know about the sales even though you cannot say which product it was for.  I prefer to enable missing members, and just hide them if you must.  It is also a good practice to regularly check for missing members, either in the DW or in the cube (through missing members).
  3. The third situation where the cube numbers are light is an unusual one.  It is where the fact table (actually a view) returns an arithmetic error such as divide by zero.  Most times SSAS will sympathetically return the SQL error, and you will know there is a problem during cube processing.  However, sometimes SSAS does not sympathetically return the SQL error, rather, it returns a successful cube process with the rows up until the divide by zero.  There are two ways of locating this error.  Firstly, you could simply execute the same SQL query in SQLEM window and wait for the full resultset or the arithmetic error to return.  This is not practical in many circumstances due to the number of rows in the resultset.  I guess you could get SSIS to help you, but an easier way is to use SQL Profiler and trace Error events.
  4. You have a referenced dimension and the INNER JOIN clause that SSAS is generating for SQL Server is removing fact records that don't join with the intermediate dimension.  To identify this issue, you need to execute the SQL generated by SSAS, and compare its counts with the raw fact table.
  5. Is your cube (measure group) processed incrementally?   If so, there might be a logic issue with the way you are incrementally processing.  To verify, you could reprocess the entire measure group and see if the numbers now equal the fact table.
  6. Is the data in your fact table changing?  Of course, if the fact table is changing, the MOLAP cube won't know about it.  This can be related to incremental processing, where the incremental processing logic is good, but historic data records are changing.  Strictly speaking you cannot incrementally process updates, only new records.  If the data has changed, the cube (partition) will require processing.
  7. Have you a default member that is not the [All] mmeber, or do you have a dimension without an [All] member.  In these cases, you would need to take this into account when comparing Measures.NetSales with sum(NetSales).
I hope this help you identify reconciliation issues between the cube and fact table.  A couple of suggestions when reconciling such issues
  1. Use physical measures in the cube.  Even if they are hidden, you can still query the physical measures.
  2. Ideally, start off by reconciling record counts.  If you can, put a record count in the cube.
  3. When you are running SQL queries against the fact table, use the SQL statement that is generated by SSAS, since it might be referencing a different table/view than you expected.  Also, as in point 4 above, it might be performing an INNER JOIN with an intermediate dimension that has missing members.

Wednesday, August 10, 2011

Improving MDX join performance

Analysis Services will tend to instantiate sets created from joins. Some sets can be extremely large and will actually fail with insufficient resources.
Server: The operation has been cancelled.


The solution is to write the MDX query in such a way that the intermediate sets are smaller. Here is an example of a problematic query that is asking for the top 10 hour/City/Date combinations. The output is simply 10 rows, but AS must create a set with hundreds of Millions of members to get the top 10.

select
[Measures].[Bytes Total] on Columns,
topcount(
[Client Host].[Client Geography].[City]
*[HoursOfDay].[HoursOfDay].[Hour Of Day]
*[Date].[Year Month Day].[Day]
,10,[Measures].[Bytes Total])
on Rows
from EasternMining


Here is a way to improve performance using the generate() function, another one of my favourite MDX functions.

select
[Measures].[Bytes Total] on Columns,
topcount(
generate([Client Host].[Client Geography].[City]
,{[Client Host].[Client Geography].currentmember}
*[HoursOfDay].[HoursOfDay].[Hour Of Day]
*[Date].[Year Month Day].[Day])
,10,[Measures].[Bytes Total])
on Rows
from EasternMining


The generate statement is breaking the query into a couple of steps, so that the largest set will probably be the Days*Hours, from which it is only keeping the top 10 for each city. The outside set, for every city, will only have 10 rows per city. The query is logically equivalent to the first query, meaning it will return the same results.

We could take this logic further with a second generate statement so the inside query only has 1o rows per city day combination.

select
[Measures].[Bytes Total] on Columns,
topcount(
generate([Client Host].[Client Geography].[City]
,topcount(
generate([HoursOfDay].[HoursOfDay].[Hour Of Day]
,topcount({[HoursOfDay].[HoursOfDay].currentmember}
*{[Client Host].[Client Geography].
currentmember}
*[Date].[Year Month Day].[Day]
,10,[Measures].[Bytes Total]))
,10,[Measures].[Bytes Total]))
,10,[Measures].[Bytes Total])
on Rows
from EasternMining

We can improve this query further by requesting nonempty() sets

select
[Measures].[Bytes Total] on Columns,
topcount(
generate(nonempty([Client Host].[Client Geography].[City],[Measures].[Bytes Total])
,topcount(
generate(nonempty([HoursOfDay].[HoursOfDay].[Hour Of Day],[Measures].[Bytes Total])
,topcount({[HoursOfDay].[HoursOfDay].currentmember}
*{[Client Host].[Client Geography].currentmember}
*nonempty([Date].[Year Month Day].[Day],[Measures].[Bytes Total])
,10,[Measures].[Bytes Total]))
,10,[Measures].[Bytes Total]))
,10,[Measures].[Bytes Total])
on Rows
from EasternMining

The NonEmpty() function is a very simple way to reduce the set size, and of course it can be done without using the generate() function. So going back to the original query, we might have found that the following change was sufficient to avoid resource errors.


select
[Measures].[Bytes Total] on Columns,
topcount(
nonempty(
[Client Host].[Client Geography].[City]
*[HoursOfDay].[HoursOfDay].[Hour Of Day]
*[Date].[Year Month Day].[Day]
,[Measures].[Bytes Total])
,10,[Measures].[Bytes Total])
on Rows
from EasternMining


However, I wanted to show the generate() function first, as it can improve the query in a way that the nonempty() cannot.

I should add that generate() and nonempty() functions are not panaceas for MDX query performance. The more you know MDX and the way SSAS executes the query, the more you will have in your toolkit to improve query performance.

Saturday, July 16, 2011

MDX Root() function doesn't support multiselect

One of my other favourite MDX functions is Root(). It's a really easy way to get all the filters off a dimension. I often use it in calculations defined in the cube.

However the Root() function does not support multi-select and you might get this error message.

#Error Query (6, 30) The MDX function ROOT failed because the coordinate for the 'Region' attribute contains a set.
Fortunately, there is a workaround. Unfortunately, it means not using the Root() function and going back to the .[All] member of all the attributes (or all the ones being used in the query) of the dimension.
For example, the following query will hit the error.
WITH
MEMBER [Client Host].Region.[AfricaAsia] AS
AGGREGATE({[Client Host].Region.&[Africa]
,[Client Host].Region.&[Asia]})
MEMBER Measures.[HitsPercent] as
[Measures].[Request Count]/(ROOT([Client Host]), [Measures].[Request Count])
SELECT
{Measures.[HitsPercent]}
ON COLUMNS
,[Date].[Month].[Month]
ON ROWS
FROM [EasternMining]
WHERE
([Client Host].Region.[AfricaAsia])


The fix is simply to change Root([Client Host]) to [Client Host].Region.[All]. ie.
WITH
MEMBER [Client Host].Region.[AfricaAsia] AS
AGGREGATE({[Client Host].Region.&[Africa]
,[Client Host].Region.&[Asia]})
MEMBER Measures.[HitsPercent] as
// [Measures].[Request Count]/(ROOT([Client Host]), [Measures].[Request Count])
[Measures].[Request Count]/([Client Host].Region.[All], [Measures].[Request Count])
,format_string="#,#0.00%"
,non_empty_behavior="measures.[Request Count]"
SELECT
{Measures.[HitsPercent]}
ON COLUMNS
,[Date].[Month].[Month]
ON ROWS
FROM [EasternMining]
WHERE
([Client Host].Region.[AfricaAsia])


Note, I have only put the .[All] member in for the Region attribute, whereas you really need to add in a .[All] member for all the dimension attributes to be equivalent to the Root() function.

By the way, I don't know why the Root() function doesn't support multi-select. One would imagine that it should.

Because of this limitation, I might be using the Root() function less liberally in calculations stored in the cube.

Saturday, June 25, 2011

Single Click SQL Management Studio


Here is a simple way to avoid clicking on the inevitable SQL prompt when starting SQL Server Management studio.
On the shortcut simple add -S -d -E

For eample, on my PC, the shortcut is
"C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\Ssms.exe" -S Manly -d master -E


Now when you click on SQL Server Management studio, it will connect automatically to the server/dbname. No more prompts to the server/dbname that you connect to every day.

Friday, June 10, 2011

linkmember, my favourite MDX function

Linkmember is one of my favourite MDX functions. It tends to be useful whenever you have role playing dimensions. For example, when you have multiple (role playing date dimensions or role playing geography dimensions) such as OrderDate and DespatchDate, or CustomerStore and PurchaseStore.

You might have a report that displays the order date on rows, with sales on columns. You would also like add a column to show sales, but sliced by delivery date. So what you want to do, is put the same, sales, measure on columns, but have the "delivery date" dimension on rows.

This is exactly what the linkmember does. Simply create a calc member that uses linkmember to switch the OrderDate on rows to DeliveryDate. Here is a simple example.

with member
Measures.AmountDelivered as
(Measures.Amount,
linkmember(OrderDate.CalendarYWD,DeliveryDate.CalendarYWD),

root(OrderDate))
select
{Measures.Amount,
Measures.AmountDelivered}
on Columns,
non empty
OrderDate.CalendarYWD.Day on Rows
from MyCube


One thing that you need to remember is that the linkmember will effectively get the member with the same key in another dimension. However, it will not change the original member. So you probably want to also override the original member with a root() or .[All].

Another way of achieving the same functionality as LinkMember is to extract the key and use StrToMember() to construct the appropriate member of the other dimension. However, LinkMember is much more efficient, and is the preferred tool.

Sunday, June 5, 2011

Tech Ed 2011 Sessions available


If you didn't afford the time or money to go to Tech Ed, don't worry, you can still view the sessions online, and in your own time.



Wednesday, May 4, 2011

How many cubes

I am often surprised how many cubes some BI developers create. It would appear that each time they have a requirement for a piece of information, they will create a new cube.  Sometimes this is amplified because they are using denormalised summary tables, and creating new ones when they need a bit more dimensionality or a new measure, leading to a plethora of overlapping data.

Since SQL Server 2005, it is possible to put dozens of fact tables (cubes in the old days) into one cube with separate measure groups for each fact table. This is a great feature. For example, Sales and Budget have slightly different dimensionality as Sales is by cashier and hour of day, whereas, budget does not have cashier and time granularity is only available down to week.  Having these in the same cube is great, since it makes querying of sales against budget (by all the common dimensions) very easy. Also, if a user is only interested in one measure group, say sales, they can ask the browser that measure group, and they will on see the associated measures and dimensions.  In the same way they might browse a separate cube.

You might ask, are there any performance implications of having one super cube? 
  • In terms of cube size and processing overhead, the design alternatives are comparable. 
  • MDX queries can be broken down to FE (formula engine) and SE (storage engine) components.   The SE component of a query isn't materially affected by unrelated dimensions, since the storage is comparable.  However, the FE component of a query is affected by the unrelated dimensions as it is works within the entire cube space (larger with more dimensions). 
There is a LookupCube function in MDX that allows you to get the value from another cube.  This might be useful in a design that exploits multiple cubes.  But, don't see this as a panacea for multiple cubes.  It has its own performance overhead and is not as functional as having the measures in one cube.

There might also be operational considerations.  The design alternative I have discussed is having multiple cubes in the same database, but if the cubes really are unrelated, you might consider having separate SSAS databases for them.  This would lead to more operational options including, parallel processing, and even hosting databases on separate servers.

So, as a cube designer, you must weigh up the performance, functional and maintenance costs.  As an over simplified summary, on the side of mini cubes is the minimisation of FE performance costs, while on side of super cubes are increased user functionality.

For another discussion on this topic see Chris Webb's blog

MDX for Top n by group

Here is a little MDX that will get the top n list for each of another set of members.

For example, for Each region, I might want to get a list of the top images in that region.

The query is very simple, just a matter of employing the generate function.

select [Measures].[Hits] on 0,
non empty
generate([Client Host].[Client Geography].[Region],
([Client Host].[Client Geography],
topcount(order([Target].[Resource Hierarchy].[Resource Type].&[.jpg].children,[Measures].[Hits],bdesc),10)))
on 1
from EasternMining


You can plug in the grouping, ordering that you want. Even nested top n lists.






Sunday, March 13, 2011

(re) Clustering a Large Online Table

One of the most important performance controls you have on a relational database is the clustering sequence of your large tables. Unfortunately, it is often not considered until there is a performance issue. Clustered indexes are important on their own, but they are especially significant when you are joining multiple large tables. There are often orders of magnitude performance gains by clustering the large tables in the same sequence. However if you have a large table that needs to be online 24 hours a day, seven days a week, it can be difficult to make such a big change.

I have achieved this a few times now, on tables that are only inserted into (ie transaction tables). The only requirements are that the table is inserted into only (no updates) and that the table has a column that is inserted into in ascending sequence only (eg transactionid, datetime etc). Essentially, this is done by creating a new table, which has all the live table's records inserted into it using bulk load (SSIS package task). This makes it very efficient, but it will take a long time on a large table, so the new table won't be up to date. The trick is to keep low and high water marks, then you can open a transaction to copy in the last few records and rename the tables within the same transaction. In this way, at the point of committing, both tables will have the same content, and any active transactions will, at worst, wait a couple of seconds for the last insert and renaming to occur.

Here are the steps involved in the SSIS package
  1. Create the new table with the clustered (and non clustered) indexes you want.
  2. Create a control table to maintain high-low watermarks. Set the high water mark to the maximum ascending key value in the live table.
  3. Perform a bulk load (SSIS Data Flow task) for all the live table rows up to and including the high water mark. This task might take several hours.
  4. Set the low water mark to the high water mark, then re set the high water mark to the new maximum ascending key value in the live table.
  5. Perform the bulk load again. This time it will just be the rows that have been inserted during our earlier bulk load. After this task, the new table will be very close, but not quite, up to date to the live table.
  6. Within the scope of a transaction lock the live table, insert records above the high water mark, rename the tables (and PKs) and commit the transaction. The new table is now live.

A few notes on the process

  • You will need to have sufficient space in your database to accommodate two copies of the table.
  • You will need to have sufficient space in your database logs to hold log records from the start of the Main Bulk Load to the end of the Main Bulk Load. So, if the bulk load takes 4 hours, their will be a transaction open for 4 hours and a log truncation won't truncate anything within the 4 hours.
  • You will need to have sufficient space in tempdb to perform the sorts for the clustered index and nonclustered indexes.
  • If you have sufficient space in the database files and logs, you should set the maximum commit size on the destination to 0 (zero). Having other values will tend to slow down the load process (although less disk space is required) since the indexes will be maintained at each commit.
  • It is quite likely that your large tables are partitioned and compressed. This doesn't need to make any difference to the reclustering process above. However, if the table is already partitioned, you could improve performance by looping through each partition. If your table isn't already partitioned, this process could be the opportunity to partition your table.
  • Don't be tempted to use NOLOCK when reading the live table.
  • Test your package before you run it in production.

There is no special code in the package I have created. The most significant task is the very last one, which is performed within the scope of a transaction. To ensure there is no opportunity of a deadlock, the task first asks for an X lock on the live table. Here is some sample SQL from each of the tasks.

Create Control table

create table ReclusteringControl( ReclusteringControlID int default 1, LowWater bigint not null default 0, HighWater bigint not null default 0, CONSTRAINT [PK_ReclusteringControl] PRIMARY KEY NONCLUSTERED (ReclusteringControlID ASC))

Insert into ReclusteringControl values(1,0,0)

update ReclusteringControl set HighWater=(select max(rid) from MyLargeTable)

Input for Main Bulk Load

select * from MyLargeTablewhere rid<=(select HighWater from ReclusteringControl)

Set New HighWatermark

update ReclusteringControl set LowWater=HighWater
update ReclusteringControl set HighWater=(select max(rid) from MyLargeTable)

Second Bulk Load (input statement)

select * from MyLargeTablewhere rid<=(select HighWater from ReclusteringControl) and rid>(select LowWater from ReclusteringControl)

Final INSERT and Rename

--set identity_insert MyLargeTable_new on

--This is required if there is an identity column on the table

GO

begin transaction

update ReclusteringControl

set LowWater=HighWater

select top 0 * from MyLargeTable with (tablockx)

update ReclusteringControl set HighWater=(select max(rid) from MyLargeTable)

insert into MyLargeTable_new --columns need to be explicitly declared if identity cols are inserted

select * from MyLargeTable where rid<=(select HighWater from ReclusteringControl) and rid>(select LowWater from ReclusteringControl);

exec sp_rename MyLargeTable,XXXX_MyLargeTable;--should delete this table

exec sp_rename MyLargeTable_new,MyLargeTable;

exec sp_rename PK_MyLargeTable, PK_MyLargeTable_XXXX

exec sp_rename PK_MyLargeTable_new, PK_MyLargeTable

commit transaction

--set identity_insert MyLargeTable off

Monday, February 14, 2011

Anything is possible with MDX

I was with one of my favourite customers last week and had an impromptu challenge put before me. We had a cube containing the company's sales data and the challenge was to reproduce a report they had created in their legacy system. (I found out later that this report had taken a couple of months to produce.) The requirement was to create a report that would display the last 7 days of a particular outlet and for each day show the sales and transactions for the top hour of that day.
I have often been quoted as saying that if the data is in the cube, one MDX query can answer any question that you might have. So the pressure was on to create the report. Fortunately, they were using ReportPortal software, which is a highly functional thin client tool, and it has good charting capability. The web browser tool allows you to create your own sets and calculated members. So all I really needed to do was create a set for the last 7 days. I will reproduce a similar query using my weblog database, which has comparable dimensions.

Here is a set declaration for the last 7 business days.

tail(nonempty([Date].[Year Month Day].[Day],[Measures].[Bytes Total]),7)

The calculated members for the top hour etc are simply a matter of taking the top hour and getting the attribute/tuple require. Here is the full query on my weblog database. Note, the good thing about a tool like ReportPortal is that you can declare sets and calculated members and it will still honour filters and hierarchies that you have dragged and dropped on rows/columns. So the only MDX you need to write is for the set and calculated member, the rest of the report is created by dragging and dropping from the palette.

with
set Last7Days as tail(nonempty([Date].[Year Month Day].[Day],[Measures].[Bytes Total]),7)
member Measures.TopHourName as order([HoursOfDay].[Hour Of Day],[Measures].[Hits],BDESC).item(0).member_caption
member Measures.TopHourSessions as (order([HoursOfDay].[Hour Of Day],[Measures].[Hits],BDESC).item(0),[Measures].[Sessions])
member Measures.TopHourHits as (order([HoursOfDay].[Hour Of Day],[Measures].[Hits],BDESC).item(0),[Measures].[Hits])
select
{Measures.TopHourName,Measures.TopHourSessions,Measures.TopHourHits} on columns,
[Date].[Month].[Month]*Last7Days on rows
from EasternMining
where ([Client Host].[Client Geography].[Region].&[Americas].[UNITED STATES])

Like all good cube browers, ReportPortal will graph your results. In this case, the most appropriate chart was a hybrid column/line chart, with sales as columns on the left hand y axis and transactions as a line on the right hand y axis.

The takeaway from this is that, so long as the cube has the necessary data, a single MDX query can get the answer to any question.



Sunday, January 16, 2011

PowerPivot versus Analysis Services

Chris Webb has done more to demystify Microsoft's PowerPivot story than any other person. Recently, he has put up a "quiz" to help people decide between PowerPivot and SSAS. The quiz is presented in a light hearted fashion, but it contains a lot of Chris' analysis on the strengths and weaknesses of each technology. You can see Chris' quiz on http://cwebbbi.wordpress.com/2010/11/19/powerpivot-vs-ssas-quiz/

If you are really interested in hearing Chris speak on this topic, there is a recording of him talking in Belgium on this topic here http://sqlserverday.be/video/view/?id=2

I found the quiz very interesting. Not the quiz final recommendation (PP vs AS), but the quiz questions and how the answers were interpreted, which exposes Chris' analysis. Here is a copy of Chris' quiz with the questions shaded in green and red interpreting Chris' analysis of the question answer here https://spreadsheets0.google.com/ccc?key=tcyHIY2MtHCkj-wTJUAb3Tg&hl=en_GB#

You can use this copy of the quiz to quickly get an understanding of Chris' analysis of the strengths and weaknesses of PowerPivot.

Sunday, December 12, 2010

Warming the OLAP cache

Analysis Services has a data (and aggregation cache) not dissimilar to a relational database cache. For really good cube performance it is desirable to have most of your cube resident in the cache. Querying the cube will help bring data into the cache, while cube processing will tend to clear the cache. More accurately, partition processing (incremental or full) will clear that entire partition from the cache.
Analysis Services, by default, will freely use 65% of a computer's memory, so data should not be pushed out of the cache unless AS is using 65% of memory. With commodity 64 bit servers, the memory available to Analysis Services should be, at least, in the tens of GBytes.

Here is an example of a query that has been run on a cold cache. Notice how the first Query Subcube has an EventClass of "non-cache data". This means that the storage engine is going to disk since the data/aggregations are not in cache.
You can help warm the cache by executing MDX queries, or you can execute the CREATE CACHE statement. The CREATE CACHE statement looks similar to a regular MDX query, the main difference being a resultset is not returned. For example

CREATE CACHE FOR [EasternMining] AS
[Port].[Ports].&[80]
*{[Measures].Members}
*{[Date].[Year Month Day].[Year].&[2010].[January].[1]
:[Date].[Year Month Day].[Year].&[2010].[December].[3]}
go
CREATE CACHE FOR [EasternMining] AS
[Port].[Ports].&[80]
*{[Measures].Members }
*[HoursOfDay].[Hour Of Day]
*{[Date].[Year Month Day].[Year].&[2010].[November].[1]
:[Date].[Year Month Day].[Year].&[2010].[December].[3]}
go
Now let's run the query again and check the Profiler trace. Here is the output
Notice now, all the Query subcubes have an EventClass of "Cache data". This means that AS has found the data for this query in the cache. Also note, the duration for the first subcube is 0ms. This is a great improvement over the original cold cache subcube of 47ms.
Generally, it is a good practice to warm the cache after cube processing. This will help the performance of the first few queries that hit the cube after processing.
Also, as a cube designer, you should be aware of the cache warming effect and design partitions so that you isolate the volatile data as much as possible, so that you minimise cache clearing when processing.
If you want to experiment with cache warming, you will probably want the ability to clear the cache quickly. Here is an example of a cache clearing statement for a cube


IISLog
EasternMining

Here is the MDX query I used in the demonstration above
WITH
member measures.MyCalc as
avg(tail(nonempty([Date].[Year Month Day].[Year].&[2010].[November].[1]
:[Date].[Year Month Day].[Year].&[2010].[December].[3].lag(1)
,[Measures].[Bytes Total]),6)
,[Measures].[Bytes Total])
SELECT {measures.MyCalc} ON COLUMNS ,
{[HoursOfDay].[All HoursOfDay].[00:00-01:59].[00:00-00:59],[HoursOfDay].[All HoursOfDay].[00:00-01:59].[01:00-01:59],
[HoursOfDay].[All HoursOfDay].[02:00-03:59].[02:00-02:59],[HoursOfDay].[All HoursOfDay].[02:00-03:59].[03:00-03:59],
[HoursOfDay].[All HoursOfDay].[04:00-05:59].[04:00-04:59],[HoursOfDay].[All HoursOfDay].[04:00-05:59].[05:00-05:59],
[HoursOfDay].[All HoursOfDay].[06:00-07:59].[06:00-06:59] ,[HoursOfDay].[All HoursOfDay].[06:00-07:59].[07:00-07:59],
[HoursOfDay].[All HoursOfDay].[08:00-09:59].[08:00-08:59],[HoursOfDay].[All HoursOfDay].[08:00-09:59].[09:00-09:59] }
ON ROWS
FROM EasternMining
WHERE ([Port].[Ports].&[80] )

Saturday, October 16, 2010

What's happening on my site?

My internet site has a series of close to real-time dashboards showing, among other things, internet activity on my site. It is primarily meant as a demonstration tool, but it has prooved useful to me in quickly seeing what's happing on my site (and my servers, since there is also a Perfmon dashboard).

Interesting, over the last couple of weeks, I have seen some suspicious looking referrers from Russia. You can probably still see them on the default dashboard on http://RichardLees.com.au/Sites/Demonstrations. See the Top Referrers table on the bottom. At the moment the top referrer includes "buy ultram.narod.ru" at the top, and there are a few more suspicious ones lower down. I am used to seeing Google, Bing or my blogspot compete for the top spot.

It's great with Microsoft PerformancePoint dashboards that you can dynamically drill down the table as an ad hoc cube browsing tool. I have drilled down to see the pages from the referrer, the http command (HEAD), the ip address, http return code etc. See picture. You can do this too, simply right click on the chart and take the direction you want. A tip; click the triangle at the top right of the chart and open in a new window to see the chart/grid on a new page.

Although this activity is suspicious, and might represent a potentially malicious attack, it doesn't appear to have had much success. Or perhaps all it wanted to do was populate my logs.

In any case, I feel that the dashboard has shown the value of processing logs and providing an easy and intuitive dashboard to this information. Everyone who has an internet site should have access to information on the site's activity. Also, many of the off-the-shelf web analytic tools only show you a subset of what is really happening. For example, I have Google Analytics, which are quite useful, although they only show me a subset of the activity, which doesn't include these referrers.


Wouldn't it be nice if Microsoft Windows included an install option that went something like


Would you like incoming http activity processed and kept in an OLAP cube for ad hoc browsing?


For more examples of online, close to real-time OLAP cubes and data mining models see http://RichardLees.com.au/sites/demonstrations

Sunday, October 10, 2010

SSAS Dynamic Security

Imagine the scenario where there are hundreds or thousands of cube users, most of which are only permitted to see "their" data. And there is a dimension with an attribute containing their Windows userid. It would be very tedious to set up a security role for every individual. Not to mention the maintenance in adding and deleting roles. Although this could be somewhat automated with AMO scripting.

A technique I like to employ is to set up just one SSAS role, which essentially contains everyone that might have some access to the cube (integrated security only). So, in the Dimension Data security tab, I restrict the "Allowed member set" to just those members where the member name is equal to the username(). Username() is an mdx function that returns the Windows userid.

The Allowed member set might look something like

{StrToMember("Branch.BranchManager.&[" + UserName() + "]")}

The default member could be

StrToMember("Branch.BranchManager.&[" + UserName() + "]")

You can add other roles, for example, one with read access to the entire cube. So the users that can only see their Branch, would be in the first role (they would only be able to see their own branch) and head office users in the second role would be able to see the entire cube.

This is a very elegant solution. As you can see, it requires no maintenance, as new/old users will automatically be able to see their own data. If the branch manager changes, the old user will have no read access and the new user will have read access to that branch.

There are many ways to extend this security. I have combined it with application tables that list users' access. This can be a many-to-one or many-to many relationship. No problem, just create a dimension (possibly many-to-many) , and grant read access to the dimension in the same way. Also, it might be that you only want part of the username, or that you want/don't want the domain. Again, that is no problem, you can use VBA functions in the MDX expression to get the substring that you want. You could also use this technique to manage a black list (Denied member set) instead of a white list (Allowed member set).
A classic use of this technique would be to support sales persons to see information relating to their customers. This technique might be combined with other restrictions. For example the dynamic security role might let browsers see a subset of their data, such as quantities, but not dollar values. And only for their customers.

Note, there is an overhead in Dimension Data security, but in my experience it is almost undetectable. Of more concern, if you needed to use it, would be the many-to-many dimension. Be wary of performance if your m2m dimension and/or m2m fact table is large.

Also note, you probably want to check the "Enable Visual Totals" check box, so that users can only see the total for members they are allowed to see.

The use of this dynamic security technique can be used in conjunction with conventional roles in the same cube.

Wednesday, August 18, 2010

Write Back cubes are easy with Excel 2010

SQL Server has supported write enabled cubes for over 10 years, but things have just got a whole lot better with Excel 2010 (and SQL Server 2008). You can now browse and update a write enabled cube in Excel without any add-ins and without writing any code.

It's quite easy. Just right click on the partition (in the Partitions tab of the cube editor) and select Writeback settings. You can just take the default settings. Interestingly, the storage mode does not need to be ROLAP, so we get good performance without operational complexity.

Now, simply open Excel 2010, connect to the cube, and enable "What-If Analysis". This is a button in the Tools section (next to PivotChart and OLAP Tools). Now you can write over any of the cells in the cube. The numbers you have changed will have a little purple triangle in the bottom right corner. When you want to write these back simply click the "Publish Changes" button under the "What-If Analysis" button. Excel will "distribute" your updates according to the settings you have asked for.

Excel has a Settings button under "What-If Analysis", which allows you to control how values are spread, when you enter at a higher level than the cube granularity. For example, you could enter a number for the whole of 2010 and it will spread the amount across all periods based on whether you want an even spread and whether you want the updates to be incremented based on the old numbers.
Excel does not natively support the insertion of new dimension members, which would have been a nice feature for some applications.
Operationally, I don't think it is so important to migrate the writeback data to the main partition, since the writeback partition is typically MOLAP. However, it would not be difficult to move the data over in an SSIS package on a scheduled basis.
Of course, you can restrict read and write access rights based on roles, so, for example, the Australian users can only update the Australian numbers.

It now really is easy to set up an updateable cube and make available to users with Excel 2010.

Monday, July 26, 2010

Getting the last non empty value

LastNonEmpty is an aggregation function available in the Enterprise version of SQL Server. However, you can create your own with a little bit of recursive MDX. Essentially, you simply create a calculated member that returns the non empty value, or if empty, it looks in the previous member. It really is very simple, and is an elegant way of writing a last non empty query. To create a LastNonEmpty calculate measure, simple use MDX such as the following

With Member Measures.LastHits as
iif(isempty(Measures.Hits),
([Date].[Year Month Day].prevmember,
Measures.LastHits
),Measures.Hits)

Notice how the calculated member is called LastHits, and the calculation actually refers to LastHits. This is the recursion. It keeps referring back to itself until it gets a day with a non empty value. If there are no non empty values, the recursion will automatically stop at the first day in your dimension and return null.

Naturally, you can add other calculations on top of this recursive calculation. For example, you might want to sum up the month to date figures using this non empty calculation. No problem, here is a MTD calculation using the last non empty calculation above.

With Member Measures.LastHits as
iif(isempty(Measures.Hits),
([Date].[Year Month Day].prevmember,
Measures.LastHits
),Measures.Hits)
Member Measures.LastHitsMTD as
sum(PeriodsToDate([Date].[Year Month Day].[Month],[Date].[Year Month Day]),Measures.LastHits)
select
{Measures.Hits,Measures.LastHits,Measures.LastHitsMTD}
on Columns,
tail([Date].[Year Month Day].[Day],31)
on rows
from
EasternMining


For examples of real-time OLAP queries on continually updating databases see http://RichardLees.com.au/Sites/Demonstrations