Saturday, November 7, 2009

Optimizing Basket Analysis

Basket Analysis is the analysis of sales by basket (or transaction). It can be useful in understanding questions such as
  • What is the total value of sales for transactions that include products x (and y, and z...)?
  • What is the average transaction value for transactions that include products x (and y, and z...)?
  • What proportion of transactions include products x (or y or z)? And what proportion of transactions that include products a (and b, and c...) also include products x (and y, and z...)?
  • How did the attach rate of product x change during the sales promotion series? And was there a lasting effect.
  • etc. etc.
Traditionally basket analysis has been achieved by running SQL queries on very large, highly scalable, databases that dynamically summarize by transaction id. You can imagine that these queries involve huge numbers of transaction and item records and really enjoy massively parallel computers. However, it is now possible to run quite effective basket analysis systems using relatively cheap commodity hardware and Microsoft OLAP cubes. Many people do this by creating a cube with a dimension on the transaction id. If you are considering creating such a solution, I have two main suggestions. This has helped me create basket analysis solutions that consume hundreds of millions of transactions in cubes that support relatively fast ad hoc queries (sub 30 seconds).
  1. Rather than creating a dimension on transaction id, create a dimension on basket id. A basket is defined by the unique combinations of items. So all the transactions that include items m, n, o and p, would be considered the same basket. Essentially, the basket id can be used as a dimension like the transaction id, but the basket it dimension is a fraction of the size. This makes the solution hugely more scalable. A basket id dimension table can be created quite simply by creating a bit string with one bit representing each item. For example an 8000 byte string will support 64000 items. There is a little bit of math to create the bit string. but essentially each ItemID can create an exponential of 2 to get its position in the string. ie
    POWER(convert(bigint,2),ItemID). Of course, a bigint can only support 2 to the power of 64, so you will need to do this by a series of 8 byte strings. This would be very very tedious SQL to write, but I found that it is quite easy to write a little SQL to generate the full SQL. Also note, there is no need to keep the full 8000 bytes, it can be a varbinary to the right most byte with an item. See below.

  2. The second major suggestion I have, is to create a dimension using an identity column from the basket id table (there is no need to have the varbinary(8000) column in the cube, and write calculated measures for the basket analysis. ie. I would resist using the distinct count physical measures in the cube. In my experience, calculated measures (that scan the basket dimension) are actually faster than the physical distinct count. However, if you are not sure, try both, and see which one is faster.

The result is a relatively small cube, with a relatively large basket id dimension. The cube will have two measure groups, one for transactions with transaction count and sale value, while the item measure group will have item sales value and item count measures. Here is the short SQL query that will generate the full query for the basket ids. It is only a few lines of SQL, but it will generate over 1100 lines of SQL that will form the query for calculating the unique basket id. You will want to run this query incrementally, so just put a where clause at the bottom to ensure it is over the high water mark. I would also encourage you to put the query in an SSIS package with an OLEDB call to a procedure to look up the identity column on the basket table, so that can be stored with the transaction. The rest, including building the cube, should be easy.

A complementary application to this basket analysis cube is a data mining model targeted at the relationships between sales items and useful for predicting sales items.

If you are having performance issues with your basket analysis solutions or want help with your application, don't hesitate to ask me to help. This is the type of BI activity I really enjoy.

For real-time demonstrations of OLAP, data mining and related BI technologies see http://EasternMining.com.au/Sites/Demonstrations

Saturday, October 17, 2009

Excel 2007 OLAP PivotTable Tips

I am a great fan of Excel as an ad hoc cube browser, even though it lacks a few features I would like to see. Many people don't appreciate some of the functionality available within Excel. Here are a few of my favourite features.

  1. Right click somewhere in the pivottable and select PivotTable Options. Get familiar with all the options. My favorites are Refresh data when opening the file and disabling row/column totals.

  2. Use conditional formatting within the cube. Excel, will honor these cell formatting options by default.

  3. Use Excel's conditional formatting on top of number formatting. Excel will hold the conditional formatting until you change what's on rows.

  4. When you have a cube with multiple measure groups (don't they all now?) use the Show fields related to: to filter measures and dimensions pertinent to your query.

  5. Use the Design tab to remove subtotals by row or columns. Removing sub totals here, will be preserved when you continue pivoting, whereas removing sub totals from rows/columns will not persist.

  6. Also use the Design tab to insert banded rows (or columns)

  7. Right click on the Row or Column header to request sort by measure columns (ascending or descending) or even some other column value.

  8. Right click on the Row or Column header to request Top 10 list. Note, this is really a Top n list, as you can determine how many rows/columns to display. Particularly useful when you have too many columns (rows) and you just want to show the top n.

  9. Right click on the Row or Column header and use the Show/Hide Fields option to remove hierarchy levels from the display. For example, you may want to show cities, without the higher levels of state, country and region.

  10. Right click on the Row or Column header to display member properties. Particularly useful for properties such as phone number

  11. If there are additional actions set up in the cube (a real drill through, for example) right click on the cell (or member) to trigger an Additional action. I find this particularly useful when you want to drill through to a report that shows all the underlying facts, with all their associated details. It can also be useful to take you to an independent application that takes some information from where you are in the cube. For example, Google search (or Bing.com) for a member name.

  12. Right click on a bunch of rows (or columns) to Group them together, as one member.

  13. Get your cube designer to create useful sets within appropriate dimensions. You can drag a set onto rows/columns. For example, Last 10 weeks, or New Customers etc.

  14. If you have a large cube (or poorly designed one) that is slow to query, then check the Defer Layout Update box on the bottom of the field list. In this way, you can design your report without waiting for the data to display. Just click on the Update button when you are ready.

  15. If your query is taking too long, just click on your Esc button, and the query will be cancelled immediately.

  16. In the Options tab, select Offline OLAP under OLAP Tools, to save a copy (or subset) of your cube locally. Note, this will only be available if you have been granted local copy permission. Can be really useful, if you want to take your laptop away and analyse the data.

  17. For the very advanced user, in the Options tab, select Convert to formulas from the OLAP tools button. This will convert your entire pivottable to individual cell formulas of CUBEMEMBER and CUBEVALUE. From this point you will be able to create a report with full control on what is in every cell of the report. It really is quite powerful, although you lose the flexibility of a pivottable. A hybrid option I like is to keep the filters from the pivottable and have a report using the CUBEMEMBER and CUBEVALUE formulas, which gives you a bit of the best from each.

Of course, there is a lot more functionality within Excel OLAP pivottable than I have highlighted above, but you can see that there is quite a lot of power hidden within the tool.

For online real-time OLAP, data mining, Excel Services, PerformancePoint demonstrations goto http://RichardLees.com.au/Sites/Demonstrations

Sunday, October 11, 2009

What's preventing Excel from being the ubiquitous cube browser

I have said earlier that Microsoft’s Excel 2007 is not the ubiquitous cube browser that it should be. What’s wrong with it? Well nothing is really wrong with it, but there are several features, that I would expect from a product that is 10 years old. With OLAP services first release 1998, there was an update for Excel PivotTable. As I understand it that first version was largely the result of one great developer (TC). However, since that release, the upgrades have been unimpressive. Excel 2007 is a great improvement on Excel 2003, but it still has a long way to catch up to the sophistication that is available in Analysis Services. Here are a few of the things I would like to see in Excel

  • Full support for calculated members. They are partially supported, calculated measures are OK, but other calculated members are treated as second class citizens.

  • Ability to create calculated members that are stored within Excel, and available like any other calculated member for that user.

  • Support for 3 dimensional queries. Currently, a user can only ask for a two dimensional report. Excel, is inherently a 3 dimensional space when you include worksheets. So I would like to be able to create a PivotTable with a dimension placed on the worksheet axis.

  • Support for naive users who open up a dimension member list with a million members. For example, ask them if they really want the one million rows (columns) and perhaps defaulting to a top 100 list.

  • More optimised MDX. Sometimes I despair with the performance of pivottable queries, and when I look at the MDX I see that it could run quickly with more efficient MDX.

  • Ability to see properties of ancestors

  • Better integration into Excel spreadsheets. For example, ability to format measures and PivotTable to remember these, even after drill down, nesting etc.

  • Ability to define sets.

  • Eposure to some set functionality in a gui fashion. For example, it might allow the user to select a crossjoin on rows with an exception function to exclude some members based on a filter criteria.

  • Ability to right click on members to “select only” or “deselect”

  • More control over charts. For example mixing the dimensionality of measures so that one measure is cumulative on series and the other is unrelated to the series.
Don’t get me wrong, even with this wish-list, I am a great fan of Excel PivotTable. As a rich client cube browser, it is still my favourite. It’s just that it isn’t keeping up to Analysis Services, which is truly a market leader in the OLAP field. In a later blog, I will write about my favourite features of Excel pivottable, many of which are not immediately obvious.

For real-time OLAP, Data Mining, Excel Services and related demonstrations goto http://RichardLees.com.au/Sites/Demonstrations

Tuesday, September 29, 2009

My Calculated Member is slow


Have you written a calculated member that is performing very slowly, while the underlying physical measures are fast? This could be a common cube query performance issue that has a very quick remedy.

Essentially, AS want to know when to treat this as an empty cell. If you don't specify the non_empty_behaviour, AS will need to physically calculate all potential cells to determine which ones are actually emtpy. With non_empty_behaviour specified, AS can avoid many calculations, since many blocks of cells are known to be empty. The non_empty_behaviour attribute is appropriate for both SQL Server 2005 and SQL Server 2008.

To remedy, all you need to do is specify the non_emtpy_behaviour for the calculated member.

Note, there are logic implications with non_empty_behaviour. With the example in the picture (hits per web session) it is logical that I would want to return null when the sessions measure is null. This would not be the case if I were calculating a YTD measure, since the measure might be null on the current day, but the calculation needs to add many other days that might not be null.

For live OLAP and data mining demonstrations with real data see http://RichardLees.com.au/Sites/Demonstrations

Friday, September 25, 2009

Which cube browser for Microsoft OLAP


I am often asked which cube browser I recommend for Analysis Services cubes. The answer is typically a few minutes long, since there isn't one ubiquitous cube browser that satisfies everyone. It's a bit like asking someone which vehicle you recommend. It really depends on what the purpose is and what you value. A high performance sports car might be appropriate for some people, but would be inappropriate as a family car or to tow the boat to the coast. For me, the best vehicle is a sports bike, but this would be inappropriate for most road users. The same applies to cube browsers, some users need something very simple and easy, while others need sophistication and flexibility. Having said that if Microsoft Excel had a few more features, it could quite easily be the ubiquitous browser.

Two cube vehicles that complement each other well, unsurprisingly, are Microsoft Excel and SQL Server Reporting Services (RS). Excel is very flexible, while RS can deliver tailored reports, with flexible parameters and easy links to drill down/across to other reports.
In this way, RS is perfect for naive users that need easy to consume information with click-through for more detailed and associated information. Note; RS does not support ad hoc cube browsing. Microsoft also have PerformancePoint, which is a great tool for delivering visual data to a thin client. It has functionality somewhere in-between Excel and RS and many features not available in either. PerformancePoint supports (very rudimentary) ad hoc cube browsing, but it is very powerful at delivering visual and interactive data. So PerformancePoint is ideal for presenting visual data to naive users, and for data savvy users there can be a great deal of “data flexibility”. PerformancePoint can deliver this in a semi structured way, so that users don't drive a cube browser, rather, they drive a dashboard of their data. There is a good example of rich PerformancePoint dashboard on SQLS Live Dashboard. Notice how you can click on scorecard items, and they act as filters on the charts.

Another important cube browser (although not yet truly ad hoc) is Excel Services. Excel Services is essentially an Excel spreadsheet accessed through a thin client, Internet Explorer. Any spreadsheet can be published, even one that is attached to an OLAP cube, so it can be a useful vehicle to deliver live cube data. There are a couple of good Excel Services examples on NZ Census. Notice how you can interact with the spreadsheet by drilling down hierarchies and applying filters. Note too; how the data is live (you might want to switch to the weblogs spreadsheet). However, in the current release of Excel Services you cannot change what is on rows, columns and filters. I have many customers that love Excel Services, since it permits their Finance dept to design and publish reports entirely within Excel, and their report users always see live data.

There are hundreds, if not thousands, of third party cube browsers. They really aren't that hard to develop.
I wrote the first thin client cube browser, ThinSlicer, which is still used by many organisations. When I wrote the first version of ThinSlicer, in 1998, I thought that its shelf life would be short; surely Microsoft will offer a highly functional, low cost, thin client cube browser? How wrong could I be? It is now 10 years on, and the only ad hoc thin client cube browser we have from Microsoft is PerformancePoint, which while it has an ad hoc cube browser, it is not the reason anyone would buy it. An example of a more functional thin client cube browser would be Report Portal. This product supports quite strong ad hoc cube browsing and charting. I must add that there are many other commercial cube browsers, and I really don’t want to list them, since the list and relative strengths are continually changing.

If you are embarking on a data warehouse or business intelligence project, I would encourage you to leave the cube browser choice until quite late in the process. (Just like you shouldn't purchase your production hardware until you are almost ready to deploy a solution.) This is particularly true if you are not familiar with cubes and are not sure what you and your users will need. Microsoft’s SQL Server is probably the best foundation for a data warehouse/business intelligence solution. Particularly its OLAP cube and related technologies such as data mining. I would encourage you to use Excel 2007 and Reporting Services (which is free with any version of SQL Server) as the initial cube delivery vehicles, and trial other products when, and if, you find the need. If you are looking for better thin client visualisation, do try PerformancePoint and Excel Services before you engage with third party vendors. These products are very useful, and they are likely to be greatly improved in future versions. I should add that PerformancePoint dashboards will be incorporated into SharePoint MOSS in future, which makes a lot of sense to me. One other consideration, if you are delivering to users outside of your Intranet/internal network is licensing. SQL Server RS and Excel are quite easy to licence for use over the Internet. Unfortunately, it’s not so easy with PerformancePoint and Excel Services. If that is what you would like to do, I suggest you talk to your Microsoft reseller early on.

Tuesday, September 8, 2009

Big Queries


Have you ever wanted to know which queries are consuming the most resource on your server? There is a quick and easy way to do this without starting up SQL Profiler. Just execute the following SQL query.
select top 50
execution_count,
total_physical_reads,
total_logical_reads,
total_elapsed_time/1000 total_elapsed_time,
total_elapsed_time/(1000*execution_count) Avg_Execution_Time,
text
--,*
From Sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(QS.sql_handle)
order by total_elapsed_time desc

This query will list out the top 50 queries from you plan cache and show how much resource they have consumed since they have been in the cache. Note, this is not an accurate way of measuring total resource consumption, since it depends on what and how long plans have been in the cache. However, for a first cut tuning exercise, it is very helpful. If you are looking at overall SQL Server (or machine) activity, they it doesn't matter if a small query is running inefficiently, if it executes infrequently. The query above, provides total_elapsed_time, which is a good indication of which queries (in total) are consuming the most resource.

What I would do in the first instance is copy the text from the query and run an explain, possibly with an execution, and look for tuning opportunities. One of my mottos is that anything can have its performance improved, it's only a matter of how much effort is required.

Tuesday, August 25, 2009

SQL Server Analysis Services Partitions

Like relational partitions, SSAS partitions can be very useful in easing operations and, sometimes, improving performance. Like any performance tuning, you really need to know the ramifications of the feature to get the best out of it.

Firstly, partitioning is an Enterprise version, so you will need to have SQL Server EE to make use of it. Secondly, performance doesn't just improve by virtue of having partitions. It depends on what is in the partitions, how many there are and how they are processed/queried. SSAS knows the low and high dimension key attributes for each partition. So if you have partitioned on date, and your query is looking for a particular date value (or range) SSAS will only query the partition (or partitions) relevant for your query. Will your query run faster with partitioning in this instance? Maybe, but not necessarily. I would generally not anticipate an improvement in this scenario, unless there is some caching effect. For example, if the current month (partition) is the "hottest" it will tend to stay in the data cache at the expense of the less queried months. Note, this tends to happen in a large cube without partitions.

On the other hand, if your query does not include a filter that aligns to the partitioning strategy, SSAS will break your query into multiple sub queries, with at least one sub-query in each partition. This can allow the query to be executed in a parallel fashion, which might, but, again, not necessarily, improve performance. If the parallel queries are essentially breaking the workload (as opposed to all performing the same workload for different slices) then there is likely to be a significant performance gain.

Just like a relational database, if there is only one query active at a time, parallelisation is more likely to achieve a performance gain than if there are many concurrent queries. Many concurrent parallised queries can easily saturate a server and degrade overall performance.

One of my favourite strategies for partitioning is not the classic one (which is to partition by date, say one partition per month/year) and it comes with a very high probability of performing well. It is to partition into static and active partitions, with a large static partition(s) and a very small, but volatile, active partition. If you want to keep your cube very close to real-time, evey time you incrementally update a partition the partition is washed out of the data cache. It has to be washed out, since it is now out of date. Unfortunately, if your cube is very large, the next few queries (including cache warmer) will be very slow, while they read the cube (and aggregations) off disk and into memory. So a good potential strategy to to have all of the historical data in one partition and a second, very small, partition, with current data that is getting incrementally updated every minute or so. That way, each minute, the large static partition will remain in memory. Only the small dynamic partition will get washed out. This is good, we are only washing out the small amount of data that we are changing. This strategy requires an addition operational process, where the dynamic data is merged into the static partition. There are a number of ways to do this, including merging partitions, or reprocessing the partitions during the quiet part of the day. Also, this strategy does not negate the possibility of having a classic partitioning strategy also. I.e your large partition may actually be a number of medium sized partitions, but the volatile data is isolated into one small partition.

My second favourite reason for partitioning is, like relational partitions, it makes some operational procedures easier. For example, with a classic partitioning strategy, to drop a month or year's worth of data, simply involves dropping one or more partitions. No reprocessing require.

My suggestion to you, if you are considering partitioning (and you should for large cubes) is to create a few test cubes and benchmark the performance with various partitioning strategies. Monitor your IO, memory and cpu utilisation. Ensure that your tests involve the caching (or washing out of cache) effect, and ensure that you are using data volumes (and hardware) that reflect what will be used in production.

Monday, August 24, 2009

Dynamic OLAP Reports

The specification from a customer's BI solution, that I just completed, included 3 reports. (The project was really about designing a high performing cube with a hundreds of millions of facts, over 20 real dimensions and updates every 10 minutes.) I love those sort of challenges. Anyway, the 3 reports were essentially the same Profit and Loss Statement at different levels of the business hierarchy. The customer expected 3 reports, and the early BI developer created 3 reports in the PoC, however I created one report that could navigate up and down the business hierarchy with hyperlinks. Needless to say, the customer was delighted to have one report and the flexibility to click up and down the business. This is similar to a blog I made some time ago. I encourage BI developers to use SQL Server Reporting Services Reports (SSRS) on OLAP cubes to create fast and linked reports so that the reports appear like an application to the end user.

Adding hyperlinks to another report in SSRS is really simple, you just right click on the text box, asking for text box properties and clicking on Action Go to Report. The parameters need to be configured. In my case, the parameter for the business unit is created using the current member's unique name and its parent's unique name. So a tiny bit of MDX is required. That brings me onto another of my soapbox themes. BI developers should know MDX. If your BI developer doesn't know MDX, he/she is unlikely to be able to design high performing cubes and will not know how to exploit all the available functionality. it's a bit like asking someone to design a relational data warehouse for you, who doesn't know SQL very well. Good MDX skills are a predicate for good cube designs.

I can't provide the example of my customer's report, but you can see examples of dynamic and linked reports on http://RichardLees.com.au/Sites/Demonstrations

Thursday, July 23, 2009

Who's Using Resources on My Server?

One of my favourite analytical pages is Perfmon by Process. This page shows me which Windows processes are using my critical resources, such as Processor, IO, page file and virtual bytes. When you click on this link, it will show you the critical resources and for each of the last three hours, it shows you which process has been the biggest consumer. The two charts on the right show you the top 10 consumers for the critical resource you have selected on the left (scorecard). By default, the selection is %Processor Time, but you can click on others. My favourite is IO Data Bytes/sec, since this tends to be the most critical resource in a large database system. Try clicking on IO Data Bytes/sec and other counters. See how the grraphs on the right change dynamically, and will show you real time analysis for the last 60 minutes and last 60 hours.

This is just a demonstration. If you have a large production system, do you have such analytical tools available to you? If you don't, I would suggest that you are operating blind, and that you should get some. The toolset I have created uses nothing but SQL Server 2005/2008 and PerformancePoint (for visualisation). In fact, it's so simple, I would like Microsoft to have something like it available as an optional install from SQL Server.

There are lots more real-time demonstrations on http://RichardLees.com.au

Wednesday, July 22, 2009

Instant File Initialization for SQL Server 2008 (and 2005)

I will do anything to reduce the IO load on large databases, which is why I am a great fan of compression in SQL Server 2008. However, another way of reducing IO is to ask SQL Server to format new data extents without writing zeroes out to all pages. If your database takes a new extent (by default it is 10% of the file size, which typically isn't good) SQL Server will want to write binary zeroes to all the pages to ensure that some old data isn't hidden in your database. SQL Server does not need the binary zeroes, it is done to ensure data security. Someone else may have deleted a file and perhaps they don’t not want you seeing the data.

Very often, the data security is not an issue and you would like to avoid the writing of binary zeroes. This can be achieved (on NTFS drives) by ensuring that the account running SQL Server has SE_MANAGE_VOLUME_NAME privilege. You can grant this privilege in User Rights Assignment on Windows 2003, XP or above. To do this, grant the SQL Service account Perform volume maintenance tasks local security rights. Alternatively, if the SQL service account has administrator privileges, it will automatically have this privilege. That's all you need to do to avoid having SQL Server write all those binary zeros. By the way, SQL Server always needs to write binary zeroes to the log files, we are only avoiding binary zeroes on the data files.

If you don't want to grant this privilege to the SQL service account, and you want to avoid the IO load during peak times, I suggest that you extend (manually or automatically) your datasets during an off peak times (i.e. before SQL does it for you).

One of the most noticeable tasks affected by binary zeros is database restores. If you watch the restore progress, it will typically not move off 0% until it has written out the entire data files with binary zeroes. After writing every data page, it will then write the backup files over the data pages, during which time, you will see the progress percent increase. So writing binary zeroes can approximately double full database restore times.

Hope that helps you reduce IO load and increase SQL Server performance.

For real-time SQL demonstrations, including Perfmon, which is writing over 300 records every second, see http://RichardLees.com.au/Sites/Demonstrations

Spotted on SproutDaily.com


Last week I was spotted by SproutDaily.com Some of the best photographers in Sydney take pics around Manly Beach and post them out to everyone registered at SproutDaily.com.

Just can't do anything in Manly without everyone knowing.

Wednesday, July 1, 2009

Restartable or Rerunnable SSIS Packages

BI solutions need to be continually importing/updating data from external source systems. I believe that a good BI solution will have automatically correcting load and process jobs. What I mean by that is if a package fails (such as a power or system failure) the application will automatically correct itself next time it runs. In the old days, we used to create "restartable" jobs, where, after a failure, the operator would manually intervene to restart the job with specific parameters. I think there is little excuse nowadays to have applications that require manual intervention after a failure. They should be designed in such a way that they “know” where they are up to and continue, or reprocess.

For example, my Permon application (Perfmon Dashboard) has a job that runs every 180 seconds to transform and load the new records, and incrementally process the OLAP cube. Every now and again the machine is restarted for some reason and the Perfmon update process could be at any point when it is stopped. However, the job will automatically run from the right place, next time it runs. It will know whether there is 3 minutes or 10 minutes of data to process, by using high water marks in the database. In this way, I have no operational overhead in running this application. It just looks after itself. It is the same for the weblogs BI solution on Weblogs Dashboard. If your BI solution requires manual intervention after a failure or an error, get someone else in, like myself, to fix it up to run automatically without manual intervention.

Now having said that, it is often useful to be able to pass parameters to an SSIS package. (I just needed to mention automatically correcting jobs above, as some developers make packages restartable,to help with manual correction, when it should be automatic.) For example, you might want to have the ability to ask the SSIS package to conditionally run one task. This can be achieved quite easily by assigning the Disable property on the task (container etc) using Expressions on the task to set it to a parameter. To pass the parameter it is simply a matter of adding /SET "\Package.Variables[DisableBackup]";False to the cmd for the job in SQL Agent.

Friday, June 26, 2009

Reporting Services Tips for OLAP Cubes

Naturally I have an inclination to use Reporting Services on OLAP cubes. My feeling is that reports should by and large be run from OLAP cube where they can be dynamic and fast. SQL based reports are more ideally suited for "atomic data" reports (eg transaction lists) and some special reports requiring real time data.

There are a few funnies when you get into OLAP reports with SSRS. The SSRS wizard does a great job of starting the report and building the parameters you need. However, before long you will need to edit the MDX that the wizard creates, to do things like taylor the parameter queries for cascading filters. Here are a couple of tricks that I commonly use.

In SSRS 2008, when you edit any MDX query, the designer will automatically rebuild (in other words wash out any MDX changes you have made) input parameter queries. The simple work around for this frustrating behaviour is to remove the Dimension and Hierarchy values from the query's parameter list.

You might want to use the SSRS calendar control for easy date picking, which is used to filter an OLAP date dimension. This isn't handled automatically, but can be achieved with the following. Change the parameter property to be Date/Time datatype. That will ensure that SSRS uses the date picker. For most reports you will want to dynamically specify the default. This can be simply achieved by a single row OLAP query (for example the last day with data) or from today's date using the now() function. If you use the now() function you can strip off the time with =cdate(format(now(),"yyyy-MM-dd")). The trick now is to use the datetime parameter in your MDX query. What you need to do is construct a full member name using the known dimension-hierarchy names and contruct the member key using the date. The construction of the key will depend on what you have defined as the key of the date hierarchy. If the key is a date, then the fully qualified member name will look something like this. [Date].[Calendar].[TheDate].&[2009-06-26T00:00:00]. So to include the date parameter in your query you can include something like the following in your query strtomember("[Date].[Calendar].[TheDate].&["+format(now(),"yyyy-MM-dd")+"T00:00:00]") Just replace [Date] [Calendar] and [TheDate] with the dimension, hierarchy and level name from your dimension. The MDX query tool in SQL Enterprise Manager will help by providing a sample member. The date parameter might simply be used in the WHERE clause to filter the query or it might be part of a range specification etc. It's a member specification and as such could be used anywhere in your query.

Not so much of a tip, but more of an apology. The MDX query editor in SSRS is a very poor text editor. When I am editing an MDX query, I invariably cut and paste the query into and out of SQL Enterprise Manager Query tool.

Here is an easy way to get a report line number. =RunningValue(1, SUM, "Invoice_Data") The line number can also be used for alternate line shading.

Check the performance of all the MDX queries. If any of them are not running fast enough, see if there is anything you can do to speed them up. One classic way to improve performance and improve report usability is to filter out empty members.

Another classic way to improve performance is to change the use of any single select parameters from StrToSet() to StrToMember() and placed on the WHERE clause.

Just like SQL reports. Or perhaps more so, since OLAP reports are more likely to be run dynamically, link your reports together so that users can easily navigate down and across reports. Essentially, making it easier for users to drill through to focus the salient data. With SSRS reports on OLAP this can be achieved in such a way that users do not need to know anything about databases or even that a cube is behind the report. They simply click on headings or numbers to drill into and across the data that interests them. For example, click on the "Top Memory Consumers" picture above for an OLAP report. The report is dynamic and close to real time. Notice how you can click on any of the processes and get an analysis of that process. Likewise, you can click on any of the counters in that report for a breakdown of the processes using that resource. A good set of dynamic SSRS reports will appear more like an application to end users than a traditional list of static reports.

One last tip is to verify that all the queries in your report definition are actually used. Sometimes SSRS will create new queries for you when you edit parameters. Or you might modify a report and remove a parameter. Do remember to take out the unused queries. All queries appear to fire even if they aren't used.

For examples of real time OLAP reports see http://RichardLees.com.au/Sites/Demonstrations

Monday, June 22, 2009

Bing.com - Microsoft's new search engine


Have you checked out Microsoft's beta search engine, Bing.com? It has some nice features and now look much more like Google.com.

I have noticed that it is rising on my Top Referrers List. Currently it is in the top 20 and rising. For an up to date list see http://RichardLees.com.au/Sites/Demonstrations

By the way, I count 1 referrer as one person clicking on a link from Bing.com to somewhere on http://RichardLees.com.au So far this month (June) Bing.com has made 45 referrals compared to Google, several hundred, or blogspot.com with 1679.

For real time web analytics see RichardLees.com.au

Thursday, June 18, 2009

Microsoft Making PerformancePoint Planning Source Code Available

I have just heard that Microsoft will be making PerformancePoint Planning software available to partners (http://www.microsoft.com/bi/partners/default.aspx) who were actively involved in evaluating/using the product, at no cost.

This is great news. PerformancePoint Planning was a great V1 product and I think it would have had a great future had Microsoft continued to invest in it. It's a loss to Microsoft, MS partners and customers that the product was discontinued. Let's face it every organisation does planning, and if the organisation is large, you do need software to manage the process. I did blog at the time of the discontinuance announcement http://richardlees.blogspot.com/2009/02/microsoft-is-discontinuing.html

I don't know if there are any restrictions on what the partners can do with the code, but I imagine/hope some company will turn it into a V2 commercial product.

I really think the corporate planning market has room for maturation with none of the incumbent vendors (Hyperion, Peoplesoft, Cognos etc.) providing a ubiquitous product.

Monday, June 15, 2009

Learn SQL - SELECT Product(column)

I have on occasion been quoted as suggesting that SQL code with temp tables can indicate the author lacks strong SQL skills. Consequently, I have been challenged to write single SQL statements to replace multiple statements stuck together with temp tables. One of these occasions was with Ron Soukup over a bottle of whisky, but that story deserves its own blog.

On this occasion (sometime in 1999) I was challenged to write a query to get the product of a set of numbers. On another bottle of whiskey, as it happens. My friend wanted something like the SUM() function, but he wanted the product of a set of numbers rather than the sum. The business application was quite simple. His client, a financial organisation, needed to calculate compound interes for money on overnight deposit. Each day will have a different market interest rate.

This problem had an interesting solution using a couple of maths functions - Sum(), Exp() and Log(). Here is something like the SQL that I wrote for them.

select exp(sum(LOG(OvernightInterestRate+1))) as CompoundRate
from OvernightDeposits
where TheDate Between '1999-07-23' and '2009-06-18'


Now that's really quite easy, and is much more efficient than putting the data into a temp table and reprocessing it. Having said that, it would be quite useful (and I thought quite easy) to include a Product() function in SQL Server.

For real-time SQL, OLAP and Data Mining demonstrations, goto http://RichardLees.com.au/Sites/Demonstrations

Sunday, June 14, 2009

Know SQL - Predicates on the WHERE or OUTER JOIN Clauses

Writing good SQL is fundamental in a high performing application free of bugs. SQL has been around as the standard database query language since it left QBE by the roadside in the 1980's, and the core syntax hasn't changed much. Database developers (of any rdbms) should fully understand all ANSI aspects of the language.

I was with some developers recently who hadn't appreciated the difference between a filter predicate on a WHERE and a LEFT OUTER JOIN clause. Essentially the difference is whether the predicate is working on the join or on the resultset. If a predicate is on the right side table of a LEFT OUTER JOIN clause, then it will work on the right hand records only. If the same predicate is on the WHERE clause, the entire record (left and right side) will be filtered out. This can affect the number of records returned.

Here is an example of the difference. The first query has a filter on the WHERE clause, so that any objects joined with columns that have a scale<>0 will be entirely filtered out. That is, not all objects with a name like '%sys%' will be returned.

Select o.* from sys.objects o
left outer join sys.columns c on o.object_id=c.object_id
where o.name like '%sys%'
and c.scale <> 0

This second query has had the columns predicate moved to the LEFT OUTER JOIN clause. This is logically very different since the predicate scale<>0 is working on the join. In this query all objects with a name like '%sys%' will be returned. It’s just a matter of whether they are joined with columns.
Select o.* from sys.objects o
left outer join sys.columns c on o.object_id=c.object_id
and c.scale <> 0
where o.name like '%sys%'


I have used sys.objects and sys.columns, since they are tables (well views actually, but that doesn't matter) in the master database (of either SQL Server 2005 or SQL Server 2008) so you can cut and paste the queries into your query editor and visualise the difference.

This is just one subtlety of the SQL language. If it is news to you and you are a database developer, I suggest that you spend some time learning the SQL language more thoroughly. The good news is that any time you invest in the SQL language will be useful to you for a long time, since it does not appear to threatened by a new language anytime soon. (As much as I love MDX, I can't see it replacing SQL in the rdbms.)

Saturday, June 13, 2009

Leaving Footprints All Over the Internet

Someone recently asked me if my web analytics dashboard (http://RichardLees.com.au/Sites/Demonstrations) records information on where they had been on my site. The answer is an emphatic Yes. Not that my site does anything special in the logging department. I am just using Windows IIS and by default it will log every http request. The only extra thing I do is that I put this data in a cube so that it is easy to query and visualise. You should remember that everytime you browse to a page on the internet, there is probably a log on that server recording your IP address, what you requested, date, time etc. And there will often be additional information such as your browser configuration, even the operating system and version number. If you are accessing the internet via a proxy server, then your IP address is shared amongst other proxy server users so you are somewhat camouflaged. But if you are using a proxy server, then you should consider the proxy server is logging every request you make over the internet, and possibly all the intranet browsing too.

So, as I was saying, my site puts all the web logs into a cube so that you can visualise all the activity on my site, even what I do. I built it quite simply as a demonstration of OLAP technology.

You can ad hoc query the activity on my site using ThinSlicer or PerformancePoint. Both of these tools are thin client, only requiring IE6 (or above) and there are no client controls. So if you have been to my site (anything on http://richardlees.com.au/) then you will be able to browse the cube and see where you have been and what resources you used on my server at the time. Simply go to the Interactive Chart Grid on http://RichardLees.com.au/Sites/Demonstrations and navigate to your IP address and display the requests you have made. This is using Microsoft's PerformancePoint dashboard technology, which takes a little getting used to, but it is a fully functional cube browser. For example, the picture above is a query I made, drilling into IP address 84.151.222.71, which happens to be in Munich, Germany. I then drilled into .jpg files to see what pictures had been downloaded. If you don't know your IP address, you can drill down to your geographic location and select the IP address active at the time you were on the site. You could even browse to my site requesting a resource that does not exist, such as HelenClark.jpg, since you will be the only person requesting this resource, you will be able to drill into this resource and list out your own IP address etc. You could then pivot on the IP address and see what else you requested. The cube is available for continuous querying, even though it is being updated with new logs every 15 minutes or so.

Please note, I do not use these web logs for anything other than providing a demonstration of OLAP and data mining technologies. There is no reason to feel paranoid about what I store on my server, but it might make you think about what information other web sites have about your activity.

Monday, June 8, 2009

Altavista is Back in Action

Anyone who was using the internet in the 1990s, like me, will fondly remember Altavista.com, the best search engine of that era.

I have just noticed that it is back in action and referring people to my site! It seems to have replaced referrals that have come from Yahoo. I guess that shouldn't be so surprising since Yahoo acquired Altavista in a roundabout way via Overture Services.

What is interesting to me is that altavista.com has come back and that almost all of the Altavista referrals (on my site) have come from Chicargo, Illinois. Anyone know why might be?

I have attached a couple of reports, one is simply the Top Referrers Report from http://RichardLees.com.au/Sites/Demonstrations which shows altavista suddenly appearing in Maya 2009. The other I have created using ThinSlicer showing the geographic location of the altavista users. You can do the same remotely on my site. Or, if you prefer, the default dashboard on http://RichardLees.com.au/Sites/Demonstrations has a top referrers grid at the bottom right of the dashboard, which is interactive, so you can drill down on Altavista or drill across to any other dimension.

I find it interesting participating and watching the competition between the big search engines. From the stats on my site, Google.com is by far the largest, with live.com enjoying a steady workload. By the way, the home page on the dashboard above shows the top referrers, but only for the last 5 months. If you want to see a longer range, simply drill up on the time dimension to All, then drill back down to month and you will see all months.

Saturday, June 6, 2009

Resizing .jpg files for SharePoint Server 2007

My internet site (http://richardlees.com.au/) is hosted on Sharepoint Server 2007. There are many photos on my site, particularly, but not exclusively, family snaps. In SharePoint I have resized the photos for easier display. I knew that SharePoint did not physically resize the images, but didn't appreciate the full impact it had on people browsing my site over a low speed link. Since SharePoint did not physically resize the photo (.jpg file) a 2MB file might be downloaded to show a little photo whose image only needs to be a few kilobytes. I am a bit disappointed that Sharepoint doesn't resize images automatically. I kind of feel that since it knows how big the display images are, it could resize them accordingly, and even though it didn't do it now, I thought it must be fixed in a service pack or subsequent release.

Anyway a couple of weeks ago a partner colleague mentioned to me his frustration with these images, so I spent a few minutes and resized all the .jpg images in SharePoint. I was only using about 30 images, so it only took about 30 minutes. It is interesting how much difference this has made to the performance of my web site. The home page on my web dashboard shows average response times for the major file types (.aspx, .axd, .css, and .jpg) over the last 60 days. If you look at this graph (top right on http://RichardLees.com.au/Sites/Demonstrations) you will see .jpg has dropped in response time around May 21. Althought there is quite a lot of daily variations since it is an average of many different size images. I haven't had such a large performance change since I talked to my son about his download activity (http://richardlees.blogspot.com/2009/01/is-my-web-site-getting-slower.html).

You are able to drill into the response time graph, and even take it to a new window if you know how to drive PerformancePoint (see http://richardlees.blogspot.com/2009/04/try-out-microsofts-zero-footprint-cube.html for some quick instructions). I have done this, and drilled down to the photo of Anna and myself on unicycles (one of the main MB culprits). The graph is attached. You can achieve the same, although you will see a more up to date chart since the cube is updating every few minutes.

Two takeaways from this initiative.
  1. If you are using SharePoint 2007, it pays to resize your images to the display size.
  2. It always pays to monitor and log performance of any production application so that you can see user performance and the impact of your tuning exercises
For more close to real-time performance information of my internet site and the two servers behind the site goto http://RichardLees.com.au/Sites/Demonstrations

Analysis Services Database Name and DatabaseID

Typically an Analysis Services database has a DatabaseID the same as the database name. However, if you rename a database, the DatabaseID will stay the same. This can be a source of confusion and it can lead to useful operational techniques.

Whenever anyone connects to an Analysis Services database, they do so by selecting a particular database name. So when you rename a database, users will see and use the new name. However, SSIS packages that perform Analysis Services processing tasks select the database by the DatabaseID. Also, Analysis Services keeps the DatabaseID as a unique identifier. So when you rename an Analysis Services database, existing SSIS packages will be looking for the old DatabaseID.
So if you have an Analysis Services database, say EasternMining, and you rename it to Eastern Mining, then users will only connect to it by asking for Eastern Mining, which is fine. However if you try to restore a database called EasternMining (which doesn't exist anymore) it will fail, even though you have enabled Allow Database Overwrite. The message will be something like Backup and restore errors: An object with the same 'databaseid' ID but different 'databasename' Name already exists on the instance. (The AllowOverwrite flag only applies to the Object ID, not to the object name.) I find the message a bit misleading, but what it is saying is that a database with that DatabaseID exists already.

Unfortunately, you cannot change the databaseID, without doing something like backing up and restoring the database. Interestingly, an xmla create database statement will overwrite an existing database and DatabaseID. If you want to know the DatabaseID for an Analysis Services database, just right click on the database and select properties.
I have previously blogged on how you might use this to maintain close to continuous availability when you need to deploy a new Analysis Services cube structure. http://richardlees.blogspot.com/2009/05/new-analysis-services-database-version.html

Sunday, May 31, 2009

Data Mining as an Enhancement to Basket Analysis

Big retailers all want to perform some form of basket analysis, where they analyse the mix of products that customers purchase in one basket. This tends to be an extremely hardware intensive exercise. Using OLAP technologies can help in this exercise, but even with OLAP hardware can be stressed since you really need a dimension on TransactionID. (OK, there are some optimisations you can make to minimise this dimension, but it is still very large and expensive.) However, if you want to analyse which products are bought together (along with other variables such as time of day, customer gender etc) then data mining (DM) can be a great tool and offer its analysis at a very low cost (ie you only need a small commodity server to perform the analysis). DM algorithms have been available as part of the standard edition of SQL Server since 2000.

Essentially, you ask the data mining to create a model to predict products, based on other products (and other variables such as time of day, store, customer gender etc). The data mining algorithms can scan billions of transactions and produce a data mining model in a relatively short time. The data mining algorithms search through many combinations of products/attributes and discover which products/attributes have strong associations. Essentially, the DM model will only stores relevant associations and their predictive strength. The data mining model tends to be very small since it only contains the associations, even though the work it has done to find them can be resource consuming. The data mining model does not need to store the billions of baskets, or the associations between irrelevant products. You are able to very easily browse the model to visualise products (and potentially other attributes such as gender and time of day) and their associations. Also, you can use the data mining model to predict which products might be bought in a particular basket. This data mining prediction query is extremely fast, typically a few milliseconds, so it is quite plausible that online transactions are probing the data mining model for suggested products at real time. For example, your billions of transactions might be terabytes of data, while the resultant DM model could be a couple of hundred megabytes. of data.
I have an online demonstration of a model that does something like this on http://EasternMining.com.au/sites/Demonstrations/Pages/LibrariesSuggestions.aspx I don't have a retailer's transaction history. Most companies privacy policies prevent them from me displaying them on the internet, however, if yours doesn't let me know and I will host the db on my site. What I do have is the borrowing history of a real municipal library (with names and personal identifiers fudged), which is very similar in structure to retailer transactions. The book is like a product and a borrower's loaned books are a transaction or basket. In my demonstration, you don't need to enter the books and gender, just select any borrower, and the demonstration will select the books that they have borrowed and their sex. This is a relational query and is actually embedded in the data mining (DMX) query. Here is the "product suggestion" DMX query. It looks a bit wordy, but it is not complicated, and the SQL Server wizard creates most of the query for you.

SELECT flattened
t.[BorrowerNo],t.sex,t.[decadeofbirth],
TopCount(predict([Loans n Titles],include_statistics),$probability,12)
From BookSuggestionFullDT
PREDICTION JOIN
SHAPE {OPENQUERY([Libraries],
'SELECT
BorrowerNo,
Sex,
DecadeOfBirth
FROM Borrowers_Interesting')}
APPEND
({OPENQUERY([Libraries],
'SELECT
Title,
BorrowerNo
FROM LoansInterestingFull
WHERE Borrower = ''Damen Roxburgh'' ')}
RELATE BorrowerNo TO BorrowerNo)
AS Loans_interesting AS t
ON BookSuggestionFullDT.Sex = t.Sex AND
BookSuggestionFullDT.[Decade Of Birth] = t.DecadeOfBirth AND
BookSuggestionFullDT.[Loans n Titles].Title = t.Loans_interesting.Title

Here is a similar query with the values coded into the query. As you can see, it is much easier to read. I have found the DMX language easier to learn than MDX, since the tasks you are requesting tend to be simpler. I.e. you are generally only asking for a prediction based on some input variables.

SELECT flattened
MyCustomer.Customer,MyCustomer.Sex,MyCustomer.DecadeOfBirth,
TopCount(predict([Loans n Titles],include_statistics),$probability,12)
From BookSuggestionFullDT
PREDICTION JOIN
(SELECT "Richard Lees" as Customer,
"Male" as [Sex] ,
"1960s" as DecadeOfBirth,
(select 'Marie Claire' as [Title] union
select 'Cleo' as Title) as books) as MyCustomer
ON BookSuggestionFullDT.Sex = MyCustomer.Sex AND
BookSuggestionFullDT.[Decade Of Birth] = MyCustomer.[DecadeOfBirth] AND
BookSuggestionFullDT.[Loans n Titles].Title = MyCustomer.Books.Title

So what I am suggesting here is that if you are performing basket analysis, or have a desire to, then you should also consider creating a data mining model. The DM model will help find the associations and optionally predict potential products. You could even have online shopping sites use DM models to make predictions based on the basket content so far. The query tends to be a few milliseconds, certainly sub-second.

If you require any assistance in creating or exploiting your data mining models, please don't hesitate to ask me for assistance.

For more online real-time business intelligence demonstrations see http://RichardLees.com.au/Sites/Demonstrations


Richard

Friday, May 29, 2009

Creating your first data mining model

Data Mining is a entirely new exercise using new technology for most developers. It can be awkward creating your very first model. To help people create their very first data mining model using real data, I have written a whitepaper on msdn. So if you want to create your first model, visualise it, query it, and create a Reporting Services report from it go to http://technet.microsoft.com/en-us/library/dd883232.aspx


Of course, if you require any assistance with more advanced features of data mining, or applying it to your business strategy, just ask me.

Thursday, May 28, 2009

Don't be afraid of Data Mining models and DMX

Since SQL Server 2000, Microsoft have included data mining algorithms in the standard edition. It continues to astound me that people are still not taking advantage of this technology. OK, you will have to learn something new, but it is a lot of fun and can have profound impact on an organisation. Sometimes all you want, or need, to do is build the model and visualise it.

However, it is often desirable to use the model to make predictions. To do this, you really need to write DMX (Data Mining version of MDX) although the wizard does a good job of helping you. I have some real time data mining demonstrations on my internet site. For example, http://EasternMining.com.au/sites/Demonstrations/Pages/PredictingLatenciesusingDataMining.aspx will query a data mining model with the last 50 hits to the site and predict what the response time will be. Of course there is no benefit in such a model, as we already know the outcome, but it is an interesting demonstration of data mining. The predictions could be for delivery time, probability of fraud, customer profitability, new hire longevity, medical prognosis etc. Here is the DMX query behind the report.

SELECT
Last50.[rid],
Last50.[LogTime],
Predict(ProcessingTimeDT.[Processing Time]) as [Prediction],
Last50.ProcessingTime,
Last50.BytesIn,
Last50.BytesOut,
Last50.Country,
Last50.Resource,
Last50.ClientAgnt as [ClientAgent],
Last50.http_status,
Last50.Operation,
Last50.ReferringServer,
Last50.os,
Last50.City
From
ProcessingTimeDT
PREDICTION JOIN
OPENQUERY([Weblogs],
'SELECT
[rid],
[LogTime],
ProcessingTime,
BytesIn,
BytesOut,
Country,
Resource,
ClientAgnt,
http_status,
Operation,
ReferringServer,
ResourceType,
OK,
OS,
City
FROM
(SELECT TOP (50) rid, LogTime, ProcessingTime, BytesIn, BytesOut, Country, State, City, resource, resourcetype, clientagnt, OK, http_status, operation,
referringServer, calyear, Calmonth,OS
FROM InternetLog_Wide
ORDER BY RID DESC) as Last50Hits') AS Last50
ON ProcessingTimeDT.[Bytes In] = Last50.BytesIn
AND ProcessingTimeDT.[Bytes Out] = Last50.BytesOut
AND ProcessingTimeDT.Country = Last50.Country
AND ProcessingTimeDT.Resource = Last50.Resource
AND ProcessingTimeDT.ResourceType = Last50.ResourceType
AND ProcessingTimeDT.ClientAgnt = Last50.ClientAgnt
AND ProcessingTimeDT.OK = Last50.OK
AND ProcessingTimeDT.[Http Status] = Last50.http_status
AND ProcessingTimeDT.Operation = Last50.Operation
AND ProcessingTimeDT.[Referring Server] = Last50.ReferringServer
AND ProcessingTimeDT.OS = Last50.OS
AND ProcessingTimeDT.City = Last50.City


The DMX query is actually a hybrid query, with the inner query (for the last 50 hits) coming from the SQL relational database. If we take out the hybrid query and just enter some data manually, the query looks a lot cleaner. For example here is a DMX query where I type in a couple of the attributes (country, city, resource) and get a prediction based on this input.

SELECT
Predict(ProcessingTimeDT.[Processing Time]) as [Prediction],
TypedData.*
From
ProcessingTimeDT
PREDICTION JOIN
(SELECT
'UNITED STATES' as Country,
'RichardsWelcomePage.aspx' as Resource,
'MSIE 8.0' as ClientAgnt,
'New York' as City )
AS TypedData
ON ProcessingTimeDT.Country = TypedData.Country
AND ProcessingTimeDT.Resource = TypedData.Resource
AND ProcessingTimeDT.ClientAgnt = TypedData.ClientAgnt
AND ProcessingTimeDT.City = TypedData.City

Notice how the data mining model does not need to have all the attributes. The more you give the query, the more accurate the query will be.

If you need any assistance designing or querying your data mining models, don't hesitate to ask for my assistance. Data mining is one of my favourite activities.

Richard

For more real-time data mining and OLAP demonstrations goto http://EasternMining.com.au/Sites/Demonstrations

For more about Richard Lees goto http://richardlees.com.au/

Wednesday, May 27, 2009

MDX for Business - Text as a calculated measure

It surprises many developers that you can have a text value as a calculated measure, since a measure usually contains an aggregated number. However, a text value is very simple and very useful in a calculated measure. It could be as simple as getting a property of a member such as a customer's mobile number -

Member Measures.MobileNumber as Customers.Customer.properties("Mobile Number")
I have a real example, in a report that shows hours on rows and the last three days on columns, and for each cell it shows the top City (customer, product, whatever) by virtue of the number of web requests in that hour. This is shown on the default page of my web dashboard on http://EasternMining.com.au/Sites/Demonstrations See down near the bottom is a table called Top Cities by Hour - Last 3 Days. It shows the city with the most activity on my site for that hour. It might look ahead of time to you, if you are in America or Europe. It is actually very close to real-time in Sydney Australia.

Here is the mdx

With Member Measures.TopCity as
Topcount([Client Host].[Client Geography].[City]
,1,measures.[request count]).item(0).member_caption
Select
tail([Date].[Year Month Day].[Day],3) on Columns,
[HoursOfDay].[HoursOfDay].[Hour Of Day] on Rows
From EasternMining
Where Measures.TopCity


See how simple the mdx is. Again, imagine having to write the same query in SQL. Also, if I want to change something in the report, such as last 5 days, last 3 months etc, it is very simple. In fact, If I chage [City] to [Region] I get the following report. This report tends to show the time zones of Asia, Europe and Americas. See how Americas tend to be early morning (Sydney time) while Europe is late at night and Asia is visiting during the online day.


It's so very easy with MDX. If the answer is in the data, then one MDX expression should be able to ask the question. Let me know if you need any assistance in writing or learning mdx.

Friday, May 22, 2009

MDX for Business - Ranking Customers/Cities

I was just asked to help with an MDX query to list out the top 50 Customers and provide ranks for the last 3 periods. This is a relatively simple MDX query, which someone with moderate MDX skills should be able to write. It might be interesting to you, so here it is.

with
Set Last3Months as tail([Date].[Year Month Day].[Month],3)
Member Measures.Last3MonthsHits as Sum(Last3Months,Measures.Hits)
Set Top50Cities as topcount(filter([Client Host].[Client Geography].[City],[Client Host].[Client Geography].member_caption<>"-"),50,Measures.Last3MonthsHits)
Member Measures.RankThisMonth as
rank([Client Host].[Client Geography].currentmember,order(Top50Cities,Measures.Hits,BDESC))
Member Measures.Movement as Measures.RankThisMonth-(Measures.RankThisMonth,[Date].[Year Month Day].prevmember)

Select Last3Months
*{Measures.Hits,Measures.RankThisMonth,Measures.Movement} on 0,
Top50Cities on 1
From EasternMining

It is a real query and I have executed it on my Weblogs cube (all hits and sessions from my web server). The results are formatted in Excel, which I have found interesting. Sydney (my home town) naturally is top of the list, but then comes New York and London. I don't do any business in these cities, which is why I find it interesting. Perhaps I should have offices there? Redmond is quite high. I guess my old Microsoft colleagues are still finding the real time demonstrations interesting.

Anyway, you can read the MDX and see what it is doing. Notice the elegance of the MDX query. It is very succinct, and easy to read. For those of you who know SQL, can you imagine writing this query in SQL? It wouldn't look anywhere near as elegant, nor would it perform as well.

If you would like assistance with your OLAP cube design or MDX query writing, please don't hesitate to ask.

Perfmon Has Everything

I think I have blogged on the Perfmon cube before, but I am blogging again since many people still do not appreciate how valuable and comprehensive the information from Perfmon can be. When you have a production performance issue, it is so much easier to diagnose if you can do so without adding further stress on the server and by looking at trend information. For example, if your server is busy and going slow, wouldn't it be good to see what cpu/memory/io metrics are now, and compare them to how they were before the performance issue? If you don't store historical information, trend analysis is not possible. And without historical information you will be less likely to preemptively act to avoid a performance problem.

My dashboard http://EasternMining.com.au/sites/Demonstrations/Shared%20Documents/Windows%20Performance%20Monitor/Perfmon.aspx shows many perfmon metrics, but the dashboard conceals how rich the cube is. Notice the headline information in the dashboard above. Now let's assume that there might be an IO bottleneck (very common in database servers nowadays). Click on the Disk tab at the top of the dashboard. This will show you the Physical Disk counters from perfmon. You can filter it by a particular machine. Now click on Disk Bytes/sec in the main display table. Notice how the two graphs on the right dynamically chart this counter for the last 60 minutes and last 60 hours. We can see that there is a large amount of IO on the B: drive during the last couple of days. Now we want to know which program (windows process) is consuming this IO. Perfmon has all the information. Click on Process tab at the top of the dashboard. Then click on IO Data Bytes/sec in the main table. Notice how the graphs on the right chart the top processes of this counter. Also try clicking on IO Write Bytes/sec and see that most of the IO activity is write IO. The chart tells us which process is consuming the IO. In my case it is msmdsrv.exe but it might be different by the time you look at it. All this analysis can be performed without increasing load on the target servers.

If you are interested in Windows performance, try clicking around the Perfmon dashboard tabs on my demonstration site above. The Perfmon metrics are very close to real time (<=180 second lag). You can also use ThinSlicer for ad hoc queries of the Perfmon cube from http://richardlees.com.au:8080/iislog/tsPerfmon.aspx?Column=[DateTime].[Year%20Month%20Day%20Hour]&Row=[Counters].[Counter%20Hierarchy]&Measure=Measures.[ValueAverage]

Also if you would like a Perfmon cube in your organisation, I would be happy to help you. The only technologies used to create the real time dashboard above is Microsoft SQL Server and Microsoft PerformancePoint (PerformancePoint is optional for an application like this, since it is just the vehicle for displaying the information from the cube).


Sunday, May 17, 2009

New Analysis Services database version with minimal downtime

It can be operationally challenging to deploy a new AS database version without cube downtime. Typically, if you deploy a new version of an AS database, the database will be offline until the database has completed a full process. This might be a long time in an environment where the large measure groups are very large and are only incrementally processed (on a daily, hourly, minute etc. schedule), since the full process may take many hours. This can be a problem in SQL Server OLAP as we have been operationally used to a cube being available 24 by 7, since SQL Server 7.0.

One workaround is to deploy a new version of the database with a slightly different name, for example the existing database might be Perfmon and the new version might be Perfmon_V2. In this way, the new database can be deployed and processing while the existing database is online and available. The trick comes when the new database has completed processing. The old database could be deleted and the new database renamed to the original name. This has a slight snag, since when you rename a database, the databaseID remains the same, and the DatbaseID is used by SSIS in AS processing tasks. So you would need to change all the SSIS packages that refer to the database. This would be confusing to manage since the database names would be out of sync with the databaseIDs. The extra trick is to back up the new AS database after processing and restore it over the original database. When you restore an AS database, the database name and databaseID are in sync.

Hence, for an AS database that takes a long time to fully process, to deploy a new database version follow these steps.
  1. Deploy the new AS database with a suffix of _V2
  2. Fully process the new AS database
  3. Backup the new AS database
  4. Restore the AS database over the original database (Allow Database Overwrite)

The downtime is limited to the duration of the restore operation.

Note, the original AS database could continue with incremental processing during the new database full process, although some consideration needs to be given to the way you manage high water marks in the measure groups. This workaround is suitable for SQL Server 2005 and SQL Server 2008.

For realtime online OLAP and data mining demonstrations see http://RichardLees.com.au/Sites/Demonstrations

Monday, May 4, 2009

SQL Server Fragmentation

There are two types of fragmentation

SQL Server Fragmentation
You can analyse SQL fragmentation with DBCC SHOWCONTIG, and use SQL Server utilities to reorganise, if necessary. Most DBAs know about this type of fragmentation and SQL Server utilities can help you reoganise your databases online.

OS Level Fragmentation
SQL Server knows nothing about this fragmentation since it is managed by the OS. You need to use the system defragmentation tool (or file copy) to defrag, which is not easy while SQL Server has the file open. A much better strategy is to avoid OS fragmentation. To do this simply ask SQL Server to grow files at a reasonable size. 1 MB is not a good increment for a terabye database. Also the default of 10% is not very good, since an active transaction that requires additional space will have to wait while the entire 10% is formatted before it can continue. Not a good thing to happen in a high OLTP environment. Notice how the Defrag utility is showing the number of fragments per file. You should be a little bit concerned about database files with over about 100 fragments, and anything over 1000 fragments may well be causing you performance issues.

The best strategy is to preallocate the space in large chunks, during quiet times, so that no transaction has to wait and you avoid OS fragmentation. However, just in case the db file is out of space give it a smallish allocation size of something like 200MB. This is a compromise between not having the file too fragmented and not having online transactions wait for several minutes for formatting.

Tuesday, April 21, 2009

Photo Switcheroo

Everyone knows that my web logs go into a cube and provide a demonstration of how powerful OLAP cubes are (http://RichardLees.com.au/Sites/Demonstrations) Well I have noticed an increasing number of hits from personal sites that have links to photos on my site. One in particular - the surfing photo on at top left. I am not sure if they do it for copyright or web performance reasons. Anyway, I have switched the top left photo for one of myself - top right. Is that an awful thing to do? If my conscience gets the better of me, I will switch them back.

Here is a list of a few of the sites that are using the photo and now have my photo. No doubt they will notice the old guy on the surf ski, at some stage, and update the photo.


There are more sites using the surfing photo. If you are interested, you can query my weblogs cube and look for Referrers to Surfing.jpg It is one 0f the top referring targets. You can also see which sites are still using it. Just look for hits after April 20.

Richard
For real time OLAP and data mining demonstrations go to http://RichardLees.com.au/Sites/Demonstrations