Wednesday, July 1, 2009

Restartable or Rerunnable SSIS Packages

BI solutions need to be continually importing/updating data from external source systems. I believe that a good BI solution will have automatically correcting load and process jobs. What I mean by that is if a package fails (such as a power or system failure) the application will automatically correct itself next time it runs. In the old days, we used to create "restartable" jobs, where, after a failure, the operator would manually intervene to restart the job with specific parameters. I think there is little excuse nowadays to have applications that require manual intervention after a failure. They should be designed in such a way that they “know” where they are up to and continue, or reprocess.

For example, my Permon application (Perfmon Dashboard) has a job that runs every 180 seconds to transform and load the new records, and incrementally process the OLAP cube. Every now and again the machine is restarted for some reason and the Perfmon update process could be at any point when it is stopped. However, the job will automatically run from the right place, next time it runs. It will know whether there is 3 minutes or 10 minutes of data to process, by using high water marks in the database. In this way, I have no operational overhead in running this application. It just looks after itself. It is the same for the weblogs BI solution on Weblogs Dashboard. If your BI solution requires manual intervention after a failure or an error, get someone else in, like myself, to fix it up to run automatically without manual intervention.

Now having said that, it is often useful to be able to pass parameters to an SSIS package. (I just needed to mention automatically correcting jobs above, as some developers make packages restartable,to help with manual correction, when it should be automatic.) For example, you might want to have the ability to ask the SSIS package to conditionally run one task. This can be achieved quite easily by assigning the Disable property on the task (container etc) using Expressions on the task to set it to a parameter. To pass the parameter it is simply a matter of adding /SET "\Package.Variables[DisableBackup]";False to the cmd for the job in SQL Agent.

Friday, June 26, 2009

Reporting Services Tips for OLAP Cubes

Naturally I have an inclination to use Reporting Services on OLAP cubes. My feeling is that reports should by and large be run from OLAP cube where they can be dynamic and fast. SQL based reports are more ideally suited for "atomic data" reports (eg transaction lists) and some special reports requiring real time data.

There are a few funnies when you get into OLAP reports with SSRS. The SSRS wizard does a great job of starting the report and building the parameters you need. However, before long you will need to edit the MDX that the wizard creates, to do things like taylor the parameter queries for cascading filters. Here are a couple of tricks that I commonly use.

In SSRS 2008, when you edit any MDX query, the designer will automatically rebuild (in other words wash out any MDX changes you have made) input parameter queries. The simple work around for this frustrating behaviour is to remove the Dimension and Hierarchy values from the query's parameter list.

You might want to use the SSRS calendar control for easy date picking, which is used to filter an OLAP date dimension. This isn't handled automatically, but can be achieved with the following. Change the parameter property to be Date/Time datatype. That will ensure that SSRS uses the date picker. For most reports you will want to dynamically specify the default. This can be simply achieved by a single row OLAP query (for example the last day with data) or from today's date using the now() function. If you use the now() function you can strip off the time with =cdate(format(now(),"yyyy-MM-dd")). The trick now is to use the datetime parameter in your MDX query. What you need to do is construct a full member name using the known dimension-hierarchy names and contruct the member key using the date. The construction of the key will depend on what you have defined as the key of the date hierarchy. If the key is a date, then the fully qualified member name will look something like this. [Date].[Calendar].[TheDate].&[2009-06-26T00:00:00]. So to include the date parameter in your query you can include something like the following in your query strtomember("[Date].[Calendar].[TheDate].&["+format(now(),"yyyy-MM-dd")+"T00:00:00]") Just replace [Date] [Calendar] and [TheDate] with the dimension, hierarchy and level name from your dimension. The MDX query tool in SQL Enterprise Manager will help by providing a sample member. The date parameter might simply be used in the WHERE clause to filter the query or it might be part of a range specification etc. It's a member specification and as such could be used anywhere in your query.

Not so much of a tip, but more of an apology. The MDX query editor in SSRS is a very poor text editor. When I am editing an MDX query, I invariably cut and paste the query into and out of SQL Enterprise Manager Query tool.

Here is an easy way to get a report line number. =RunningValue(1, SUM, "Invoice_Data") The line number can also be used for alternate line shading.

Check the performance of all the MDX queries. If any of them are not running fast enough, see if there is anything you can do to speed them up. One classic way to improve performance and improve report usability is to filter out empty members.

Another classic way to improve performance is to change the use of any single select parameters from StrToSet() to StrToMember() and placed on the WHERE clause.

Just like SQL reports. Or perhaps more so, since OLAP reports are more likely to be run dynamically, link your reports together so that users can easily navigate down and across reports. Essentially, making it easier for users to drill through to focus the salient data. With SSRS reports on OLAP this can be achieved in such a way that users do not need to know anything about databases or even that a cube is behind the report. They simply click on headings or numbers to drill into and across the data that interests them. For example, click on the "Top Memory Consumers" picture above for an OLAP report. The report is dynamic and close to real time. Notice how you can click on any of the processes and get an analysis of that process. Likewise, you can click on any of the counters in that report for a breakdown of the processes using that resource. A good set of dynamic SSRS reports will appear more like an application to end users than a traditional list of static reports.

One last tip is to verify that all the queries in your report definition are actually used. Sometimes SSRS will create new queries for you when you edit parameters. Or you might modify a report and remove a parameter. Do remember to take out the unused queries. All queries appear to fire even if they aren't used.

For examples of real time OLAP reports see http://RichardLees.com.au/Sites/Demonstrations

Monday, June 22, 2009

Bing.com - Microsoft's new search engine


Have you checked out Microsoft's beta search engine, Bing.com? It has some nice features and now look much more like Google.com.

I have noticed that it is rising on my Top Referrers List. Currently it is in the top 20 and rising. For an up to date list see http://RichardLees.com.au/Sites/Demonstrations

By the way, I count 1 referrer as one person clicking on a link from Bing.com to somewhere on http://RichardLees.com.au So far this month (June) Bing.com has made 45 referrals compared to Google, several hundred, or blogspot.com with 1679.

For real time web analytics see RichardLees.com.au

Thursday, June 18, 2009

Microsoft Making PerformancePoint Planning Source Code Available

I have just heard that Microsoft will be making PerformancePoint Planning software available to partners (http://www.microsoft.com/bi/partners/default.aspx) who were actively involved in evaluating/using the product, at no cost.

This is great news. PerformancePoint Planning was a great V1 product and I think it would have had a great future had Microsoft continued to invest in it. It's a loss to Microsoft, MS partners and customers that the product was discontinued. Let's face it every organisation does planning, and if the organisation is large, you do need software to manage the process. I did blog at the time of the discontinuance announcement http://richardlees.blogspot.com/2009/02/microsoft-is-discontinuing.html

I don't know if there are any restrictions on what the partners can do with the code, but I imagine/hope some company will turn it into a V2 commercial product.

I really think the corporate planning market has room for maturation with none of the incumbent vendors (Hyperion, Peoplesoft, Cognos etc.) providing a ubiquitous product.

Monday, June 15, 2009

Learn SQL - SELECT Product(column)

I have on occasion been quoted as suggesting that SQL code with temp tables can indicate the author lacks strong SQL skills. Consequently, I have been challenged to write single SQL statements to replace multiple statements stuck together with temp tables. One of these occasions was with Ron Soukup over a bottle of whisky, but that story deserves its own blog.

On this occasion (sometime in 1999) I was challenged to write a query to get the product of a set of numbers. On another bottle of whiskey, as it happens. My friend wanted something like the SUM() function, but he wanted the product of a set of numbers rather than the sum. The business application was quite simple. His client, a financial organisation, needed to calculate compound interes for money on overnight deposit. Each day will have a different market interest rate.

This problem had an interesting solution using a couple of maths functions - Sum(), Exp() and Log(). Here is something like the SQL that I wrote for them.

select exp(sum(LOG(OvernightInterestRate+1))) as CompoundRate
from OvernightDeposits
where TheDate Between '1999-07-23' and '2009-06-18'


Now that's really quite easy, and is much more efficient than putting the data into a temp table and reprocessing it. Having said that, it would be quite useful (and I thought quite easy) to include a Product() function in SQL Server.

For real-time SQL, OLAP and Data Mining demonstrations, goto http://RichardLees.com.au/Sites/Demonstrations

Sunday, June 14, 2009

Know SQL - Predicates on the WHERE or OUTER JOIN Clauses

Writing good SQL is fundamental in a high performing application free of bugs. SQL has been around as the standard database query language since it left QBE by the roadside in the 1980's, and the core syntax hasn't changed much. Database developers (of any rdbms) should fully understand all ANSI aspects of the language.

I was with some developers recently who hadn't appreciated the difference between a filter predicate on a WHERE and a LEFT OUTER JOIN clause. Essentially the difference is whether the predicate is working on the join or on the resultset. If a predicate is on the right side table of a LEFT OUTER JOIN clause, then it will work on the right hand records only. If the same predicate is on the WHERE clause, the entire record (left and right side) will be filtered out. This can affect the number of records returned.

Here is an example of the difference. The first query has a filter on the WHERE clause, so that any objects joined with columns that have a scale<>0 will be entirely filtered out. That is, not all objects with a name like '%sys%' will be returned.

Select o.* from sys.objects o
left outer join sys.columns c on o.object_id=c.object_id
where o.name like '%sys%'
and c.scale <> 0

This second query has had the columns predicate moved to the LEFT OUTER JOIN clause. This is logically very different since the predicate scale<>0 is working on the join. In this query all objects with a name like '%sys%' will be returned. It’s just a matter of whether they are joined with columns.
Select o.* from sys.objects o
left outer join sys.columns c on o.object_id=c.object_id
and c.scale <> 0
where o.name like '%sys%'


I have used sys.objects and sys.columns, since they are tables (well views actually, but that doesn't matter) in the master database (of either SQL Server 2005 or SQL Server 2008) so you can cut and paste the queries into your query editor and visualise the difference.

This is just one subtlety of the SQL language. If it is news to you and you are a database developer, I suggest that you spend some time learning the SQL language more thoroughly. The good news is that any time you invest in the SQL language will be useful to you for a long time, since it does not appear to threatened by a new language anytime soon. (As much as I love MDX, I can't see it replacing SQL in the rdbms.)

Saturday, June 13, 2009

Leaving Footprints All Over the Internet

Someone recently asked me if my web analytics dashboard (http://RichardLees.com.au/Sites/Demonstrations) records information on where they had been on my site. The answer is an emphatic Yes. Not that my site does anything special in the logging department. I am just using Windows IIS and by default it will log every http request. The only extra thing I do is that I put this data in a cube so that it is easy to query and visualise. You should remember that everytime you browse to a page on the internet, there is probably a log on that server recording your IP address, what you requested, date, time etc. And there will often be additional information such as your browser configuration, even the operating system and version number. If you are accessing the internet via a proxy server, then your IP address is shared amongst other proxy server users so you are somewhat camouflaged. But if you are using a proxy server, then you should consider the proxy server is logging every request you make over the internet, and possibly all the intranet browsing too.

So, as I was saying, my site puts all the web logs into a cube so that you can visualise all the activity on my site, even what I do. I built it quite simply as a demonstration of OLAP technology.

You can ad hoc query the activity on my site using ThinSlicer or PerformancePoint. Both of these tools are thin client, only requiring IE6 (or above) and there are no client controls. So if you have been to my site (anything on http://richardlees.com.au/) then you will be able to browse the cube and see where you have been and what resources you used on my server at the time. Simply go to the Interactive Chart Grid on http://RichardLees.com.au/Sites/Demonstrations and navigate to your IP address and display the requests you have made. This is using Microsoft's PerformancePoint dashboard technology, which takes a little getting used to, but it is a fully functional cube browser. For example, the picture above is a query I made, drilling into IP address 84.151.222.71, which happens to be in Munich, Germany. I then drilled into .jpg files to see what pictures had been downloaded. If you don't know your IP address, you can drill down to your geographic location and select the IP address active at the time you were on the site. You could even browse to my site requesting a resource that does not exist, such as HelenClark.jpg, since you will be the only person requesting this resource, you will be able to drill into this resource and list out your own IP address etc. You could then pivot on the IP address and see what else you requested. The cube is available for continuous querying, even though it is being updated with new logs every 15 minutes or so.

Please note, I do not use these web logs for anything other than providing a demonstration of OLAP and data mining technologies. There is no reason to feel paranoid about what I store on my server, but it might make you think about what information other web sites have about your activity.

Monday, June 8, 2009

Altavista is Back in Action

Anyone who was using the internet in the 1990s, like me, will fondly remember Altavista.com, the best search engine of that era.

I have just noticed that it is back in action and referring people to my site! It seems to have replaced referrals that have come from Yahoo. I guess that shouldn't be so surprising since Yahoo acquired Altavista in a roundabout way via Overture Services.

What is interesting to me is that altavista.com has come back and that almost all of the Altavista referrals (on my site) have come from Chicargo, Illinois. Anyone know why might be?

I have attached a couple of reports, one is simply the Top Referrers Report from http://RichardLees.com.au/Sites/Demonstrations which shows altavista suddenly appearing in Maya 2009. The other I have created using ThinSlicer showing the geographic location of the altavista users. You can do the same remotely on my site. Or, if you prefer, the default dashboard on http://RichardLees.com.au/Sites/Demonstrations has a top referrers grid at the bottom right of the dashboard, which is interactive, so you can drill down on Altavista or drill across to any other dimension.

I find it interesting participating and watching the competition between the big search engines. From the stats on my site, Google.com is by far the largest, with live.com enjoying a steady workload. By the way, the home page on the dashboard above shows the top referrers, but only for the last 5 months. If you want to see a longer range, simply drill up on the time dimension to All, then drill back down to month and you will see all months.

Saturday, June 6, 2009

Resizing .jpg files for SharePoint Server 2007

My internet site (http://richardlees.com.au/) is hosted on Sharepoint Server 2007. There are many photos on my site, particularly, but not exclusively, family snaps. In SharePoint I have resized the photos for easier display. I knew that SharePoint did not physically resize the images, but didn't appreciate the full impact it had on people browsing my site over a low speed link. Since SharePoint did not physically resize the photo (.jpg file) a 2MB file might be downloaded to show a little photo whose image only needs to be a few kilobytes. I am a bit disappointed that Sharepoint doesn't resize images automatically. I kind of feel that since it knows how big the display images are, it could resize them accordingly, and even though it didn't do it now, I thought it must be fixed in a service pack or subsequent release.

Anyway a couple of weeks ago a partner colleague mentioned to me his frustration with these images, so I spent a few minutes and resized all the .jpg images in SharePoint. I was only using about 30 images, so it only took about 30 minutes. It is interesting how much difference this has made to the performance of my web site. The home page on my web dashboard shows average response times for the major file types (.aspx, .axd, .css, and .jpg) over the last 60 days. If you look at this graph (top right on http://RichardLees.com.au/Sites/Demonstrations) you will see .jpg has dropped in response time around May 21. Althought there is quite a lot of daily variations since it is an average of many different size images. I haven't had such a large performance change since I talked to my son about his download activity (http://richardlees.blogspot.com/2009/01/is-my-web-site-getting-slower.html).

You are able to drill into the response time graph, and even take it to a new window if you know how to drive PerformancePoint (see http://richardlees.blogspot.com/2009/04/try-out-microsofts-zero-footprint-cube.html for some quick instructions). I have done this, and drilled down to the photo of Anna and myself on unicycles (one of the main MB culprits). The graph is attached. You can achieve the same, although you will see a more up to date chart since the cube is updating every few minutes.

Two takeaways from this initiative.
  1. If you are using SharePoint 2007, it pays to resize your images to the display size.
  2. It always pays to monitor and log performance of any production application so that you can see user performance and the impact of your tuning exercises
For more close to real-time performance information of my internet site and the two servers behind the site goto http://RichardLees.com.au/Sites/Demonstrations

Analysis Services Database Name and DatabaseID

Typically an Analysis Services database has a DatabaseID the same as the database name. However, if you rename a database, the DatabaseID will stay the same. This can be a source of confusion and it can lead to useful operational techniques.

Whenever anyone connects to an Analysis Services database, they do so by selecting a particular database name. So when you rename a database, users will see and use the new name. However, SSIS packages that perform Analysis Services processing tasks select the database by the DatabaseID. Also, Analysis Services keeps the DatabaseID as a unique identifier. So when you rename an Analysis Services database, existing SSIS packages will be looking for the old DatabaseID.
So if you have an Analysis Services database, say EasternMining, and you rename it to Eastern Mining, then users will only connect to it by asking for Eastern Mining, which is fine. However if you try to restore a database called EasternMining (which doesn't exist anymore) it will fail, even though you have enabled Allow Database Overwrite. The message will be something like Backup and restore errors: An object with the same 'databaseid' ID but different 'databasename' Name already exists on the instance. (The AllowOverwrite flag only applies to the Object ID, not to the object name.) I find the message a bit misleading, but what it is saying is that a database with that DatabaseID exists already.

Unfortunately, you cannot change the databaseID, without doing something like backing up and restoring the database. Interestingly, an xmla create database statement will overwrite an existing database and DatabaseID. If you want to know the DatabaseID for an Analysis Services database, just right click on the database and select properties.
I have previously blogged on how you might use this to maintain close to continuous availability when you need to deploy a new Analysis Services cube structure. http://richardlees.blogspot.com/2009/05/new-analysis-services-database-version.html

Sunday, May 31, 2009

Data Mining as an Enhancement to Basket Analysis

Big retailers all want to perform some form of basket analysis, where they analyse the mix of products that customers purchase in one basket. This tends to be an extremely hardware intensive exercise. Using OLAP technologies can help in this exercise, but even with OLAP hardware can be stressed since you really need a dimension on TransactionID. (OK, there are some optimisations you can make to minimise this dimension, but it is still very large and expensive.) However, if you want to analyse which products are bought together (along with other variables such as time of day, customer gender etc) then data mining (DM) can be a great tool and offer its analysis at a very low cost (ie you only need a small commodity server to perform the analysis). DM algorithms have been available as part of the standard edition of SQL Server since 2000.

Essentially, you ask the data mining to create a model to predict products, based on other products (and other variables such as time of day, store, customer gender etc). The data mining algorithms can scan billions of transactions and produce a data mining model in a relatively short time. The data mining algorithms search through many combinations of products/attributes and discover which products/attributes have strong associations. Essentially, the DM model will only stores relevant associations and their predictive strength. The data mining model tends to be very small since it only contains the associations, even though the work it has done to find them can be resource consuming. The data mining model does not need to store the billions of baskets, or the associations between irrelevant products. You are able to very easily browse the model to visualise products (and potentially other attributes such as gender and time of day) and their associations. Also, you can use the data mining model to predict which products might be bought in a particular basket. This data mining prediction query is extremely fast, typically a few milliseconds, so it is quite plausible that online transactions are probing the data mining model for suggested products at real time. For example, your billions of transactions might be terabytes of data, while the resultant DM model could be a couple of hundred megabytes. of data.
I have an online demonstration of a model that does something like this on http://EasternMining.com.au/sites/Demonstrations/Pages/LibrariesSuggestions.aspx I don't have a retailer's transaction history. Most companies privacy policies prevent them from me displaying them on the internet, however, if yours doesn't let me know and I will host the db on my site. What I do have is the borrowing history of a real municipal library (with names and personal identifiers fudged), which is very similar in structure to retailer transactions. The book is like a product and a borrower's loaned books are a transaction or basket. In my demonstration, you don't need to enter the books and gender, just select any borrower, and the demonstration will select the books that they have borrowed and their sex. This is a relational query and is actually embedded in the data mining (DMX) query. Here is the "product suggestion" DMX query. It looks a bit wordy, but it is not complicated, and the SQL Server wizard creates most of the query for you.

SELECT flattened
t.[BorrowerNo],t.sex,t.[decadeofbirth],
TopCount(predict([Loans n Titles],include_statistics),$probability,12)
From BookSuggestionFullDT
PREDICTION JOIN
SHAPE {OPENQUERY([Libraries],
'SELECT
BorrowerNo,
Sex,
DecadeOfBirth
FROM Borrowers_Interesting')}
APPEND
({OPENQUERY([Libraries],
'SELECT
Title,
BorrowerNo
FROM LoansInterestingFull
WHERE Borrower = ''Damen Roxburgh'' ')}
RELATE BorrowerNo TO BorrowerNo)
AS Loans_interesting AS t
ON BookSuggestionFullDT.Sex = t.Sex AND
BookSuggestionFullDT.[Decade Of Birth] = t.DecadeOfBirth AND
BookSuggestionFullDT.[Loans n Titles].Title = t.Loans_interesting.Title

Here is a similar query with the values coded into the query. As you can see, it is much easier to read. I have found the DMX language easier to learn than MDX, since the tasks you are requesting tend to be simpler. I.e. you are generally only asking for a prediction based on some input variables.

SELECT flattened
MyCustomer.Customer,MyCustomer.Sex,MyCustomer.DecadeOfBirth,
TopCount(predict([Loans n Titles],include_statistics),$probability,12)
From BookSuggestionFullDT
PREDICTION JOIN
(SELECT "Richard Lees" as Customer,
"Male" as [Sex] ,
"1960s" as DecadeOfBirth,
(select 'Marie Claire' as [Title] union
select 'Cleo' as Title) as books) as MyCustomer
ON BookSuggestionFullDT.Sex = MyCustomer.Sex AND
BookSuggestionFullDT.[Decade Of Birth] = MyCustomer.[DecadeOfBirth] AND
BookSuggestionFullDT.[Loans n Titles].Title = MyCustomer.Books.Title

So what I am suggesting here is that if you are performing basket analysis, or have a desire to, then you should also consider creating a data mining model. The DM model will help find the associations and optionally predict potential products. You could even have online shopping sites use DM models to make predictions based on the basket content so far. The query tends to be a few milliseconds, certainly sub-second.

If you require any assistance in creating or exploiting your data mining models, please don't hesitate to ask me for assistance.

For more online real-time business intelligence demonstrations see http://RichardLees.com.au/Sites/Demonstrations


Richard

Friday, May 29, 2009

Creating your first data mining model

Data Mining is a entirely new exercise using new technology for most developers. It can be awkward creating your very first model. To help people create their very first data mining model using real data, I have written a whitepaper on msdn. So if you want to create your first model, visualise it, query it, and create a Reporting Services report from it go to http://technet.microsoft.com/en-us/library/dd883232.aspx


Of course, if you require any assistance with more advanced features of data mining, or applying it to your business strategy, just ask me.

Thursday, May 28, 2009

Don't be afraid of Data Mining models and DMX

Since SQL Server 2000, Microsoft have included data mining algorithms in the standard edition. It continues to astound me that people are still not taking advantage of this technology. OK, you will have to learn something new, but it is a lot of fun and can have profound impact on an organisation. Sometimes all you want, or need, to do is build the model and visualise it.

However, it is often desirable to use the model to make predictions. To do this, you really need to write DMX (Data Mining version of MDX) although the wizard does a good job of helping you. I have some real time data mining demonstrations on my internet site. For example, http://EasternMining.com.au/sites/Demonstrations/Pages/PredictingLatenciesusingDataMining.aspx will query a data mining model with the last 50 hits to the site and predict what the response time will be. Of course there is no benefit in such a model, as we already know the outcome, but it is an interesting demonstration of data mining. The predictions could be for delivery time, probability of fraud, customer profitability, new hire longevity, medical prognosis etc. Here is the DMX query behind the report.

SELECT
Last50.[rid],
Last50.[LogTime],
Predict(ProcessingTimeDT.[Processing Time]) as [Prediction],
Last50.ProcessingTime,
Last50.BytesIn,
Last50.BytesOut,
Last50.Country,
Last50.Resource,
Last50.ClientAgnt as [ClientAgent],
Last50.http_status,
Last50.Operation,
Last50.ReferringServer,
Last50.os,
Last50.City
From
ProcessingTimeDT
PREDICTION JOIN
OPENQUERY([Weblogs],
'SELECT
[rid],
[LogTime],
ProcessingTime,
BytesIn,
BytesOut,
Country,
Resource,
ClientAgnt,
http_status,
Operation,
ReferringServer,
ResourceType,
OK,
OS,
City
FROM
(SELECT TOP (50) rid, LogTime, ProcessingTime, BytesIn, BytesOut, Country, State, City, resource, resourcetype, clientagnt, OK, http_status, operation,
referringServer, calyear, Calmonth,OS
FROM InternetLog_Wide
ORDER BY RID DESC) as Last50Hits') AS Last50
ON ProcessingTimeDT.[Bytes In] = Last50.BytesIn
AND ProcessingTimeDT.[Bytes Out] = Last50.BytesOut
AND ProcessingTimeDT.Country = Last50.Country
AND ProcessingTimeDT.Resource = Last50.Resource
AND ProcessingTimeDT.ResourceType = Last50.ResourceType
AND ProcessingTimeDT.ClientAgnt = Last50.ClientAgnt
AND ProcessingTimeDT.OK = Last50.OK
AND ProcessingTimeDT.[Http Status] = Last50.http_status
AND ProcessingTimeDT.Operation = Last50.Operation
AND ProcessingTimeDT.[Referring Server] = Last50.ReferringServer
AND ProcessingTimeDT.OS = Last50.OS
AND ProcessingTimeDT.City = Last50.City


The DMX query is actually a hybrid query, with the inner query (for the last 50 hits) coming from the SQL relational database. If we take out the hybrid query and just enter some data manually, the query looks a lot cleaner. For example here is a DMX query where I type in a couple of the attributes (country, city, resource) and get a prediction based on this input.

SELECT
Predict(ProcessingTimeDT.[Processing Time]) as [Prediction],
TypedData.*
From
ProcessingTimeDT
PREDICTION JOIN
(SELECT
'UNITED STATES' as Country,
'RichardsWelcomePage.aspx' as Resource,
'MSIE 8.0' as ClientAgnt,
'New York' as City )
AS TypedData
ON ProcessingTimeDT.Country = TypedData.Country
AND ProcessingTimeDT.Resource = TypedData.Resource
AND ProcessingTimeDT.ClientAgnt = TypedData.ClientAgnt
AND ProcessingTimeDT.City = TypedData.City

Notice how the data mining model does not need to have all the attributes. The more you give the query, the more accurate the query will be.

If you need any assistance designing or querying your data mining models, don't hesitate to ask for my assistance. Data mining is one of my favourite activities.

Richard

For more real-time data mining and OLAP demonstrations goto http://EasternMining.com.au/Sites/Demonstrations

For more about Richard Lees goto http://richardlees.com.au/

Wednesday, May 27, 2009

MDX for Business - Text as a calculated measure

It surprises many developers that you can have a text value as a calculated measure, since a measure usually contains an aggregated number. However, a text value is very simple and very useful in a calculated measure. It could be as simple as getting a property of a member such as a customer's mobile number -

Member Measures.MobileNumber as Customers.Customer.properties("Mobile Number")
I have a real example, in a report that shows hours on rows and the last three days on columns, and for each cell it shows the top City (customer, product, whatever) by virtue of the number of web requests in that hour. This is shown on the default page of my web dashboard on http://EasternMining.com.au/Sites/Demonstrations See down near the bottom is a table called Top Cities by Hour - Last 3 Days. It shows the city with the most activity on my site for that hour. It might look ahead of time to you, if you are in America or Europe. It is actually very close to real-time in Sydney Australia.

Here is the mdx

With Member Measures.TopCity as
Topcount([Client Host].[Client Geography].[City]
,1,measures.[request count]).item(0).member_caption
Select
tail([Date].[Year Month Day].[Day],3) on Columns,
[HoursOfDay].[HoursOfDay].[Hour Of Day] on Rows
From EasternMining
Where Measures.TopCity


See how simple the mdx is. Again, imagine having to write the same query in SQL. Also, if I want to change something in the report, such as last 5 days, last 3 months etc, it is very simple. In fact, If I chage [City] to [Region] I get the following report. This report tends to show the time zones of Asia, Europe and Americas. See how Americas tend to be early morning (Sydney time) while Europe is late at night and Asia is visiting during the online day.


It's so very easy with MDX. If the answer is in the data, then one MDX expression should be able to ask the question. Let me know if you need any assistance in writing or learning mdx.

Friday, May 22, 2009

MDX for Business - Ranking Customers/Cities

I was just asked to help with an MDX query to list out the top 50 Customers and provide ranks for the last 3 periods. This is a relatively simple MDX query, which someone with moderate MDX skills should be able to write. It might be interesting to you, so here it is.

with
Set Last3Months as tail([Date].[Year Month Day].[Month],3)
Member Measures.Last3MonthsHits as Sum(Last3Months,Measures.Hits)
Set Top50Cities as topcount(filter([Client Host].[Client Geography].[City],[Client Host].[Client Geography].member_caption<>"-"),50,Measures.Last3MonthsHits)
Member Measures.RankThisMonth as
rank([Client Host].[Client Geography].currentmember,order(Top50Cities,Measures.Hits,BDESC))
Member Measures.Movement as Measures.RankThisMonth-(Measures.RankThisMonth,[Date].[Year Month Day].prevmember)

Select Last3Months
*{Measures.Hits,Measures.RankThisMonth,Measures.Movement} on 0,
Top50Cities on 1
From EasternMining

It is a real query and I have executed it on my Weblogs cube (all hits and sessions from my web server). The results are formatted in Excel, which I have found interesting. Sydney (my home town) naturally is top of the list, but then comes New York and London. I don't do any business in these cities, which is why I find it interesting. Perhaps I should have offices there? Redmond is quite high. I guess my old Microsoft colleagues are still finding the real time demonstrations interesting.

Anyway, you can read the MDX and see what it is doing. Notice the elegance of the MDX query. It is very succinct, and easy to read. For those of you who know SQL, can you imagine writing this query in SQL? It wouldn't look anywhere near as elegant, nor would it perform as well.

If you would like assistance with your OLAP cube design or MDX query writing, please don't hesitate to ask.

Perfmon Has Everything

I think I have blogged on the Perfmon cube before, but I am blogging again since many people still do not appreciate how valuable and comprehensive the information from Perfmon can be. When you have a production performance issue, it is so much easier to diagnose if you can do so without adding further stress on the server and by looking at trend information. For example, if your server is busy and going slow, wouldn't it be good to see what cpu/memory/io metrics are now, and compare them to how they were before the performance issue? If you don't store historical information, trend analysis is not possible. And without historical information you will be less likely to preemptively act to avoid a performance problem.

My dashboard http://EasternMining.com.au/sites/Demonstrations/Shared%20Documents/Windows%20Performance%20Monitor/Perfmon.aspx shows many perfmon metrics, but the dashboard conceals how rich the cube is. Notice the headline information in the dashboard above. Now let's assume that there might be an IO bottleneck (very common in database servers nowadays). Click on the Disk tab at the top of the dashboard. This will show you the Physical Disk counters from perfmon. You can filter it by a particular machine. Now click on Disk Bytes/sec in the main display table. Notice how the two graphs on the right dynamically chart this counter for the last 60 minutes and last 60 hours. We can see that there is a large amount of IO on the B: drive during the last couple of days. Now we want to know which program (windows process) is consuming this IO. Perfmon has all the information. Click on Process tab at the top of the dashboard. Then click on IO Data Bytes/sec in the main table. Notice how the graphs on the right chart the top processes of this counter. Also try clicking on IO Write Bytes/sec and see that most of the IO activity is write IO. The chart tells us which process is consuming the IO. In my case it is msmdsrv.exe but it might be different by the time you look at it. All this analysis can be performed without increasing load on the target servers.

If you are interested in Windows performance, try clicking around the Perfmon dashboard tabs on my demonstration site above. The Perfmon metrics are very close to real time (<=180 second lag). You can also use ThinSlicer for ad hoc queries of the Perfmon cube from http://richardlees.com.au:8080/iislog/tsPerfmon.aspx?Column=[DateTime].[Year%20Month%20Day%20Hour]&Row=[Counters].[Counter%20Hierarchy]&Measure=Measures.[ValueAverage]

Also if you would like a Perfmon cube in your organisation, I would be happy to help you. The only technologies used to create the real time dashboard above is Microsoft SQL Server and Microsoft PerformancePoint (PerformancePoint is optional for an application like this, since it is just the vehicle for displaying the information from the cube).


Sunday, May 17, 2009

New Analysis Services database version with minimal downtime

It can be operationally challenging to deploy a new AS database version without cube downtime. Typically, if you deploy a new version of an AS database, the database will be offline until the database has completed a full process. This might be a long time in an environment where the large measure groups are very large and are only incrementally processed (on a daily, hourly, minute etc. schedule), since the full process may take many hours. This can be a problem in SQL Server OLAP as we have been operationally used to a cube being available 24 by 7, since SQL Server 7.0.

One workaround is to deploy a new version of the database with a slightly different name, for example the existing database might be Perfmon and the new version might be Perfmon_V2. In this way, the new database can be deployed and processing while the existing database is online and available. The trick comes when the new database has completed processing. The old database could be deleted and the new database renamed to the original name. This has a slight snag, since when you rename a database, the databaseID remains the same, and the DatbaseID is used by SSIS in AS processing tasks. So you would need to change all the SSIS packages that refer to the database. This would be confusing to manage since the database names would be out of sync with the databaseIDs. The extra trick is to back up the new AS database after processing and restore it over the original database. When you restore an AS database, the database name and databaseID are in sync.

Hence, for an AS database that takes a long time to fully process, to deploy a new database version follow these steps.
  1. Deploy the new AS database with a suffix of _V2
  2. Fully process the new AS database
  3. Backup the new AS database
  4. Restore the AS database over the original database (Allow Database Overwrite)

The downtime is limited to the duration of the restore operation.

Note, the original AS database could continue with incremental processing during the new database full process, although some consideration needs to be given to the way you manage high water marks in the measure groups. This workaround is suitable for SQL Server 2005 and SQL Server 2008.

For realtime online OLAP and data mining demonstrations see http://RichardLees.com.au/Sites/Demonstrations

Monday, May 4, 2009

SQL Server Fragmentation

There are two types of fragmentation

SQL Server Fragmentation
You can analyse SQL fragmentation with DBCC SHOWCONTIG, and use SQL Server utilities to reorganise, if necessary. Most DBAs know about this type of fragmentation and SQL Server utilities can help you reoganise your databases online.

OS Level Fragmentation
SQL Server knows nothing about this fragmentation since it is managed by the OS. You need to use the system defragmentation tool (or file copy) to defrag, which is not easy while SQL Server has the file open. A much better strategy is to avoid OS fragmentation. To do this simply ask SQL Server to grow files at a reasonable size. 1 MB is not a good increment for a terabye database. Also the default of 10% is not very good, since an active transaction that requires additional space will have to wait while the entire 10% is formatted before it can continue. Not a good thing to happen in a high OLTP environment. Notice how the Defrag utility is showing the number of fragments per file. You should be a little bit concerned about database files with over about 100 fragments, and anything over 1000 fragments may well be causing you performance issues.

The best strategy is to preallocate the space in large chunks, during quiet times, so that no transaction has to wait and you avoid OS fragmentation. However, just in case the db file is out of space give it a smallish allocation size of something like 200MB. This is a compromise between not having the file too fragmented and not having online transactions wait for several minutes for formatting.

Tuesday, April 21, 2009

Photo Switcheroo

Everyone knows that my web logs go into a cube and provide a demonstration of how powerful OLAP cubes are (http://RichardLees.com.au/Sites/Demonstrations) Well I have noticed an increasing number of hits from personal sites that have links to photos on my site. One in particular - the surfing photo on at top left. I am not sure if they do it for copyright or web performance reasons. Anyway, I have switched the top left photo for one of myself - top right. Is that an awful thing to do? If my conscience gets the better of me, I will switch them back.

Here is a list of a few of the sites that are using the photo and now have my photo. No doubt they will notice the old guy on the surf ski, at some stage, and update the photo.


There are more sites using the surfing photo. If you are interested, you can query my weblogs cube and look for Referrers to Surfing.jpg It is one 0f the top referring targets. You can also see which sites are still using it. Just look for hits after April 20.

Richard
For real time OLAP and data mining demonstrations go to http://RichardLees.com.au/Sites/Demonstrations

Thursday, April 9, 2009

Try out Microsoft's zero footprint cube browser

Many people are not aware that Microsoft has a thin client cube browser. It is part of the current version of PerformancePoint Server, (although that will change in future and become part of SharePoint).

I have PerformancePoint on my demonstration site, so you are able to try out the cube browser. All you need is Internet Explorer (version 6 or above). It doesn't appear to be supported by other cube browsers as yet. To try out the cube browser, simply go to my demonstration site, http://RichardLees.com.au/Sites/Demonstrations. The home page is a PerformancePoint page with several SharePoint webparts. All the charts on this home page are from a real time database of web activity on RichardLees.com.au. Some of these web parts, I have written MDX to get exactly the chart I want. These charts are not interactive. However many of the charts I have created by simply slicing and filtering the data, which means that you can continue to slice and dice. For example, on the demonstrations home page try right clicking on the chart Top Referrers Last Few Months column heading March and select Drill Up. That will take you to the higher level on the Date dimension. Then right click on 2009 and select drill up again, which will take you to the All member of the date dimension. Now right click on one of the row headings (these are referring sites to RichardLees.com.au) and select Drill Up. Right click on the row heading again and select Drill Up. Now you will be at the All level of the entire cube, and able to drill to any part of the cube. Note, you will also find a useful button on the right hand side of the webpart, which has an option to Open in new window. This will give you much more real estate to work with. You can right click on any of the row/column heading and select Drill Down to and select any dimension hierarchy such as Client Agent, Client Host, Date, Entry Exit, Hour, Machine, Operation, Port, Status HTTP, Status Win32 etc. Actually, there are more dimensions than the webpart can display. As you hover over each dimension the available hierarchies are displayed.
Some of the useful thin client functions you have are Show only, Remove, Sort (rows and/or columns), Report Type (grid, column, line, stacked, stacke 100%), Pivot, Filter. You really can slice and dice to any part of the cube you desire.
Try it out on http://RichardLees.com.au/Sites/Demonstrations If you are good, you should be able to navigate to filter out your own activity on this site and see where you have been. There are other cubes on this site that you can also navigate such as the New Zealand census, Perfmon stats from my demonstration machines and Foodmart. If you are having trouble understanding how to use the functionality and want a simpler cube, use the New Zealand census cube, since it has a much simpler structure and is easy to consume.

Tuesday, March 31, 2009

Internet Explorer 8 is Now Available

There has been a jump in Internet Explorer 8 activity on my site since it was release a couple of weeks ago. It jumped from about 5% of site activity to about 15%. Not surprisingly this increase has come from IE6 and IE7. However, what really intrigues me is how much Internet Explorer 6 is still being used. It still represents a little over 20% of total activity. See chart on right, or for a live graph see http://RichardLees.com.au/Sites/Demonstrations The data is kept up to within 15 minutes of real time. You are able to slice and dice the web activity via the PerformancePoint graphs or "ThinSlicer on Weblogs". Note, these thin client cube browsers work best with IE.

Sunday, March 29, 2009

Nigel Pendse' BI Survey 8

As a participant, I have just received a complimentary copy of BI Survey 8. I have been reading Nigel's surveys since 1999. While I often disagreed with some of his findings, I have always considered him the best analyst in the Business Intelligence space.

Unfortunately, I am not able to quote anything from the BI Survey. Nigel Pendse is remunerated by subscribers (not vendors) which makes his surveys refreshingly independent.

I would encourage any organization, about to embark on a data warehouse/business intelligence project, to subscribe to this analysis. Unless you are a seasoned BI developer, you will benefit from many insights and be much better armed before being caressed by a dapper BI sales team.

The current survey includes analyses on the selection process, who succeeded by vendor, biggest outstanding issues, size of deployment etc. etc. Those who are familiar with Nigel’s reports will know that he is very candid when talking about vendors and products. He really doesn’t hold back, understandably since this is what his subscribers are paying for.

I encourage openness in our BI industry. My hope is that the more informed organizations are, the better the decisions they will make, improving the effectiveness and reputation of our BI industry.

You can download a free chapter or subscribe to the full survey on http://www.bi-survey.com

Richard
For real-time online BI demonstrations goto http://RichardLees.com.au/Sites/Demonstrations

Friday, March 6, 2009

Error 44. No mapping between account names and security IDs was done

Have you ever received this error?
Error 44 The following system error occurred: No mapping between account names and security IDs was done.

You will receive this error when you have windows accounts or groups in an AS project that you are deploying to a server in a different domain. Ie. the server does not recognise these accounts.

A simple work around is to leave the windows users and groups out of the roles and add them in from SQL Server Management Studio. Another option, if it is a common deployment is to use AMO and add the appropriate accounts automatically. Also, if you use the AS Deployment wizard there are options to omit role deployment.

Richard

Wednesday, February 18, 2009

Space Considerations for Various OLAP Data Types

Data types can have huge space and performance implications on large SQL tables, and there are similar implications for large OLAP cubes. We all know that double word numbers take up more space than single word numbers, but what exactly is the difference?

I have performed a simple benchmark to compare the space used of the various data types. My tests involved a
  • 44 Million row fact table
  • 3 measures containing values between -99 and 1039.22
  • 5 dimensions
  • SQL Server 2008
  • Zero aggregations

I used 6 datatypes in SQL Server tables. The SQL data types I employed were

  • Money
  • Float
  • Decimal
  • Bigint
  • Integer
  • Real

For the Money data type, I created two cubes, since Analysis Services defaulted the Money data type to Double. I created one cube with the default Double data type, and another with the cube data type as Currency.

The results show that the best performing (in terms of size) were

  • Bigint and Integer (41% smaller than Double)
  • Currency (25% smaller than Double)
  • Single (3% smaller than Double)
  • Double

Note, Sometimes the data type you choose is determined (or constrained) by the domain of the numbers you are aggregating. For example, if your numbers contain significant digits after the decimal point, then Bigint and Integer are inappropriate, unless you are happy to round them. Likewise, for imprecise numbers, Double gives you the most precision, which you might need. But just because the underlying SQL table uses a single or double byte precision data type, if the number is always an integer, you might as well tell Analysis Services to use a Bigint and save on space.

It shouldn't surprise you that the binary numbers (Bigint and Integer) are the smallest on disk, nor that they took the same space as each other. Analysis Services compresses all the cells before writing to disk, and the binary numbers will tend to have lots of leading zeroes, which compress very well. Similarly, but to a lessor extent, Currency (really a packed decimal) compresses next best.

Just in case you are wondering why bother fiddling with data types when the cube is only a gigabyte or two in size, and disk space is plentiful. The reason is that if you can save 40% in your cube size, processing and querying will be noticeably faster. With a smaller cube, your cache will warm faster, and if the entire cube doesn't fit in RAM, you will have a greator proportion of your cube in RAM with a smaller cube. Ie. your queries will run much faster.

Richard

for real-time online OLAP demonstrations go to http://RichardLees.com.au/Sites/Demonstrations

Monday, February 9, 2009

How many rows in a table

A very common DBA/developer task is to query how many rows there are in a table. Of course the natural way to do this is to write a query such as

SELECT COUNT(*) FROM MyTable

However, this query will be slow for a very large table, particularly if it only has a clustered index. Why not just ask SQL Server for the stats on the table? If your SQL Server database has the default settings, it will be continually sampling the table and will generally have a very good idea of how many records there are. A very simple way of asking the optimiser for these stats is to write the following query


SELECT * FROM MyTable


and click on the Display Estimated Execution Plan button, which looks a bit like 3 green boxes and a caption. SQL Server will tell you, amongst other things, the number of records it expects to read as you hover over the scan object.

I encourage developers to click on this button, not just to get a quick count, but so that they start to use the showplan more often. In my view, the more often you use showplan and the more familiar you get with the optimiser, the higher performing SQL you will write.

Richard

for real time SQL, OLAP, Data Mining, PerformancePoint, Excel Services see http://RichardLees.com.au/Sites/Demonstrations

Monday, February 2, 2009

Microsoft is discontinuing PerformancePoint Server

This is hard for me to fathom, let alone beleive. http://www.informationweek.com/news/business_intelligence/analytics/showArticle.jhtml?articleID=212902915

Microsoft was a new entrant to the Financial Planning software arena with PPS back in 2007. However it's V1 product received positive reviews (see Nigel Pendse's report). However, it appears that they won't be doing any more work on the planning component after the release of PPS 2007 sp3, and will roll the dashboard software into SharePoint Server.

I am pleased to hear that they are continuing the dashboard software, since that's the component that I often use. See my demonstration site on http://RichardLees.com.au/Sites/Demonstrations and it makes sense to roll it into SharePoint. Hopefully this will mean that it will be easy to license it for Internet use.

But I can't help but feel that they are making a mistake in dropping the planning product that could have a huge future.

Richard
http://RichardLees.com.au

Friday, January 30, 2009

Is my web site getting slower

Strange things have been happening in our household over the Christmas holidays. My web dashboard tells me that as from January 1st, my web server's average respone times have deteriorated badly. A few days ago my ISP informed me that I had exceeded my monthly limit, and one of my sons has been watching a lot of movies on his laptop lately.
The chart above (you can see the latest from http://RichardLees.com.au/Sites/Demonstrations) shows the dramatic slow down of the response times since the downloads were happening. Has anyone using my site noticed the slow down?
My son confesses that he might have used up a little of our bandwidth downloading these movies and from now on he will limit it to one at a time and mostly after hours. Also, I think that since school has started he won't have so much time to watch movies on his laptop. At least I am hoping.
What would be really nice is an ability to give a lower priority to these downloads (happening in the background) over online http requests to my server. I am not much of a network engineer. Does anyone know if there is a simple way of prioritising this adsl2 traffic?
Richard

Saturday, January 10, 2009

Drill Through in Analysis Services

Drill through functionality was not included in the first version of OLAP technology from Microsoft (SQL Server 7). I found it such sought after requirement that I created my own drill through for ThinSlicer (the first thin client cube browser). It would simply generate an SQL statement based on the dimension members and assumed that the level names were the same as the column names. Then in 2000, drill though functionality was included in Analysis Services, so I removed my custom code.

However, things changed in 2005 (SQL Server 2005) when drill through architecture was changed so that a drill through request did not go to the relational database (for the atomic records) rather, it returned the most detailed cell data from the cube. This, was a huge change in functionality, and some would argue that is not "drill through" but "drill bottom". As you can imagine, there are security implications with drilling through to relational data. In the cube, you can include quite complex security functionality, which would be difficult to honor in a drill through. This might be the reason drill through was dramatically changed in SQL Server 2005.

I hope that one day SQL Server will bring back drill through to relational data. It is such a useful function for many cubes. In the meantime, if you want to include drill through to the atomic records, you will need to create a cube action and write some custom code.
One thing to take into account when you do drill through to atomic data in the relational database is SQL performance. As we know, cube queries tend to have more reliable response times than ad hoc SQL queries. If you write your own customer drill through functionality, I have two suggestions
  1. Use "Top n" in your query (or equivalent from a non ms db). The optimiser takes this into account.
  2. Create a non clustered index on each of the foreign keys
For an example of how easy drill through is simply go to my demonstrations home page http://RichardLees.com.au/Sites/Demonstrations, right click on one of the charts and select "show details". Here is an example of the "drill through" records you will be returned.

I didn't need to write any code for this drill through. The graph is presented by PerformancePoint, and like any good cube browser, it simply honors the drill through functionality of Analysis Services.
Enabling drill through on the cube, couldn't be simpler, you simply enable drill through on the role.

Richard

Sunday, December 21, 2008

Who's Reading This Blog?

I don't host this blog site, consequently I don't have access to the logs to analyse activity on the site. However, that doesn't mean that I can't monitor blog activity, and indeed, let all of you analyse the activity of this blog.



This is an old trick really. I have simply embedded an tag in my profile that has a reference to http://EasternMining.com.au/blogs.gif That resource does not exist, but it requests my own web server, which I have an automated process to add logs to a publicly available cube. You can see lots of real-time web log analysis demonstrations on my home site http://RichardLees.com.au/Sites/Demonstrations



So you and I can view how many people are reading this blog, when and where they are reading from. See Blog Activity or navigate there from the "Web Performance" page of the "Web Analytics" dashboard on my demonstrations site. Ignore all the graphs, except the chart at the bottom left of the page. This chart is showing the number of visitors by Geography for the last 60 days. It's only showing about 10 days today, since I added the tag in on December 11. You don't have to stop there, since the graph is presented by Microsoft PerformancePoint, it is interactive and you are able to drill down further, change chart type etc. for example, here is a sorted list of client IP addresses. You can ask for exactly the same, and you will be able to find your own IP address in the list from when you read this blog. Because, whenever you go to this page, you make a real-time query to the web logs cube.


I would like to point out that I have added this functionality simply to provide an interesting live demonstration. I don't have any particular reason to know anyone's IP address. Although having said that, this might be a useful reminder to many people of the tracks they are leaving behind on the Internet.


Richard


Thursday, December 11, 2008

Book Review: Data Mining with SQL Server 2008

I remember when data mining algorithms were first included in SQL Server 2000. It was very exciting and I immediately went to the municipal library and talked them into giving me an extract of their database for data mining demonstrations. That version of SQL Server had no data mining documentation. But that didn't worry me and I helped a few organisations exploit data mining. However there weren't many people that went to the trouble of learning SQL Server's data mining technolgy. SQL Server 2005 had some documentation but data mining still wasn't used to its potential. SQL Server 2008 has been greatly beefed up in its data mining capability, primarily in the ancillary tools that professional data miners demand, such as lift charts and validity testing. Now SQL Server 2008 has, arguably, the best set of data mining tools in the market. So if you have been putting off data mining, don't delay any longer.

This book by Jamie MacLennan, ZhaoHui Tang and Bogdan Crivat (all developers of the product in Redmond) is a very practical guide and quite readable by someone new to data mining. It starts with the data mining tools included in Excel 2007 and goes on to detail all the algorithms, the syntax of the DMX language and embedding data mining in your applications. Experienced data mining will also find the book useful. I found many useful tips. For example, I only just learnt that you can nest MDX in your DMX queries. It is more common to embed SQL in data mining queries.

To learn about data mining, I really believe that you need some real data to explore. The book has a download site where readers can download databases and demonstrations to experiment with.

If you are using Analysis Services and haven't yet started data mining, I suggest that you get a copy of this book and teach yourself data mining. Data mining is going to be really big.

I did have trouble with Wiley download url, but here is a direct link for the exercise data. http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470277742,descCd-DOWNLOAD.html

By the way, for anyone who is interested, there are a couple of live data mining demonstrations on my site. Where data mining is embedded in Reporting Services (also covered in the book). There is a book suggestion tool using the library data I mentioned above, and the another that predicts response times for the last 50 http requests on my web server. This demonstration has no practical value, but hopefully you can draw the analogy to a similar model that predicts customer profitability etc. http://RichardLees.com.au/sites/Demonstrations/Pages/LibrariesSuggestions.aspx

Richard

http://richardlees.com.au/