Thursday, May 28, 2009

Don't be afraid of Data Mining models and DMX

Since SQL Server 2000, Microsoft have included data mining algorithms in the standard edition. It continues to astound me that people are still not taking advantage of this technology. OK, you will have to learn something new, but it is a lot of fun and can have profound impact on an organisation. Sometimes all you want, or need, to do is build the model and visualise it.

However, it is often desirable to use the model to make predictions. To do this, you really need to write DMX (Data Mining version of MDX) although the wizard does a good job of helping you. I have some real time data mining demonstrations on my internet site. For example, http://EasternMining.com.au/sites/Demonstrations/Pages/PredictingLatenciesusingDataMining.aspx will query a data mining model with the last 50 hits to the site and predict what the response time will be. Of course there is no benefit in such a model, as we already know the outcome, but it is an interesting demonstration of data mining. The predictions could be for delivery time, probability of fraud, customer profitability, new hire longevity, medical prognosis etc. Here is the DMX query behind the report.

SELECT
Last50.[rid],
Last50.[LogTime],
Predict(ProcessingTimeDT.[Processing Time]) as [Prediction],
Last50.ProcessingTime,
Last50.BytesIn,
Last50.BytesOut,
Last50.Country,
Last50.Resource,
Last50.ClientAgnt as [ClientAgent],
Last50.http_status,
Last50.Operation,
Last50.ReferringServer,
Last50.os,
Last50.City
From
ProcessingTimeDT
PREDICTION JOIN
OPENQUERY([Weblogs],
'SELECT
[rid],
[LogTime],
ProcessingTime,
BytesIn,
BytesOut,
Country,
Resource,
ClientAgnt,
http_status,
Operation,
ReferringServer,
ResourceType,
OK,
OS,
City
FROM
(SELECT TOP (50) rid, LogTime, ProcessingTime, BytesIn, BytesOut, Country, State, City, resource, resourcetype, clientagnt, OK, http_status, operation,
referringServer, calyear, Calmonth,OS
FROM InternetLog_Wide
ORDER BY RID DESC) as Last50Hits') AS Last50
ON ProcessingTimeDT.[Bytes In] = Last50.BytesIn
AND ProcessingTimeDT.[Bytes Out] = Last50.BytesOut
AND ProcessingTimeDT.Country = Last50.Country
AND ProcessingTimeDT.Resource = Last50.Resource
AND ProcessingTimeDT.ResourceType = Last50.ResourceType
AND ProcessingTimeDT.ClientAgnt = Last50.ClientAgnt
AND ProcessingTimeDT.OK = Last50.OK
AND ProcessingTimeDT.[Http Status] = Last50.http_status
AND ProcessingTimeDT.Operation = Last50.Operation
AND ProcessingTimeDT.[Referring Server] = Last50.ReferringServer
AND ProcessingTimeDT.OS = Last50.OS
AND ProcessingTimeDT.City = Last50.City


The DMX query is actually a hybrid query, with the inner query (for the last 50 hits) coming from the SQL relational database. If we take out the hybrid query and just enter some data manually, the query looks a lot cleaner. For example here is a DMX query where I type in a couple of the attributes (country, city, resource) and get a prediction based on this input.

SELECT
Predict(ProcessingTimeDT.[Processing Time]) as [Prediction],
TypedData.*
From
ProcessingTimeDT
PREDICTION JOIN
(SELECT
'UNITED STATES' as Country,
'RichardsWelcomePage.aspx' as Resource,
'MSIE 8.0' as ClientAgnt,
'New York' as City )
AS TypedData
ON ProcessingTimeDT.Country = TypedData.Country
AND ProcessingTimeDT.Resource = TypedData.Resource
AND ProcessingTimeDT.ClientAgnt = TypedData.ClientAgnt
AND ProcessingTimeDT.City = TypedData.City

Notice how the data mining model does not need to have all the attributes. The more you give the query, the more accurate the query will be.

If you need any assistance designing or querying your data mining models, don't hesitate to ask for my assistance. Data mining is one of my favourite activities.

Richard

For more real-time data mining and OLAP demonstrations goto http://EasternMining.com.au/Sites/Demonstrations

For more about Richard Lees goto http://richardlees.com.au/

4 comments:

KoreanTiger said...
This comment has been removed by the author.
KoreanTiger said...

Mr. Ricardson.

I'm a also MS business intelligence developer.(very short experience) and I really want to do some datamining task through MS technologies.

I have read some books but I realized that's not enough.

Currently, I'm studying Algebra, Calculus 1,2,3 (Again!) and Information theory. But I'm not quite sure whether It is a right way to be a data-miner.

Could you give me any suggestions? Any suggestions/advices are delightful!

(As English is my second language, plz forgive me if I made any mistakes.)

Richard Lees said...

Studying algebra and calculus is a very good foundataion. Probability theory is also useful.

I find that there aren't many full time data mining professionals, although this is changing and there will be more going forward. In companies I work with, data mining activities tend to emerge in the business intelligence teams with a team member/consultant interested in data mining. So I would encourage you to learn about the traditional BI technologies (OLAP and the MDX query language). Other technologies that will help in this area are very strong SQL skills and data transformation technologies like Integration Services.

One good thing for you is that all these technologies are part of SQL Server 2008, which I think is all the software you need to learn these skills. There are many books, tutorials, videos and MCP exams that will help you.

Don't be daunted, there is a lot to learn, and it may take a few years.

Vijay said...

Richard - I'm trying to generate a probability output for a decision tree model.

Basically it is:

ID Rule Probability
A 1 75%
A 3 58%
A 28 33%
A 145 31%


I can write a prediction query that returns the probability for ID "A". How can I write a prediction query that returns the table above with probability for ID "A" for each rule of a decision tree?

Thanks!

Vijay