
Since SQL Server 2000, Microsoft have included data mining algorithms in the standard edition. It continues to astound me that people are still not taking advantage of this technology. OK, you will have to learn something new, but it is a lot of fun and can have profound impact on an organisation. Sometimes all you want, or need, to do is build the model and visualise it.
However, it is often desirable to use the model to make predictions. To do this, you really need to write DMX (Data Mining version of MDX) although the wizard does a good job of helping you. I have some real time data mining demonstrations on my internet site. For example,
http://EasternMining.com.au/sites/Demonstrations/Pages/PredictingLatenciesusingDataMining.aspx will query a data mining model with the last 50 hits to the site and predict what the response time will be. Of course there is no benefit in such a model, as we already know the outcome, but it is an interesting demonstration of data mining. The predictions could be for delivery time, probability of fraud, customer profitability, new hire longevity, medical prognosis etc. Here is the DMX query behind the report.
SELECT
Last50.[rid],
Last50.[LogTime],
Predict(ProcessingTimeDT.[Processing Time]) as [Prediction],
Last50.ProcessingTime,
Last50.BytesIn,
Last50.BytesOut,
Last50.Country,
Last50.Resource,
Last50.ClientAgnt as [ClientAgent],
Last50.http_status,
Last50.Operation,
Last50.ReferringServer,
Last50.os,
Last50.City
From
ProcessingTimeDT
PREDICTION JOIN
OPENQUERY([Weblogs],
'SELECT
[rid],
[LogTime],
ProcessingTime,
BytesIn,
BytesOut,
Country,
Resource,
ClientAgnt,
http_status,
Operation,
ReferringServer,
ResourceType,
OK,
OS,
City
FROM
(SELECT TOP (50) rid, LogTime, ProcessingTime, BytesIn, BytesOut, Country, State, City, resource, resourcetype, clientagnt, OK, http_status, operation,
referringServer, calyear, Calmonth,OS
FROM InternetLog_Wide
ORDER BY RID DESC) as Last50Hits') AS Last50
ON ProcessingTimeDT.[Bytes In] = Last50.BytesIn
AND ProcessingTimeDT.[Bytes Out] = Last50.BytesOut
AND ProcessingTimeDT.Country = Last50.Country
AND ProcessingTimeDT.Resource = Last50.Resource
AND ProcessingTimeDT.ResourceType = Last50.ResourceType
AND ProcessingTimeDT.ClientAgnt = Last50.ClientAgnt
AND ProcessingTimeDT.OK = Last50.OK
AND ProcessingTimeDT.[Http Status] = Last50.http_status
AND ProcessingTimeDT.Operation = Last50.Operation
AND ProcessingTimeDT.[Referring Server] = Last50.ReferringServer
AND ProcessingTimeDT.OS = Last50.OS
AND ProcessingTimeDT.City = Last50.City
The DMX query is actually a hybrid query, with the inner query (for the last 50 hits) coming from the SQL relational database. If we take out the hybrid query and just enter some data manually, the query looks a lot cleaner. For example here is a DMX query where I type in a couple of the attributes (country, city, resource) and get a prediction based on this input.
SELECT
Predict(ProcessingTimeDT.[Processing Time]) as [Prediction],
TypedData.*
From
ProcessingTimeDT
PREDICTION JOIN
(SELECT
'UNITED STATES' as Country,
'RichardsWelcomePage.aspx' as Resource,
'MSIE 8.0' as ClientAgnt,
'New York' as City )
AS TypedData
ON ProcessingTimeDT.Country = TypedData.Country
AND ProcessingTimeDT.Resource = TypedData.Resource
AND ProcessingTimeDT.ClientAgnt = TypedData.ClientAgnt
AND ProcessingTimeDT.City = TypedData.City
Notice how the data mining model does not need to have all the attributes. The more you give the query, the more accurate the query will be.
If you need any assistance designing or querying your data mining models, don't hesitate to ask for my assistance. Data mining is one of my favourite activities.
Richard