posted 6/5/2012 by MMilligan - Views: [1286]
I recently had the pleasure of working with a very talented group of professionals on a very challenging project. Everything during the project went so smoothly we had the luxury of adding an additional layer of icing on the cake in the form of a data validation layer unlike anything I had ever implemented or seen in the field. One piece of that layer was in developing a process based on Kimball's Design Tip #10: Is Your Data Correct? (http://www.kimballgroup.com/html/designtips.html) It would be worth your while to take a minute and read this design tip. It's only two pages long.
To summarize, the design tip describes a process in which outliers are identified after a daily ETL load completes for further investigation by the data steward(s) of the data warehouse. This is done using a query that finds entries in the fact table that are outside of three standard deviations of the mean. In a normal distribution, 99% of the data falls above or below that threshold. This requires us to look at the entire time history of data and can be very expensive to compute.
In an effort to improve the efficiency of this process a table is used to track three accumulating numbers to aid in the calculation of the formula: number of days, the sum of a metric, and sum square of the metric. In the example below I am tracking these three values using sales by store for each day from the Contoso Retail sample database. (http://www.microsoft.com/en-us/download/details.aspx?id=18279) The query can be used in an SSRS report set up with a data driven subscription to automatically email the interested parties when outliers occur.
The example in the design tip uses stores and departments. Contoso doesn't have departments and my initial attempts of just using store and day of the week were not enough for me to drill down into the detail of what caused the outlier exception. I decided to use three components to slice up the total sales amount instead of two: Store, day of the week, and Product Category Name.
The following code below populates the OutliersFactSales table. It took approximately six hours to run on a quad core laptop with 8 Gig of RAM. Thank goodness this part only needs to be run once to build the history. Subsequent runs would be for only the new days added to the data warehouse with each incremental load. The end result is a table with about 17,000 rows: one row for every store for every day of the week (Sunday - Saturday) for every Product Category. The size of the table will not grow unless new product categories or stores are added. I've included a script to insert the data into a table so that you don't have to run the code yourself. Click here to download.
--SQL SNIPPET - KIMBALL OUTLIER METHOD /* --COMMENTS START create nonclustered index ix_MMFactSales on factSales (DateKey,StoreKey,ProductKey,SalesKey); drop table dbo.OutliersFactSales truncate table dbo.OutliersFactSales CREATE THE Outlier TABLE AND SEED WITH FIRST DATE IN FACT TABLE SELECT StoreKey ,p3.ProductCategoryKey ,DayOfTheWeek = DATEPART(weekday,dateKey) ,NumberOfDays = 1 ,SUMMetric = SUM(SalesAmount) ,SUMMetricSquared = SQUARE(SUM(SalesAmount)) INTO [ContosoRetailDW].dbo.OutliersFactSales FROM [ContosoRetailDW].[dbo].[FactSales] f JOIN [ContosoRetailDW].[dbo].DimProduct p on p.ProductKey = f.ProductKey JOIN [ContosoRetailDW].[dbo].DimProductSubcategory p2 on p2.ProductSubcategoryKey = p.ProductSubcategoryKey JOIN [ContosoRetailDW].[dbo].DimProductCategory p3 on p3.ProductCategoryKey = p2.ProductCategoryKey WHERE DateKey = (SELECT MIN(DATEKEY) FROM [dbo].[FactSales]) --1/1/2007 GROUP BY StoreKey,p3.ProductCategoryKey,datekey; CREATE TABLE [dbo].[OutliersFactSales] ( [StoreKey] [int] NOT NULL ,[ProductCategoryKey] [int] NOT NULL ,[DayOfTheWeek] [int] NOT NULL ,[NumberOfDays] [int] NOT NULL ,[SUMMetric] [money] NULL ,[SUMMetricSquared] [float] NULL ,CONSTRAINT [PK_OutliersFactSales] PRIMARY KEY CLUSTERED ( [StoreKey] ASC ,[ProductCategoryKey] ASC ,[DayOfTheWeek] ASC ) WITH ( PAD_INDEX = OFF ,STATISTICS_NORECOMPUTE = OFF ,IGNORE_DUP_KEY = OFF ,ALLOW_ROW_LOCKS = ON ,ALLOW_PAGE_LOCKS = ON ) ON [PRIMARY] ) ON [PRIMARY] --COMMENTS END */ DECLARE @DateToRun INT ,@EndingDate INT; --get starting date SELECT @DateToRun = CAST(CONVERT(VARCHAR, MIN(DateKey), 112) AS INT) FROM DBO.FactSales; SELECT @EndingDate = CAST(CONVERT(VARCHAR, MAX(DateKey), 112) AS INT) FROM DBO.FactSales; SET @EndingDate = @EndingDate - 1; WHILE @DateToRun <= @EndingDate BEGIN -- update current records UPDATE a SET NumberOfDays = NumberOfDays + 1 ,SumMetric = a.sumMetric + m.sumMetric ,SumMetricSquared = a.SumMetricSquared + m.SumMetricSquared FROM dbo.OutliersFactSales a INNER JOIN ( SELECT StoreKey ,p3.ProductCategoryKey ,DayOfTheWeek = DATEPART(weekday, dateKey) ,SUMMetric = SUM(SalesAmount) ,SUMMetricSquared = SQUARE(SUM(SalesAmount)) FROM [ContosoRetailDW].[dbo].[FactSales] f INNER JOIN [dbo].DimProduct p ON p.ProductKey = f.ProductKey INNER JOIN [dbo].DimProductSubcategory p2 ON p2.ProductSubcategoryKey = p.ProductSubcategoryKey INNER JOIN [dbo].DimProductCategory p3 ON p3.ProductCategoryKey = p2.ProductCategoryKey WHERE CAST(CONVERT(VARCHAR, DateKey, 112) AS INT) = @DateToRun --1/1/2007 GROUP BY StoreKey ,p3.ProductCategoryKey ,datekey ) m ON m.StoreKey = a.StoreKey AND m.DayOfTheWeek = a.DayOfTheWeek AND m.ProductCategoryKey = a.ProductCategoryKey; --add new records INSERT INTO dbo.OutliersFactSales SELECT StoreKey ,p3.ProductCategoryKey ,DayOfTheWeek = DATEPART(weekday, dateKey) ,NumberOfDays = 1 ,SUMMetric = SUM(SalesAmount) ,SUMMetricSquared = SQUARE(SUM(SalesAmount)) FROM [ContosoRetailDW].[dbo].[FactSales] f INNER JOIN [ContosoRetailDW].[dbo].DimProduct p ON p.ProductKey = f.ProductKey INNER JOIN [ContosoRetailDW].[dbo].DimProductSubcategory p2 ON p2.ProductSubcategoryKey = p.ProductSubcategoryKey INNER JOIN [ContosoRetailDW].[dbo].DimProductCategory p3 ON p3.ProductCategoryKey = p2.ProductCategoryKey WHERE CAST(CONVERT(VARCHAR, DateKey, 112) AS INT) = @DateToRun --1/1/2007 AND cast(StoreKey AS VARCHAR(10)) + ' ' + cast(DATEPART(weekday, dateKey) AS CHAR(1)) + ' ' + cast(p3.ProductCategoryKey AS VARCHAR(10)) NOT IN ( SELECT cast(StoreKey AS VARCHAR(10)) + ' ' + cast(dayOfTheWeek AS CHAR(1)) + ' ' + cast(ProductCategoryKey AS VARCHAR(10)) FROM dbo.OutliersFactSales ) GROUP BY StoreKey ,p3.ProductCategoryKey ,DateKey; -- increment date SET @DateToRun = CONVERT(INT, CONVERT(VARCHAR, DATEADD(dd, 1, CONVERT(DATE, convert(VARCHAR, @DateToRun))), 112)) PRINT @DateToRun END --- six hours later...
The next block of code is what you would base your SSRS report on. It returns the store, day of the week, product categories, and metrics for a specific date in the data warehouse for outliers of the Sales Amount metric. These results could be further refined to only include outliers that meet a certain threshold. This report could be set up with a regular or data driven subscription so that the outliers can be analyzed after each refresh.
--execute and view outliers --select * from drop table dbo.OutliersFactSales SELECT f.StoreKey ,p3.ProductCategoryName ,dayoftheweek ,f2.NumberOfDays ,f2.sumMetric ,f2.SumMetricSquared ,sumMetricAdding = SUM(SalesAmount) ,SumMetricSquaredAdding = square(sum(SalesAmount)) --(sum(fueladded)*SUM(FuelAdded)) ,leftside = ABS(sum(SalesAmount) - (1 / f2.NumberOfDays) * f2.sumMetric) ,rightside = 3.0 * SQRT((1.0 / (f2.NumberOfDays - 1)) * (f2.SumMetricSquared - (1.0 / f2.NumberOfDays) * (SQUARE(SUMMetric)))) --summetric*summetric)) ,LeftRightDifference = (ABS(sum(SalesAmount) - (1 / f2.NumberOfDays) * f2.sumMetric)) - (3.0 * SQRT((1.0 / (f2.NumberOfDays - 1)) * (f2.SumMetricSquared - (1.0 / f2.NumberOfDays) * (SQUARE(SUMMetric))))) --summetric*summetric)) ,AvgSalesByDay = f2.sumMetric / f2.NumberOfDays ,Mean = SQRT((1.0 / (f2.NumberOfDays - 1.0)) * (f2.SumMetricSquared - (1.0 / f2.NumberOfDays) * (SQUARE(SUMMetric)))) --,STDEV(f2.sumMetric) FROM [ContosoRetailDW].[dbo].[FactSales] f INNER JOIN [ContosoRetailDW].[dbo].DimProduct p ON p.ProductKey = f.ProductKey INNER JOIN [ContosoRetailDW].[dbo].DimProductSubcategory p2 ON p2.ProductSubcategoryKey = p.ProductSubcategoryKey INNER JOIN [ContosoRetailDW].[dbo].DimProductCategory p3 ON p3.ProductCategoryKey = p2.ProductCategoryKey INNER JOIN dbo.OutliersFactSales f2 ON f2.StoreKey = f.StoreKey AND f2.dayoftheweek = DATEPART(weekday, dateKey) AND f2.ProductCategoryKey = p3.ProductCategoryKey WHERE CAST(CONVERT(VARCHAR, f.DateKey, 112) AS INT) = 20090329 AND f2.NumberOfDays > 1 GROUP BY f.storekey ,p3.ProductCategoryName ,dayoftheweek ,f2.NumberOfDays ,f2.sumMetric ,f2.SumMetricSquared HAVING ABS(SUM(SalesAmount) - (1 / f2.NumberOfDays) * f2.sumMetric) > 3.0 * SQRT((1.0 / (f2.NumberOfDays - 1)) * (f2.SumMetricSquared - (1.0 / f2.NumberOfDays) * (SQUARE(SUMMetric)))) --SUMMetric*SUMMetric)); ORDER BY LeftRightDifference DESC;
These are the outliers for Sunday, March 29th 2009 sorted in descending order by the difference between the total sales and three standard deviations from the mean.
Let's look at the first row. Store 307, Cameras and camcorders, Sunday there are 156 days of sample data in our OutlierFactSales table for this Store/Day of Week/Product Category. The SumMetric is the total sales for those 156 days and the SumMetricSquared is the accumulating square of those sums. The SumMetricAdding column is the amount that was added to SumMetric for this days sales and likewise for SumMetricSquaredAdding. The leftside column is the left portion of our HAVING clause (which incidentally is the same as the SumMetricAdding column.) The right side is the value that is three standard deviations from the mean. The filter in the having clause states that the left side must be greater than the right side for the record to be returned. The LeftRightDifference is the left side minus the right side. The AverageSalesByDay divides SumMetric by Number of Days.
If we look at the record at the top of the results we see the row in which the difference between the left side and right side is the greatest. This is for store 307, Cameras and camcorders. Total Sales for that Sunday were $235,261 and the average Sunday sales for cameras and camcorders for that store are only $160,040.
The query below shows the details for that record. (Careful of the embedded space after Product Category Name! LTRIM(RTRIM('anyone? '))
select * from dbo.FactSales f JOIN [ContosoRetailDW].[dbo].DimProduct p on p.ProductKey = f.ProductKey JOIN [dbo].DimProductSubcategory p2 on p2.ProductSubcategoryKey = p.ProductSubcategoryKey JOIN [dbo].DimProductCategory p3 on p3.ProductCategoryKey = p2.ProductCategoryKey JOIN dbo.DimPromotion x on f.PromotionKey = x.PromotionKey WHERE CAST(CONVERT(VARCHAR,f.DateKey,112) AS INT) = 20090329 and StoreKey = 306 and ProductCategoryName = 'Cameras and camcorders ' order by SalesAmount Desc;
This query shows us some pretty big orders. They sold 108 units of cameras for a total of almost $35,000. Next on the list are 18 HD TVs for a total of $17,000. It looks like the Asian Spring Promotion really brought some business in. Looking at store 306 for the same product category we see they sold some large quantities of camcorders and cameras for their European Spring Promotion.
Next, let's add an outlier and see if our query picks it up! I chose store 5, Computers because that store/category wasn't returned by our original query. This statement changes the sales amount for a record from $2,208.80 to $22,088.00.
update f set SalesAmount = 22088 --2208.8 from dbo.FactSales f JOIN [ContosoRetailDW].[dbo].DimProduct p on p.ProductKey = f.ProductKey JOIN [dbo].DimProductSubcategory p2 on p2.ProductSubcategoryKey = p.ProductSubcategoryKey JOIN [dbo].DimProductCategory p3 on p3.ProductCategoryKey = p2.ProductCategoryKey JOIN dbo.DimPromotion x on f.PromotionKey = x.PromotionKey WHERE CAST(CONVERT(VARCHAR,f.DateKey,112) AS INT) = 20090329 and StoreKey = 5 and ProductCategoryName = 'Computers';
Running our query again... nothing happened! I updated the record again and added a zero, re-ran our query and it picked it up. Looking at the record we see why $22,088 wasn't enough to trigger a response. Three standard deviations from the mean is $28,945.
I think this is a really cool and fairly simple way to add additional value to the data warehouse. I especially want to thank Cory Anderson, a member of my last project team; he went the extra mile helping me polish the data validation layer for our solution. I should also thank Ralph Kimball for publishing his design tips for all to share. Helping identify bad or suspect data in the source system once again makes the BI team company heroes.