Anomaly Detection using the Reporting API

While we’re looking at the Reporting API and little-known features, let’s use the new anomaly detection that was introduced in February 2013. Like real time reporting, it is currently only available in the Reporting API, giving developers a real edge!

What is it?

The Reporting API can now calculate upper and lower bounds for the metric you request as well as a forecast.

I can not explain the maths behind this, but I can point you to the documentation, which spells out the algorithms being used.

To sum it up, the “upper_bound” value basically means “95% confidence that the actual value will be below” and the “lower_bound” value is “95% confidence that the actual value will be higher”.

If your traffic is big and relatively stable, you can look at upper and lower bounds as “my metric should usually be within these bounds” or “outliers are anomalies! I should look at them!”

Maybe an image says more than words:

[Screenshot]
Visits Report with Anomaly Detection
The red line represents the Visits metric. These are the visits since July 7th this year on this blog. The red line ends on the 16th, with a (so far) pretty low number.

The green line is the “upper bound” calculated by the anomaly detection algorithm. You can see that the red line passes above a couple of times. The really big one last week was due to a couple of retweets, specifically one mention by Adam Greco that led to some more RTs and a lot of traffic.

The lower bound is invisible here. The value is 0 for my blog across the whole period, which makes sense, I guess. Just look at the traffic on weekends.

The light blue line represents what the system forecasts. With roughly 13 Visits per day, I think it is slightly on the low end. If I sum up the whole period, the forecast says 920 Visits whereas I actually had 1220.

My guess is that for my specific (low) traffic pattern, the algorithms have a tendency that they wouldn’t have for proper, high traffic sites.

Note: The bounds and forecast lines continue past this day! I pulled the report on the 16th, but I asked for data until the 30th. There is obviously no Visits after today, but you can see the upper bound and forecast until the end of the month! How cool is that?

How do I use it?

If you know how to pull a report out of the Reporting API, you know all you need.

Anomaly detection can be switched on by setting the “anomalyDetection” flag in the reportDescription to — wait for it — “true”.

This works for “trended” and “overtime” reports, not for “ranked”. It works with one or more metrics, and with “dateGranularity” set to “day”.

Let’s look at a reportDescription:

{
	"reportDescription":{
		"metrics":[ {
			"id":"visits"
		} ],
		"anomalyDetection":true,
		"dateTo":"2013-09-30",
		"locale":"en_US",
		"dateGranularity":"day",
		"dateFrom":"2013-07-07",
		"reportSuiteID":"jexnerweb4dev"
	}
}

It really is as easy as that additional one line.

I guess what I’m saying is: if you are already pulling overtime or trended reports with a granularity of “day”, you really should add anomaly detection to them!

If you output a graph, you should really overlay the three graphs you get out of anomaly detection.

And if you have time, add some kind of alert, colour coding or email to the mix: when the metric goes beyond the upper bound or below the lower, trigger something! Maybe it’s just a normal outlier, but if there is something weird happening, people will want to know.

Can we have an example, please?

Sure.

Let’s push it a bit and use two metrics in an overtime report, like so:

{
	"reportDescription":{
		"metrics":[
			{"id":"pageviews"},
			{"id":"visits"}
		],
		"anomalyDetection":true,
		"dateTo":"2013-09-30",
		"locale":"en_US",
		"dateGranularity":"day",
		"dateFrom":"2013-07-07",
		"reportSuiteID":"jexnerweb4dev"
	}
}

We’re essentially pulling two values (PVs & Visits) for each day starting July 7th until the end of September. That will give us 87 days worth of data, or 87 blocks like the following:

{
	"name":"Wed. 14 Aug. 2013",
	"year":2013,
	"month":8,
	"day":14,
	"counts":["66","41"],
	"upper_bounds":["74.4432","33.7564"],
	"lower_bounds":["0","0"],
	"forecasts":["27","13"]
}

Since we requested two metrics (PVs & Visits), all the values come in pairs.

We had 66 PVs on the 14th of August as well as 41 Visits. Based on the previous days (the “training period”), the system expected the metrics to be between 0 and 74 (PVs) and 0 and 33 (Visits). It also forecast 24 PVs and 13 Visits for that day.

Let’s look at a Friday in my data set. Fridays are very average.

{
	"name":"Fri. 16 Aug. 2013",
	"year":2013,
	"month":8,
	"day":16,
	"counts":["23","16"],
	"upper_bounds":["74.4432","33.7564"],
	"lower_bounds":["0","0"],
	"forecasts":["27","13"]
}

Not too bad — 23 PVs actual versus 27 forecast, and 16 Visits actual versus 13 forecast.

As I said: if you have high traffic, your numbers should be a lot better.

One thought on “Anomaly Detection using the Reporting API

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.