Someone asked me the other day: if I cannot currently run tests to improve data quality, is there anything else I can do to make my data better?
Now there’s a good question!
The obvious, easy, and somewhat lame answer is: well, why can’t you run tests? You’d be able to catch and prevent, like, 90% of your bad data! Maybe more!
I am easy and somewhat lame, but not that easy and lame, so I’ll ignore the obvious, easy and somewhat lame answer. Instead, let’s dive into the different aspects of “data quality as an afterthought”.
A colleague of mine in the Netherlands, while working on a similar request, came up with 5 stages, or different points at which you can do something. I like his stages, so I’ll follow his lead. Here they are:
- Amending, and
Because the question above explicitly rules out preventing, we shall ignore that one for this post.
We shall also pass over detecting, at least partly. On one hand, tools like ObervePoint, Hub’Scan, and countless others can do a good job at finding bad data. On the other hand, detecting is often very specific to your business.
If you are selling electric cars online, you can check that revenue on any given transaction should be between $30k and $200k, ruling out glitches and spikes.
If, on the other hand, you sell electronic parts, chewing gum, or wrapping paper, revenue on a transaction might be small, and the upper limit might be EUR5000.
Two different businesses, two different detection thresholds. Figuring out what you need to detect flawed data is going to be an interesting project.
That leaves us with patching, hiding, and correcting.
Patching, in the context of this post, shall mean changing, masking, re-routing, or dropping of data before it arrives in the Analytics processing queue.
There are four points where you can patch.
- During collection of data, i.e. on the page, or in your tag manager
- before the tracking call is sent, i.e. in
- at the inlet of data using Processing Rules, and
- at the inlet of data using VISTA Rules.
The crucial part is: detection has to happen in real time — the example above with the acceptable range of revenue is a good example. The flip side is that if an error can not be detected on an isolated tracking call, then we can not patch it.
If you want to decide between the four points above, ask yourself the following questions:
- Do I have access to a JS developer and tag management, i.e. can I quickly change the tracking on my site? If so, you should probably patch on the page, or in
doPlugins. Otherwise, you need to patch server-side, i.e. use Processing or VISTA Rules.
- Is the patch needed relatively simple? If so, chances are Processing Rules can do it. Otherwise, you need VISTA Rules.
Processing Rules, or browser-side patching are usually preferable over VISTA Rules. Building and deployment of VISTA Rules is always an engagement with my colleagues from the Engineering Services department. As such, they’ll come with a fee, and they take some time to put into place.
Data can be hidden away, in the sense that you can use segments to mask it in the reports.
The segmentation can be on demand (users can choose to use the segment), or mandatory (users have no choice, this uses the “Virtual Report Suite” feature).
An Analytics administrator would create a segment (maybe called “Cleaned Data”), and she would keep that segment up to date. If needed, she would assign that segment as a “Virtual Report Suite” for all or for specific users.
The technical and organisational part of hiding is pretty straight-forward.
Now, how does your Analytics administrator define the segment? How can we see whether data is bad, just by looking at the data?
Since you are reading this blog, I’m assuming you are a developer. So let me take a step back.
Segmentation is a feature in Analytics, which works on the data that Analytics has tracked. For you, that means your friendly Marketer is going to ask you to track data that will help with the segmentation if a situation should occur.
The first thing that comes to mind: track a version (or use the snippet from Useful Data Elements). The next thing would be: try to figure out whether the bad data is tracked alongside something else that can be identified. A good example would be if your bad data is only tracked on a specific page. If that was the case, your friendly marketer could very easily exclude that page by pageName, and the faulty data would automatically be hidden.
One more thing on hiding: segmentation happens at tracking call level.
If there is bad data on a tracking call that records a purchase, and if that bad data only concerns one of the products that the customer bought, know that you can hide the whole purchase, but not the individual product.
In my experience, hiding often works alongside patching, with patching setting some information which can later be used to segment / hide.
Sometimes, hiding data is not necessary, because you can amend or modify what Analytics shows to your stake holders. Sometimes, you’ll combine hiding and amending to get to your goal.
In order to amend data, you can use three Analytics features:
- Derived Metrics,
- Classifications, and
- Transactional Data Sources
The first one allows you to add or recalculate missing data. My favourite and easy example is the “Internal Searches” event that was missing on the search results page. You can easily create a Derived Metric based on Page Views on that page, which can almost completely replace the “Internal Searches” metric.
Classifications can sometimes be used to create or repair dimensional data, as long as there is some other data point in the system that allows you to add a classifications on top of it.
And transactional data sources are awesome if you need to repair order data retrospectively, but within 90 days. If you have a transaction ID for an order that was tracked badly, you can upload data against the ID, which will make most metrics look good again.
The one situation where a Transactional Data Source doesn’t help is when the bad data only affects one out of a lot of products. You can fix data at order level, but if your stake holders break down into products, they’ll still see bad data. In those cases, hiding might be a good idea.
As a general rule, and if it is possible to amend the data, I would always rate amending higher than hiding.
The last of our stages is correcting, or changing the data in Analytics.
It is very likely that in the past, you have been told Analytics data cannot be changed.
That is very close to the truth, but not exactly correct.
There is a team at Adobe called “Engineering Services”. Those guys have access to Analytics in ways noone else has. They can do what can only be called surgery on data, and one of the things they can do is change data.
As far as I know, they export data, modify it, mask the original, then upload the changes data.
That sounds awesome, and it is awesome in the truest sense of the word: inspiring awe.
It also comes with a price tag, a pretty impressive effort, and an effect called “scarring”, where the data at the boundaries of the “incision” can be affected.
In my experience, correcting very rarely makes sense.
For lack of a better term, I called these 5 “stages”.
“Stages” suggest some sort of chronology, or sequence.
In reality, that is not how this works.
Instead, you will always want to combine patching with amending and/or hiding, and sometimes you’ll even throw some correcting into the mix.
I would opine that amending should trump hiding, that those two should be used if possible over correcting, and that patching is a must for any case of bad data that is still on-going when you detect it.
Hiding is often a good measure to be used alongside patching, and if something can be patched with Processing Rules, chances are you can somehow amend existing data using Classifications.
Since you are a developer, your heart beats for fixing things on the page, in the TMS, or in
doPlugins, I presume. I agree.
Technical reasons aside, patching at that level means someone with enough technical understanding (you!) will look at what happens on the site. It is very likely that while doing that, you will find the root cause, and that is still the most important find.
And, to round it all up: preventing it from happening makes all of this obsolete, which is why you really should devote resources to it.