Bad Data – Prevention & other Aspects

Someone asked me the other day: if I cannot currently run tests to improve data quality, is there anything else I can do to make my data better?

Now there’s a good question!

The obvious, easy, and somewhat lame answer is: well, why can’t you run tests? You’d be able to catch and prevent, like, 90% of your bad data! Maybe more!

I am easy and somewhat lame, but not that easy and lame, so I’ll ignore the obvious, easy and somewhat lame answer. Instead, let’s dive into the different aspects of “data quality as an afterthought”.

Stages

A colleague of mine in the Netherlands, while working on a similar request, came up with 5 stages, or different points at which you can do something. I like his stages, so I’ll follow his lead. Here they are:

  1. Detecting,
  2. Preventing,
  3. Patching,
  4. Hiding,
  5. Amending, and
  6. Correcting

Because the question above explicitly rules out preventing, we shall ignore that one for this post.

We shall also pass over detecting, at least partly. On one hand, tools like ObervePoint, Hub’Scan, and countless others can do a good job at finding bad data. On the other hand, detecting is often very specific to your business.

If you are selling electric cars online, you can check that revenue on any given transaction should be between $30k and $200k, ruling out glitches and spikes.

If, on the other hand, you sell electronic parts, chewing gum, or wrapping paper, revenue on a transaction might be small, and the upper limit might be EUR5000.

Two different businesses, two different detection thresholds. Figuring out what you need to detect flawed data is going to be an interesting project.

That leaves us with patching, hiding, and correcting.

Patching

Patching, in the context of this post, shall mean changing, masking, re-routing, or dropping of data before it arrives in the Analytics processing queue.

[crude drawing]
Processing of Data in the Analytics Cloud
There are four points where you can patch.

  1. During collection of data, i.e. on the page, or in your tag manager
  2. before the tracking call is sent, i.e. in doPlugins,
  3. at the inlet of data using Processing Rules, and
  4. at the inlet of data using VISTA Rules.

Processing Rules are limited to changing data (“set revenue to 0 if it is > 2000000”), but the tracking on the page, code in doPlugins, as well as VISTA Rules can in principle do anything you want.

The crucial part is: detection has to happen in real time — the example above with the acceptable range of revenue is a good example. The flip side is that if an error can not be detected on an isolated tracking call, then we can not patch it.

If you want to decide between the four points above, ask yourself the following questions:

  • Do I have access to a JS developer and tag management, i.e. can I quickly change the tracking on my site? If so, you should probably patch on the page, or in doPlugins. Otherwise, you need to patch server-side, i.e. use Processing or VISTA Rules.
  • Is the patch needed relatively simple? If so, chances are Processing Rules can do it. Otherwise, you need VISTA Rules.

Processing Rules, or browser-side patching are usually preferable over VISTA Rules. Building and deployment of VISTA Rules is always an engagement with my colleagues from the Engineering Services department. As such, they’ll come with a fee, and they take some time to put into place.

Hiding

Data can be hidden away, in the sense that you can use segments to mask it in the reports.

The segmentation can be on demand (users can choose to use the segment), or mandatory (users have no choice, this uses the “Virtual Report Suite” feature).

An Analytics administrator would create a segment (maybe called “Cleaned Data”), and she would keep that segment up to date. If needed, she would assign that segment as a “Virtual Report Suite” for all or for specific users.

The technical and organisational part of hiding is pretty straight-forward.

Now, how does your Analytics administrator define the segment? How can we see whether data is bad, just by looking at the data?

Since you are reading this blog, I’m assuming you are a developer. So let me take a step back.

Segmentation is a feature in Analytics, which works on the data that Analytics has tracked. For you, that means your friendly Marketer is going to ask you to track data that will help with the segmentation if a situation should occur.

The first thing that comes to mind: track a version (or use the snippet from Useful Data Elements). The next thing would be: try to figure out whether the bad data is tracked alongside something else that can be identified. A good example would be if your bad data is only tracked on a specific page. If that was the case, your friendly marketer could very easily exclude that page by pageName, and the faulty data would automatically be hidden.

One more thing on hiding: segmentation happens at tracking call level.

If there is bad data on a tracking call that records a purchase, and if that bad data only concerns one of the products that the customer bought, know that you can hide the whole purchase, but not the individual product.

In my experience, hiding often works alongside patching, with patching setting some information which can later be used to segment / hide.

Amending

Sometimes, hiding data is not necessary, because you can amend or modify what Analytics shows to your stake holders. Sometimes, you’ll combine hiding and amending to get to your goal.

In order to amend data, you can use three Analytics features:

  1. Derived Metrics,
  2. Classifications, and
  3. Transactional Data Sources

The first one allows you to add or recalculate missing data. My favourite and easy example is the “Internal Searches” event that was missing on the search results page. You can easily create a Derived Metric based on Page Views on that page, which can almost completely replace the “Internal Searches” metric.

Classifications can sometimes be used to create or repair dimensional data, as long as there is some other data point in the system that allows you to add a classifications on top of it.

And transactional data sources are awesome if you need to repair order data retrospectively, but within 90 days. If you have a transaction ID for an order that was tracked badly, you can upload data against the ID, which will make most metrics look good again.

The one situation where a Transactional Data Source doesn’t help is when the bad data only affects one out of a lot of products. You can fix data at order level, but if your stake holders break down into products, they’ll still see bad data. In those cases, hiding might be a good idea.

As a general rule, and if it is possible to amend the data, I would always rate amending higher than hiding.

Correcting

The last of our stages is correcting, or changing the data in Analytics.

It is very likely that in the past, you have been told Analytics data cannot be changed.

That is very close to the truth, but not exactly correct.

There is a team at Adobe called “Engineering Services”. Those guys have access to Analytics in ways noone else has. They can do what can only be called surgery on data, and one of the things they can do is change data.

As far as I know, they export data, modify it, mask the original, then upload the changes data.

That sounds awesome, and it is awesome in the truest sense of the word: inspiring awe.

It also comes with a price tag, a pretty impressive effort, and an effect called “scarring”, where the data at the boundaries of the “incision” can be affected.

In my experience, correcting very rarely makes sense.

Notes

For lack of a better term, I called these 5 “stages”.

“Stages” suggest some sort of chronology, or sequence.

In reality, that is not how this works.

Instead, you will always want to combine patching with amending and/or hiding, and sometimes you’ll even throw some correcting into the mix.

I would opine that amending should trump hiding, that those two should be used if possible over correcting, and that patching is a must for any case of bad data that is still on-going when you detect it.

Hiding is often a good measure to be used alongside patching, and if something can be patched with Processing Rules, chances are you can somehow amend existing data using Classifications.

Since you are a developer, your heart beats for fixing things on the page, in the TMS, or in doPlugins, I presume. I agree.

Technical reasons aside, patching at that level means someone with enough technical understanding (you!) will look at what happens on the site. It is very likely that while doing that, you will find the root cause, and that is still the most important find.

And, to round it all up: preventing it from happening makes all of this obsolete, which is why you really should devote resources to it.

One thought on “Bad Data – Prevention & other Aspects

  1. Leigh

    Great article! Many people are told that they are pouring concrete with Adobe Analytics and their data can never be removed. Until recently I thought this was pretty much true, apart from some negative data hacks which sounded like a pretty dirty operation.
    I just recently found out that Engineering Services are quite frequently working on projects to amend, correct and reverse data.
    An example of when it does make sense and where Engineering can really save the day, clients who have accidentally pushed staging environment data into a production report suite. This can be corrected, even if it happened over an extended period of time. Engineering can also step in to scrub PII if any client has passed into Analytics.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s