Data Warehouse & Data Feeds

One aspect of technology is that everybody can use designations just the way they see fit. We saw that in the article on “variables” in SiteCatalyst, and today we show another example: Data Warehouse.

We have mentioned before that Adobe Analytics collects data, processes it and then aggregates it for convenient reporting. This is the case for SiteCatalyst (“Digital Analytics”) and Discover (“Ad-hoc Analytics”).

So what if you ever need very specific data, like a log of hit-level data or maybe a report specific to one single email campaign or even a single visitor? What if someone at Adobe wants to look at what data was actually received on our tracking servers for troubleshooting purposes?

For those use cases the Analytics solution includes Data Warehouse (“Data Repository”).

I like the name “Data Repository” much better than Data Warehouse, but it has been known by the latter for years now and probably will be for some time.

Data Warehouse

Technically, Data Warehouse (DWH) records all the data that comes in on a tracking call in a simple log. It does this after Processing Rules and VISTA Rules have been applied. No further processing takes place, the data is just stored away safely for access at any point. Usually, DWH stores the data for 36 months, but it depends on the contract, so in your case it could be longer or shorter.

DWH also has a basic interface that allows you to query the stored data. You can select metrics and dimensions, apply date ranges and segments. The resulting tab-separated file will be sent via email or delivered to an FTP server.

What can you do with that data?

Three big themes come to mind:

  1. Create complex reports – complex beyond the reach of SiteCatalyst or even Discover. Data Warehouse has the ability to correlate as many dimensions (“variables”) and apply as many metrics as you need. Those reports can also be longer than 50000 lines.
  2. Push hit-level, raw data out via FTP for processing in your own ETL or data warehouse solution.
  3. Generate debug logs that Adobe employees (mainly those in ClientCare) can use to help you troubleshoot data collection issues.

If you, as a developer, ever come into contact with DWH, we reckon it would be for use case 2, the output of raw data for further processing. Let’s look at that then.

Data Feeds

Data Warehouse can be instructed to send a daily feed of data, aptly called a “Data Feed”.

It usually starts processing data for a given day once the day is over, i.e. just after midnight. It assembles all data for the day then sends it to an FTP server.

Once it has completed the FTP send, it will create a file with the same name as the data feed file but a “.fin” extension. You can poll for the “.fin” file on your end and start whatever ETL process you need once it exists, because its existence means the data push has been completed.

The data comes in a tab separated file with up to a couple hundred columns. Each row represents exactly one hit.

As an example, let’s look at a simple tracking call and check what it will look like in the data feed.

[Screenshot]

Simple Tracking Call in Debugger

As we can see looking at the tracking call in a debugger, this call transports the pageName, props, eVars and three events ‐ prodView, event2 & event12. As an example, let’s look at eVar6.

Now in the data feed, we would see the following:

[Screenshot]

The Tracking Call in Data Feed

We have deliberately omitted a whole bunch of columns and rows here to make it easier to see what you can expect.

In the above example, the event_list column shows six distinct values: 2, 211, 201, 100, 101, 105 and 000. Those are the prodView event (2), event2 (201), event12 (211) and “instance of eVar6” (105), and some more. The latter tells us that a value has been set in eVar6, and we can also see that in the eVar6 column.

See how there are two columns for eVar6? We have eVar6 and post_eVar6. The former is what the system received in the tracking call, while the latter is what it sees after applying Procesing Rules, VISTA Rules and the visitor profile.

In other words, due to processing and because eVars are persistent, post_eVar6 can have a value even if no value was sent in.

Let’s look at an example where that happens:

[Screenshot]

2 Simple Tracking Calls in the Debugger

We can see here that on the first hit, the visitor came via a link with a tracking code (“test_cmp”). That tracking code was written into the s.campaign variable (“v0” in the debugger), which in this case had been set to expire after a month.

On the second hit that we see, the visitor looked at a blog article, which the system tracks in event7.

The data feed shows this:

[Screenshot]

2 Tracking Calls in Data Feed

Now because the post_campaign column on the second hit contained the tracking code, the event (event7 or 206) will be counted against this code. In the report, we can see this:

[Screenshot]

Campaign Report with event7

No Metrics!

One big thing to keep in mind when you work with Data Feeds: there are no metrics! The concept of a metric doesn’t really exist at hit level.

We know customers who use their own ETL process to try and calculate “Visits” and “Visitors” metrics based on the visid_high and visid_low columns which store the visitor ID also used in SiteCatalyst for the same purpose. It is possible to get close, but unlikely that you will ever get the exact same numbers.

(That is a rule for any tool in web analytics, actually. No two tools will ever show you the same numbers for a given scenario. On one hand, they all process data slightly differently. On the other hand, the Javascript-based tracking itself is prone to hickups and little inaccuracies.)

About

German expat living in Switzerland (formerly UK and France). Consultant and member of the Multi-Solutions Group at Adobe, working with the Digital Marketing Suite. Father of 4 girls.

Posted in Automation, Integration, Principles
6 comments on “Data Warehouse & Data Feeds
  1. […] a customer, you can extract the data back out using Data Feeds or a couple of other methods. The main thing with SaaS is, though, that you don’t have to. […]

    Like

  2. Jordan says:

    Thanks Jan, very helpful for our development team.

    Jordan.

    Like

  3. Manu says:

    Always useful to remind ourselves what are these for! Thanks Jan!

    Like

  4. […] scheduled (every day, every week and so on) and the data is either raw hit-level data (“Data Feed“) or partly aggregated (from Data […]

    Like

  5. […] is the point where data gets split: it goes into Data Warehouse (which means that Data Feeds contain what we have now), plus it is sent into a further processing […]

    Like

  6. […] Live Stream is a feature very similar to Data Feeds. It spits out raw data, hit per hit, containing whatever has been […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 1,398 other followers

%d bloggers like this: