One aspect of technology is that everybody can use designations just the way they see fit. We saw that in the article on “variables” in SiteCatalyst, and today we show another example: Data Warehouse.
We have mentioned before that Adobe Analytics collects data, processes it and then aggregates it for convenient reporting. This is the case for SiteCatalyst (“Digital Analytics”) and Discover (“Ad-hoc Analytics”).
So what if you ever need very specific data, like a log of hit-level data or maybe a report specific to one single email campaign or even a single visitor? What if someone at Adobe wants to look at what data was actually received on our tracking servers for troubleshooting purposes?
For those use cases the Analytics solution includes Data Warehouse (“Data Repository”).
I like the name “Data Repository” much better than Data Warehouse, but it has been known by the latter for years now and probably will be for some time.
Technically, Data Warehouse (DWH) records all the data that comes in on a tracking call in a simple log. It does this after Processing Rules and VISTA Rules have been applied. No further processing takes place, the data is just stored away safely for access at any point. Usually, DWH stores the data for 36 months, but it depends on the contract, so in your case it could be longer or shorter.
DWH also has a basic interface that allows you to query the stored data. You can select metrics and dimensions, apply date ranges and segments. The resulting tab-separated file will be sent via email or delivered to an FTP server.
What can you do with that data?
Three big themes come to mind:
- Create complex reports – complex beyond the reach of SiteCatalyst or even Discover. Data Warehouse has the ability to correlate as many dimensions (“variables”) and apply as many metrics as you need. Those reports can also be longer than 50000 lines.
- Push hit-level, raw data out via FTP for processing in your own ETL or data warehouse solution.
- Generate debug logs that Adobe employees (mainly those in ClientCare) can use to help you troubleshoot data collection issues.
If you, as a developer, ever come into contact with DWH, we reckon it would be for use case 2, the output of raw data for further processing. Let’s look at that then.
Data Warehouse can be instructed to send a daily feed of data, aptly called a “Data Feed”.
It usually starts processing data for a given day once the day is over, i.e. just after midnight. It assembles all data for the day then sends it to an FTP server.
Once it has completed the FTP send, it will create a file with the same name as the data feed file but a “.fin” extension. You can poll for the “.fin” file on your end and start whatever ETL process you need once it exists, because its existence means the data push has been completed.
The data comes in a tab separated file with up to a couple hundred columns. Each row represents exactly one hit.
As an example, let’s look at a simple tracking call and check what it will look like in the data feed.As we can see looking at the tracking call in a debugger, this call transports the pageName, props, eVars and three events ‐ prodView, event2 & event12. As an example, let’s look at eVar6.
Now in the data feed, we would see the following:We have deliberately omitted a whole bunch of columns and rows here to make it easier to see what you can expect.
In the above example, the
event_list column shows six distinct values: 2, 211, 201, 100, 101, 105 and 000. Those are the prodView event (2), event2 (201), event12 (211) and “instance of eVar6” (105), and some more. The latter tells us that a value has been set in eVar6, and we can also see that in the
See how there are two columns for eVar6? We have
post_eVar6. The former is what the system received in the tracking call, while the latter is what it sees after applying Procesing Rules, VISTA Rules and the visitor profile.
In other words, due to processing and because eVars are persistent,
post_eVar6 can have a value even if no value was sent in.
Let’s look at an example where that happens:We can see here that on the first hit, the visitor came via a link with a tracking code (“test_cmp”). That tracking code was written into the s.campaign variable (“v0” in the debugger), which in this case had been set to expire after a month.
On the second hit that we see, the visitor looked at a blog article, which the system tracks in event7.
The data feed shows this:Now because the
post_campaigncolumn on the second hit contained the tracking code, the event (event7 or 206) will be counted against this code. In the report, we can see this:
One big thing to keep in mind when you work with Data Feeds: there are no metrics! The concept of a metric doesn’t really exist at hit level.
We know customers who use their own ETL process to try and calculate “Visits” and “Visitors” metrics based on the
visid_low columns which store the visitor ID also used in SiteCatalyst for the same purpose. It is possible to get close, but unlikely that you will ever get the exact same numbers.