Using Data Feeds for Debugging

When I wrote about debugging mobile apps some time ago I mentioned that one way to get data out of SiteCatalyst Adobe Analytics for troubleshooting is the so-called “Data Feeds” — flat files of raw data delivered by FTP, containing one line per every hit sent.

BIG data

When you looked at the URL of a tracking request the last time, did you see how long it was? A thousand characters are not uncommon, especially if you use Context Data and define name spaces and things like that. The URL can get so long that IE chokes on it and doesn’t track at all! The dreaded 2048-character limit lurks… though that doesn’t happen frequently.

So think of thousands or millions of hits and how much data that would be if you just pulled it in raw format. Yup, a Data Feed file can get big. So big, in fact, that the system will not allow you to send more than a day’s worth of data in a single file. So big that the file is zipped (or gzipped) before it’s being sent out. So big that you should closely watch your FTP server disk space!

Every line in a Data Feed file can contain 500+ columns, each column representing one bit of data that has been or potentially could have been sent into Analytics.

If you send “DE” into s.prop2, for example, you will see a row somewhere in the Data Feed that contains “DE” in the column that represents that “variable”. The same is true for all values you are sending, with a notable exception: Context Data is NOT part of the Data Feed unless you have a Processing Rules that copies the data into a “variable”.

You are likely sick of buzz words, and I am sure “big data” is high on the list of terms that you and I would place on a bullshit bingo sheet. Possibly multiple times. Well, Data Feed files are big data, literally.

If in the past your strategy has been “just open the log file and scroll through it” (a strategy made popular in the 80s by emacs), I’m afraid you have to reconsider. Only few editors can open files this size (Excel is certainly not one of those, though emacs is).

You need alternatives.

I have in the past worked with Unix and Linux systems and I’d like to suggest a couple of easy-to-use tools that will help you find the data you actually need in those big files.

If you’re on Unix, Linux, or OS X, you have access to these little gems on your command line. Should you be on Windows, feel free to download Cygwin. You’ll have a “Cygwin Terminal” that comes with them as well.

Search

Let’s start with simple search.

Use case: you ran a test and you want to see what the data looked like once it ended up in Analytics.

How do you find “your” data? Easiest would be to note down the IP address you had when you tested, then look for that IP address in the file.

So, open the file — woah, wait! Didn’t I say it was likely too big? Yup, too big for an editor.

But not too big for a “pager”. The pager is called more or less, depending on your system. Sometimes, both names work and it is really the same pager. Yes, those Unix people are jokers.

Try this command:

	$> less hits.tsv

The system will show you the first 25 or so lines of the file and you can scroll up and down with your cursor keys. Some versions allow you to scroll using the mouse and a scroll bar, but usually it’s cursor up and down.

Page Up and Page Down should also work as expected.

Try typing g or G. That should move to the top or bottom of the file, respectively.

So you could try and find your IP address just moving up and down, but there is an easier way: hit the / key and the lowest line of the screen will go blank, with just a “/” on the left. That is a search prompt!

You can now type your IP address, hit enter and the pager will try to find what you typed and move to that location.

[Screenshot]
Less/More after a Successful Search
If the pager does not find your term, it’ll look like this:

[Screenshot]
Less/More after a Failed Search
Pretty easy to do, no?

Here’s an important key: q. This is how you exit less/more. Sometimes the ESC key works as well.

Let’s move on to more powerful tools and to the concept of “piping”!

Filter

In the Unix/Linux world, pretty much everything is a file. Some of these things do not really seem to be or behave like a file, one example being the screen that you see in front of you. But conceptually, even the screen is a file.

There are tools that work on files, but not like an editor, more like a black box. You put a file into the tool and it spits out a new, modified file on the other side. This is how most tools on a Unix/Linux command line work.

Now the powerful thing is that you can “redirect” that output, tell it to go somewhere specific.

In most cases, you will “redirect it into a file”, which essentially means saving it. But you can also use the output of one tool as the input of another one. We call it “piping the output” into some other tool.

Bear with me, we’ll get to an example soon. Just remember: file goes in, tool modifies file line by line, tools spits out resulting lines which you can redirect, say into a new file.

grep

One tool that helps searching is the incredibly simple, yet powerful grep.

In its easiest form, grep checks a file, line by line, and outputs all those lines that match your search term. Example — a file called “characters.tab” that looks like this:

	ID  Name         Short Name  Species
	1   September    Tem         Human
	2   Saturday     N/A         Marid
	3   A-Through-L  Ell         Wyverary
	4   Spoke        N/A         Taxicrab
	5   Aroostook    N/A         Ford Model A

If I run this command:

	$> grep em character.tab

I’ll get back a single line:

	1   September    Tem         Human

If instead I run:

	$> grep s characters.tab

I get

	ID  Name         Short Name  Species
	5   Aroostook    N/A         Ford Model A

You what?

Well, the header line has an “s” at the end, line 4 (“A-Through-L”) has no “s” at all, and lines 2 & 3 do have “S” in them, but not “s”. You guessed it: grep is case-sensitive by default.

Run this one instead:

	$> grep -i s characters.tab

and you’ll get everyone but the Wyverary (a Wyvern who thinks his dad is a library, in case you were wondering).

You can of course look for longer search terms, or terms including special characters like the space character. Perfectly valid:

	$> grep "Model A" characters.tab

Back to debugging the Data Feed file. You know your IP address, so go grep for it:

	$> grep "222.111.5.2" hits.tsv

Note: the file won’t actually be called hits.tsv, but rather something like “jexnerinstructional.nonsense_2014-08-05.tsv”, a combination of report suite ID and date of the data inside. But for this posting, I’ll call it hits.tsv otherwise the lines will get too long.

The result should be a bunch of lines on your screen. Hard to read. Time to get back to “piping” and “redirect”.

	$> grep "222.111.5.2" hits.tsv > myhits.tab

This will put every line that contains my IP address into a new file called “myhits.tab”.

Note that there is a header file in the “lookup_data” folder that you can add for clarity:

	$> head -1 ../hits-lookup_data/column_headers.tsv > myhits.tab
	$> grep "222.111.5.2" hits.tsv >> myhits.tab

The first command takes 1 line off the top of “column_headers.tsv” and redirects it into “myhits.tab”. The second looks for all lines containing the IP in “hits.tsv” and appends them (the “>>”) into “myhits.tab”.

The new file “myhits.tab” might now be small enough to be opened in Excel for analysis, or in any other tool you’re happy with.

Note: Every single tool on a Unix/Linux system (or in an OS X or Cygwin terminal) understands the “>” & “>>” symbol. In fact, it’s the terminal itself that handles them, so the tools just have to output their results.

What if you want to filter lines for a specific IP address, but you also only want those that contain a specific value in one of the variables? You need to grep twice, so to speak.

	$> grep "222.111.5.2" hits.tsv | grep -i english

This construct will first go through the file and spit out all lines containing the IP, then those lines are fed into the second grep which only passes on those that contain the word “english” (case insensitive this time).

The “|” in this command is the “pipe” symbol. It tells the terminal to take the ouput of the tool on the left and use it as input for the tool on the right. Pretty nifty, hm?

With grep and the pipe symbol, you can run some pretty complex searches and most of the time, you won’t need anything else.

A pretty useful thing to do is to pipe the result of a grep into the pager, like so:

	$> grep "222.111.5.2" hits.tsv | less

No need to save this, but you would like to be able to scroll and move around, wouldn’t you? You’re welcome.

Select Columns

The Data Feed files potentially contain all possible “variables”. That is a lot of columns. And most of the time, you only want to look at a couple of them. That’s where cut comes in.

Take the “characters.tab” file and run:

	$> cut -f2,4 characters.tab

The result will be:

	Name         Species
	September    Human
	Saturday     Marid
	A-Through-L  Wyverary
	Spoke        Taxicrab
	Aroostook    Ford Model A

I use that a lot when I’m only interested in what values where passed into a couple of variables, not all of them. I would look at the headers of the “hits.tsv” file, write down the position of those I wanted, then use cut to get them out.

There are some other tools that I use, but in the interest of actually ever finishing this article, I shall refrain from mentioning them for now.

Where are the Events?

All of the Success Events go into one column: “event_list”.

The values in that list look funny, they are essentially numbers:

[Screenshot]
Events in a Data Feed File
To decode what you see there, you need to know what those numbers mean.

  • 1 — a purchase event, usually sent from a “Thank you!” page (s.events="purchase";)
  • 2 — a product view (s.events="scView";)
  • 10 — shopping cart open event (scOpen)
  • 11 — checkout event (scCheckout)
  • 12 — add to cart event (scAdd)
  • 13 — remove from cart event (scRemove)
  • 14 — cart view event (scView)
  • 100 – 174 — instance of eVar1 – eVar75, I need to explain that metric at some point
  • 200 – 299 — event1 – event100
  • 500 – 5xx — instances of (mobile) solution variables
  • 700 – 7xx — solution-specific events (for mobile, video, …)

The “event.tsv” file will tell you the exact events that are found in your Data Feed file.

Example:

[Screenshot]
Sample event.tsv File
The hit highlighted in yellow in the example tells you that one event was sent (event2) as well as values into 2 eVars (eVar1 & eVar2) along with some mobile-specific solution variables (501, 507, 508, 509 are instances of “mobileappid”, “mobiledevice”, “mobileosversion”, and “mobileresolution”, respectively).

Also note that “event_list” is completely empty while “post_event_list” has values! We’ll get to that next.

See Events in the help or KPIs and Success Events for more.

What is post_xyz?

Data Feed files can contain more or less columns, it really depends on how they were originally set up. Sometimes, you will have two columns per eVar:

[Screenshot]
Header with eVar and post_eVar Columns
So what’s that all about?

Remember when I wrote about persistence? How eVars are like stamps that can stay with the visitor for some time, until they expire? And how events are counted against whatever the eVars currently contain, taking into account the persistence?

Well, this is how that works.

When you send a value into – say – eVar15, like so:

	s.eVar15="September";

You will see that exact value in the column for “eVar15” in the Data Feed file.

Processing Rules or VISTA might change the value after it has been sent. Depending on how eVar15 is configured, and specifically looking at attribution (“Original Value (First)” vs. “Most Recent”) the value might or might not overwrite whatever is currently in eVar15 in the backend.

And when the Data Feed file is created, the system takes the value from the backend, checks it against the expiry setting, then writes it into the “post_eVar15” column.

So those two might have different values!

And if you want to know what you should see in the report with a specific metric, you have to check what “post_eVar15” was when that event came in!

Same for the events, by the way! Processing Rules or VISTA could change them, or even the “serialization” settings.

So you really want to look at the “post_event_list” column along with “post_eVar15”.

In our mobile example, you will never see any value in the “normal” columns. Why not? Well, since v4 of the Mobile SDKs, all values are being sent in Context Data and only assigned into props and eVars by Processing Rules. Looks like this:

[Screenshot]
Mobile – eVar vs post_eVar

Notes

One command you must know on a Unix/Linux system is man, short for “manual”. Try this:

	$> man grep

As a result, you should see a pager that contains the manual page (“man page”) for the grep utility for your specific system, command line options and all.

If you do not have man pages on your machine, you can always google “man grep”.

Each one of the little tools has an indecent amount of options and command line parameters. They have been around for more than 20 years and grown into very powerful littel tools. And there are tons of them. You might want to read up on awk, tail, cat (and tac), as well as some built-in command that your terminal offers, like variable assignments and loops.

What if the settings of your report suite disallow IP addresses? How do you find your test data then?

One way that works well is to use one “variable” as a sort of debug flag, like so:

	s.prop75 = "JEDEBUG";

You would never set that prop in your production code, but you would set it when you run a test. And you’d use grep to find all lines containing “JEDEBUG” in the Data Feed file. Easy.

What about other ways to crunch through those big files?

I know a consultant who imports Data Feed files into Access, then uses SQL queries to find what he’s looking for. Not a bad idea.

Other colleagues have discussed importing the file(s) into Hadoop, but I’m not sure anyone has done that. I would argue that the files are reasonably structured and therefore maybe better placed into a row-oriented database rather than something like Hadoop. But I’m no expert by any means.

Do you have a good workflow that you want to share? How do you find relevant stuff in big files?

11 thoughts on “Using Data Feeds for Debugging

  1. Hi,

    If I’m using numeric event in data feed as below,will there be any issue in passing the data feed value?

    var events =’event15,event18=’+score+’,event20=’+timeTaken;

    Thanks,
    Devisree.

    Like

    1. Hiya,

      Data Feeds can not be used to pass data in, they are there for extracting data _out_.

      So either the answer to your question is “it doesn’t quite work that way”, or I may have misunderstood you. Could you rephrase your question, please, or tell us what you are trying to achieve?

      Cheers,
      jan

      Like

  2. Thanks for sharing it Jan.
    How to send/Configure only the events that we want in event_list? Where can I make that change?
    How can I find the list_vars in Datafeed column?
    For example: event_list”: “2,200,236,227,20,100,101,102,111,113,114,115,129,131,162,172,173,175”

    Like

    1. I’m not sure I fully understand your question. You cannot limit what is sent in event_list as it is the raw data that is collected.

      Feel free to reach out to me and elaborate further via email. Might be more efficient than here in the comments.

      kasper(a)accrease.com

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.