When I wrote about debugging mobile apps some time ago I mentioned that one way to get data out of
SiteCatalyst Adobe Analytics for troubleshooting is the so-called “Data Feeds” — flat files of raw data delivered by FTP, containing one line per every hit sent.
When you looked at the URL of a tracking request the last time, did you see how long it was? A thousand characters are not uncommon, especially if you use Context Data and define name spaces and things like that. The URL can get so long that IE chokes on it and doesn’t track at all! The dreaded 2048-character limit lurks… though that doesn’t happen frequently.
So think of thousands or millions of hits and how much data that would be if you just pulled it in raw format. Yup, a Data Feed file can get big. So big, in fact, that the system will not allow you to send more than a day’s worth of data in a single file. So big that the file is zipped (or gzipped) before it’s being sent out. So big that you should closely watch your FTP server disk space!
Every line in a Data Feed file can contain 500+ columns, each column representing one bit of data that has been or potentially could have been sent into Analytics.
If you send “DE” into
s.prop2, for example, you will see a row somewhere in the Data Feed that contains “DE” in the column that represents that “variable”. The same is true for all values you are sending, with a notable exception: Context Data is NOT part of the Data Feed unless you have a Processing Rules that copies the data into a “variable”.
You are likely sick of buzz words, and I am sure “big data” is high on the list of terms that you and I would place on a bullshit bingo sheet. Possibly multiple times. Well, Data Feed files are big data, literally.
If in the past your strategy has been “just open the log file and scroll through it” (a strategy made popular in the 80s by emacs), I’m afraid you have to reconsider. Only few editors can open files this size (Excel is certainly not one of those, though emacs is).
You need alternatives.
I have in the past worked with Unix and Linux systems and I’d like to suggest a couple of easy-to-use tools that will help you find the data you actually need in those big files.
If you’re on Unix, Linux, or OS X, you have access to these little gems on your command line. Should you be on Windows, feel free to download Cygwin. You’ll have a “Cygwin Terminal” that comes with them as well.
Let’s start with simple search.
Use case: you ran a test and you want to see what the data looked like once it ended up in Analytics.
How do you find “your” data? Easiest would be to note down the IP address you had when you tested, then look for that IP address in the file.
So, open the file — woah, wait! Didn’t I say it was likely too big? Yup, too big for an editor.
But not too big for a “pager”. The pager is called
less, depending on your system. Sometimes, both names work and it is really the same pager. Yes, those Unix people are jokers.
Try this command:
$> less hits.tsv
The system will show you the first 25 or so lines of the file and you can scroll up and down with your cursor keys. Some versions allow you to scroll using the mouse and a scroll bar, but usually it’s cursor up and down.
Page Up and Page Down should also work as expected.
Try typing g or G. That should move to the top or bottom of the file, respectively.
So you could try and find your IP address just moving up and down, but there is an easier way: hit the / key and the lowest line of the screen will go blank, with just a “/” on the left. That is a search prompt!
You can now type your IP address, hit enter and the pager will try to find what you typed and move to that location.
If the pager does not find your term, it’ll look like this:
Pretty easy to do, no?
Here’s an important key: q. This is how you exit
more. Sometimes the ESC key works as well.
Let’s move on to more powerful tools and to the concept of “piping”!
In the Unix/Linux world, pretty much everything is a file. Some of these things do not really seem to be or behave like a file, one example being the screen that you see in front of you. But conceptually, even the screen is a file.
There are tools that work on files, but not like an editor, more like a black box. You put a file into the tool and it spits out a new, modified file on the other side. This is how most tools on a Unix/Linux command line work.
Now the powerful thing is that you can “redirect” that output, tell it to go somewhere specific.
In most cases, you will “redirect it into a file”, which essentially means saving it. But you can also use the output of one tool as the input of another one. We call it “piping the output” into some other tool.
Bear with me, we’ll get to an example soon. Just remember: file goes in, tool modifies file line by line, tools spits out resulting lines which you can redirect, say into a new file.
One tool that helps searching is the incredibly simple, yet powerful
In its easiest form,
grep checks a file, line by line, and outputs all those lines that match your search term. Example — a file called “characters.tab” that looks like this:
ID Name Short Name Species 1 September Tem Human 2 Saturday N/A Marid 3 A-Through-L Ell Wyverary 4 Spoke N/A Taxicrab 5 Aroostook N/A Ford Model A
If I run this command:
$> grep em character.tab
I’ll get back a single line:
1 September Tem Human
If instead I run:
$> grep s characters.tab
ID Name Short Name Species 5 Aroostook N/A Ford Model A
Well, the header line has an “s” at the end, line 4 (“A-Through-L”) has no “s” at all, and lines 2 & 3 do have “S” in them, but not “s”. You guessed it:
grep is case-sensitive by default.
Run this one instead:
$> grep -i s characters.tab
and you’ll get everyone but the Wyverary (a Wyvern who thinks his dad is a library, in case you were wondering).
You can of course look for longer search terms, or terms including special characters like the space character. Perfectly valid:
$> grep "Model A" characters.tab
Back to debugging the Data Feed file. You know your IP address, so go grep for it:
$> grep "184.108.40.206" hits.tsv
Note: the file won’t actually be called hits.tsv, but rather something like “jexnerinstructional.nonsense_2014-08-05.tsv”, a combination of report suite ID and date of the data inside. But for this posting, I’ll call it hits.tsv otherwise the lines will get too long.
The result should be a bunch of lines on your screen. Hard to read. Time to get back to “piping” and “redirect”.
$> grep "220.127.116.11" hits.tsv > myhits.tab
This will put every line that contains my IP address into a new file called “myhits.tab”.
Note that there is a header file in the “lookup_data” folder that you can add for clarity:
$> head -1 ../hits-lookup_data/column_headers.tsv > myhits.tab $> grep "18.104.22.168" hits.tsv >> myhits.tab
The first command takes 1 line off the top of “column_headers.tsv” and redirects it into “myhits.tab”. The second looks for all lines containing the IP in “hits.tsv” and appends them (the “>>”) into “myhits.tab”.
The new file “myhits.tab” might now be small enough to be opened in Excel for analysis, or in any other tool you’re happy with.
Note: Every single tool on a Unix/Linux system (or in an OS X or Cygwin terminal) understands the “>” & “>>” symbol. In fact, it’s the terminal itself that handles them, so the tools just have to output their results.
What if you want to filter lines for a specific IP address, but you also only want those that contain a specific value in one of the variables? You need to
grep twice, so to speak.
$> grep "22.214.171.124" hits.tsv | grep -i english
This construct will first go through the file and spit out all lines containing the IP, then those lines are fed into the second
grep which only passes on those that contain the word “english” (case insensitive this time).
The “|” in this command is the “pipe” symbol. It tells the terminal to take the ouput of the tool on the left and use it as input for the tool on the right. Pretty nifty, hm?
grep and the pipe symbol, you can run some pretty complex searches and most of the time, you won’t need anything else.
A pretty useful thing to do is to pipe the result of a
grep into the pager, like so:
$> grep "126.96.36.199" hits.tsv | less
No need to save this, but you would like to be able to scroll and move around, wouldn’t you? You’re welcome.
The Data Feed files potentially contain all possible “variables”. That is a lot of columns. And most of the time, you only want to look at a couple of them. That’s where
cut comes in.
Take the “characters.tab” file and run:
$> cut -f2,4 characters.tab
The result will be:
Name Species September Human Saturday Marid A-Through-L Wyverary Spoke Taxicrab Aroostook Ford Model A
I use that a lot when I’m only interested in what values where passed into a couple of variables, not all of them. I would look at the headers of the “hits.tsv” file, write down the position of those I wanted, then use
cut to get them out.
There are some other tools that I use, but in the interest of actually ever finishing this article, I shall refrain from mentioning them for now.
Where are the Events?
All of the Success Events go into one column: “event_list”.
The values in that list look funny, they are essentially numbers:
To decode what you see there, you need to know what those numbers mean.
- 1 — a purchase event, usually sent from a “Thank you!” page (
- 2 — a product view (
- 10 — shopping cart open event (
- 11 — checkout event (
- 12 — add to cart event (
- 13 — remove from cart event (
- 14 — cart view event (
- 100 – 174 — instance of eVar1 – eVar75, I need to explain that metric at some point
- 200 – 299 — event1 – event100
- 500 – 5xx — instances of (mobile) solution variables
- 700 – 7xx — solution-specific events (for mobile, video, …)
The “event.tsv” file will tell you the exact events that are found in your Data Feed file.
The hit highlighted in yellow in the example tells you that one event was sent (event2) as well as values into 2 eVars (eVar1 & eVar2) along with some mobile-specific solution variables (501, 507, 508, 509 are instances of “mobileappid”, “mobiledevice”, “mobileosversion”, and “mobileresolution”, respectively).
Also note that “event_list” is completely empty while “post_event_list” has values! We’ll get to that next.
Data Feed files can contain more or less columns, it really depends on how they were originally set up. Sometimes, you will have two columns per eVar:
So what’s that all about?
Remember when I wrote about persistence? How eVars are like stamps that can stay with the visitor for some time, until they expire? And how events are counted against whatever the eVars currently contain, taking into account the persistence?
Well, this is how that works.
When you send a value into – say – eVar15, like so:
You will see that exact value in the column for “eVar15” in the Data Feed file.
Processing Rules or VISTA might change the value after it has been sent. Depending on how eVar15 is configured, and specifically looking at attribution (“Original Value (First)” vs. “Most Recent”) the value might or might not overwrite whatever is currently in eVar15 in the backend.
And when the Data Feed file is created, the system takes the value from the backend, checks it against the expiry setting, then writes it into the “post_eVar15” column.
So those two might have different values!
And if you want to know what you should see in the report with a specific metric, you have to check what “post_eVar15” was when that event came in!
Same for the events, by the way! Processing Rules or VISTA could change them, or even the “serialization” settings.
So you really want to look at the “post_event_list” column along with “post_eVar15”.
In our mobile example, you will never see any value in the “normal” columns. Why not? Well, since v4 of the Mobile SDKs, all values are being sent in Context Data and only assigned into props and eVars by Processing Rules. Looks like this:
One command you must know on a Unix/Linux system is
man, short for “manual”. Try this:
$> man grep
As a result, you should see a pager that contains the manual page (“man page”) for the
grep utility for your specific system, command line options and all.
If you do not have man pages on your machine, you can always google “man grep”.
Each one of the little tools has an indecent amount of options and command line parameters. They have been around for more than 20 years and grown into very powerful littel tools. And there are tons of them. You might want to read up on
tac), as well as some built-in command that your terminal offers, like variable assignments and loops.
What if the settings of your report suite disallow IP addresses? How do you find your test data then?
One way that works well is to use one “variable” as a sort of debug flag, like so:
s.prop75 = "JEDEBUG";
You would never set that prop in your production code, but you would set it when you run a test. And you’d use
grep to find all lines containing “JEDEBUG” in the Data Feed file. Easy.
What about other ways to crunch through those big files?
I know a consultant who imports Data Feed files into Access, then uses SQL queries to find what he’s looking for. Not a bad idea.
Other colleagues have discussed importing the file(s) into Hadoop, but I’m not sure anyone has done that. I would argue that the files are reasonably structured and therefore maybe better placed into a row-oriented database rather than something like Hadoop. But I’m no expert by any means.
Do you have a good workflow that you want to share? How do you find relevant stuff in big files?