Shortly after setting up this blog, I embedded Google Analytics tracking; primarily because I like numbers but also in hope of discovering that at least one other person who isn’t me or one my supervisors is interested in my adventures. It’s also great writing practice and gives me the chance to properly think through the things that I am doing to avoid looking wrong on the internet.
I was already in the habit of spamming links to my posts via various social networks so it wasn’t a long wait for the warm, fuzzy feeling of confirmation that people were actually reading my work. Or at the very least clicking on it.
However after a few days, I noticed several strange entries amongst my lovely numbers1:
|Source||Sessions||% of Referrals||Bounce Rate||Pages / Session||Avg. Session Duration|
Curses. All my non-social referrals are ghost referrals! Disreputable publishers use spambots liberally to remotely execute Google Analytics tracking scripts2 to appear to be providing a stream of referrals to your website. Though, I’m unsure of the aim of this apparent data pollution3 attack. Beyond a poor attempt at driving confused hostmaters to the sources to increase organic traffic I don’t really see what the benefit to the executor is4? Perhaps the sites attempt to install malware on or capture more valuable information from unsuspecting visitor’s machines.
Oddly, page specific metrics (such as landing/exit pages) are polluted too. Bots copy the hostname of the referral source to the page name of the false hit, giving hostmasters the impression something more worrying is afoot. It’s easy to forget falsification of data is a potential possibility, especially when one is not responsible for collection and management of the data.
None of this is particularly important or bothersome, unless like me, you like numbers and numbers that are wrong are upsetting. So how can normality be restored? These spambots target indiscriminately and remotely, leaving them unaware of the actual target and thus with no option but to spoof the
hostname of the hit (or leave the field unset) which should in fact match that of the website under attack.
A helpful blogpost details how to set up a simple filter in your Google Analytics control panel to remove future5 spurious data by ignoring hits which fail to provide a valid expected hostname6. The remaining 52.25% of my traffic appears genuine, hooray!
- Spambots performed a seemingly pointless data pollution attack on my Google Analytics records.
- One should always be as suspicious of data as possible, especially if it was collected by somebody else.
- I like numbers. I really don’t like people messing with my numbers.
- I haven’t censored the source column as it may be potentially useful to others having the same problem3. ↩
- Presumably by guessing or spidering Google Analytics tracking IDs. ↩
- Electing to use “pollution” over “poison” here as the result is less directly toxic and more of confusing annoyance. ↩ ↩
- Although by mentioning the domains here perhaps I’ve done exactly what they wanted… ↩
- If you are as fussy as me when it comes to data, that same blog has another helpful post which offers a method to clean up historic data too. ↩
- Be sure to correctly escape the regular expression! ↩