The Problem With Client-Side Analytics
Today’s client-side analytics services are trivially spoofable, by which we mean that the numbers they present can be inflated by making no or very few requests to the host website.
This post highlights the problem, and proposes a partial solution that substantially mitigates the issues with minimal effort. Our proposed solution is to include a digital signature in each message sent to the analytics provider. Apart from the server-side generation of the signature all other components remain the same. This can be implemented in a couple of lines of code (snippets provided), and gives trust that every line in your analytics is the result of a request to your web servers.
It is unclear whether nefarious characters exploit the current disconnect between client-side analytics services and their associated host websites. However, given the ease with which spoofing is possible, we suggest implementing a solution as soon as is possible.
The analytics landscape
There are an increasing number of web analytic companies. They seem to be springing up on an almost daily basis. Google Analytics provides a cheap (free!) and cheerful, one-size-fits-all approach. Google Analytics is very popular; used by half the top 1 million websites and over 12.3 million websites in total. To get an indication of how many users make how many page views on your site, they are more than adequate.
More recently a series of new premium and real-time analytics companies have launched. These include ChartBeat, Coremetrics, MixPanel, GoSquared, LuckyOrange… (the list goes on); additionally high-volume, high-performance solutions include SASS, Unica and most recently Google Analytics Premium. All of these services have one thing in common: in their recommended and most common mode of operation, the only integration required is the embedding of a small snippet of JavaScript.
How client-side analytics works
Typically a JavaScript file hosted on the analytics company’s servers (or more commonly, the analytics company’s content distribution network) is embedded on every webpage to be monitored. On pageload this JavaScript is requested. This leads to an image pixel being requested with a series of URL-encoded parameters—typically including the time, a random number and some interesting data about the client machine (screen and browser dimensions, etc.). Additionally, as part of this request, the user’s browser will send information in request headers including the current URL, the User Agent (a short description of your browser and operating system version) and a Cookie (a small file hosted on the users machine uniquely identifying them).
The analytics company writes all the data contained in the image-pixel request to a database and the aggregate numbers are your analytics. Simple!
The problem
Ease of installation and use has dramatically sped up the uptake of such services, but it leads to a fundamental weakness: it is trivially easy to spoof these analytics services, making no or very few requests to the host website. To illustrate we implemented a spoofer for Google Analytics and we show its mischievous efforts below.
If business decisions are being made off the back of these analytics, such as ad spend, A-B testing, product choice, or even company valuation; these decisions could be made using very bad data.
How to spoof an analytics service
We reckon a script to spoof most analytics services can be written by a competent programmer in less than one hour. It involves opening the developer console in Safari or Chrome, or installing Firebug in Firefox and looking at the image pixel being sent to the analytics server. Hitting refresh 15 times will generally give you enough information to work out what request headers and URL variables are being sent.
If any fields remain a mystery, the JavaScript source code is transmitted in plain text and is easy to reverse engineer.
This script can then be run locally, on a legitimate network of rented servers, or on an illegitimate bot network to make hundreds or millions of requests.
As a customer of an analytics company, the malicious user need make only a single request to your website to find out what analytics services you use and what your account ID is. After that, all further requests go to the analytics company leaving no clues.
Current solutions
There are 3 solutions to this problem currently employed.
- Ignore it
- Install more analytics: 3, 4, 5 providers. One has to be right! Take the average of the closest two. But, if you can spoof one, you can spoof them all.
- Server-side requests. Some solutions let you make requests straight from you server to the analytics provider via an API (e.g. Mixpanel), alternatively you can hack this solution by writing your own spoofer. The issue is analytics engines lose additional data gathered in JavaScript (screen size etc.); furthermore they use the fact that JavaScript was executed as a crude filter to remove bots and crawlers—data gathered using server-side data is much more noisy. More thorough providers offer hybrid solutions marrying log files with client-side requests (the approach taken at spider.io), however this can be over-kill for small websites.
Required solution
There is a need for an analytics solution where an automated agent cannot spoof your analytics without your knowledge. Spider.io separates automated and systematic traffic from real users; however we’re not saying you must work with us to get an accurate picture of your analytics. There is a much simpler solution that, although less comprehensive, could easily be implemented by current analytics companies. Client-side analytics is prone to both over-reporting numbers from spoofers and under-reporting traffic from bots. Most websites are primarily concerned with their real user traffic making their primary concern over-reporting. We propose a solution to this below.
Proposed solution - signed analytics
We believe that along with the image pixel request the publisher site needs to send a digital signature. A digital signature is a verifiable digest of the data being sent that guarantees the source and freshness of the data. If the digital signature doesn’t match, then data has been tampered with.
A unique digital signature of a random number and timestamp is embedded in each page with the request for the analytics provider’s JavaScript include. The signature is then sent from the user’s browser to the analytics provider’s servers with the image pixel request. The analytics provider simply checks each signature is only seen once and the timestamp contained is fresh. Notice there is no need to sign the whole message, signing the random number and timestamp is sufficient for uniqueness and freshness.
By their nature, to generate a digital signature requires a private key (a secret code) to be stored on the publishers server that cannot be viewable by JavaScript. This means signed analytics requires a change to server-side code, making installation a little harder (but not much!).
Possible implementation
There are many ways to implement the inclusion if a digital signature in messages sent by client-side analytics. Here we propose generating a Sha1 HMAC server-side and including it in the JavaScript include as a HTML fragment. The included JavaScript simply needs to grab the fragment and send it along with any other messages sent to the analytics end point. The most common languages large-scale websites are written in are PHP, Python and ASP. Below are examples of the required installation changes to incorporate a digital signature:
PHP
$r = rand(); $ts = time(); $ds = hash_hmac ("sha1", $ts . $r , "SECRET_KEY"); echo "<script id=\"example-com-analytics\" src=\"http://example.com/analytics.js#ts=$ts&r=$r&ds=$ds\"></script>"
Python
import hashlib import hmac import random import time r = random.random() ts = time.time() ds = hmac.new("SECRET_KEY",hashlib.sha1,"%s%s%s" % (r, ts)) print "<script id=\"example-com-analytics\" src=\"http://example.com/analytics.js#ts=%s&r=%s&ds=%s\"></script>" % (r, ts, ds)
ASP.NET – C#
Random rand = new Random(); System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding(); byte[] keyBytes = encoding.GetBytes("SECRET_KEY"); String ts = DateTime.Now.Ticks.toString; String r = rand.Next().ToString; byte[] messageBytes = encoding.GetBytes(r+ts); HMACSHA1 hmacsha1 = new HMACSHA1(keyBytes); byte[] dsBytes = hmacsha1.ComputeHash(messageBytes); String ds = BitConverter.ToString(dsBytes); Response.Write( String.Format( "<script id=\"example-com-analytics\" src=\"http://example.com/analytics.js#ts={0}&r={1}&ds={2}\"></script>",ts,r,ds ) );
Advantages of signed analytics
Digital signatures have the following advantages over the traditional JavaScript approach:
- Spoofing is much harder. At least one request to your servers must be made for every entry in your analytics.
- If your analytics numbers are spoofed, you have evidence locally in your server logs for post-hoc analysis
What signed analytics does not provide
- A record of users who do not execute JavaScript (scrapers and older mobile browsers)
- Analytics numbers which are free from sophisticated spoofing/JavaScript-enabled web robots
For this, a heavier weight hybrid (client-side and server-side) solution is required.
Trusting your numbers
Having analytics you don’t trust is as bad as having no analytics at all. Bad decisions are based on bad data. If your analytics numbers are based on digital signatures, you can at least be confident your analytics numbers are not being trivially spoofed.