Testing JavaScript with Amazon’s Mechanical Turk

This is the first in a short series of blog posts covering our approach to JavaScript testing.

The need for systematic JavaScript testing in the wild

Like most companies we test our JavaScript in-house thoroughly: we deploy to a stage setup that is identical to our production setup, then we run a suite of selenium tests on a mixture of real and virtual machines covering common browser–operating system pairings we see in the wild. Assuming it passes these tests, we then test on iOS and Android.

Most companies would stop there and push to production. We don’t.

There are a great many subtle effects and unexplained artifacts that occur in the wild that just don’t happen in a testing environment:

Creating every possible combination of these in a test environment is not tractable. In this situation we take a statistical approach and take a sample from the population (users of the Internet).

Amazon’s Mechanical Turk

If you haven’t come across Amazon’s Mechanical Turk, or MTurk for short, we highly recommend researching it. Essentially it is a crowd-sourcing platform. You pay lots of people (Workers) small amounts of money to work on simple tasks (Human Intelligence Tasks or HITs).

Workers form our sample of real world users.

We have two types of HIT:

The JavaScript is executed in a debug mode that streams raw data, partially computed data and error messages back to us. These results allow us to troubleshoot data collection and provide ground truth for testing data processing.

Handling bias

Clearly the pool of MTurk workers are not strictly representative of our customer’s users. To handle this we split our tasks into two pools:

HITs requiring a specific setup are common on MTurk as many of the tasks require esoteric applets, environments,  or settings. Stratified HITs are assigned in the ratio that we see users (with a browser–operating system pair making a stratum).

Continuous testing

As well as being a good pre-production test, MTurk is fantastic for continuous testing. As new browsers are released, browsers and operating systems are updated and environments change, this is reflected in your diverse pool of workers. Changes in browsers and operating systems can automatically feed back into your strata for Stratified HITs. This lets you spot problems early and isn’t reliant on updating your in-house test setup.

How it fits together

Pages are served from our Customer’s Web server’s with a link to the production spider.io JavaScript embedded. The production JavaScript is downloaded from our web servers and sends back an anonymised digest of the browser’s behaviour.

Periodically we update the HIT Allocator and Verifier’s list of common user environments. We allocate a mixture of stratified and random HITs, each of which requires either a scripted or natural task. A verbose debug trace is sent up from the worker’s browser allowing us to test and improve our code.

 

Cost vs value

There is no doubt using MTurk is an expensive way to test your JavaScript, however as an addition to in-house systematic testing we believe it pays dividends. We pay about $0.05 for each 30-second HIT. (The majority of our HITs actually take less than 30seconds.) This means we can get our script tested in 100 different real world settings for £3.

Put this in contrast with our test box, a £400 Mac mini running OS X with 3 Windows Virtual Machines giving us a total of 13 browser–operating system pairs, on one connection, one set of hardware and under constant load.

MTurk clearly has a place in the JavaScript testers tool box.

Silicon Milkroundabout and the Spider.io Challenge

To celebrate spider.io’s attendance at Silicon Milkroundabout, the recruitment fair for startups, we’ve launched the Spider.io Challenge.

Can you hack it?

The Spider.io Challenge is a treasure hunt for hackers.

We’ve hidden fourteen codes in and around challenge.spider.io. With each code that you find, we’ll give you a clue to help you find the next code. As soon as you sign in to the challenge, the clock starts ticking. It’s a race against time. It’s a race against other hackers. Best of luck!

Searching for clues across the web stack is a lot like what we’re doing when we’re searching for web robots. Clues as to the true nature of a website visitor can be hidden anywhere. It’s our job to work as detectives, to solve the visitor puzzle.

Silicon Milkroundabout

Spider.io will have a stand at the Silicon Milkroundabout careers fair on Sunday, 30 October. If you’re interested in helping us catch bad people doing bad things, come over to our stand and have a chat.

Calling Out To Researchers/Academics

Esteemed researchers/academics,

Spider.io is eager to work closely with you, through joint research, sponsored research, CASE studentships, etc. We currently have close ties to the Department of Computing at Imperial College London, and we are looking to build up relationships with academics/researchers at other institutions. For more information, please get in touch at: research@spider.io.

If curious about our efforts, a sample bibliography is provided below.

Research Bibliography

 

The Problem With Client-Side Analytics

Today’s client-side analytics services are trivially spoofable, by which we mean that the numbers they present can be inflated by making no or very few requests to the host website.

This post highlights the problem, and proposes a partial solution that substantially mitigates the issues with minimal effort. Our proposed solution is to include a digital signature in each message sent to the analytics provider. Apart from the server-side generation of the signature all other components remain the same. This can be implemented in a couple of lines of code (snippets provided), and gives trust that every line in your analytics is the result of a request to your web servers.

It is unclear whether nefarious characters exploit the current disconnect between client-side analytics services and their associated host websites. However, given the ease with which spoofing is possible, we suggest implementing a solution as soon as is possible.

The analytics landscape

There are an increasing number of web analytic companies. They seem to be springing up on an almost daily basis. Google Analytics provides a cheap (free!) and cheerful, one-size-fits-all approach. Google Analytics is very popular; used by half the top 1 million websites and over 12.3 million websites in total. To get an indication of how many users make how many page views on your site, they are more than adequate.

More recently a series of new premium and real-time analytics companies have launched. These include ChartBeat, Coremetrics, MixPanel, GoSquared, LuckyOrange… (the list goes on); additionally high-volume, high-performance solutions include SASS, Unica and most recently Google Analytics Premium. All of these services have one thing in common: in their recommended and most common mode of operation, the only integration required is the embedding of a small snippet of JavaScript.

How client-side analytics works

Typically a JavaScript file hosted on the analytics company’s servers (or more commonly, the analytics company’s content distribution network) is embedded on every webpage to be monitored. On pageload this JavaScript is requested. This leads to an image pixel being requested with a series of URL-encoded parameters—typically including the time, a random number and some interesting data about the client machine (screen and browser dimensions, etc.). Additionally, as part of this request, the user’s browser will send information in request headers including the current URL, the User Agent (a short description of your browser and operating system version) and a Cookie (a small file hosted on the users machine uniquely identifying them).

The analytics company writes all the data contained in the image-pixel request to a database and the aggregate numbers are your analytics. Simple!

The problem

Ease of installation and use has dramatically sped up the uptake of such services, but it leads to a fundamental weakness: it is trivially easy to spoof these analytics services, making no or very few requests to the host website. To illustrate we implemented a spoofer for Google Analytics and we show its mischievous efforts below.

If business decisions are being made off the back of these analytics, such as ad spend, A-B testing, product choice, or even company valuation; these decisions could be made using very bad data.

How to spoof an analytics service

We reckon a script to spoof most analytics services can be written by a competent programmer in less than one hour. It involves opening the developer console in Safari or Chrome, or installing Firebug in Firefox and looking at the image pixel being sent to the analytics server. Hitting refresh 15 times will generally give you enough information to work out what request headers and URL variables are being sent.

If any fields remain a mystery, the JavaScript source code is transmitted in plain text and is easy to reverse engineer.

This script can then be run locally, on a legitimate network of rented servers, or on an illegitimate bot network to make hundreds or millions of requests.

As a customer of an analytics company, the malicious user need make only a single request to your website to find out what analytics services you use and what your account ID is. After that, all further requests go to the analytics company leaving no clues.

Current solutions

There are 3 solutions to this problem currently employed.

  1. Ignore it
  2. Install more analytics: 3, 4, 5 providers. One has to be right! Take the average of the closest two. But, if you can spoof one, you can spoof them all.
  3. Server-side requests. Some solutions let you make requests straight from you server to the analytics provider via an API (e.g. Mixpanel), alternatively you can hack this solution by writing your own spoofer. The issue is analytics engines lose additional data gathered in JavaScript (screen size etc.); furthermore they use the fact that JavaScript was executed as a crude filter to remove bots and crawlers—data gathered using server-side data is much more noisy. More thorough providers offer hybrid solutions marrying log files with client-side requests (the approach taken at spider.io), however this can be over-kill for small websites.

Required solution

There is a need for an analytics solution where an automated agent cannot spoof your analytics without your knowledge. Spider.io separates automated and systematic traffic from real users; however we’re not saying you must work with us to get an accurate picture of your analytics. There is a much simpler solution that, although less comprehensive, could easily be implemented by current analytics companies. Client-side analytics is prone to both over-reporting numbers from spoofers and under-reporting traffic from bots. Most websites are primarily concerned with their real user traffic making their primary concern over-reporting. We propose a solution to this below.

Proposed solution - signed analytics

We believe that along with the image pixel request the publisher site needs to send a digital signature. A digital signature is a verifiable digest of the data being sent that guarantees the source and freshness of the data. If the digital signature doesn’t match, then data has been tampered with.

A unique digital signature of a random number and timestamp is embedded in each page with the request for the analytics provider’s JavaScript include. The signature is then sent from the user’s browser to the analytics provider’s servers with the image pixel request. The analytics provider simply checks each signature is only seen once and the timestamp contained is fresh. Notice there is no need to sign the whole message, signing the random number and timestamp is sufficient for uniqueness and freshness.

By their nature, to generate a digital signature requires a private key (a secret code) to be stored on the publishers server that cannot be viewable by JavaScript. This means signed analytics requires a change to server-side code, making installation a little harder (but not much!).

Possible implementation

There are many ways to implement the inclusion if a digital signature in messages sent by client-side analytics. Here we propose generating a Sha1 HMAC server-side and including it in the JavaScript include as a HTML fragment. The included JavaScript simply needs to grab the fragment and send it along with any other messages sent to the analytics end point. The most common languages large-scale websites are written in are PHP, Python and ASP. Below are examples of the required installation changes to incorporate a digital signature:

PHP

$r = rand();
$ts = time();
$ds = hash_hmac ("sha1", $ts . $r , "SECRET_KEY");
echo "<script id=\"example-com-analytics\" src=\"http://example.com/analytics.js#ts=$ts&r=$r&ds=$ds\"></script>"

Python

import hashlib
import hmac
import random
import time
r = random.random()
ts = time.time()
ds = hmac.new("SECRET_KEY",hashlib.sha1,"%s%s%s" % (r, ts))
print "<script id=\"example-com-analytics\" src=\"http://example.com/analytics.js#ts=%s&r=%s&ds=%s\"></script>" % (r, ts, ds)

ASP.NET – C#

Random rand = new Random();
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
byte[] keyBytes = encoding.GetBytes("SECRET_KEY");
String ts = DateTime.Now.Ticks.toString;
String r = rand.Next().ToString;
byte[] messageBytes = encoding.GetBytes(r+ts);
HMACSHA1 hmacsha1 = new HMACSHA1(keyBytes);
byte[] dsBytes = hmacsha1.ComputeHash(messageBytes);
String ds = BitConverter.ToString(dsBytes);
Response.Write(
    String.Format(
        "<script id=\"example-com-analytics\" src=\"http://example.com/analytics.js#ts={0}&r={1}&ds={2}\"></script>",ts,r,ds
    )
);

Advantages of signed analytics

Digital signatures have the following advantages over the traditional JavaScript approach:

What signed analytics does not provide

For this, a heavier weight hybrid (client-side and server-side) solution is required.

Trusting your numbers

Having analytics you don’t trust is as bad as having no analytics at all. Bad decisions are based on bad data. If your analytics numbers are based on digital signatures, you can at least be confident your analytics numbers are not being trivially spoofed.

How To Catch A Bot

At spider.io we’re in the business of catching automated web traffic. This is a short post introducing some of the clues we analyse and why.

In its simplest form a bot has two components: a priority queue of web pages to crawl, and a loop that takes the first item off the end of the queue and downloads it. A trivial example can be seen below that downloads the top 1m webpages, according to Alexa, as images using Paul Hammond’s webkit2png project.

#!/bin/bash
cat top-1m.csv | while read f;
do
./webkit2png -s 1 -C  -D ./out http://`echo $f | cut -d, -f2`;
done;

More complex bots add some processing either in or after the loop and periodically update their priority queue.

By analysing several clues it is possible to identify these two bot components.

Catching the priority queue

Bots don’t navigate sites like people do. Typically they make more requests, at a higher speed, and rather than navigating a site with purpose they select pages systematically. This makes the click trace (the list of pages a user has visited) of a bot substantially different from a typical user.

Click-trace analysis is the traditional approach to detecting bots. The advantage of catching a bot based on its click trace is that this is independent of the technology used by the assailant. The disadvantage of this approach is that you need to have seen enough clicks from the same bot before you can classify and you need to be able to identify these clicks as all being from the same bot.

Catching the downloader

Real users download web pages with web browsers, onto a computer or mobile device. There are several distinctive activity streams that accompany such a download at the different levels of the OSI model, and by identifying these activity streams we can check that a normal download is taking place.

We catch the majority of bots based on how they download individual pages. This has a number of advantages: we only need to see a single page request from a bot to be able to catch it; we can catch bots that distribute themselves across multiple IPs; and we can recognise bot requests hidden amongst legitimate user traffic from the same IP.

Find the motive, find the perpetrator

As quoted at least once in any decent cop show, the same applies to bots. Once you know what the villain is likely to do, it’s much easier to spot. Unfortunately this is an arms race—once the bot creators know where you’re looking, the bots become harder to find.