Testing JavaScript with Amazon’s Mechanical Turk

This is the first in a short series of blog posts covering our approach to JavaScript testing.

The need for systematic JavaScript testing in the wild

Like most companies we test our JavaScript in-house thoroughly: we deploy to a stage setup that is identical to our production setup, then we run a suite of selenium tests on a mixture of real and virtual machines covering common browser–operating system pairings we see in the wild. Assuming it passes these tests, we then test on iOS and Android.

Most companies would stop there and push to production. We don’t.

There are a great many subtle effects and unexplained artifacts that occur in the wild that just don’t happen in a testing environment:

People using strange hardware configurations,
New, unusual or badly written plugins,
Bad or intermittent connections,
Hundreds of browser windows open simultaneously for days,
the list continues.

Creating every possible combination of these in a test environment is not tractable. In this situation we take a statistical approach and take a sample from the population (users of the Internet).

Amazon’s Mechanical Turk

If you haven’t come across Amazon’s Mechanical Turk, or MTurk for short, we highly recommend researching it. Essentially it is a crowd-sourcing platform. You pay lots of people (Workers) small amounts of money to work on simple tasks (Human Intelligence Tasks or HITs).

Workers form our sample of real world users.

We have two types of HIT:

A scripted task, where the worker follows a series of predefined steps mimicking our selenium setup, and
A realistic task where the worker has to accomplish the sort of task they would typically tackle on a web page (information seeking, navigating etc.).

The JavaScript is executed in a debug mode that streams raw data, partially computed data and error messages back to us. These results allow us to troubleshoot data collection and provide ground truth for testing data processing.

Handling bias

Clearly the pool of MTurk workers are not strictly representative of our customer’s users. To handle this we split our tasks into two pools:

Random HITs, where anyone can take part, and
Stratified HITs, where we specify the worker’s browser and operating system.

HITs requiring a specific setup are common on MTurk as many of the tasks require esoteric applets, environments, or settings. Stratified HITs are assigned in the ratio that we see users (with a browser–operating system pair making a stratum).

Continuous testing

As well as being a good pre-production test, MTurk is fantastic for continuous testing. As new browsers are released, browsers and operating systems are updated and environments change, this is reflected in your diverse pool of workers. Changes in browsers and operating systems can automatically feed back into your strata for Stratified HITs. This lets you spot problems early and isn’t reliant on updating your in-house test setup.

How it fits together

Pages are served from our Customer’s Web server’s with a link to the production spider.io JavaScript embedded. The production JavaScript is downloaded from our web servers and sends back an anonymised digest of the browser’s behaviour.

Periodically we update the HIT Allocator and Verifier’s list of common user environments. We allocate a mixture of stratified and random HITs, each of which requires either a scripted or natural task. A verbose debug trace is sent up from the worker’s browser allowing us to test and improve our code.

Cost vs value

There is no doubt using MTurk is an expensive way to test your JavaScript, however as an addition to in-house systematic testing we believe it pays dividends. We pay about $0.05 for each 30-second HIT. (The majority of our HITs actually take less than 30seconds.) This means we can get our script tested in 100 different real world settings for £3.

Put this in contrast with our test box, a £400 Mac mini running OS X with 3 Windows Virtual Machines giving us a total of 13 browser–operating system pairs, on one connection, one set of hardware and under constant load.

MTurk clearly has a place in the JavaScript testers tool box.

Posted on October 26, 2011 at 4:01pm

Silicon Milkroundabout and the Spider.io Challenge

To celebrate spider.io’s attendance at Silicon Milkroundabout, the recruitment fair for startups, we’ve launched the Spider.io Challenge.

Can you hack it?

The Spider.io Challenge is a treasure hunt for hackers.

We’ve hidden fourteen codes in and around challenge.spider.io. With each code that you find, we’ll give you a clue to help you find the next code. As soon as you sign in to the challenge, the clock starts ticking. It’s a race against time. It’s a race against other hackers. Best of luck!

Searching for clues across the web stack is a lot like what we’re doing when we’re searching for web robots. Clues as to the true nature of a website visitor can be hidden anywhere. It’s our job to work as detectives, to solve the visitor puzzle.

Silicon Milkroundabout

Spider.io will have a stand at the Silicon Milkroundabout careers fair on Sunday, 30 October. If you’re interested in helping us catch bad people doing bad things, come over to our stand and have a chat.

Posted on October 24, 2011 at 5:32pm

Calling Out To Researchers/Academics

Esteemed researchers/academics,

Spider.io is eager to work closely with you, through joint research, sponsored research, CASE studentships, etc. We currently have close ties to the Department of Computing at Imperial College London, and we are looking to build up relationships with academics/researchers at other institutions. For more information, please get in touch at: research@spider.io.

If curious about our efforts, a sample bibliography is provided below.

Research Bibliography

A Brief Review of Biometric Identification. Tilo Burghart.
A k-Nearest Neighbor Approach for User Authentication through Biometric Keystroke Dynamics. Jiankun Hu et al.
A Large-Scale Hidden Semi-Markov Model for Anomaly Detection on User Browsing Behaviors. Yi Xie et al.
A Large-scale Study of Automated Web Search Traffic. Greg Buehrer et al.
A Machine Learning Approach to Keystroke Dynamics Based User Authentication. Kenneth Revett et al.
A Methodology for Designing Accurate Anomaly Detection Systems. Kenneth Ingham et al.
A Multi-Layered Approach to Botnet Detection. Robert Erbacher et al.
A Multi-Model Approach to the Detection of Web-Based Attacks. Christopher Kruegel et al.
A Nested Hidden Markov Model for Internet Browsing Behavior. Steven Scott et al.
A New Biometric Technology Based on Mouse Dynamics. Ahmed Ahmed et al.
A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Classification. Nigel Williams et al.
A Survey of Botnet: Consequences, Defenses and Challenges. Yun-Ho Shin et al.
A Survey of Botnet Technology and Defenses. Michael Bailey et al.
A Survey of User Authentication Based on Mouse Dynamics. Kenneth Revett et al.
A Taxonomy of Botnet Structures. David Dagon et al.
ALADIN: Active Learning of Anomalies to Detect Intrusion. Jack Stokes et al.
An Evolutionary Keystroke Authentication Based on Ellipsoidal Hypothesis Space. Jae-Wook Lee et al.
An Identity Authentication System Based On Human Computer Interaction Behaviour. Hugo Gamboa et al.
Anomaly Detection for HTTP Intrusion Detection: Algorithm Comparisons and the Effect of Generalization on Accuracy. Kenneth Ingham.
Anomaly Detection of Web-Based Attacks. Christopher Kruegel et al.
Automatic Detection and Banning of Content Stealing Bots for E-Commerce. Nicolas Poggi et al.
Behavioural Biometrics: a Survey and Classification. Roman Yampolskiy et al.
Bot Detection via Mouse Mapping. Evan Winslow.
BotCop: An Online Botnet Traffic Classifier. Wei Lu et al.
BotGraph: Large Scale Spamming Botnet Detection. Yao Zhao et al.
Botnet Traffic Detection Techniques by C&C Session Classification Using SVM. Satoshi Kondo et al.
BotSniffer: Detecting Botnet Command and Control Channels in Network Traffic. Guofei Gu et al.
Comparing Anomaly Detection Techniques for HTTP. Kenneth Ingham et al.
Computational Forensics for Botnet Analyzation and Botnet Attack Mitigation. Anders Flaglien.
Detecting Botnets Using Hidden Markov Models on Network Traces. Wade Gobel.
Direct and Indirect Human Computer Interaction Based Biometrics. Roman Yampolskiy et al.
Discovery of Web Robot Sessions based on Navigational Patterns. Pan-Ning Tan et al.
Enhanced User Authentication Through Keystroke Biometrics. Edmond Lau et al.
HMM-Web: a Framework for the Detection of Attacks against Web Applications. Igino Corona et al.
How Dynamic are IP Addresses? Yinglian Xie et al.
Identity Theft, Computers and Behavioral Biometrics. Robert Moskovitch et al.
Implicit CAPTCHAs. Henry Bairda et al.
Internet Traffic Classification Demystified: Myths, Caveats, and the Best Practices. Hyunchul Kim et al.
Intrusion Detection with Neural Networks. Jake Ryan et al.
Inverse Biometrics for Mouse Dynamics. Akif Nazar et al.
Keystroke Analysis of Different Languages: A Case Study. Daniele Gunetti et al.
Keystroke Dynamics. Jarmo Ilonen.
Keystroke Statistical Learning Model for Web Authentication. Cheng-Huang Jiang et al.
Learning User Keystroke Patterns for Authentication. Ying Zhao.
McPAD : A Multiple Classifier System for Accurate Payload-based Anomaly Detection. Roberto Perdisci et al.
McPAD and HMM-Web: Two Different Approaches for the Detection of Attacks against Web Applications. Davide Ariu et al.
Modeling of Web Robot Navigational Patterns. Pang-Ning Tan et al.
Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks. Ramakrishna Gummadi et al.
Offline/Realtime Traffic Classification Using Semi-Supervised Learning. Jeffrey Erman et al.
On the Validation of Traffic Classification Algorithms. Geza Szabo et al.
On Using Mouse Movements as a Biometric. Shivani Hashia et al.
Periodic Behavior in Botnet Command and Control Channels Traffic. Basil AsSadhan et al.
Prevention or Identification of Web Intrusion via Human Computer Interaction Behaviour – A Proposal. Hugo Gamboa et al.
Proactive Intrusion Detection. Benjamin Liebald, Dan Roth et al.
SBotMiner: Large Scale Search Bot Detection. Fang Yu et al.
Securing Web Service by Automatic Robot Detection. KyoungSoo Park et al.
Spamming Botnets: Signatures and Characteristics. Yinglian Xie et al.
Support Vector Machines for TCP Traffic Classification. Alice Este et al.
Towards a Taxonomy of Intrusion-Detection Systems. Herve Debar et al.
Towards Behaviometric Security Systems: Learning to Identify a Typist. Mordechai Nisenson et al.
Traffic Classification – Towards Accurate Real Time Network Applications. Zhu Li et al.
Typing Patterns: A Key to User Identification. Alen Peacock et al.
User Authentication through Keystroke Dynamics. Francesco Bergadano et al.
User Re-Authentication via Mouse Movements. Maja Pusara et al.
Web based Keystroke Dynamics Identity Verification using Neural Network. Sungzoon Cho et al.
Web Robot Detection techniques: Overview and Limitations. Derek Doran et al.
Web Robot Detection: a Probabilistic Reasoning Approach. Athena Stassopoulou et al.
Web Tap: Detecting Covert Web Traffic. Kevin Borders et al.
WEBBIOMETRICS: User Verification via Web Interaction. Hugo Gamboa et al.
What’s in a Session: Tracking Individual Behavior on the Web. Mark Meiss et al.
WIDAM – Web Interaction Display and Monitoring. Hugo Gamboa et al.

Posted on October 21, 2011 at 11:40am

The Problem With Client-Side Analytics

Today’s client-side analytics services are trivially spoofable, by which we mean that the numbers they present can be inflated by making no or very few requests to the host website.

This post highlights the problem, and proposes a partial solution that substantially mitigates the issues with minimal effort. Our proposed solution is to include a digital signature in each message sent to the analytics provider. Apart from the server-side generation of the signature all other components remain the same. This can be implemented in a couple of lines of code (snippets provided), and gives trust that every line in your analytics is the result of a request to your web servers.

It is unclear whether nefarious characters exploit the current disconnect between client-side analytics services and their associated host websites. However, given the ease with which spoofing is possible, we suggest implementing a solution as soon as is possible.

The analytics landscape

There are an increasing number of web analytic companies. They seem to be springing up on an almost daily basis. Google Analytics provides a cheap (free!) and cheerful, one-size-fits-all approach. Google Analytics is very popular; used by half the top 1 million websites and over 12.3 million websites in total. To get an indication of how many users make how many page views on your site, they are more than adequate.

More recently a series of new premium and real-time analytics companies have launched. These include ChartBeat, Coremetrics, MixPanel, GoSquared, LuckyOrange… (the list goes on); additionally high-volume, high-performance solutions include SASS, Unica and most recently Google Analytics Premium. All of these services have one thing in common: in their recommended and most common mode of operation, the only integration required is the embedding of a small snippet of JavaScript.

How client-side analytics works

Typically a JavaScript file hosted on the analytics company’s servers (or more commonly, the analytics company’s content distribution network) is embedded on every webpage to be monitored. On pageload this JavaScript is requested. This leads to an image pixel being requested with a series of URL-encoded parameters—typically including the time, a random number and some interesting data about the client machine (screen and browser dimensions, etc.). Additionally, as part of this request, the user’s browser will send information in request headers including the current URL, the User Agent (a short description of your browser and operating system version) and a Cookie (a small file hosted on the users machine uniquely identifying them).

The analytics company writes all the data contained in the image-pixel request to a database and the aggregate numbers are your analytics. Simple!

The problem

Ease of installation and use has dramatically sped up the uptake of such services, but it leads to a fundamental weakness: it is trivially easy to spoof these analytics services, making no or very few requests to the host website. To illustrate we implemented a spoofer for Google Analytics and we show its mischievous efforts below.

If business decisions are being made off the back of these analytics, such as ad spend, A-B testing, product choice, or even company valuation; these decisions could be made using very bad data.

How to spoof an analytics service

We reckon a script to spoof most analytics services can be written by a competent programmer in less than one hour. It involves opening the developer console in Safari or Chrome, or installing Firebug in Firefox and looking at the image pixel being sent to the analytics server. Hitting refresh 15 times will generally give you enough information to work out what request headers and URL variables are being sent.

If any fields remain a mystery, the JavaScript source code is transmitted in plain text and is easy to reverse engineer.

This script can then be run locally, on a legitimate network of rented servers, or on an illegitimate bot network to make hundreds or millions of requests.

As a customer of an analytics company, the malicious user need make only a single request to your website to find out what analytics services you use and what your account ID is. After that, all further requests go to the analytics company leaving no clues.

Current solutions

There are 3 solutions to this problem currently employed.

Ignore it
Install more analytics: 3, 4, 5 providers. One has to be right! Take the average of the closest two. But, if you can spoof one, you can spoof them all.
Server-side requests. Some solutions let you make requests straight from you server to the analytics provider via an API (e.g. Mixpanel), alternatively you can hack this solution by writing your own spoofer. The issue is analytics engines lose additional data gathered in JavaScript (screen size etc.); furthermore they use the fact that JavaScript was executed as a crude filter to remove bots and crawlers—data gathered using server-side data is much more noisy. More thorough providers offer hybrid solutions marrying log files with client-side requests (the approach taken at spider.io), however this can be over-kill for small websites.

Required solution

There is a need for an analytics solution where an automated agent cannot spoof your analytics without your knowledge. Spider.io separates automated and systematic traffic from real users; however we’re not saying you must work with us to get an accurate picture of your analytics. There is a much simpler solution that, although less comprehensive, could easily be implemented by current analytics companies. Client-side analytics is prone to both over-reporting numbers from spoofers and under-reporting traffic from bots. Most websites are primarily concerned with their real user traffic making their primary concern over-reporting. We propose a solution to this below.

Proposed solution - signed analytics

We believe that along with the image pixel request the publisher site needs to send a digital signature. A digital signature is a verifiable digest of the data being sent that guarantees the source and freshness of the data. If the digital signature doesn’t match, then data has been tampered with.

A unique digital signature of a random number and timestamp is embedded in each page with the request for the analytics provider’s JavaScript include. The signature is then sent from the user’s browser to the analytics provider’s servers with the image pixel request. The analytics provider simply checks each signature is only seen once and the timestamp contained is fresh. Notice there is no need to sign the whole message, signing the random number and timestamp is sufficient for uniqueness and freshness.

By their nature, to generate a digital signature requires a private key (a secret code) to be stored on the publishers server that cannot be viewable by JavaScript. This means signed analytics requires a change to server-side code, making installation a little harder (but not much!).

Possible implementation

There are many ways to implement the inclusion if a digital signature in messages sent by client-side analytics. Here we propose generating a Sha1 HMAC server-side and including it in the JavaScript include as a HTML fragment. The included JavaScript simply needs to grab the fragment and send it along with any other messages sent to the analytics end point. The most common languages large-scale websites are written in are PHP, Python and ASP. Below are examples of the required installation changes to incorporate a digital signature:

PHP

$r = rand();
$ts = time();
$ds = hash_hmac ("sha1", $ts . $r , "SECRET_KEY");
echo "<script id=\"example-com-analytics\" src=\"http://example.com/analytics.js#ts=$ts&r=$r&ds=$ds\"></script>"

Python

import hashlib
import hmac
import random
import time
r = random.random()
ts = time.time()
ds = hmac.new("SECRET_KEY",hashlib.sha1,"%s%s%s" % (r, ts))
print "<script id=\"example-com-analytics\" src=\"http://example.com/analytics.js#ts=%s&r=%s&ds=%s\"></script>" % (r, ts, ds)

ASP.NET – C#

Random rand = new Random();
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
byte[] keyBytes = encoding.GetBytes("SECRET_KEY");
String ts = DateTime.Now.Ticks.toString;
String r = rand.Next().ToString;
byte[] messageBytes = encoding.GetBytes(r+ts);
HMACSHA1 hmacsha1 = new HMACSHA1(keyBytes);
byte[] dsBytes = hmacsha1.ComputeHash(messageBytes);
String ds = BitConverter.ToString(dsBytes);
Response.Write(
    String.Format(
        "<script id=\"example-com-analytics\" src=\"http://example.com/analytics.js#ts={0}&r={1}&ds={2}\"></script>",ts,r,ds
    )
);

Advantages of signed analytics

Digital signatures have the following advantages over the traditional JavaScript approach:

Spoofing is much harder. At least one request to your servers must be made for every entry in your analytics.
If your analytics numbers are spoofed, you have evidence locally in your server logs for post-hoc analysis

What signed analytics does not provide

A record of users who do not execute JavaScript (scrapers and older mobile browsers)
Analytics numbers which are free from sophisticated spoofing/JavaScript-enabled web robots

For this, a heavier weight hybrid (client-side and server-side) solution is required.

Trusting your numbers

Having analytics you don’t trust is as bad as having no analytics at all. Bad decisions are based on bad data. If your analytics numbers are based on digital signatures, you can at least be confident your analytics numbers are not being trivially spoofed.

Posted on October 20, 2011 at 6:06pm

How To Catch A Bot

At spider.io we’re in the business of catching automated web traffic. This is a short post introducing some of the clues we analyse and why.

In its simplest form a bot has two components: a priority queue of web pages to crawl, and a loop that takes the first item off the end of the queue and downloads it. A trivial example can be seen below that downloads the top 1m webpages, according to Alexa, as images using Paul Hammond’s webkit2png project.

#!/bin/bash
cat top-1m.csv | while read f;
do
./webkit2png -s 1 -C  -D ./out http://`echo $f | cut -d, -f2`;
done;

More complex bots add some processing either in or after the loop and periodically update their priority queue.

By analysing several clues it is possible to identify these two bot components.

Catching the priority queue

Bots don’t navigate sites like people do. Typically they make more requests, at a higher speed, and rather than navigating a site with purpose they select pages systematically. This makes the click trace (the list of pages a user has visited) of a bot substantially different from a typical user.

Click-trace analysis is the traditional approach to detecting bots. The advantage of catching a bot based on its click trace is that this is independent of the technology used by the assailant. The disadvantage of this approach is that you need to have seen enough clicks from the same bot before you can classify and you need to be able to identify these clicks as all being from the same bot.

Catching the downloader

Real users download web pages with web browsers, onto a computer or mobile device. There are several distinctive activity streams that accompany such a download at the different levels of the OSI model, and by identifying these activity streams we can check that a normal download is taking place.

We catch the majority of bots based on how they download individual pages. This has a number of advantages: we only need to see a single page request from a bot to be able to catch it; we can catch bots that distribute themselves across multiple IPs; and we can recognise bot requests hidden amongst legitimate user traffic from the same IP.

Find the motive, find the perpetrator

As quoted at least once in any decent cop show, the same applies to bots. Once you know what the villain is likely to do, it’s much easier to spot. Unfortunately this is an arms race—once the bot creators know where you’re looking, the bots become harder to find.

Posted on September 30, 2011 at 4:05pm