What’s in an IP address?

At spider.io our aim is to pick out all deviant website traffic. This involves trying to spot anomalies in enormous data streams in real time, where these data streams are informationally sparce and—what’s worse—much of what we see is not as it initially appears. This is because as much as we are trying to identify unwanted website activity, the perpetrators are also trying to evade our efforts. In this cat-and-mouse game almost every information source can be spoofed/masked/reverse-engineered/riddled with misdirection.

In our hunt for content-scrapers, spam injectors, click jackers, mouse jackers, cookie stuffers and phantom traffic, we look for visitor fingerprints, which are difficult to spoof—in terms of, for example, client-side behaviour, access patterns and consistency across the OSI layers, from the application layer to below the TCP layer.

In this post we consider a much simpler type of visitor fingerprint which is particularly difficult to fake: the requesting IP address. (NB The requesting IP address may not be the originating IP address, but that is for another post.) The IP address, the core routing key for the Internet, is present in every single HTTP request, and if any person who sends a request wants to receive the associated response, then they have to give their IP address.

In this post we consider four helpful clues provided to us by an IP address.

Geolocation

Firstly, given an IP address, we can say something about the real-world geographic location or geolocation of the requesting website visitor.

Whilst it’s well-known that many web services over the years have been geolocating IP addresses, it may come as a surprise to you to know just how cheap (free) and easy it is to get hold of reliable geolocation data for IPs. Consider Maxmind, for example. It offers a downloadable database of geolocation data, which allows IP address to be located in milliseconds, accurate to city level. This database is absolutely free. And to really help you get up and running, Maxmind even throw in an API in the language of your choice.

Whois

Okay, so we’ve got our first clue: IP geolocation. We can also say something about who or what is associated with any particular IP address.  To do so we turn first to an aptly named service, Whois. Originally devised in the mid 1980s as a human-readable directory of IP address allocations for the internet, the Whois service is almost as old as the internet itself. To make a Whois query, we open a TCP connection to a Whois server. We send across the IP, and we get back such informational goodies as the company/organisation that owns the IP/range and associated contact information.

DNS

Next we turn to DNS, the phonebook of the internet. DNS records map domain names to IP addresses, but its actually trivially easy to do the reverse: i.e. look up a domain name from just an IP address. Whenever a domain name is registered for, say, spider.io, a public DNS record is added by the registrar for that domain, (indeed you wouldn’t be reading this page without the help of such a record). But in addition to that, most organisations register reverse DNS records, which point from the owner’s IPs back to the owning domain (enthusiastic readers may notice that this website is hosted on Amazon’s EC2).

So with a quick reverse DNS lookup, can we put a domain name to the IP address? Well, not so fast. For a start, the reverse DNS record often isn’t there. And remember what we said earlier about faking it? Well, reverse DNS is something that, unfortunately for us, goes on the list as ‘easily faked’. It’s often the case that the owner of the IP address is responsible for maintaining the reverse DNS entry, which is rather handy if, for instance, you want your homemade crawler to look like its registered to googlebot.com. With that in mind, a common way to verify a reverse DNS entry is to use the result (the domain name referenced by the reverse lookup) to perform a subsequent forward DNS lookup (atechnique called forward confirmed reverse DNS lookup), and if the forward lookup brings you back to the IP in question, you can be satisfied that this is a valid reverse DNS record.

DNSBLs

DNS blacklists, or DNSBLs, are another DNS extension. DNSBLs like this, for example, contain IP addresses flagged for deviant/nefarious activity—spamming, in particular. By querying various DNSBLs we may put together an online character reference for the IP address to the extent that we may be able to identify known TOR exit-point, or a machines that have been flagged as scanning drones.

So from a single IP address—without even considering behaviour associated with this IP—we can already start to build a compelling picture of who our mystery visitor might be and why they have visited.

Physical Hack Day

The Introduction

The first spider.io hack day,  led with diabolical cleverness by Dr Overell, produced our doomsday device.

For hack day 2, we decided to make the natural move from bits and bytes to yeast and spouts. We decided to build a bar in the office—with Octobeer: The GitHub Kegerator Project providing some early inspiration.

Arriving in the office in the early hours of the morning, Vegard took to the whiteboard wall like a Norwegian to a bonfire during midsummer madness, and led our brainstorming effort. The theme for the day: Grand Designs meets Scrapheap Challenge—construct a beautiful bar out of whatever materials are available to us. Vegard’s attempts at drawing parallels between software development and the hack day were met with suitable jeering and dismissive disdain (“blah, blah, blah…”, “Dude, we’re not in some kind of management consulting pep-talk situation here!”). Right, moving on swiftly then…

The Schedule

The Design

As one might expect, Spider Towers is not usually particularly well stocked with building materials, bar equipment or power tools, so some pre-planning was required. Wickes, eBay, Screwfix, Anagramltd.com and a1barstuff.co.uk were scoured beforehand for items that might come in handy. We’re quite sure the owners of the antique mahogany table were not expecting the table to be drilled and sawed, the extension turned into a beer engine cover and the legs to be turned into magnificent bar stools. However, despite the purchasing being done in advance, designing the bar was entirely left for the hack day.

Iterating and pivoting through several designs, we finally arrived at a design with two towers set symmetrically on top of the table, set on top of a hexagonal base (to maximise the storage space under the rounded table—thoughts of creating an oval base were quickly dismissed by our resident builder/designer/wood hacker). The base would feature doors on the back for easy access, and insulation to keep the pre-chilled ale cool for as long as possible.

Building the Bar

Three distinct tasks were identified:

  1. Build the base
  2. Create a tower to cover the beer engine and provide an anchor point for the lever action
  3. Adapt the table with suitable openings to accomodate the beer engine tower (the lager font is going to be made as part 2 of the project)

People gravitated towards their preferred job, teams formed and work began.

The Base

The Beer Tower

The Lunch

The Completion

The Enjoyment

Our First Hack Day—Black Hat

Doc Overell’s email arrived before the birds started chirping: “Some notes for Friday’s Hack day.” Opening the companion document revealed what we had all been waiting for, the brief for the first spider.io hack day—one day to spend scheming and implementing from the perspective of the enemy.

The Brief

In 12 hours from conception to proof-of-concept demo, build a product or service that can generate revenue from Internet traffic.

The Rules

The Schedule

The Day

The day kicked off at 08:20 with Simon in the chair for the brainstorming session, which lasted until 9am.

A clear favourite quickly emerged. Details will be omitted here. Suffice to say, it was a nemesis bot. The reason this was so popular was that we’d be creating our worst nightmare: a distributed legal way of crawling that is probably the hardest to detect and block.

The project was then generalised from a distributed bot to a job distribution framework (not limited to crawling), that any service could plug into. However, the team was hungry for a target to test the product on, so why not kill two birds with one stone? Details on the precise application are also omitted.

The Architecture
Project agreed, Ashley took up the reigns to design the Architecture.

The system was split into 6 components:

JSON RPC was agreed as the default spit and glue to hold the project together, with a thin wrapper converting this to other protocols, where required, and the Crawler to be implemented as a Python module for simplicity.

Work Begins

Jobs were swiftly allocated to people and after a short break for provisions, coding began in earnest at 10:15 am.

A brief but much needed pizza break at 1:30pm topped everyone up with meat and cheese (with one vegetarian top-up). With the clock ticking, and much progress still to be made, people quickly got back to work.

Components started coming together around 6pm, with the completion of the crawler and the launch of the JobTracker on EC2. The crawler was quickly hooked up to the Scraper, which joined the JobTracker on EC2 at 7pm, with communication between the two tested. A simple centralised job requester was pulled together (literally at the 11th hour) as a POC that general Job requesters could be integrated into the system.

The final 45 minutes were spent tracking down type errors in JSON rpc calls (integers sent as strings, etc.). The whole thing came together at 8:15pm with us happily scraping CENSORED with distributed, seemingly legitimate traffic.

Conclusions

The day produced a very sophisticated bot, which it probably makes sense to maintain for future testing of our own detection systems—as well as to test how good any other bot-detection services might be. Details of the bot will be kept under lock and key at Spider Towers.

Some Hack-Day Images

Silicon Milkroundabout, 30.10.2010

Thank you

Many, many thanks to Ian, Pete, Anaïs and their SongKick team for Sunday’s Silicon Milkroundabout. A stupendous event, much enjoyed.

And thank you to everyone who came to banter with us at the spider.io gazebo. In particular, a white hat tip to some of the visiting challenge hackers: @tackers, @PeteSpider, @chrisdarby89, @alol and @spidery_tweet.

We’re looking forward to having many of you visit us at Spider Towers. As mentioned, we’ve put some details up here.

Our one-minute pitch on the day

For those who didn’t get to hear our one-minute pitch, this is what we had to say on the day:

At spider.io, we look to catch bad people doing very bad things.

We catch botnets, browser emulators, clickjackers, traffic launderers, bots that probe for weakness, bots that learn. At spider.io, our business is to distinguish legitimate human website visitors from nefarious automated traffic.

How do we do it?

It’s a hard engineering problem. It would be a hard problem at toy levels of traffic. We need reverse Turing tests. We need to analyse from the application layer to below the TCP layer. We need clever stateful classifiers, that classify information based on previously received information. And if this isn’t hard enough, imagine doing this across four times the number of messages each day than the number of tweets received by Twitter each day. This is where we’ll be before the year is out. And for us this is just the beginning.

If you’d like to work at the very edge of what is technically possible, come say, “Hi,” at the spider.io gazebo.

Arts and crafts in a gazebo… in a brewery

We had a lot of fun preparing for the day. Simon, who has a rather admirable building qualification from Cambridge, showed off his skills and built us some suitably excellent whiteboard stands. Ben, not to be outdone by the Cantabrigian, whipped up a rather fabulous stencil. And how best to show off their creations? Stick them in an all-weather gazebo, of course, in the Truman Brewery.

Extreme Architecting

As Simon covered in our first blog post, our quest is to detect web bots that are up to no good. Of course, this includes bots pretending to be real people, not just those with bot user agents! Otherwise it would be pretty easy, wouldn’t it?

To help our customers, we need to receive and analyse requests from their websites, making a real-time judgement as to the nature and intentions of an “actor” (a person or bot making requests to the website). Then we need to push our real-time classifications into an analytics dashboard/API. This would be tricky enough, if we were only analysing low levels of traffic. But, before the year is out, we’re expecting that we’ll be analysing over a billion events per day (which would be four times the number of tweets currently sent each day). Due to the volumes of traffic we process, and our customers’ mitigation requirements, we need our systems to be highly available and massively scalable.

Don’t get attached

As a startup, we have very little technological baggage, but we do also have lots of systems to build and grow very quickly. As a result, we’re constantly watching the tech landscape for new or improved ‘off-the-shelf’ (or ‘straight-from-github’) software. We have a strong preference for open source wherever possible, so we are free to make changes that we can contribute back to the community. Sometimes, this means throwing away a home-grown system in the early stages of its life, to replace it with a recently-released community alternative. We aren’t afraid to do this. Just because something looked right 3 months ago, doesn’t mean it’s still right now. It’s all too easy to get attached to the current way of doing things, which is something that must be overcome when a better solution presents itself.

The key to this ‘extreme architecting’ is to keep systems separated as far as possible, so they are freely interchangeable. The common pattern of abstraction used in software (with Java interfaces, for example) can easily be overlooked when building large but tightly-coupled computer systems that require high performance. In a stream based architecture it’s fairly easy to add components to message queuing servers, running systems in parallel to compare results and performance, before selecting the best for the job.

Storm

A recent example of this is Storm, released by Nathan Marz of Twitter. Storm is a distributed stream-processing framework, and something we’d been keeping an eye on for some time. One look at the slides showed the architecture that led to the creation of Storm was the same architecture we had internally—in fact, we’d recently started our own project to create something similar to Storm in light of our experiences. We were well aware that in the coming months our load was going to move from a few hundred requests a second to potentially tens of thousands (and more!). As a result, we needed to ensure our new system was reliable and massively scalable.

When the Storm project was open-sourced, I spent a few hours seeing how our system could make use of it. We had parts of our system converted to it in a matter of days. This was due to a combination of excellent documentation, but also that our system was configured in a similar way, so we did not have to take a great conceptual leap. Rather than spending months writing our own code, we were able to spend weeks adapting our existing systems to its framework and carry out exploratory testing work—we like to know how our systems will respond under heavy (and crushing) loads. It’s better than being surprised when something goes wrong!

We’ve been running Storm on production loads in our test environment for a couple of weeks now, and are preparing to roll out to our full production environment—one of the first companies to do so outside of Twitter. We’re really excited about the project and hope to contribute back to it. There’s a growing community already making contributions to the ecosystem, and we’re grateful to Nathan for all his hard work.

Storm – Uses

Our new Storm cluster provides the base for a several components in our system. Our stateful classifiers consume multiple streams of information about incoming requests and join them together to identify bot and user behaviours. We use the Esper stream processing engine to perform the analysis, while Storm provides the fault-tolerance and message distribution layer required to make Esper scale. We also have a set of stateless classifiers to make decisions based on single requests. These are manually scaled (and written in Python) at present, but will be moved under Storm’s management using its ‘multilang’ feature.

Storm is also finding uses within our database architecture, providing first-line aggregation to reduce the frequency of writes to the HBase cluster that runs our analytics dashboard. In time, we plan to take advantage of Storm’s distributed RPC (remote procedure call) features to enhance our customer-facing APIs.

 

At the cutting edge

We came to the realisation that most of our systems run “0.x” version software. With proper testing, and with a focus on ensuring graceful service degradation in the face of component failure, we don’t consider this a problem—more, a necessary part of using community software at the cutting edge of data processing. If you want to work with big data, using some technologies you’ll have heard of and some you won’t, why not have a look at our careers page?