Our First Hack Day—Black Hat
Doc Overell’s email arrived before the birds started chirping: “Some notes for Friday’s Hack day.” Opening the companion document revealed what we had all been waiting for, the brief for the first spider.io hack day—one day to spend scheming and implementing from the perspective of the enemy.
The Brief
In 12 hours from conception to proof-of-concept demo, build a product or service that can generate revenue from Internet traffic.
The Rules
- Everyone must code;
- We will all work on the same project;
- All coding and planning must be done within the 12hrs;
- All genres will be selected on Spotify, no one can skip until the final commit (Spotify Roulette).
The Schedule
- 8am Brainstorm. Come up with as many ideas as possible. Whittle them down. Vote on the final few. The project with the majority vote that best fits the brief and the rules will be chosen.
- 9am Split project into components and start work!
- Lunch time. Have suitably hack day style lunch (Pizza, Burritos or similar)
- 6:30pm Begin the search for the Ballmer Peak (http://xkcd.com/323/)
- 7pm Start pulling things together and testing, testing, testing
- 8pm Stand back and marvel at the Hack produced before retiring to the pub.
The Day
The day kicked off at 08:20 with Simon in the chair for the brainstorming session, which lasted until 9am.
A clear favourite quickly emerged. Details will be omitted here. Suffice to say, it was a nemesis bot. The reason this was so popular was that we’d be creating our worst nightmare: a distributed legal way of crawling that is probably the hardest to detect and block.
The project was then generalised from a distributed bot to a job distribution framework (not limited to crawling), that any service could plug into. However, the team was hungry for a target to test the product on, so why not kill two birds with one stone? Details on the precise application are also omitted.
The Architecture
Project agreed, Ashley took up the reigns to design the Architecture.
The system was split into 6 components:
- The JobTracker API (which maintained job queues, incomplete jobs, etc.)
- The Scraper API (which collected scraped data and submitted new jobs),
- The Crawler (a site specific information extraction script, which, given a block of HTML, identified the links of interest to crawl next)
- The API wrappers
- The Job requester
CENSORED
JSON RPC was agreed as the default spit and glue to hold the project together, with a thin wrapper converting this to other protocols, where required, and the Crawler to be implemented as a Python module for simplicity.
Work Begins
Jobs were swiftly allocated to people and after a short break for provisions, coding began in earnest at 10:15 am.
A brief but much needed pizza break at 1:30pm topped everyone up with meat and cheese (with one vegetarian top-up). With the clock ticking, and much progress still to be made, people quickly got back to work.
Components started coming together around 6pm, with the completion of the crawler and the launch of the JobTracker on EC2. The crawler was quickly hooked up to the Scraper, which joined the JobTracker on EC2 at 7pm, with communication between the two tested. A simple centralised job requester was pulled together (literally at the 11th hour) as a POC that general Job requesters could be integrated into the system.
The final 45 minutes were spent tracking down type errors in JSON rpc calls (integers sent as strings, etc.). The whole thing came together at 8:15pm with us happily scraping CENSORED with distributed, seemingly legitimate traffic.
Conclusions
The day produced a very sophisticated bot, which it probably makes sense to maintain for future testing of our own detection systems—as well as to test how good any other bot-detection services might be. Details of the bot will be kept under lock and key at Spider Towers.