Crawling HackerNews

With the new HN Search there's also a great REST API provided. You should go check it out!

It's been a while now since I started developing HN2JSON and I am actually still quite happy with that achievement. If you've not heard of HN2JSON yet, it's basically a simple Ruby interface to HackerNews.

But my dream was always to collect a complete copy of the HackerNews database! This wasn't possible with HN2JSON, because it is too slow and I didn't want to hammer HN with millions of requests. What I was left with was the HNSearch API

This was going to be a hard battle to win. And it was I can tell you. I won't trouble you with the details, but I had coded and thought for a whole week until I finally figured it out.

The battle begins

To understand why I thought this was such a hard problem to solve, you must know about the most complained about "feature" of the HN API. It restricts your search and you can't retrieve HN posts by their natural id. Instead the db holding the data assigns a signed_id to every db entry. You can get a post by signed_id, but they are near-impossible to guess.

So confronted with this problem I had thought about writing a distributed system (that would guess entries based on pre-crawled data) and also just brute-force guessing all possible signed_ids. Looking back, these approaches are ludicrous compared to the simple solution that I found in the end.

For you to understand this simple solution, you must know that, when calling the API URL you can pass in filters based on various available parameters. Let me show you an example:

http://api.thriftdb.com/api.hnsearch.com/items/_search? \
  filter[fields][username]=pg

This URL will give you access to 10 posts and comments on HN where the submitter's username is "pg". Of course 10 entries isn't very helpful, apart from the fact that you can up that limit to 100, but what I realized was that there is also a filter on the submission date. This filter takes a date range:

http://api.thriftdb.com/api.hnsearch.com/items/_search? \
  filter[fields][create_ts]=[2013-01-01T00:00:00Z + TO + *]

This URL will return you 10 posts and comments from start of 2013 upto anytime. Might not look helpful at first, but when we combine this with a sorting filter we are able to collect all items in the database.

Now let me explain that last bit again.

Let's say, we sort all entries there are by date in ascending order. When we call the corresponding URL, we are presented with the first 10(0) entries ever on HN. And now let's say that we take the last entry we get from the API and set a date-range filter going from that last entry to anytime. Now we get the next 10(0) entries from HN. If you keep updating the URL, you are able to collect the complete HN database. An example URL for that would be something like:

http://api.thriftdb.com/api.hnsearch.com/items/_search? \
  sortby=create_ts asc
  &limit=100
  &filter[fields][create_ts]=[<PUT LAST ENTRY'S DATE HERE> + TO + *]

Conclusion

So there you are, simple after all. Coding this mechanism up and making it fault-tolerant didn't even take me 2 hours. I plan to collect the whole db in the very near future and then to open-source it so that you and everyone else can benefit from it.