The Farrellel Postulate

Feb. 7, 2023 was the 15th anniversary of the creation of my username on Hacker News. Except for a three year hiatus, it's been a site I visit daily to keep up with new projects, learn about other areas of tech, and learn more about portions of our world. Hacker News was launched in 2007 and contains a trove of knowledge, hot takes, and history.

These notes are my personal observations on assembling a dataset for this content. Since I'm not affiliated with YCombinator, Microsoft, or Google, I'm not warranting this the most efficient way, but it works for me and my purposes.

We could scrape the actual website, but with close to 39.8 records (as of March 2024), at 10 records per second, we're looking at 45 days to get the full data set.

Some research I did shows there is a backup on Micrsoft's BigQuery and there is a Firebase database that is updated in near real time. We can use the BigQuery data to download a historical chunk and then get more recent data from Firebase.

Schema

column	data type	notes
id	int	numeric id for the submission
type	string	item type, one of story, comment, job, poll, pollopt
score	int	story points, comment points are not publicly avaialble
by	string	username of submitter
title	string	title of submission, null if type is not story
url	string	link for stories, null if type is not story
text	string	comment text, generally null if type is story
time	int	unix epoch time of publishing
parent	int	parent item. null if type is story
descendants	int	number of item descedants

Data Locations

The Archive

You can find an historical archive of hacker news hosted on Google's BigQuery. Unfortunately, the data stopped being updated 2022-11-16 09:12:32 UTC. We'll call that our data epoch. Everything in that set is fixed and assumed to be good.

good and cheap but not fast

Hacker News API (Firebase)

Firebase provides some client libraries that can capture changes as near real time events. The design of this project is to build a dataset, not mirror the data in real time, so we'll go for a more quick and dirty daily download of new data (i.e. items we haven't seen before) and we can upsert stories and comments that are older than a few days so we get the "final" story scores and comment edits. Final because technically points can change over time, but rarely change after a few days from post.

Building our dataset

Download the BigQuery data
Download item data from Firebase from data epoch to present
Merge data into a database
Scripts to upsert database with latest data

Publishing the dataset

I had an idea to publish the files to a git repository on github, but this isn't feasible.

The compressed size of all the files are around 5 Gb
File per day or per year seem wrong... file per month seems right
Git stores changes to files (how could it not?), so 5Gb is really a lower bound on how big the repo will get
- We publish changes every day, so those are going to be tracked as changes and increase the size of the .git directory
I could use git-lfs, but then we still have to host the files somewhere and other users will have to install the command-line extension to access the files
Users won't be able to download portions of the dataset... it would be all or nothing.

Git/Github is just a bad fit.

Keep it simple

The end design is to publish the data to a file store and then populate an information page to hold metadata about the downloads.

Find the links to the data at the hn-chrono github or at patf.com/hn-data.

If this ever gets some attention, we change this whole process to a torrent.

posted 2024-03-17

Assembling the Hacker News dataset