Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Y Combinator Dataset Of Posts
56 points by xirium on April 24, 2008 | hide | past | favorite | 22 comments
If anyone is considering making a YCombinator site search or wants to perform an analysis of historical posts then we've made a dataset available at http://www.xirium.com/ycombinator-news20080424.tar.gz

The dataset is 100MB, so only download it if you need it. This dataset may be removed in the next week or so.



> so only download it if you need it > This dataset may be removed in the next week or so

the latter cancels the former.


I've set up a mirror, should be quite a bit faster ;)

http://weblava.net/ycombinator-news20080424.tar.gz


Thank you. You may also want to mirror user data ( http://news.ycombinator.com/item?id=173045 ).




Thank you.

You also want to mirror the update utility ( http://news.ycombinator.com/item?id=173354 ).



If anyone wants to turn that user data into a CSV of username,age,karma, run this Ruby script in the folder with all the HTML files in it:

  Dir['*.html'].each do |user_file|
    user_data = File.read(user_file)
    user_name = user_file.chomp('.html')
    age = user_data[/created:<\/td><td>(\d+)/, 1]
    karma = user_data[/karma:<\/td><td>([\d\-]+)/, 1]
    puts "#{user_name},#{age},#{karma}"
  end


Once you've done that, you can load it into Excel (or Numbers, or whatever) and play with sorting it, adding formulas, and what not. Only semi-interesting one I found was this:

Top users by "karma earned per day of membership":

            Username   Age Karma   K/A
  ------------------------------------
                 dhh     1    48    48
                  pg   563 17544    31
            oldgregg     3    87    29
               nickb   429 11672    27
                donw     3    52    17
                pius   210  2803    13
              edw519   428  5316    12
                 rms   427  5017    11
              drm237   271  2851    10
                 hhm   246  2475    10
             keating    10   103    10
                 sah    42   411     9
               freax     5    45     9
           further08     2    19     9
The ones who got their karma very quickly are the interesting ones to check out.


I'm trying to pull it down to one of our university boxes so we can mirror it. It's going a little slow at the moment though (eta 10 hours).

I'll update with the link as soon as it's done.


I canceled my download, so hopefully part of 2.5KB/s will trickle down to your connection.


Me too :)


I've cancelled mine too. Was getting 1.5 kb/s


Hrm. I can't seem to edit my post.

Anyway, the mirror is:

http://sadiq.uwcs.co.uk/ycombinator-news20080424.tar.gz


I think post editing gets disabled after an hour or two, presumably to create a permanent record.


cool! why not set up a torrent and seed?


Firstly, that would require installing it :). Secondly, this is being served by a very old server which, from empirical testing, wouldn't cope with this type of protocol. Thirdly, I have to keep borrowed bandwidth to a manageable level during business hours, which ideally means making it tend to zero in the long term. I was anticipating a mirror to continue serving this data but it would have been rude to ask.

Anyhow, I was expecting at most 20 concurrent connections before traffic decayed. I wasn't expecting concurrent connections from 56 unique IP addresses. The server is in London and it is transferring 1.7MB/s, mostly to European users. However, latency and routing to US clients seems to drastically reduce throughput to those users.

Most of the HTTP 206 [Partial] requests seem to be from Internet Explorer, despite this being a relatively obscure choice on this forum. I can only suppose that IE is quite inclined to re-establish connections after a connection briefly stalls. The latter would be because the httpd state creeped 39MB into virtual memory on this 64MB RAM server. This would also be why TCP window scaling didn't occur.

Anyhow, I knew it was risky to use this server to serve relatively large files but it will be used in the future for posting smaller tidbits.


I used to run 50 bittorrent downloads simultaneously off of my kuro just fine. A wrist watch could handle it.


This is awesome, thanks.

One suggestion: it would be even more useful (for my purposes at least) if you had another version that only included the full posts, rather than having the full posts in addition to having separate files for each comment subthread. The way it is now, there is a lot of data duplication, since a comment of depth n will appear in n separate files.


It would be neat if there were comparable datasets available for other sites. For example, I'd be excited about getting my hands on a dataset describing Twitter's user/following graph.


Can I import the posts into jaanix so it is available for searching / tagging / saving / editing ?


Why not?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: