Y Combinator Dataset Of Posts

jrnewton · on April 24, 2008

> so only download it if you need it > This dataset may be removed in the next week or so

the latter cancels the former.

mattjaynes · on April 25, 2008

I've set up a mirror, should be quite a bit faster ;)

http://weblava.net/ycombinator-news20080424.tar.gz

xirium · on April 25, 2008

Thank you. You may also want to mirror user data ( http://news.ycombinator.com/item?id=173045 ).

mattjaynes · on April 25, 2008

Sure ;)

http://weblava.net/ycombinator-news-profile20080424.tar.gz

palish · on April 25, 2008

Thank you.

For what it's worth, here are additional mirrors.

Posts: http://dl.getdropbox.com/u/315/programming/datasets/ycombina...

Profiles: http://dl.getdropbox.com/u/315/programming/datasets/ycombina...

xirium · on April 25, 2008

Thank you.

You also want to mirror the update utility ( http://news.ycombinator.com/item?id=173354 ).

palish · on April 25, 2008

You got it: http://dl.getdropbox.com/u/315/programming/datasets/ycombina...

petercooper · on April 25, 2008

If anyone wants to turn that user data into a CSV of username,age,karma, run this Ruby script in the folder with all the HTML files in it:

  Dir['*.html'].each do |user_file|
    user_data = File.read(user_file)
    user_name = user_file.chomp('.html')
    age = user_data[/created:<\/td><td>(\d+)/, 1]
    karma = user_data[/karma:<\/td><td>([\d\-]+)/, 1]
    puts "#{user_name},#{age},#{karma}"
  end

petercooper · on April 25, 2008

Once you've done that, you can load it into Excel (or Numbers, or whatever) and play with sorting it, adding formulas, and what not. Only semi-interesting one I found was this:

Top users by "karma earned per day of membership":

            Username   Age Karma   K/A
  ------------------------------------
                 dhh     1    48    48
                  pg   563 17544    31
            oldgregg     3    87    29
               nickb   429 11672    27
                donw     3    52    17
                pius   210  2803    13
              edw519   428  5316    12
                 rms   427  5017    11
              drm237   271  2851    10
                 hhm   246  2475    10
             keating    10   103    10
                 sah    42   411     9
               freax     5    45     9
           further08     2    19     9

The ones who got their karma very quickly are the interesting ones to check out.

sadiq · on April 24, 2008

I'm trying to pull it down to one of our university boxes so we can mirror it. It's going a little slow at the moment though (eta 10 hours).

I'll update with the link as soon as it's done.

ivank · on April 24, 2008

I canceled my download, so hopefully part of 2.5KB/s will trickle down to your connection.

groovyone · on April 24, 2008

Me too :)

ra · on April 24, 2008

I've cancelled mine too. Was getting 1.5 kb/s

sadiq · on April 25, 2008

Hrm. I can't seem to edit my post.

Anyway, the mirror is:

http://sadiq.uwcs.co.uk/ycombinator-news20080424.tar.gz

jcl · on April 25, 2008

I think post editing gets disabled after an hour or two, presumably to create a permanent record.

mariorz · on April 24, 2008

cool! why not set up a torrent and seed?

xirium · on April 25, 2008

Firstly, that would require installing it :). Secondly, this is being served by a very old server which, from empirical testing, wouldn't cope with this type of protocol. Thirdly, I have to keep borrowed bandwidth to a manageable level during business hours, which ideally means making it tend to zero in the long term. I was anticipating a mirror to continue serving this data but it would have been rude to ask.

Anyhow, I was expecting at most 20 concurrent connections before traffic decayed. I wasn't expecting concurrent connections from 56 unique IP addresses. The server is in London and it is transferring 1.7MB/s, mostly to European users. However, latency and routing to US clients seems to drastically reduce throughput to those users.

Most of the HTTP 206 [Partial] requests seem to be from Internet Explorer, despite this being a relatively obscure choice on this forum. I can only suppose that IE is quite inclined to re-establish connections after a connection briefly stalls. The latter would be because the httpd state creeped 39MB into virtual memory on this 64MB RAM server. This would also be why TCP window scaling didn't occur.

Anyhow, I knew it was risky to use this server to serve relatively large files but it will be used in the future for posting smaller tidbits.

attack · on April 25, 2008

I used to run 50 bittorrent downloads simultaneously off of my kuro just fine. A wrist watch could handle it.

programnature · on April 25, 2008

This is awesome, thanks.

One suggestion: it would be even more useful (for my purposes at least) if you had another version that only included the full posts, rather than having the full posts in addition to having separate files for each comment subthread. The way it is now, there is a lot of data duplication, since a comment of depth n will appear in n separate files.

mmcgrana · on April 24, 2008

It would be neat if there were comparable datasets available for other sites. For example, I'd be excited about getting my hands on a dataset describing Twitter's user/following graph.

gaika · on April 24, 2008

Can I import the posts into jaanix so it is available for searching / tagging / saving / editing ?

kirubakaran · on April 24, 2008

Why not?