How FriendFeed uses MySQL to store schema-less data

joshu · on Oct 29, 2009

I probably should write up the (very similar, but with a few significant differences) my design for the never-executed Delicious backend datastore.

eserorg · on Oct 29, 2009

That would be very interesting to read.

We're looking at implementing a tagging system for navigating through a large proprietary datastore of oil and gas well data.

The problem we're running into is scaling conjunction queries without blowing-out our hardware budget -- 100's of tags per item with tens of millions of items, updated daily.

joshu · on Oct 29, 2009

The right data structure for tagging is probably an inverted index. Look at Lucene, Solr, or Sphinx for this.

eserorg · on Oct 30, 2009

You're absolutely right.

I hacked up a prototype last night that creates a reverse index of:

tags => item GUID's => a precomputed bloom filter for all tags in that GUID

Each tag has its own reverse-index, which is stored in a separate file assembled using O_APPEND. The entries in each file are pre-sorted based on a predefined ranking algorithm.

Queries are mapped to cluster nodes using a CRC32 hash modulo across the number of cluster nodes.

Tag conjunction queries are run by computing a bloom filter for the tags in the query. The tag with the fewest GUID's is determined (approximately) through a LUT stored in main-memory. The query bloom filter is then sequentially compared using a bitwise-and with each of the bloom filters in the appropriate reverse-index file.

After 10 matches are found, the results are rendered.

No check is done for false positives.

The problem I'm having now is in prepending new entries to the start of a reverse-index file. This operation is not performant. Since it's out-of-band with the query stream, it's tolerable for the time being.

antirez · on Oct 29, 2009

You may try what happens with Redis server-side Set intersections. This could help in theory, at least if you have a way to sort the output data (that will be unsorted).

Basically you need a lot of RAM, and to put in keys called something like "tag:<tagid>" all the IDs of your products having such an ID.

Then if you want all the products IDs having as tags both foo, and bar, you ask Redis the following: SINTER tag:10 tag:20

Using SINTERSTORE is also possible to put the result into a new key, and then use SORT to sort this data.

mnmdev · on Oct 29, 2009

I had seen your comment on the original article and was very interested. At the time I checked your site just in case you had written on the subject. I think this would be very useful.

bravura · on Oct 29, 2009

Please, I'd find that very interesting.

hackernews · on Oct 29, 2009

this was discussed 413,442 posts back

http://apps.ycombinator.com/item?id=496946

dpapathanasiou · on Oct 29, 2009

That strange feeling you sometimes get: http://www.youtube.com/watch?v=G2eUopy9sd8

JoeAltmaier · on Oct 30, 2009

This matches all my intuitions about how I was going to do this; now I have a cookbook (several recipies!)! This is why Hacker News is so valuable. Nothing beats experienced advisors.

SingAlong · on Oct 30, 2009

I tried this method for my app. But it turned out that I needed a lot of fields to be in their own columns since i would have to use them with queries in the WHERE clause or ORDER clause.

So be careful while choosing the columns you would like to serialize or json-ify and dump.

Just my 2c.

technoweenie · on Oct 30, 2009

Part of this article talks about the index tables they use. The rationale for having separate index tables is they can create or drop them without locking tables or anything like that. It takes a little more manual work to keep things updated though.

I too use a hybrid approach on my site. Only columns that aren't specifically referenced in indexes aren't kept in the serialized attribute hash.

cnu · on Oct 30, 2009

Check out a open sourced version written on top of SQLAlchemy. http://github.com/bickfordb/waffle