We're looking at implementing a tagging system for navigating through a large proprietary datastore of oil and gas well data.
The problem we're running into is scaling conjunction queries without blowing-out our hardware budget -- 100's of tags per item with tens of millions of items, updated daily.
I hacked up a prototype last night that creates a reverse index of:
tags => item GUID's => a precomputed bloom filter for all tags in that GUID
Each tag has its own reverse-index, which is stored in a separate file assembled using O_APPEND. The entries in each file are pre-sorted based on a predefined ranking algorithm.
Queries are mapped to cluster nodes using a CRC32 hash modulo across the number of cluster nodes.
Tag conjunction queries are run by computing a bloom filter for the tags in the query. The tag with the fewest GUID's is determined (approximately) through a LUT stored in main-memory. The query bloom filter is then sequentially compared using a bitwise-and with each of the bloom filters in the appropriate reverse-index file.
After 10 matches are found, the results are rendered.
No check is done for false positives.
The problem I'm having now is in prepending new entries to the start of a reverse-index file. This operation is not performant. Since it's out-of-band with the query stream, it's tolerable for the time being.
You may try what happens with Redis server-side Set intersections. This could help in theory, at least if you have a way to sort the output data (that will be unsorted).
Basically you need a lot of RAM, and to put in keys called something like "tag:<tagid>" all the IDs of your products having such an ID.
Then if you want all the products IDs having as tags both foo, and bar, you ask Redis the following: SINTER tag:10 tag:20
Using SINTERSTORE is also possible to put the result into a new key, and then use SORT to sort this data.
I had seen your comment on the original article and was very interested. At the time I checked your site just in case you had written on the subject. I think this would be very useful.
This matches all my intuitions about how I was going to do this; now I have a cookbook (several recipies!)!
This is why Hacker News is so valuable. Nothing beats experienced advisors.
I tried this method for my app. But it turned out that I needed a lot of fields to be in their own columns since i would have to use them with queries in the WHERE clause or ORDER clause.
So be careful while choosing the columns you would like to serialize or json-ify and dump.
Part of this article talks about the index tables they use. The rationale for having separate index tables is they can create or drop them without locking tables or anything like that. It takes a little more manual work to keep things updated though.
I too use a hybrid approach on my site. Only columns that aren't specifically referenced in indexes aren't kept in the serialized attribute hash.