> For our geo-search API, we used PostgreSQL for many months, but once our Media entries were sharded, moved over to using Apache Solr. It has a simple JSON interface, so as far as our application is concerned, it’s just another API to consume.
Does anyone have particular insight to share on this? Last I checked, Solr's geospatial searching methods are rather inefficient -- haversine across all documents, bounding boxes that rely on haversine and Solr4 was looking into geohashes (better but have some serious edge-case problems where they fall apart).
Meanwhile PostgreSQL offers r-tree indexing for spatial queries and is blazing fast.
Am I missing some hidden power about Solr's geospatial lookups that make it faster/better than an r-tree implementation?
It probably was the database sharding. If the Solr setup could handle the geo-search-related data without the need for sharding it probably can beat out Postgres with sharding.
Having this exposed through an api that is standardized and maintained by someone else is also nothing to sneeze at. I'd trade a bit of performance for that kind of standardization and turnkey use in the right scenario.
The reason we use Solr for this specific task is because PostgreSQL cannot efficiently and quickly merge two index queries (time & space). It can do this to a limited degree, but both of these dimensions potentially match 10s of millions of documents, and PG falls over at this.
So you make the r-tree 3 dimensions (lat,lng,time). PostgreSQL supports this.
I dunno I can't envision Solr being more efficient than a properly designed RDBMS for these situations. If you were integrating a full-text search I'd absolutely believe that to be the case but...
We need independent time & geo searches as well. The indexes are vastly smaller in Solr. We use PostgreSQL extensively and prefer it, so it's not a matter of simply wanting to use something different.
That's very interesting. Could you share your story with the mailing list pgsql-hackers a little bit? The guys who work on indexing are quite active on those lists.
Also, there's some new thing I don't understand super well, sp-gist, do you have any thoughts on that?
I'm no Solr expert, but bug SOLR2155 has a patch [1] that does a geospatial search using geohash prefixes [2].
As far as I can tell, you take the point's latitude and longitude and interleave the binary bits - so if your record's latitude is 11111111 and your longitude is 10000000 your geohash is 1110101010101010. You index on that, then when you do a spatial search for the point nearest to 11111110,10000011 you look up key 1110101010101101 and a prefix search finds the closest value in the index is the the record you inserted earlier. Presumably then you realize there could be an even closer record at 11111111,01111111 which would have got stuck at 1011111111111111 in the index so you look there too just in case, take the closer of the two search results, and bob's your mother's brother.
Hrm, so for a proximity search it basically has to take a combination of all potential encompassing geohashes and then do a second-pass (substantially reduced data set) using a haversine approach or something.
Does anyone have particular insight to share on this? Last I checked, Solr's geospatial searching methods are rather inefficient -- haversine across all documents, bounding boxes that rely on haversine and Solr4 was looking into geohashes (better but have some serious edge-case problems where they fall apart).
Meanwhile PostgreSQL offers r-tree indexing for spatial queries and is blazing fast.
Am I missing some hidden power about Solr's geospatial lookups that make it faster/better than an r-tree implementation?