The Short Comings of Sphinx GEO Search Posted on September 24th, 2010
When I first got Sphinx installed I was thrilled. I had found a simple easy solution to all of my problems. I started churning out indexes left and right. One day my boss asked me "Hey can we put some geo distance information on here?" ... "No problem!". I had been reading the docs and saw how easy it was to add in geo distance and to sort by geo distance and to limit within a geo distance. Heck it was even easy to limit with in a range of distance(x with in 40 to 50 miles). Needless to say I added the columns added the 3 lines of code and poof there it was. What used to take stupid mathematical queries is poof done in 3 lines of code. At this point I was pounding the Sphinx Kool Aid, and boy did it tasted great.
A couple days later my boss pointed out to me, that we didn't have a 1:1 relationship between our products and their locations, we had products distributed throughout the United States, and that my solution while correct(we had a featured location) only worked on a 1:1 correlation and so could i please implement a 1:N relationship. Now completely drunk on the Kool Aid my response "OF COURSE!". I proudly opened up my fully customized conf file and set about to change the latitude and longitude columns to beMulti-valued attributes . But before I could press the first key I realized that Latitude and Longitude need to be compared as ordered pairs, if I made them mva's they'd be compared as n:n of latitude to longitude comparisons(I didn't try this I know a bad idea when it see it, sometimes).
I then began to do some thinking, and then some reading, and then some more thinking and more reading. I came to the realization that addresses must be stored in their own index, not a problem I quickly create my new index. Knowing that I could run multiple queries on different indexes at the same time, I wrote up a quick sphinx query and executed. My jaw nearly hit the floor, there was a storm trooper telling me "These aren't the results I was looking for". I did some quick reading and find out that it is impossible in a multi query to pass results from the first query to the second query with in sphinx. Down hearted but still convinced this is possible with Sphinx, I decided I can hack it.
And Hack It I Did. I would run two queries, one would limit our products by location, the second would limit actual products by product specific constraints and I would pass in the product ids using php. This looked pretty damn ugly, slightly better than sql solution, but it was working. I had to tweak the memory usage up on the search daemon to accommodate for the larger data set, but all in all it seemed to work fast the code was straight forward and everything was in one area Sphinx. I began to fill up my cup with kool aid. A week had passed, now was the big moment the push live. We pushed live and instantaneously I saw memory issue errors over flow my error log. I said not a problem, the live server has more ram than dev, I'll throw some machine at it. Then I saw the PHP memory errors coming. And that's when the KoolAid went bad. As it was late, the push live had been backed up for over a month now, I patched code to stop the error messages at the expense of quality data and I started looking in svn for the old trusted distance formulas.
In this specific instance I end up using straight SQL, there were a couple joins, but performance was not horrible with our traffic load. In other instances where I needed the full text search, I did the same distance constraints and passing the specific role ids into sphinx. But there was always that pain in my neck, the jabbing in my side, knowing that this code should be cleaner.