Sphinx Search Sorting/Ranking - php

I just recently discovered sphinx search which I want to use for my PHP application. I have a table of geolocations where every record stores a country code. For every user who uses the search function to look up geopositions, I know which country he is from.
How would I reweigh the results such that the matching results are ascending in distance to the country of the user? I already have calculated a distance matrix for each country to each other, which I can access via SQL. The country information in the geolocation database is stored as 2 letter ISO country code.
What is a good solution for this problem? I heard about UDFs, are they applicable for that problem? Is it possible to solve this problem more easily by reformatting my table?
Thank you very much.

The "easiest" way to solve this is to have coordinates for each country. You then store the coordinates for each record in the sphinx index, and when searching find the coordinates and us it in the search. This way sphinx caculates the distance dynamically.
Did you have coordinates likes this to create the matrix? But it also resupposes, you are just using a 'point' per country, if your matrix is more advanced, eg taking the closest point on the borders of each (to make disances between odd shaped countries better), then it wont work so well.
In theory you could perhaps do this with payloads, by using the country name as keywords, and the distance in a payload (arranged specially so that close disances have a high weight) but will probably be expensive to index, and might not work all that well in practice.

Related

Generating statistics using PHP and SQL

I run a SQL database with some user information.
On the main page, I would like to throw some statistics about the database, and what I thought was easy at first, showed to be complicated for me (I'm a newbie).
To give a pratical example of what I'm trying to achieve, I will use a real situation to exemplify:
On my CLIENTS table, all of my clients are from different countries (represented by a country code). One of the statistics I'm trying to show, is WHAT COUNTRY HAS MORE CLIENTS.
Is there a simple way to find this kind of information? I understand I can simply count how many occurences of certain country I have on the TABLE, but I would need to compare with every country to check which on hosts more clients.
I guess that sums up my question.
EDIT: I came up with a solution but I'm just not sure if it's best, using PHP. I did a loop test for each country checking the number of clients, and compared to the one before. If the count was higher, I updated the $higher_country var, if not, I just moved to next country. Would that be my only option?
You can do something like...
SELECT country_id, count(country_id) as nmbr
FROM clients
group by country_id
order by nmbr desc
limit 1
This counts up the number of a specific value and orders it in reverse order (so highest first) and just picks the first record.

Include distance in search form

I would like to implement a search by distance on a website.
There must be a user in a living city can find all users living within 100 or 200 km for example.
I have a table in my database that stores all the cities and their coordinates.
I thought to create another table that would store the distance between all cities but my data base contains 36,000 cities and it may make a lot of records ...
How could I make this search more simply knowing that my project will be developed with Symfony and Doctrine?
Thank you beforehand
You can use the correct answer here to determine the distance between co-ordinates.
Measuring the distance between two coordinates in PHP
For performance reasons you need to use geospatial index to efficiently query such a database. For example MongoDB has a feature for this.
If performance is not an issue you can simply store locations in relational database table and calculate distances in SQL. See this question for some information about this solution: Geo-Search (Distance) in PHP/MySQL (Performance)

Getting city boundaries from openstreetmap

I'm developing a website and I need to get all the boundaries of an area given depending on the user input.
For example, the user want to know the boundaries of a city named x. How should I get it from openstreetmap? I've heard of xapi and osmosis but couldnt find any examples anywhere.
Thanks!
I took a stab at doing this with JavaScript here:
https://github.com/pgkelley4/city-boundaries-google-maps
Basically it comes down to finding the relation that OpenStreetMap stores the city boundaries as.
I used something like the following query to get the area:
area[name="Seattle"]["is_in:state_code"="WA"];foreach(out;);
Or if that doesn't find anything, going through the node to find any associated areas:
node[name="New York"][is_in~"NY"];foreach(out;is_in;out;);
To get the relation ID, subtract 3600000000 from the area ID returned by the above queries. Then get the relation from its ID:
(relation(" + relationID + ");>;);out;
You can test out queries here, mine could probably be improved on:
http://overpass-api.de/query_form.html
That is how to get the city boundaries, processing them is another matter as nothing is in order within the relation. For that see my GitHub project and:
http://wiki.openstreetmap.org/wiki/Relation:multipolygon/Algorithm
Also I would note that the OpenStreetMap data for city boundaries is spotty. It is missing for big cities like Dallas and LA from what I can tell.

Levenshtein search

I work on a site which sells let's say stuff and offers a "vendors search". On this search you enter your city, or postal code, or region and a distance (in km or miles) then the site gives you a list of vendors.
To do that, I have a database with the vendors. In the form to save these vendors, you enter their full address and when you click on the save button, a request to google maps is made in order to get their latitude and longitude.
When someone does a search, I look on a table where I store all the search terms and their lat/lng.
This table looks like
+--------+-------+------+
| term | lat | lng |
+--------+-------+------+
So the first query is something very simple
select lat, lng from my_search_table where term = "the term"
If I find a result, I then search with a nice method for all the vendors in the range the visitor wants and print the result on a map.
If I don't find a result, I search with a levenshtein function because people writing bruxelle or bruxeles instead of bruxelles is something really common and I don't want to make a request to google maps all the time (I also have a "how many time searched" column in my table to get some stats)
So I request my_search_time with no where clause and loop through all results to get the smallest levensthein distance. If the smallest result is greater than 2, I request coordinates from google maps.
Here is my problem. For some countries (we have several sites all around the world), my_search_table has 15-20k+ entries... and php doesn't (really) like looping on such data (which I perfectly understand) and my request falls under the php timeout. I could increase this timeout but the problem will be the same in a few months.
So I tried a levensthein MySQL function (found on stackoverflow btw) but it's also very slow.
So my question is "is there any way to make this search fast even on very large datasets ?"
My suggestion is based on three things:
First, your data set is big. That means - it's: big enough to reject the idea of "select all" + "run levenshtein() in PHP application"
Second, you have control over your database. So you can adjust some architecture-related things
Finally, performance of SELECT queries is the most important thing, while performance for adding new data doesn't matter.
The thing is you can not perform fast levenshtein search because levenshtein itself is very slow. I mean, calculating levenshtein distance is a slow thing. Thus, you'll not be able to resolve the issue with only "smart search". You'll have to prepare some data.
Possible solution will be: create some group index and assign it during adding/updating data. That means - you'll store additional column which will store some hash (numeric, for example). When adding new data, you'll:
Perform search with levenshtein distance (for that you may either use your application or that function which you've (already mentioned) over all records in your table against inserted data
Set group index for new row to value of index which found rows in previous step have.
If nothing found, set some new group index value (it' the first row and there are no similar rows yet) - which will be different from any group index values that already present in table
To search desired rows, you'll need just select rows with same group index value. That means: your select queries will be very fast. But - yes, this will cause extremely huge overhead when adding/changing your data. Thus, it isn't applicable for case, when performance of updating/inserting matters.
You could try MySQL function SOUNDS LIKE
SELECT lat, lng FROM my_search_table WHERE term SOUNDS LIKE "the term"
You can use a kd-tree or a ternary tree to speed up the search. The idea is to use a binary search.

Determining location from Lat/Lng to match second table

I've done some reading so far and I am at a crossroads.
My situation is this:
Table with a list of lat/lng values ( we can take these to be "cities" ) with a radius
Table with a list of movement values, including a lat/lng
My requirement is that I return the list of movements and include the nearest city (if it's within the radius). I've so far used the haversine formula in PHP to do this for each record I return but it's not particularly efficient.
My two options I've found is either to:
1/ create a stored procedure in MySQL to do the Haversine that side (something like this: Proximity Search )
2/ use a "bounding box" method of positioning the cities instead of a circle. This is not a big problem and would allow the sql to be simplified. However, in some cases the typical logic of determining whether the point lies between the top left and bottom right will not work if the lat and lng are in negatives. In PHP, to work around this, I would do a quick "if" where I'd check if the top-left lat/lng is greater than the bottom-right and use AND/OR depending on the result.
After some extra reading I found this question here on SO.
https://stackoverflow.com/a/5548877/1749630
This answer was exactly what I needed to point me in the right direction. In this way I am now using a sub-select in order to find the city.
If someone posts a better answer than this one I'll mark that instead of this one :)

Categories