One of the sites I work on is a social networking site of sorts, and the content would be greatly enhanced by using some sort of location service to recommend "friends" based on proximity. The site focuses on the US, but with potential users worldwide.
I've considered creating an associative array or relational database with countries, states/provinces/territories, counties, and cities to provide a rough way to drill down to their relative proximity, but this can be extremely unwieldy and complicated very quickly.
I've also considered IP geolocation, but the results tend to be unreliable (some services show my company's IP as located some 600 miles North-east), and I would at least need some sort of fallback to lookup, for instance, a zip/postal code.
Can you tell me a clear defined way to effectively do this sort of lookup locally, without use of 3rd party APIs, preferably with at least some reference to where to gather the basic information from in the first place? I'm currently running PHP 5.3.2 and MySQL 5.1.44, if it makes any difference.
Thank you!
EDIT:
Added a bounty to try to get better ideas, or other ways of handling the problem, perhaps more efficiently. As it is, the load time due to the huge database size is insane. I figure I definitely need to improve my caching, but I'm trying to see if there's anything I should be doing with regard to improving my location system.
This might be a bit obvious... but the only way that you can know the location of a user, with the best degree of accuracy is to actually:
Ask the User where they are!
Once you have asked the user where they are, you can then use third party applications to figure out distances.
If you don't want to use any third party application as your question mentioned, then you could download and integrate one of the Geo databases into your own service.
The source which I use is Yahoo Geo Planet.
You can download the entire GeoPlanet Data file which comes in TSV format. When I downloaded it I just imported it to mysql using mysqlimport.
http://developer.yahoo.com/geo/geoplanet/data/
It contains a record for every distinct geographically location in the world. A tonne of post codes, districts, regions, countries, practically everything you would ever need.
In addition to that, it contains neighbours, so you can query based on geographic regions which are close to.
Unfortunately, simply asking where they are isn't quite good enough, and while GeoPlanet is a good option, and I have decided to use it, I didn't feel it was a complete answer. Yes, it works, but -how-. Aliases don't cover misspellings, and while most outsiders call San Francisco things like "San Fran" or "Frisco", locals use "The City", so aliases don't always work. I needed some level of exactitude.
Well, after some work, here's the approach I've used, which is a bit intensive, and may not be an option for everybody, but works for me:
First thing, grab a copy of the GeoPlanet db in TSV format from http://developer.yahoo.com/geo/geoplanet/data/ (105 MB Zipped)
To import this into my MySQL db, I created the tables with columns named according to the Readme file located in the zip. Geoplanet_places was the only one given a primary key associated to the WOE_ID. This and geoplanet_adjacencies are really the only tables I need at this moment. For me, importation was done locally to my DB using:
mysqlimport --socket=/PATH/TO/SOCKET/mysql.sock --user=EXAMPLE --password=EXAMPLE DATABASE_NAME /PATH/TO/DOWNLOADED/GEOPLANET/DATA/geoplanet_places.tsv
I stripped the version number from the .tsv, and used the filename as the table name. Your experience may be significantly different, but I'm adding it for clarity. Import all the files you want.
I decided to have two options for people entering their profile data: You always have to select your country (from an option list, using ISO 3166 Alpha-2 Codes as the value), but we can then use either the postal (ZIP/PIN) code to look up where they are; or, for countries like Ireland lacking a national postal code system, they can enter their city and province name.
To search using country and postal code, I can do something like this:
SELECT Parent_ID FROM geoplanet_places WHERE ISO = "$ctry" AND Name="$zip" AND PlaceType="ZIP";
I count the results. If 0, I have no result, the place is not known, and I assume a problem (An error is logged accordingly to confirm it is not a fluke). If there is more than one, the results are enumerated and a next screen pops up asking to confirm in which location they reside. Ideally, this should never happen with the postal code system, but may occur when asking based on location. If there is only one, I store the Parent_ID to their profile asI continue to query back, passing back in the Parent_ID as a comparator to the WOE_ID, as so:
SELECT Name, WOE_ID, Parent_ID FROM geoplanet_places WHERE WOE_ID="$pid";
Where $pid is the previous Parent_ID - I'll use this later on when rendering the page to determine location, and Town/City is low enough of a level to apply proximity checks on the adjacencies table. Trying to join the results was significantly slower than throwing multiple queries when I ran it with MySQLWorkbench. I continue the queries until Parent_ID="1" meaning that it's parent is the world (it is a country).
I decided that when I'm searching using text entry for city, state/province, and country, I'll have to guarantee accurate entry by confirming using a Metaphone processor to determine their likely selection if it can't be found the first time. Unfortunately some people either can't spell or the primary language of the site is not their primary language.
To display location, I start with the WOE_ID stored in their profile, get the name, then look up it's parent. I comma-separate to get a result like Irvine, Orange, CA, USA. I can look up based on any one of these names to determine other members in proximity using the adjacencies and places tables.
Again, this probably isn't the best way to go about it, and using Geolocation can change if, for instance, you're on a trip using the hotel wifi; however, this method seems "close enough for government work", so I thought I'd share my solution as worthless as it may be.
This solution is generally more accurate & useful than the only matching at the city level, but it will require you to use third-party services for geocoding when a user signs up if you only have their address. Hope it still helps.
1) Get the users's location. Use as much information as you can get:
Ask them where they are when they register
Use the HTML5/JS navigator.geolocation API http://merged.ca/iphone/html5-geolocation (works well with iPhones and the like)
Use IP geolocation database like http://www.maxmind.com/app/geolitecity (it's free, can be downloaded once and used locally, though it should be updated monthly for best results)
2) You need to store the location's latitude and longitude along with the user. If you don't already have it from a sensor lookup or Geo IP database, you will need to do a geocode lookup on the address. You asked not to use a third party service, but there really isn't a way around it (that's why the services exist; rolling your own is very complicated and expensive). See http://en.wikipedia.org/wiki/Geocoding#List_of_some_geocoding_systems for a list of geocoding services you can use.
// Google Maps Example
$address = "$line1, $city, $state $zip, $country";
$ch = curl_init();
$query = http_build_query(array(
'oe' => 'utf8',
'sensor' => 'false', // set this to 'true' if you used navigation.geolocation
'key' => YOUR GMAPS API KEY HERE,
'address' => $address
));
curl_setopt($ch, CURLOPT_URL, 'http://maps.google.com/maps/api/geocode/json?' . $query);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$latLong = current(end(json_decode(curl_exec($ch), true))); // let's pretend nothing ever goes wrong
3) You can now search users by calculating the distance from your search location to each user's location and putting a limit on it for the proximity. Example:
(reference: http://jehiah.cz/a/spatial-proximity-searching-using-latlongs)
$myLat = 45.5;
$myLong = -73.5833;
$range = 2; // miles
$sql = "SELECT *,
truncate((degrees(acos(
sin(radians(latitude))
* sin( radians({$myLat}))
+ cos(radians(latitude))
* cos( radians({$myLat}))
* cos( radians(longitude - {$myLong}) )
) ) * 69.09),1) as distance
FROM users
HAVING distance < {$range}";
Related
Say I have a database table representing users with potentially millions of records (Wishful thinking). This table contains a whole bunch of information about each user including information about their location:
City
County/State etc
Country
Latitude
Longitude
Geohash based on the latitude/longitude values.
I would like to implement a feature where by a logged in user can search for other users that are nearby.
Ideally, I would like to grab say the 20 users that are geographically closest to the user, followed by the next 20, and the next 20 etc. So essentially I want to be able to order my users table by the distance from a certain point.
Approach 1
I have some previous experience with the haversine formula which I used to calculate the distance between one point and a few hundred others. This approach would be ideal on a relatively small record set but I fear it would become incredibly slow with such a large record set.
Approach 2
I've additionally done some research into geohashing and I understand how the hash is calculated and I get the theory behind how it represents a location and how precision is lost with shorter resolutions. I could of course grab the users that are located near the user's geographical area by grabbing users that have a similar beginning to their geohash (Based on a precision I specify - and potentially looking in the neighbouring regions) but that doesn't solve the problem of needing to sort by location. This approach is also not great for edge cases where 2 users may be very close to one another but lie close to the edges of 2 regions represented by the geohash.
Any ideas/suggestion towards the approach would be greatly appreciated. I'm not looking for code in particular but links to good examples and resources would be helpful.
Thanks,
Jonathon
Edit
Approach 3
After some thought I've come up with another potential solution to consider. Upon receiving each user's location information, I would store information about the location (town/city, area, country, latitude, longitude, geohash maybe) in a separate table (say locations). I would then connect the user to the location by a foreign key. This would give me a much smaller dataset to work with. To find nearby users I could then simply find other locations that are close to the user's location and then use their IDs to find other users. Perhaps some sort of caching could be then implemented by storing a list of the nearby location IDs for each location.
You can try a space filling curve. Translate the co-ordinate to a binary and interleave it. Treat it as base-4 number. You are also wrong a geohash can be used to sort also by location. Most likely use a bounding box and filter the solution and then use the harvesine formula.
I want to put together a PHP script to resolve city name (nothing else is needed) with a good resolution just for a single country (IRAN). As I have to query the DB for multiple times, its better to go through a downloadable local version.
I have read most of posts on stackoverflow and since now I have tested these:
GeoIP City from maxmind sounds good, but is not free.
GeoIP from maxmind, has a low level of accuracy (about 50-60%)
ip2country.net has an IP-2-City Database but not free and does not resolve city names for Iran.
I also tried the DB#.Lite from ipinfodb.com which has an API here without any success. The problem is that, it does not detect many city names.
I also tried hostip.info API, but it seems to be too slow.
There is a free php class with local DB which resolves only the country name.
I dont know if there is chance, using Piwik with this GeoIP plugin. It would be appreciated to have ideas if someone knows about it.
ipinfo.io is another service which does not resolve city names with accuracy.
I dont know if there is a way to use Google analytics to resolve city names, as I think google would be better than any other service regarding countries like Iran.
Any good idea would be really appreciated.
This is a tough one and hard to do reliably. I have given it a go in the past and it went something like this
Obtain a database of IP addresses, plus cities and countries (http://lite.ip2location.com/database-ip-country-region-city OR http://ipinfodb.com/ip_database.php)
Get the IP address and query for it against those tables to find the city and country
Finally check if its in Iran using the country column
There are paid for services that can do this really quickly for you. It might take you ages to get something working that is still unreliable because you simply do not have the data. I would seriously consider http://www.maxmind.com/en/city_per - unless of course this is a completely none commercial project and $ is a no no.
If you can get the lat and long from an IP table, even without the city data then you may want to then use something like this to check for the nearest city of Javascript is an optio n - Finding nearest listed (array?) city from known location.
What about the browsers Share Location feature?
If a browser-based solution works for your use case, you might want to look at MaxMind's GeoIP2 JavaScript API. It first attempts to locate the user using HTML5 geolocation and if that fails or is inaccurate, it reverts to MaxMind's GeoIP2 City data (not GeoLite). MaxMind provides a free attribution version.
Sometimes we need to use local Geo-IP database instead of web services for particular purposes. This is my experience:
I downloaded database form https://db-ip.com. I was searching a lot and finally I found this one more reliable. but still two more problem:
1-The "IP address to city" database is too huge to upload on MYSQL database on hosting as it pass time out limit.
2-The Database is base on IP address compare of http://lite.ip2location.com Database is base of IP Numbers.
So I developed a simple .NET app to solve those problems.
The solution can be downloaded from here: https://github.com/atoosi/IP2Location-Database-Luncher
I am designing a web app where I need to determine which places listed in my DB are in the users driving distance.
Here is a broad overview of the process that I am currently using -
Get users current location via Google's map api
Run through each place in my database(approx 100) checking if the place is within the users driving distance using the google places api. I return and parse the JSON file with PHP to see if any locations exist given the users coordinates.
If place is in users driving distance display top locations(limited to 20 by google places), other wise don't display
This process works fine when I am running through a handful of places, but running through 100 places is much slower, and makes 100 api calls. With Google's current limit of 100,000 calls per day, this could become an issue down the road.
So is there a better way to determine which places in my database are within a users driving distance? I do not want to keep track of addresses in my DB, I want to rely on Google for that.
Thanks.
You can use the formula found here to calculate the distance between zip codes:
http://support.sas.com/kb/5/325.html
This is not precise (door-step to door-step) but you can calculate the distance from the user's zip code to the location's zip code.
Using this method, you won't even have to hit Google's API.
I have an unconventional idea for you. This will be very, very odd when you think about it for the first time, as it does exactly the opposite order of what you will expect to do. However, you might get to see the logic.
In order to put it in action, you'll need a broad category of stuff that you want the user to see. For instance, I'm going to go with "supermarkets".
There is a wonderful API as part of google places called nearbySearch. Its true wonder is to allow you to rank places by distance. We will make use of this.
Pre-requisites
Modify your database and store the unique ID returned on nearbySearch places. This isn't against the ToS, and we'll need this
Get a list of those IDs.
The plan
When you get the user's location, query nearbySearch for your category, and loop through results with the following constraints:
If the result's ID matches something in your database, you have that result. Bonus #1: it's sorted by distance ascending! Bonus #2: you already get the lat-loc for it!
If the result's ID does not match, you can either silently skip it or use it and add it to your database. This means that you can quite literally update your database on-the-fly with little to no manual work as an added bonus.
When you have run through the request, you will have IDs that never came up in the results. Calculate the point-to-point distance of the furthest result in Google's data and you will have the max distance from your point. If this is too small, use the technique I described here to do a compounded search.
The only requirement is: you need to know roughly what you are searching for. However, consider this: your normal query cycle takes you anywhere between 1 and 100 google Queries. My method takes 1 for a 50km radius. :-)
To calculate distances, you will need Haversine's formula rather than doing a zip code lookup, by the way. This has the added advantage of being truly international.
Important caveats
This search method directly depends on the trade-off between the places you know about and the distance. If you are looking for less than 10km radii, use this method to only generate one request.
If, however, you have to do compounded searching, bear in mind that each request cycle will cost you 3N, where N is the number of queries generated on the last cycle. Therefore, if you only have 3 places in a 100km radius, it makes more sense to look up each place individually.
I have a whitelist of cities. Let's say, Seattle, Portland, Salem. Using GeoIP, I'd detect user city. Let's call it $user_city. Based on $user_city, I want to display classified-listings from nearest city from my whitelist (Seattle || Portland || Salem) with in 140 miles. If city is not listed in 140 miles, I'd just show a drop-down and ask user to manually select a city.
There are a few ways of doing this:
calculate this on the fly (I found an algorithm in one of SO answers)
with help of DB (let me explain):
create a table called regions
regions will have
city 1 | city 2 | distance (upto 140 miles)
city 1= cities from whitelist
city 2= any city within 140 miles from city 1
This would create a reasonable sized table. If my whitelist has 200 cities, and there are 40 cities (or towns) within 140 miles of each city. This would create 8000 rows.
Now, when a user comes to my site:
1) I check if user is from whitelist city already (city 1 column). If so, display that city
2). If not, check if $user_city is in "city 2" column
2a) if it is, get whitelist city with lowest distance
2b) if it is not, display drop-down for manual input
Final constraint: whichever method we select, it has to work from within iFrame. I mean, can I create this page on my mysite1.com and embed this page inside someothersite2.com inside an iframe? Will it still be able to get user_city and find nearest whitelisted city? I know there are some cross-domain scripting rules so I am not sure if iFrame would be able to get user-ip address, pass it to GeoIP, and resolve it to $user_city
So, my question:
How best to do this? If a lot of people embed my page in their page (using iframe) then my server would get pounded 10000s of times per second (wishful thinking, but let's assume that's the case). I don't know if a DB would be able to handle so much pounding. I don't want to have to pay for more DB servers or web-servers. I want to minimize resource-requirement at my end. So, I don't mind offloading a bit of work to user's browser via JavaScript.
EDIT:
Some answers have recommended storing lat, long and then doing the Math. The reason I suggested creating a 'regions' table is that this way all math is precomputed. If I have a "whitelist" of cities, and if I precompute all possible nearby city for each whitelisted city. Then I don't have to compute distance (using Haversine algorithm for eg) everytime.
Is it possible to offload all of this to user's browser via some crafty use of Java Script? I don't want to overload my server for a free service. It might make money but I am very close to broke and I am afraid my server would go down before I make enough money to pay for the upgrades.
So, the three constraints of this problem are 1) should work from inside iframe (I am hoping this will go viral and every blogger would want to embed my site into their page's iframe. 2) should be very fast 3) should minimize load on my server
Use one table City and do a mysql math-calculation for every query, with the addition of a cache layer eg memcache. Fair performance and very flexible!
Use two tables City (id,lat,lng,name) and Distance (city_id1,city_id2,dist), get your result by a traditional JOIN. (Could use a cache layer too.) Not very flexible.
Custom data structure: CityObj (id,lat,lng,data[blob]) just serialize and compress a php-array of the cities and store it. This might rise your eyebrows but as we know the bottleneck is never CPU or memory, it's disc IO. This is one read from an index of an INT as apposed to the JOIN which uses a tmp-table. This is not very flexible but will be fast and scalable. Easy to shard and cluster.
Is it possible to offload all of this to user's browser via some crafty use of Java Script? I don't want to overload my server for a free service. It might make money but I am very close to broke and I am afraid my server would go down before I make enough money to pay for the upgrades.
Yes, it is possible...using Google Maps API and the geometry library. The function you are looking for is google.maps.geometry.spherical.computeDistanceBetween. Here is an example that I made a while ago that might help get you started. I use jQuery here. Take a look at the source to see what's happening and modify as needed. Briefly:
supplierZips is an Array of zip codes comparable to your city whitelist.
The first thing I do on page load is geocode the whitelist locations. You can actually do this ahead of time and cache the results, if your city whitelist is constant. This'll speed up your app.
When the user enters a zip code, I first check if it's a valid zip from a json dataset of all valid zip codes in the U.S.( http://ampersand.no.de/maps/validUSpostalCodes.json, 352 kb, data generated from zip code data at http://www.geonames.org).
If the zip is valid, I compute the location between that zip and each location in the whitelist, using the aforementioned computeDistanceBetween in the Google Maps API.
Hope this helps get you started.
You just have to get the lat and the long of each city and add it to the database.
So every city only has 1 record. No distances are stored on the position on the globe.
Once you have that you can easily do a query with using haversine formula ( http://en.wikipedia.org/wiki/Haversine_formula ) to get the nearest cities within a range.
know there are some cross-domain scripting rules so I am not sure if iFrame would be able to get user-ip address
It will be possible to get the user ip or whatever if you just get the info from the embedded page.
I don't know if a DB would be able to handle so much pounding
If you have that many requests you should have by then found a way to make a buck with it :-) which you can use for upgrades :D
Your algorithm seems generally correct. What I would do is use PostGIS (a postgresql plugin, and easier to set up than it looks :-D). I believe the additional learning curve is totally worth it, it is THE standard for geodata.
If you put the whitelist cities in as POINTs, with latitudes and longitudes, you can actually ask PostGIS to sort by distance to a given lat/lon. It should be much more efficient than doing it yourself (PostGIS is very optimized).
You could get lats and longs of your user cities (and the whitelist cities) by using a geocoding API like Yahoo Placefinder or Google Maps. What I would do would be to have a table (either the same as the whitelist cities or not) that stores city name, lat, and lon, and do lookups on that. If the city name isn't found though, hit the API you are using, and cache the result in the table. This way you'll quickly not need to hit the API except for obscure places. The API is fast too.
If you're really going to be seeing that kind of server load, you may want to look into using something besides PHP though (such as node.js). Incidentally you shouldn't have any trouble geocoding from an iframe, from the Point of View of the server, its just like the browser is going to that page "normally".
I'm connecting to the Google Maps API from PHP to geocode some starting points for a rental station locator application.
Those starting points don't have to be exact addresses; city names are enough. Geocoding responses with an accuracy equal to or grater than 4 (city/locality level) are used as starting points, and the surrounding rental stations searched.
The application is supposed to work in Germany. When a locality name is ambiguous (i.e. there are more than one place of that name) I want to display a list of possibilities.
That is not a problem in general: If you make an ambiguous search, Google's XML output returns a list of <PlaceMark> elements instead of just one.
Obviously, I need to bias the geocoding towards Germany, so if somebody enters a postcode or the name of a locality that exists in other countries as well, only hits in Germany actually come up.
I thought I could achieve this by adding , de or , Deutschland to the search query. This works mostly fine, but produces bizarre and intolerable results in some cases.
There are, for example, 27 localities in Germany named Neustadt. (Wikipedia)
When I search for Neustadt alone:
http://maps.google.com/maps/geo/hl=de&output=xml&key=xyz&q=Neustadt
I get at least six of them, which I could live with (it could be that the others are not incorporated, or parts of a different locality, or whatever).
When, however, I search for Neustadt, de, or Neustadt, Deutschland, or Neustadt, Germany, I get only one of the twenty-seven localities, for no apparent reason - it is not the biggest, nor is it the most accuracy accurate, nor does it have any other unique characteristics.
Does anybody know why this is, and what I can do about it?
I tried the region parameter but to no avail - when I do not use , de, postcodes (like 50825 will be resolved to their US counterparts, not the german ones.
My current workaround idea is to add the country name when the input is numeric only, and otherwise filter out only german results, but this sounds awfully kludgy. Does anybody know a better way?
This is definitely not an exhaustive answer, but just a few notes:
You are using the old V2 version of the Geocoding API, which Google have recently deprecated in favour of the new V3 API. Google suggests to use the new service from now on, and while I have no real experience with the new version, it seems that they have improved the service on various points, especially with the structure of the response. You do not need an API key to use the new service, and you simply need to use a slightly different URL:
http://maps.google.com/maps/api/geocode/xml?address=Neustadt&sensor=false
You mentioned that you were filtering for placemarks on their accuracy property. Note that this field does not appear anymore in the results of the new Geocoding API, but in any case, I think it was still not very reliable in the old API.
You may want to try to use the bounds and region parameters of the new API, but I still doubt that they will solve your problem.
I believe that the Google Geocoder is a great tool for when you give it a full address in at least a "street, locality, country" format, and it is also very reliable in other formats when it doesn't have to deal with any ambiguities (Geocoding "London, UK" always worked for me).
However, in your case, I would really consider pre-computing all the coordinates of each German locality and simply handle the geocoding yourself from within your database. I think this is quite feasible especially since your application is localized to just one country. Each town in the Wikipedia "List of German Towns" appears to have the coordinates stored inside a neat little <span> which looks very easy to parse:
<span class="geo">47.84556; 8.85167</span>
There are sixteen Neustadts in that list, which may be better than Google's six :)
found google autocomplete works better then their geocode
http://code.google.com/apis/maps/documentation/javascript/places.html#places_autocomplete
Found this question searching for this exact issue. Then realized Bing Maps API works much better. It has its quirks though. For example, if you pass an airport code, it would show you 6 different solutions, each corresponding to each terminal of the airport.