I have a group of users. The user count could be 50 or could be 2000. Each should have a long/lat that I have retrieved from Google Geo api.
I need to query them all, and group them by proximity and a certain count. Say the count is 12 and I have 120 users in the group. I want to group people by how close they are (long/lat) to other people. So that I wind up with 10 groups of people who are close in proximity.
I currently have the google geo coding api setup and would prefer to use that.
TIA.
-- Update
I have been googling about this for awhile and it appears that I am looking for a spatial query that returns groups by proximity.
Keep in mind that this problem grows exponentially with every user you add, as the amount of distance calculations is linked to the square of the number of users (it's actually N*(N-1) distances... so a 2000 user base would mean almost 4 million distance calculations on every pass. Just keep that in mind when sizing the resources you need
Are you looking to group them based on straight-line (actually great circle) distance or based on walking/driving distance?
If the former, the great circle distance can be approximated with simple math if you're able to tolerate a small margin of error and wish to assume the earth is a sphere. From GCMAP.com:
Earth's hypothetical shape is called the geoid and is approximated by
an ellipsoid or an oblate sphereoid. A simpler model is to use a
sphere, which is pretty close and makes the math MUCH easier. Assuming
a sphere of radius 6371.2 km, convert longitude and latitude to
radians (multiply by pi/180) and then use the following formula:
theta = lon2 - lon1
dist = acos(sin(lat1) × sin(lat2) + cos(lat1) × cos(lat2) × cos(theta))
if (dist < 0) dist = dist + pi
dist = dist × 6371.2
The resulting distance is in kilometers.
Now, if you need precise calculations and are willing to spend the CPU cycles needed for much complex math, you can use Vincenty's Formulae, which uses the WGS-84 reference ellipsoid model of the earth which is used for navigation, mapping and whatnot. More info HERE
As to the algorithm itself, you need to build a to-from matrix with the result of each calculation. Each row and column would represent each node. Two simplifications you may consider:
Distance does not depend on direction of travel, so $dist[n][m] == $dist[m][n] (no need to calculate the whole matrix, just half of it)
Distance from a node to itself is always 0, so no need to calculate it, but since you're intending to group by proximity, to avoid a user being grouped with itself, you may want to always force $dist[m][m] to an arbitrarily defined and abnormally large constant ($dist[m][m] = 22000 (miles) for instance. Will work as long as all your users are on the planet)
After making all the calculations, use an array sorting method to find the X closest nodes to each node and there you have it
(you may or may not want to prevent a user being grouped on more than one group, but that's just business logic)
Actual code would be a little too much to provide at this time without seeing some of your progress first, but this is basically what you need to do algoritmically.
... it appears that I am looking for a spatial query that returns groups by proximity. ...
You could use hdbscan. Your groups are actually clusters in hdbscan wording. You would need to work with min_cluster_size and min_samples to get your groups right.
https://hdbscan.readthedocs.io/en/latest/parameter_selection.html
https://hdbscan.readthedocs.io/en/latest/
It appears that hdbscan runs under Python.
Here are two links on how to call Python from PHP:
Calling Python in PHP,
Running a Python script from PHP
Here is some more information on which clustering algorithm to choose:
http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb
http://scikit-learn.org/stable/modules/clustering.html#clustering
Use GeoHash algorithm[1]. There is a PHP implementation[2]. You may pre-calculate geohashes with different precision, store them in SQL database alongside lat-lon values and query using native GROUP BY.
https://en.wikipedia.org/wiki/Geohash
https://github.com/lvht/geohash
Related
There are some places I can choose from. I want to choose one to be my source place, and select driving time cost is less than 30 minutes. So there are maybe some places I can drive there cost less than 30 minutes will be showed.
So, what is the best way I should to save all these places data and query them on specific conditions?
Before I asking this question, I've tried to save all these places latitude and longitude. Whenever a new place has been saved to the database, I will request HERE map routing API to calculate distances and drive time between the new one with all old places info in a database, then save them in the distance table.
When a user wants to query places like the above example. I will join places table and distance table to query like:
SELECT place.id, place.name from place join distance on place_id = place.id where distance cost_time < 30;
There are some problem make me upset. If the number of old places is too big(actually it will), the time hanging after saving a place to the database will be much more.
So, I know I used a bad method to implement my goal. But I don't know how can I do, can someone help me with this problem?
last but not least, forget my poor English, if something is unclear, I'll try my best to describe it. Thank you.
You probably need to build a connected graph and compute the distances to other points on the fly.
When a new point is added, compute its distance with the X nearest neighbours only and store them in a database.
Then, you can use a algorithm like Dijkstra to find all the points at less than 30 units from your source.
You will lose some precision, as the cost to drive from A to C, then C to B will be usually greater then the direct path from A to B. And the time you saved on adding a new point, you will "lost" it to do the computation of the Dijkstra algorithm.
I'm building a php web app with Laravel 5.5 and I need to display a list of places (eg. stores) sorted by their distance from a user-specified location.
The places will be stored in a MySQL database and should be retrieved as Eloquent ORM model instances.
Doing some research I found many posts and questions on this topic (presenting different solutions), but, having very little experience with databases and geolocation/geospatial analysis, they mostly confused me, and I'd like to know what approach to follow and what are the best practices in this case.
Most answers I read suggest using the haversine formula or the spherical law of cosines in the SQL query, which would look something like (example taken from this answer):
$sf = 3.14159 / 180; // scaling factor
$sql = "SELECT * FROM table
WHERE lon BETWEEN '$minLon' AND '$maxLon'
AND lat BETWEEN '$minLat' AND '$maxLat'
ORDER BY ACOS(SIN(lat*$sf)*SIN($lat*$sf) + COS(lat*$sf)*COS($lat*$sf)*COS((lon-$lon)*$sf))";
This post points out the fact that, over short distances, assuming the Earth flat and computing a simple euclidean distance is a good approximation and is faster than using the haversine formula.
Since I only need to sort places within a single city at a time, this seems to be a good solution.
However, most of these posts and SO answers are a few years old and I was wondering if there is now (MySQL 5.7) a better solution.
For example, none of those post use any of MySQL “Spatial Analysis Functions”, like ST_Distance_Sphere and ST_Distance which seem to be exactly for that purpose.
Is there any reason (eg. performance, precision) not to use these functions instead of writing the formula in the query? (I don't know which algorithm is internally used for these functions)
I also don't know how I should store the coordinates of each place.
Most of the examples I've seen assume the coordinates to be stored in separate lat, lon columns as doubles or as FLOAT(10,6) (as in this example by google), but also MySQL POINT data type seems appropriate for storing geographic coordinates.
What are the pros and cons of these two approaches?
How can indexes be used to speed up these kind of queries? For example I've read about “spatial indexes”, but I think they can only be used for limiting the results with something like MBRContains(), not to actually order the results by distance.
So, how should I store the coordinates of places and how should I query them to be ordered by distance?
Other than the ST_Distance_Sphere, 5.7 does not bring anything extra to the table. (SPATIAL was already implemented.)
For 'thousands' of points, the code you have is probably the best. Include
INDEX(lat, lng),
INDEX(lng, lat)
And I would not worry about the curvature of the earth unless you are stretching thousands of miles (kms). Even then the code and that function should be good enough.
Do not use FLOAT(m,n), use only FLOAT. The link below gives the precision available to FLOAT and other representations.
If you have so many points that you can't cache the table and its indexes entirely (many millions of points), you could use this , which uses a couple of tricks to avoid lengthy scans like the above solution. Because of PARTITION limitations, lat/lng are represented as scaled integers. (But that is easy enough to convert in the input/output.) The earth's curvature, poles, and dateline are all handled.
I use a table that has lat & long associate with zip codes that I found. I use the haversine formula to find all zipcodes within a certain range. I then use that list of zip codes that are returned from that query and find all business with those zip codes. Maybe that solution will work for you. It was pretty easy to implement. This also eliminates you having to know the lat and long for the each business as long as you know the zip code.
Use ST_DISTANCE_SPHERE or MBRContains to get distance between points or points within a bound - much faster than doing Haversine formula which can't use indices and is not built for querying distances and because MySql is slow with range queries. Refer mysql documentation.
Haversine formula is probably good for small applications and most of the older answer refer to that solution because older versions of MySql innodb did not have spatial indexes.
The broad method of doing it is as follows - the below is from my working code in Java - hope you can tailor it for PHP as per your needs
First save the incoming data as a Point in database (Do note that the coordinate formula uses longitude, latitude convention)
GeometryFactory factory = new GeometryFactory();
Point point = factory.createPoint(new Coordinate(officeDto.getLongitude(), officeDto.getLatitude()));//IMP:Longitude,Latitude
officeDb.setLocation(point);
Create Spatial Indexes using the following in mysql
CREATE SPATIAL INDEX location ON office (location);
You might get the error "All parts of a SPATIAL index must be NOT NULL". That is because spatial indexes can only be created if the field is NOT NULL - in such a case convert the field to non-null
Finally, call the custom function ST_DISTANCE_SPHERE from your code as follows.
SELECT st_distance_sphere( office.getLocation , project.getLocation)
as distance FROM ....
Note: office.getLocation and project.getLocation both return POINT types. Native SQL method is as below from documentation
ST_Distance_Sphere(g1, g2 [, radius])
which returns the mimimum spherical distance between two points and/or multipoints on a sphere, in meters, or NULL if any geometry argument is NULL or empty.
I have a MySQL table with thousands of data points stored in 3 columns R, G, B. how can I find which data point is closest to a given point (a,b,c) using Euclidean distance?
I'm saving RGB values of colors separately in a table, so the values are limited to 0-255 in each column. What I'm trying to do is find the closest color match by finding the color with the smallest euclidean distance.
I could obviously run through every point in the table to calculate the distance but that wouldn't be efficient enough to scale. Any ideas?
I think the above comments are all true, but they are - in my humble opinion - not answering the original question. (Correct me if I'm wrong). So, let me here add my 50 cents:
You are asking for a select statement, which, given your table is called 'colors', and given your columns are called r, g and b, they are integers ranged 0..255, and you are looking for the value, in your table, closest to a given value, lets say: rr, gg, bb, then I would dare trying the following:
select min(sqrt((rr-r)*(rr-r)+(gg-g)*(gg-g)+(bb-b)*(bb-b))) from colors;
Now, this answer is given with a lot of caveats, as I am not sure I got your question right, so pls confirm if it's right, or correct me so that I can be of assistance.
Since you're looking for the minimum distance and not exact distance you can skip the square root. I think Squared Euclidean Distance applies here.
You've said the values are bounded between 0-255, so you can make an indexed look up table with 255 values.
Here is what I'm thinking in terms of SQL. r0, g0, and b0 represent the target color. The table Vector would hold the square values mentioned above in #2. This solution would visit all the records but the result set can be set to 1 by sorting and selecting only the first row.
select
c.r, c.g, c.b,
mR.dist + mG.dist + mB.dist as squared_dist
from
colors c,
vector mR,
vector mG,
vector mB
where
c.r-r0 = mR.point and
c.g-g0 = mG.point and
c.b-b0 = mB.point
group by
c.r, c.g, c.b
The first level of optimization that I see you can do would be square the distance to which you want to limit the query so that you don't need to perform the square root for each row.
The second level of optimization I would encourage would be some preprocessing to alleviate the need for extraneous squaring for each query (which could possibly create some extra run time for large tables of RGB's). You'd have to do some benchmarking to see, but by substituting in values for a, b, c, and d and then performing the query, you could alleviate some stress from MySQL.
Note that the performance difference between the last two lines may be negligible. You'll have to use test queries on your system to determine which is faster.
I just re-read and noticed that you are ordering by distance. In which case, the d should be removed everything should be moved to one side. You can still plug in the constants to prevent extra processing on MySQL's end.
I believe there are two options.
You have to either as you say iterate across the entire set and compare and check against a maximum that you set initially at an impossibly low number like -1. This runs in linear time, n times (since you're only comparing 1 point to every point in the set, this scales in a linear way).
I'm still thinking of another option... something along the lines of doing a breadth first search away from the input point until a point is found in the set at the searched point, but this requires a bit more thought (I imagine the 3D space would have to be pretty heavily populated for this to be more efficient on average though).
If you run through every point and calculate the distance, don't use the square root function, it isn't necessary. The smallest sum of squares will be enough.
This is the problem you are trying to solve. (Planar case, select all points sorted by a x, y, or z axis. Then use PHP to process them)
MySQL also has a Spatial Database which may have this as a function. I'm not positive though.
I have a system which will return all users from the database and order the results by lowest distance from a reference zip code.
For example: User will come on the site, enter zip code and it will return him all other users who are nearest to his zip (ascending order)
How am i doing this now and why is it a problem ?
The system contains more than 30 million users and their zipcodes. I am retreiving all the users in a particular state and city (narrows the dataset down to about 10,000).
This is where the problem actually happens. Now, all the result sent by mysql (10,000) rows to PHP are sent to a zipcode calculator library which calculates this distance between the base zip code and user's zip code - 10,000 times. Then orders the result by the zip code nearest.
As you can see, this is very badly optimized code. And the 10,000 records are looped through twice. Not to mention the amount of RAM each httpd process takes just transferring data to and fro mysql.
What I would like to ask the gurus in here that is there anyway to optimize this ?
I have a few ideas of my own, but i'm not sure how efficient they are.
Try to do all the zipcode calculation and ordering in mysql itself and return the paginated number of rows.
For this, i will need to move the distance between zipcode calculation logic to a stored procedure. This way I am preventing the processing of 10,000 records in PHP. However, there is still a problem. I would not need to calculate distance for zip codes which have already been calculated (for 2 users having the same zip code).
Secondly, how do i order rows in mysql using a stored procedure ?
What do you guys think ? Is this a good way ? Can i expect a performance boost using this ?
Do you have any other suggestions ?
I know this question is huge, and i really appreciate the time you have taken to read till the end. I would really like to hear your thoughts about this.
As I'm not overly familiar with PHP or MySQL, I can only give some basic tips but they should help. This also assumes you have no direct way of interfacing with the zip library from MySQL.
First, as it's doubtful that you have 10k zip codes in a city, take your existing query and do something like
SELECT DISTINCT ZipCode FROM Users WHERE ...
This will probably return a few dozen zip codes max, and no duplicates. Run this through your zip code library. That library itself is probably a source of slowness, as it has to look up the zip codes, and do a bunch of fancy trig to get actual distance. Take the results of this, and insert it into a temp table with just the zip code and the distance.
Once done with that list, have another query that gets the rest of the user data you want, and JOIN into the the temp table on zip code to get your distance.
This should give you quite a large speedup. You can do whatever paging you need in the second query after the results have been calculated. And no more looping through 10k rows.
I suggest that you narrow the latitude and longitude ranges before you compute the accurate distance for filtering and sorting purposes.
What I mean is if you do a full table scan and compute distances for all zip codes in the database relative to your reference point, it will be very slow.
Instead, filter zipcode by proximity. I mean if you have latitude 10 and longitude 20, first compute the maximum angular range for the proximity you want. Lets say you want a proximity range of 10 miles. That may translate into 0.15 degrees. So you need to filter you zip codes first latitude between 10-0.15 and 10+0.15 and longitude between 20-0.15 and 20+0.15 .
Only after that you include the accurate distance clause in your SQL query condition. That will be much faster because you no longer do full scan and you can eventually use range indexes on longitude and latitude fields.
To translate miles into degrees find the narrow range, keep in mind that the Earth has , approximately 25,000 miles of perimeter, divide 25000 by 360 degrees which gives 70 miles per degree. If you want a range of 10 miles, your range in degrees will be at most 0.15 degrees.
Keep in mind that these calculations are not accurate (the Earth is not exactly well rounded) but that is not important. What is important is that you find a degree range value that is higher than the really accurate value.
If you can get the latitude and longitude for all zipcodes into MySQL, or have an easy way of fetching the lat/long for your base zipcode and feeding it into your MySQL query, then you can order your 10k users by distance inside MySQL. There is a very similar question and answer here which gives you the correct math for the distance function. You may also want to investigate Mysql spatial extensions which would let you insert and index your lat/longs as 2D POINT data.
I have a MySQL table of records, each with a lat/lng coordinate. Searches are conducted on this data based on a center point and a radius (any records within the radius is returned). I'm using the spherical law of cosines to calculate the distance within my query. My problem is that indexing the geodata is horribly inefficient (lat/lng values are stored as floats). Using MySQL's spatial extensions is not an option. With datasets around 100k in size the query takes an unreasonable amount of time to execute.
I've done some research and it seems like using a z-index i.e. Morton number could help. I could calculate the Morton number for each record on insertion and then calculate a high/low Morton value for a bounding box based on the Earth's radius/center point/given search radius.
I only know enough about this stuff to build my app so I'm not entirely sure if this would work, and I also don't know how I can compute the Morton number in PHP. Would this be a bitwise operation?
If your radius is small compared to the size of the Earth, then you can probably get by with simple 2D Pythagorus rather than expensive 3D spherical geometry. This is probably less true the closer you get to the poles, so I hope you're not mapping penguins or polar bears!
Next, think about the bounding boxes for your problem. You know they must be within +/- $radius of the search point. Convert the search radius to degrees and find all records where lat/lon is within the box defined by the search center +/- $radiusindegrees.
If you do that search first and come up with a list of possible matches then you have only to filter out the corners of your search box from the resulting data set. If you get back the lat/lon of the matching points you can calculate the distance in PHP and avoid having to calculate it for all points in the table. Did that make sense?
use the database to find everything that fits within a square bounding box and then use PHP to filter those points that are outside of the desired radius.