Calculating Squared Euclidean distance - php

I have an exe which return an array of 16elements.I have to pass this array to Mysql using php to calculate the Euclidean distance.My table in MySQL is in the form.
id |img_id | features_1|features_2|features_3|features_4|features_5|features_6|features_7|...upto features_16
1 1 0.389 0.4567 0.8981 0.2345
2 2 0.9878 0.4567 0.56122 0.4532
3 3
4 4
......................
So I have 16 features for each image and now I have 30,000 images that is img_id is upto 30,000. I have to calculate the euclidean distance of the array from exe which is passed through php with the datas in the database and return the img_id of the 6 images whose euclidean distance is minimum. i.e. Suppose I have an array from exe A[0.458,0.234,0.4567,0.2398] I have to compute distance of each img_id with this array i.e for img_id=1 the distance will be ((0.458-0.389)^2+(0.234-0.4567)^2+(0.4567-0.8981)^2+(0.2398-0.2345)^2) and I have to repeat this process for all the 30,000 images and return the 6 img_id who has least distance. What is the efficient and fast way to calculate it?

Since php is slow you should do this directly in the SQL like this:
SELECT * FROM tablename
ORDER BY ABS(f1 - :f1) + ABS(f2 - :f2) + ... DESC
LIMIT 6;
Note that I used the absolute norm instead of the euclidian norm which makes no difference if you are not interested in the actual values (because in vector spaces with finite dimension all norms are equivalent). sqlite for eample does not provide the SQUARE function and writing (f1 - :f1) * (f1 - :f1) all the time is anoying, so I guess this is a nice solution.

Related

Distance formula for MariaDB nearest 200 places without radius

I have MariaDB, Server version: 10.0.23-MariaDB, with latitude and longitude columns (float 10,6) plus a geo_location column (geometry) that was calculated from the latitude and longitude columns.
I would like to find the nearest 200 people from a person. The person at the center has a latitude and longitude that is passed to the query. Is there a way to do that without a radius? So, if the population density is high the radius would be small. If the population density is low then the radius would be large.
There are about 4 million rows, and it needs to be as fast as possible. The rows can be filtered first based on the county that they reside. Some counties are super large with low population density and others are small counties with high population density. I need the fastest way to find the nearest 200 people.
SELECT *, ST_DISTANCE(geo_location, POINT(lon, lat)) AS distance
FROM geotable
ORDER by distance DESC
LIMIT 200;
The bad news is that it will be very slow, because no spatial indexes are used by st_distance(). You should try to restrict your query by using a maximum radius to select less records:
set #dist = 100;
set #rlon1 = lon-#dist/abs(cos(radians(lat))*69);
set #rlon2 = lon+#dist/abs(cos(radians(lat))*69);
set #rlat1 = lat-(#dist/69);
set #rlat2 = lat+(#dist/69);
SELECT *, ST_DISTANCE(geo_location, POINT(lon, lat)) AS distance
FROM geotable
WHERE ST_WITHIN(geo_location,ENVELOPE(LINESTRING(point(#rlon1, #rlat1), point(#rlon2, #rlat2))))
ORDER by distance DESC
LIMIT 200;
Or if you have the POLYGON coordinates of each country, you could use that instead of a maximum radius.
6 decimal places is good enough (16cm / 0.5 ft), but FLOAT (1.7m / 5.6 ft) looses some of that precision. It is essentially never good to tack (M,N) onto FLOAT or DOUBLE; you incur 2 roundings, one of which is a waste.
The is no straightforward way to "find nearest" on the globe because there are no "2-dimensional" indexes. However, by using partitioning for one dimension and a clustered PRIMARY KEY for the other, you can do a pretty good job.
The real problem with most solutions is the large number of disk blocks that need to be hit without finding valid items. In fact, usually well over 90% of the rows touched are not needed.
All of this is 'solved' in My lat/lng blog. It will touch maybe 800 rows to get the 200 you desire, and they will be well clustered, so only a few blocks need be touched. It does not need any pre-filtering on Country, but it does need some radical restructuring of the table. And, if you want to distinguish two people embracing each other, I suggest a scaled INT (16mm / 5/8 in) - Degrees * 10000000. Also, FLOAT won't work with PARTITIONing; INT will. The code in that link uses a MEDIUMINT scaled (2.7m / 8/8 ft), but that could be changed.

Order By Two Columns - Using Highest Rating Average with Most Ratings

I would like to show ratings with the highest average (rating_avg) AND number of ratings(rating_count). With my current script, it shows the highest average rating (DESC) regardless of how many ratings there are, which is useless for my visitors.
For example it shows:
Item 1 - 5.0 (1 Ratings)
Item 2 - 5.0 (2 Ratings)
When it should be showing the Top 10 Highest rated items by rating avg and amount of ratings, such as:
Item 1 - 4.5 (356 Ratings)
Item 2 - 4.3 (200 Ratings)
Item 3 - 4.0 (400 Ratings)
This is what I have right now:
$result = mysql_query("SELECT id, filename, filenamedisplay, console_dir, downloads, rating_avg, rating_count FROM files WHERE console_dir = '".$nodash."' ORDER BY rating_avg DESC LIMIT 10");
Thanks and I appreciate any help in advance!
This is a subtle problem and an issue in statistics. What I do is often to downgrade the ratings by one standard error for the proportion. These aren't exactly proportions, but I think the same idea can be applied.
You can calculate this using the "square root of p*q divided by n" method. If you don't understand this, google "standard error of a proportion" (or I might suggest the third chapter in "Data Analysis Using SQL and Excel" which explains this in more detail):
SELECT id, filename, filenamedisplay, console_dir, downloads, rating_avg, rating_count
FROM files cross join
(select count(*) as cnt from files where console_dir = '".$nodash."') as const
WHERE console_dir = '".$nodash."'
ORDER BY rating_avg/5 - sqrt((rating_avg/5) * (1 - rating_avg/5) / const.cnt) DESC
LIMIT 10;
In any case, see if the formula works for you.
EDIT:
Okay, let's change this to the standard error of the mean. I should have done this the first time through, but I was thinking the rating_avg was a proportion. The formula is the standard deviation divided by the square root of the sample size. We can get the population standard deviation in the const subquery:
(select count(*) as cnt, stdev(rating_avg) as std from files where console_dir = '".$nodash."') as const
This results in:
order by rating_avg - std / sqrt(const.cnt)
This might work, but I would rather have the standard deviation within each group rather than the overall population standard deviation. But, it derates the rating by an amount proportional to the size of the sample, which should improve your results.
By the way, the idea of removing one standard deviation is rather arbitrary. I've just found that it produces reasonable results. You might prefer to take, say, 1.96 times the standard deviation to get a 95% lower bound on the confidence interval.

Weighted randomness. How could I give more weight to rows that have just been added to the database?

I'm retrieving 4 random rows from a table. However, I'd like it such that more weight is given to rows that had just been inserted into the table, without penalizing older rows much.
Is there a way to do this in PHP / SQL?
SELECT *, (RAND() / id) AS o FROM your_table ORDER BY o LIMIT 4
This will order by o, where as o is some random integer between 0 and 1 / id, which means, the older your row, the lower it's o value will be (but still in random order).
I think an agreeable solution would be to use an asymptotic function (1/x) in combination with weighting.
The following has been tested:
SELECT *, (Rand()*10 + (1/(max_id - id + 1))) AS weighted_random
FROM tbl1
ORDER BY weighted_random
DESC LIMIT 4
If you want to get the max_id within the query above, just replace max_id with:
(SELECT id FROM tbl1 ORDER BY id DESC LIMIT 1)
Examples:
Let's say your max_id is 1000 ...
For each of several ids I will calculate out the value:
1/(1000 - id + 1) , which simplifies out to 1/(1001 - id):
id: 1000
1/(1001-1000) = 1/1 = 1
id: 999
1/(1001-999) = 1/2 = .5
id: 998
1/(1001-998) = 1/3 = .333
id: 991
1/(1001-991) = 1/10 = .1
id: 901
1/(1001-901) = 1/100 = .01
The nature of this 1/x makes it so that only the numbers close to max have any significant weighting.
You can see a graph of + more about asymptotic functions here:
http://zonalandeducation.com/mmts/functionInstitute/rationalFunctions/oneOverX/oneOverX.html
Note that the right side of the graph with positive numbers is the only part relevant to this specific problem.
Manipulating our equation to do different things:
(Rand()*a + (1/(b*(max_id - id + 1/b))))
I have added two values, "a", and "b"... each one will do different things:
The larger "a" gets, the less influence order has on selection. It is important to have a relatively large "a", or pretty much only recent ids will be selected.
The larger "b" gets, the more quickly the asymptotic curve will decay to insignificant weighting. If you want more of the recent rows to be weighted, I would suggest experimenting with values of "b" such as: .5, .25, or .1.
The 1/b at the end of the equation offsets problems you have with smaller values of b that are less than one.
Note:
This is not a very efficient solution when you have a large number of ids (just like the other solutions presented so far), since it calculates a value for each separate id.
... ORDER BY (RAND() + 0.5 * id/maxId)
This will add half of the id/maxId ration to the random value. I.e. for the newest entry 0.5 is added (as id/maxId = 1) and for the oldest entry nothing is added.
Similarly you can also implement other weighting functions. This depends on how exactly you want to weight the values.

searching and sorting through huge array of latitude and longitude

so I have an array of latitude/longitude (it's fake latitude/longitude as you can see, but just to illustrate the point & the original array size is MUCH larger than this):
<?php
$my_nodes = array(
1=> array(273078.139,353257.444),
2=> array(273122.77,352868.571),
3=> array(272963.687,353782.863),
4=> array(273949.566,353370.127),
5=> array(274006.13,352910.551),
6=> array(273877.095,353829.704),
7=> array(271961.898,353388.245),
8=> array(272839.07,354303.863),
9=> array(273869.141,354417.432),
10=> array(273207.173,351797.405),
11=> array(274817.901,353466.462),
12=> array(274862.533,352958.718),
13=> array(272034.812,351852.642),
14=> array(274128.978,354676.828),
15=> array(271950.85,354370.149),
16=> array(275087.902,353883.617),
17=> array(275545.711,352969.325)));
?>
I want to be able to find the closest node (in this case a node is either 1,2,3, 4,5, ...) for a given latitude X and latitude Y. I know the easiest way to do this is to do a for loop and then do a margin error difference (abs(latitude_X - latitude_X_array) + abs(latitude_Y - latitude_Y_array)) but this will be very inefficient as the size of the array grows.
I was thinking of doing a binary search, however the array needs to be sorted first in a binary search, however it's hard to sort latitude/longitude and in the end we're finding the CLOSEST latitude/longitude in the array for a given lat X, long Y. What approach should I take here?
UPDATE:
Mark has a valid point, this data could be stored in a database. However, how do I get such info from the db if I want the closest one?
Have a read of this article which explains all about finding the closest point using latitude and longitude from records stored in a database, and also gives a lot of help on how to make it efficient.... with PHP code examples.
I had a similar problem when I wanted to re-sample a large number of lat/long points to create a heightfield grid. The simplest way I found was like this:
divide the lat/long space up into a regular grid
create a bucket for each grid square
go through the list, adding each point to the bucket for grid square it falls in
then find the grid square your X,Y point falls in, and search outwards from there
I'm assuming you're storing your data in a DB table like this?
id | lat | long | data
------------------------------------------------
1 | 123.45 | 234.56 | A description of the item
2 | 111.11 | 222.22 | A description of another item
In this case you can use SQL to narrow your result set down.
if you want to find items close to grid ref 20,40, you can do the following query:
SELECT *
FROM locations
WHERE lat BETWEEN 19 AND 21
AND long BETWEEN 39 AND 41
This will return all the tiems in a 2x2 grid near your specified grid ref.
Several databases also provide spacial datatypes (MySQL and Postgres both do) and they might be worth investigating for this job. I don't, however, have experience with such things so I'm afraid I couldn't help with those.
To sort a multidimensional array in PHP you'd have to iterate over all elements and compare two at a time. For an array of size n that makes O(n) comparisons. Finding the closest node in the sorted array needs O(log n) distance calculations.
If you iterate over all elements, calculate the distance to the target node and remember the closest element, you'd be done with O(n) distance calculations.
Assuming that comparing two nodes means to compare both lat and lon values and thus is O(2), and further assuming that calculating the distance between two nodes is O(3), you end with
O(2n + 3 log n) for binary search and O(3n) for naive search.
So binary search takes n - 3 log n less operations and is round about 33% faster.
Depending on the distribution of your nodes it could be even faster to sort the list into buckets. During filling the buckets you could also throw away all nodes that would go in a bucket that could never hold the closest node. I can explain this in more detail if you want.

how to calculate the network diameter

I have data stored in relational database mysql and PHP.
I have a table called "rel" which has two fields:
from_node | to_node
=====================
1 2
1 3
2 3
and so on......
How can I calculate the network Diameter of a network. I know it is the longest or shortest path between any two pairs but how do I calculate it?
Is there any PHP script or function which can help me to do it?
Assuming you have a connected graph (otherwise the max distance is infinite) and all your node points are numbers....
Seed a table (from, to, distance) with (from_node, to_node, 1). For each tuple, you must ensure that the value of from_node is always less than the value of to_node
CREATE TABLE hops (
from_node int not null,
to_node int not null,
distance int not null default 0,
primary key (from_node, to_node, distance)
)
-- first load:
INSERT INTO hops (from_node, to_node)
SELECT from_node, to_node FROM rel;
-- iterative step
INSERT INTO hops (from_node, to_node, distance)
SELECT a.from_node, b.to_node, min(a.distance+b.distance)
FROM hops a, hops b
WHERE a.to_node = b.from_node
AND a.from_node <> b.from_node -- not self
AND a.to_node <> b.to_node -- still not self
AND a.from_node <> b.from_node -- no loops
AND NOT EXISTS ( -- avoid duplicates
SELECT * FROM hops c
WHERE c.from_node = a.from_node
AND c.to_node = b.to_node
AND c.distance = a.distance+b.distance)
GROUP BY a.from_node, b.to_node
Execute the insert repeatedly until no rows are inserted. Then select max distance to get your diameter.
EDIT: For a graph with weighted vertices, you would just seed the distance field with the weights rather than using 1.
See the Wikipedia article on graph (network) terms related to distance and diameter. It mentions a few notes on how to find the diameter. This section of the article on connected components of graphs also suggests an algorithm to discover these connected components, which could be adapted very easily to tell you the diameter of the graph. (If there are more than one components then the diameter is infinite I believe.) The algorithm is a simple one based on a bread/depth-first search, so it shouldn't be much trouble to implement, and efficiency shouldn't be a great problem either.
If you're not up to writing this (though I don't think it would take that much effort), I recommend looking for a good network/graph analysis library. There's a few out there, though I'm not sure which ones you'd want to look at using PHP. (You'd probably have to use some sort of interop.)
Hope that helps, anyway.
I really think you meant you wanted to find the cluster coefficient of a network. Furthermore, you'd like to do it in PHP. I don't know how many good network analysis libraries have been ported to PHP extensions.
However, if you follow the article, it should not be (too) difficult to come up with your own. You don't need to produce pretty graphs, you just have to find the coefficient.
If that's not what you meant, please update / clarify your question.
Network is a connected graph. So why don't you try to build some graph representation from your data and perform BFS or DFS on this? You will get exactly that you are looking for.
That's simple:
Prepare
Add a colum named distance
Give all nodes the distance of -1
First Iteration
Pick any node (e.g. the first)
give it the distance of 1
Now iterate until there are nodes with distance -1
UPDATE table SET distance=:i+1 WHERE from_node IN (SELECT to_node FROM table WHERE distance=:i)
Second Iteration
Pick a node that has the maximum distance (any) - remember it
Set all distances back to -1
Set your remebered node to 1
Call the iteration a second time
This time the maximum distance is the diameter of your graph/network.
In your example you show that each node links to every other node. If this is the case throughout your setup, then the diameter is 1.
If your setup is a line like so in a linear formation:
n=1, n = 2, n = 3, ... n
Then your diameter is (n+1)/3.
If your setup is more irregular, with a series of N number nodes and K number of links, then your diameter is at least logN/LogK
Edit: To clarify, I'm calculating the average shortest distance between pairs of nodes.
n1 - n2 - n3
(n+1)/3 = 4/3
n1-n2 = 1 hop
n2 - n3 = 1 hop
n1- n2 - n3 = 2 hops
(1+1+2)/3 = 4/3

Categories