I have data stored in relational database mysql and PHP.
I have a table called "rel" which has two fields:
from_node | to_node
=====================
1 2
1 3
2 3
and so on......
How can I calculate the network Diameter of a network. I know it is the longest or shortest path between any two pairs but how do I calculate it?
Is there any PHP script or function which can help me to do it?
Assuming you have a connected graph (otherwise the max distance is infinite) and all your node points are numbers....
Seed a table (from, to, distance) with (from_node, to_node, 1). For each tuple, you must ensure that the value of from_node is always less than the value of to_node
CREATE TABLE hops (
from_node int not null,
to_node int not null,
distance int not null default 0,
primary key (from_node, to_node, distance)
)
-- first load:
INSERT INTO hops (from_node, to_node)
SELECT from_node, to_node FROM rel;
-- iterative step
INSERT INTO hops (from_node, to_node, distance)
SELECT a.from_node, b.to_node, min(a.distance+b.distance)
FROM hops a, hops b
WHERE a.to_node = b.from_node
AND a.from_node <> b.from_node -- not self
AND a.to_node <> b.to_node -- still not self
AND a.from_node <> b.from_node -- no loops
AND NOT EXISTS ( -- avoid duplicates
SELECT * FROM hops c
WHERE c.from_node = a.from_node
AND c.to_node = b.to_node
AND c.distance = a.distance+b.distance)
GROUP BY a.from_node, b.to_node
Execute the insert repeatedly until no rows are inserted. Then select max distance to get your diameter.
EDIT: For a graph with weighted vertices, you would just seed the distance field with the weights rather than using 1.
See the Wikipedia article on graph (network) terms related to distance and diameter. It mentions a few notes on how to find the diameter. This section of the article on connected components of graphs also suggests an algorithm to discover these connected components, which could be adapted very easily to tell you the diameter of the graph. (If there are more than one components then the diameter is infinite I believe.) The algorithm is a simple one based on a bread/depth-first search, so it shouldn't be much trouble to implement, and efficiency shouldn't be a great problem either.
If you're not up to writing this (though I don't think it would take that much effort), I recommend looking for a good network/graph analysis library. There's a few out there, though I'm not sure which ones you'd want to look at using PHP. (You'd probably have to use some sort of interop.)
Hope that helps, anyway.
I really think you meant you wanted to find the cluster coefficient of a network. Furthermore, you'd like to do it in PHP. I don't know how many good network analysis libraries have been ported to PHP extensions.
However, if you follow the article, it should not be (too) difficult to come up with your own. You don't need to produce pretty graphs, you just have to find the coefficient.
If that's not what you meant, please update / clarify your question.
Network is a connected graph. So why don't you try to build some graph representation from your data and perform BFS or DFS on this? You will get exactly that you are looking for.
That's simple:
Prepare
Add a colum named distance
Give all nodes the distance of -1
First Iteration
Pick any node (e.g. the first)
give it the distance of 1
Now iterate until there are nodes with distance -1
UPDATE table SET distance=:i+1 WHERE from_node IN (SELECT to_node FROM table WHERE distance=:i)
Second Iteration
Pick a node that has the maximum distance (any) - remember it
Set all distances back to -1
Set your remebered node to 1
Call the iteration a second time
This time the maximum distance is the diameter of your graph/network.
In your example you show that each node links to every other node. If this is the case throughout your setup, then the diameter is 1.
If your setup is a line like so in a linear formation:
n=1, n = 2, n = 3, ... n
Then your diameter is (n+1)/3.
If your setup is more irregular, with a series of N number nodes and K number of links, then your diameter is at least logN/LogK
Edit: To clarify, I'm calculating the average shortest distance between pairs of nodes.
n1 - n2 - n3
(n+1)/3 = 4/3
n1-n2 = 1 hop
n2 - n3 = 1 hop
n1- n2 - n3 = 2 hops
(1+1+2)/3 = 4/3
Related
I have a group of users. The user count could be 50 or could be 2000. Each should have a long/lat that I have retrieved from Google Geo api.
I need to query them all, and group them by proximity and a certain count. Say the count is 12 and I have 120 users in the group. I want to group people by how close they are (long/lat) to other people. So that I wind up with 10 groups of people who are close in proximity.
I currently have the google geo coding api setup and would prefer to use that.
TIA.
-- Update
I have been googling about this for awhile and it appears that I am looking for a spatial query that returns groups by proximity.
Keep in mind that this problem grows exponentially with every user you add, as the amount of distance calculations is linked to the square of the number of users (it's actually N*(N-1) distances... so a 2000 user base would mean almost 4 million distance calculations on every pass. Just keep that in mind when sizing the resources you need
Are you looking to group them based on straight-line (actually great circle) distance or based on walking/driving distance?
If the former, the great circle distance can be approximated with simple math if you're able to tolerate a small margin of error and wish to assume the earth is a sphere. From GCMAP.com:
Earth's hypothetical shape is called the geoid and is approximated by
an ellipsoid or an oblate sphereoid. A simpler model is to use a
sphere, which is pretty close and makes the math MUCH easier. Assuming
a sphere of radius 6371.2 km, convert longitude and latitude to
radians (multiply by pi/180) and then use the following formula:
theta = lon2 - lon1
dist = acos(sin(lat1) × sin(lat2) + cos(lat1) × cos(lat2) × cos(theta))
if (dist < 0) dist = dist + pi
dist = dist × 6371.2
The resulting distance is in kilometers.
Now, if you need precise calculations and are willing to spend the CPU cycles needed for much complex math, you can use Vincenty's Formulae, which uses the WGS-84 reference ellipsoid model of the earth which is used for navigation, mapping and whatnot. More info HERE
As to the algorithm itself, you need to build a to-from matrix with the result of each calculation. Each row and column would represent each node. Two simplifications you may consider:
Distance does not depend on direction of travel, so $dist[n][m] == $dist[m][n] (no need to calculate the whole matrix, just half of it)
Distance from a node to itself is always 0, so no need to calculate it, but since you're intending to group by proximity, to avoid a user being grouped with itself, you may want to always force $dist[m][m] to an arbitrarily defined and abnormally large constant ($dist[m][m] = 22000 (miles) for instance. Will work as long as all your users are on the planet)
After making all the calculations, use an array sorting method to find the X closest nodes to each node and there you have it
(you may or may not want to prevent a user being grouped on more than one group, but that's just business logic)
Actual code would be a little too much to provide at this time without seeing some of your progress first, but this is basically what you need to do algoritmically.
... it appears that I am looking for a spatial query that returns groups by proximity. ...
You could use hdbscan. Your groups are actually clusters in hdbscan wording. You would need to work with min_cluster_size and min_samples to get your groups right.
https://hdbscan.readthedocs.io/en/latest/parameter_selection.html
https://hdbscan.readthedocs.io/en/latest/
It appears that hdbscan runs under Python.
Here are two links on how to call Python from PHP:
Calling Python in PHP,
Running a Python script from PHP
Here is some more information on which clustering algorithm to choose:
http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb
http://scikit-learn.org/stable/modules/clustering.html#clustering
Use GeoHash algorithm[1]. There is a PHP implementation[2]. You may pre-calculate geohashes with different precision, store them in SQL database alongside lat-lon values and query using native GROUP BY.
https://en.wikipedia.org/wiki/Geohash
https://github.com/lvht/geohash
With all of the daily fantasy games out there, I am looking to see if I can easily implement a platform that will help identify the optimal lineup for a fantasy league based on a salary cap and projected points for each player.
If given a pool of ~500 players and you need to find the highest scoring lineup of within the maximium salary cap restraints.
1 Quarter Back
2 Running Back
3 Wide Receiver
1 Tight End
1 Kicker
1 Defense
Each player is assigned a salary (that changes weekly) and I will assign projected points for those players. I have this information in a MySQL DB and would prefer to use PHP/Pear or JQuery if that's the best option for calculating this.
The Table looks something like this
player_id name position salary ranking projected_points
1 Joe Smith QB 1000 2 21.7
2 Jake Plummer QB 2500 6 11.9
I've tried sorting by projected points and filling in the roster, but it obviously will provide the highest scoring team, but also exceeds the salary cap. I cannot think of a way to have it intelligently remove players and continue to loop through and find the highest scoring lineup based on the salary constraints.
So, is there any PHP or Pear class that you know of that will help "Solve" this type of problem? Any articles you can point me to for reference? I'm not asking for someone to do this, but I've been Googleing for a while and the best solution I currently have is this. http://office.microsoft.com/en-us/excel-help/pick-your-fantasy-football-team-with-solver-HA001124603.aspx and that's using Excel and limited to 200 objects.
I'll suggest two approaches to this problem.
The first is dynamic programming. For brute force, we could initialize a list containing the empty partial team, then, for each successive player, for each partial team currently in the list, add a copy of that partial team with the new player, assuming that this new partial team respects the positional and budget constraints. This is an exponential-time algorithm, but we can reduce the running time by quite a lot (to O(#partial position breakdowns * budget * #players), assuming that all monetary values are integer) if we throw away all but the best possibility so far for each combination of partial position breakdown and budget.
The second is to find an integer programming library callable from PHP that works like Excel's solver. It looks like (e.g.) lpsolve comes with a PHP interface. Then we can formulate an integer program like so.
maximize sum_{player p} value_p x_p
subject to
sum_{quarterback player p} x_p <= 1
sum_{running back player p} x_p <= 2
...
sum_{defense player p} x_p <= 1
sum_{player p} cost_p <= budget
for each player p, x_p in {0, 1} (i.e., x_p is binary)
I have a MySQL table with thousands of data points stored in 3 columns R, G, B. how can I find which data point is closest to a given point (a,b,c) using Euclidean distance?
I'm saving RGB values of colors separately in a table, so the values are limited to 0-255 in each column. What I'm trying to do is find the closest color match by finding the color with the smallest euclidean distance.
I could obviously run through every point in the table to calculate the distance but that wouldn't be efficient enough to scale. Any ideas?
I think the above comments are all true, but they are - in my humble opinion - not answering the original question. (Correct me if I'm wrong). So, let me here add my 50 cents:
You are asking for a select statement, which, given your table is called 'colors', and given your columns are called r, g and b, they are integers ranged 0..255, and you are looking for the value, in your table, closest to a given value, lets say: rr, gg, bb, then I would dare trying the following:
select min(sqrt((rr-r)*(rr-r)+(gg-g)*(gg-g)+(bb-b)*(bb-b))) from colors;
Now, this answer is given with a lot of caveats, as I am not sure I got your question right, so pls confirm if it's right, or correct me so that I can be of assistance.
Since you're looking for the minimum distance and not exact distance you can skip the square root. I think Squared Euclidean Distance applies here.
You've said the values are bounded between 0-255, so you can make an indexed look up table with 255 values.
Here is what I'm thinking in terms of SQL. r0, g0, and b0 represent the target color. The table Vector would hold the square values mentioned above in #2. This solution would visit all the records but the result set can be set to 1 by sorting and selecting only the first row.
select
c.r, c.g, c.b,
mR.dist + mG.dist + mB.dist as squared_dist
from
colors c,
vector mR,
vector mG,
vector mB
where
c.r-r0 = mR.point and
c.g-g0 = mG.point and
c.b-b0 = mB.point
group by
c.r, c.g, c.b
The first level of optimization that I see you can do would be square the distance to which you want to limit the query so that you don't need to perform the square root for each row.
The second level of optimization I would encourage would be some preprocessing to alleviate the need for extraneous squaring for each query (which could possibly create some extra run time for large tables of RGB's). You'd have to do some benchmarking to see, but by substituting in values for a, b, c, and d and then performing the query, you could alleviate some stress from MySQL.
Note that the performance difference between the last two lines may be negligible. You'll have to use test queries on your system to determine which is faster.
I just re-read and noticed that you are ordering by distance. In which case, the d should be removed everything should be moved to one side. You can still plug in the constants to prevent extra processing on MySQL's end.
I believe there are two options.
You have to either as you say iterate across the entire set and compare and check against a maximum that you set initially at an impossibly low number like -1. This runs in linear time, n times (since you're only comparing 1 point to every point in the set, this scales in a linear way).
I'm still thinking of another option... something along the lines of doing a breadth first search away from the input point until a point is found in the set at the searched point, but this requires a bit more thought (I imagine the 3D space would have to be pretty heavily populated for this to be more efficient on average though).
If you run through every point and calculate the distance, don't use the square root function, it isn't necessary. The smallest sum of squares will be enough.
This is the problem you are trying to solve. (Planar case, select all points sorted by a x, y, or z axis. Then use PHP to process them)
MySQL also has a Spatial Database which may have this as a function. I'm not positive though.
so I have an array of latitude/longitude (it's fake latitude/longitude as you can see, but just to illustrate the point & the original array size is MUCH larger than this):
<?php
$my_nodes = array(
1=> array(273078.139,353257.444),
2=> array(273122.77,352868.571),
3=> array(272963.687,353782.863),
4=> array(273949.566,353370.127),
5=> array(274006.13,352910.551),
6=> array(273877.095,353829.704),
7=> array(271961.898,353388.245),
8=> array(272839.07,354303.863),
9=> array(273869.141,354417.432),
10=> array(273207.173,351797.405),
11=> array(274817.901,353466.462),
12=> array(274862.533,352958.718),
13=> array(272034.812,351852.642),
14=> array(274128.978,354676.828),
15=> array(271950.85,354370.149),
16=> array(275087.902,353883.617),
17=> array(275545.711,352969.325)));
?>
I want to be able to find the closest node (in this case a node is either 1,2,3, 4,5, ...) for a given latitude X and latitude Y. I know the easiest way to do this is to do a for loop and then do a margin error difference (abs(latitude_X - latitude_X_array) + abs(latitude_Y - latitude_Y_array)) but this will be very inefficient as the size of the array grows.
I was thinking of doing a binary search, however the array needs to be sorted first in a binary search, however it's hard to sort latitude/longitude and in the end we're finding the CLOSEST latitude/longitude in the array for a given lat X, long Y. What approach should I take here?
UPDATE:
Mark has a valid point, this data could be stored in a database. However, how do I get such info from the db if I want the closest one?
Have a read of this article which explains all about finding the closest point using latitude and longitude from records stored in a database, and also gives a lot of help on how to make it efficient.... with PHP code examples.
I had a similar problem when I wanted to re-sample a large number of lat/long points to create a heightfield grid. The simplest way I found was like this:
divide the lat/long space up into a regular grid
create a bucket for each grid square
go through the list, adding each point to the bucket for grid square it falls in
then find the grid square your X,Y point falls in, and search outwards from there
I'm assuming you're storing your data in a DB table like this?
id | lat | long | data
------------------------------------------------
1 | 123.45 | 234.56 | A description of the item
2 | 111.11 | 222.22 | A description of another item
In this case you can use SQL to narrow your result set down.
if you want to find items close to grid ref 20,40, you can do the following query:
SELECT *
FROM locations
WHERE lat BETWEEN 19 AND 21
AND long BETWEEN 39 AND 41
This will return all the tiems in a 2x2 grid near your specified grid ref.
Several databases also provide spacial datatypes (MySQL and Postgres both do) and they might be worth investigating for this job. I don't, however, have experience with such things so I'm afraid I couldn't help with those.
To sort a multidimensional array in PHP you'd have to iterate over all elements and compare two at a time. For an array of size n that makes O(n) comparisons. Finding the closest node in the sorted array needs O(log n) distance calculations.
If you iterate over all elements, calculate the distance to the target node and remember the closest element, you'd be done with O(n) distance calculations.
Assuming that comparing two nodes means to compare both lat and lon values and thus is O(2), and further assuming that calculating the distance between two nodes is O(3), you end with
O(2n + 3 log n) for binary search and O(3n) for naive search.
So binary search takes n - 3 log n less operations and is round about 33% faster.
Depending on the distribution of your nodes it could be even faster to sort the list into buckets. During filling the buckets you could also throw away all nodes that would go in a bucket that could never hold the closest node. I can explain this in more detail if you want.
Not sure of the best way to go about this?
I want to create a tournament bracket of 2,4,8,16,32, etc teams.
The winner of the first two will play winner of the next 2 etc.
All the way until there is a winner.
Like this
Can anyone help me?
OK so more information.
Initially I want to come up with a way to create the tournament with the 2,4,8,16,etc.
Then when I have all the users in place, if they are 16 players, there are 8 fixtures.
At this point I will send the fixture to the database.
When all the players that won are through to the next round, i would want another sql query again for the 2 winners that meet.
Can you understand what i mean?
I did something like this a few years ago. This was quite a while ago and I'm not sure I'd do it the same way (it doesn't really scale to double-elimintation or the like) How you output it might be a different question. I resorted to tables as it was in 2002-2003. There are certainly better techniques today.
The amount of rounds in the tournament is log2(players) + 1, as long as players is one of the numbers you specified above. Using this information you can calculate how many rounds there are. The last round contains the final winner.
I stored the player information something like this (tweek this for best practices)
Tournament
Name
Size
Players
Tournament
Name
Position (0 to tournament.size - 1)
Rounds
Tournament
Round
Position (max halves for each round)
Winner (player position)
Note in all my queries below, I don't include the "Tournament = [tournament]" to identify the tournament. They all need it.
It's rather simple to query this with one query and to split it out as needed for the different rounds. You could do something like this to get the next opponent (assuming there is one). For round 1, you'd simply need to get the next/previous player based on if it was even or odd:
SELECT * FROM Players WHERE Position = PlayerPosition + 1
SELECT * FROM Players WHERE Position = PlayerPosition - 1
For the next round, if the user's last Round.Position was even, you'll need to make suer that the next position up has a winner:
SELECT Player FROM Rounds WHERE Position = [playerRoundPosition] - 1
If not, the next player isn't decided, or there's a gap (don't allow gaps!)
If the users last Round.Position was odd, you'll need make sure there's a user below them AND that there's a winner below them, otherwise they should automatically be promoted to the next round (as there is no one to play)
SELECT COUNT(*) FROM Players WHERE Position > [Player.Position]
SELECT Player FROM Rounds WHERE Position = [playerRoundPosition] + 1
On a final note, I'm pretty sure you could use something like the following to reduce the queries you write by using something like:
SELECT Player FROM Rounds WHERE Position + Position % 2 = [playerRoundPosition]
SELECT Player FROM Rounds WHERE Position - Position % 2 = [playerRoundPosition]
Update:
Looking over my original post, I find that the Rounds table was a little ambigous. In reality, it should be named matches. A match is a competition between two players with a winner. The final table should look more like this (only the name changed):
Matches
Tournament
Round
Position (max halves for each round)
Winner (player position)
Hopefully that makes it a bit more clear. When the two players go up against each other (in a match), you store that information in this Matches table. This particular implementation depends on the position of the Match to know which players participated.
I started numbering the rounds at 1 because that was more clear in my implementation. You may choose 0 (or even do something completely different like go backwords), if you choose.
In the first round, match 1 means players 1 and 2 participated. In match 2, the players 3-4 participated. Essentially the first round is simply players position and position + 1 participated. You could also store this information in the rounds table if you need more access to it. Every time I used this data in the program, I needed all the round and player information anyways.
After the first round, you look at the last round of matches. In round 2, match 1, the winners from matches 1 and 2 participate. Round 2, match 2, the winners from match 3 and 4 participate. It should look pretty familiar, except that it uses the match table after round 1. I'm sure there's a more efficent way to do this repetitive task, I just never got enough time to refactor that code (it was refactored, just not that much).
Use arrays and remove the losing teams from the main array. (But keep 'em on a separate array, for reference and reuse purposes).