php - Remove duplicates from current array while looping through the same array? - php

i'm playing with cURL to crawl pages and extract links. This is some code that targets my issue.
for($i = 0; $i < sizeof($links); $i++){
$response = crawl($links[$i]);
//inside this loop i extract links for each crawled html
$newLinks = getLinks($response);
//I need to append these new links to current array in loop
$links= array_values(array_unique(array_merge($links, $newLinks));
}
I need to prevent for duplicate links so i don't crawl twice. I wonder if this is a safe approach or if it's right at all, since array_values whould reindex the elements of the array and while in loop the crawling could run twice for some link.
I could test with in_array() against $links and $newLinks to avoid duplicates but i wonder what happens when doing like my sample here.

This example work if you get only one link by function getLinks();
$newLink = getLinks($response);
if(!in_array($newLink,$links)){
$links=array_merge($links, array($newLink));
}
If you get more link, you can use foreach for links
foreach($newLinks as $newLink){
if(!in_array($newLink,$links)){
$links=array_merge($links, array($newLink));
}
}

TL/DR: Use array_merge($links, array_diff($newLinks, $links));
My justification:
Welcome to the land of exponential growth. As your collected links array grows, the time it takes will go through the roof. Give array_merge(array_diff($links, $new_links)) a chance instead. Using this benchmarking code:
function get($num)
{
$links = array();
for($i=0;$i<rand(5,20);$i++)
{
$links[] = rand(1,$num);
}
return $links;
}
function test($iter, $num_total_links)
{
$unique_time = 0;
$unique_links = array();
$diff_time = 0;
$diff_links = array();
for($i=0;$i<$iter;$i++)
{
$new_links = get($num_total_links);
$start = microtime(true);
$unique_links =
array_values(array_unique(array_merge($unique_links, $new_links)));
$end = microtime(true);
$unique_time += $end - $start;
$start = microtime(true);
$diff_links = array_values(array_merge(array_diff($new_links, $diff_links)));
$end = microtime(true);
$diff_time += $end - $start;
}
echo $unique_time . ' - ' . $diff_time;
}
You can tweak the values; if you expect to surf a large number of pages with relatively few links in common, pick a large (not too large or it'll take forever) $iter and $num_total_links. If you're likely to see many of the same link, reduce the $num_total_links accordingly.
What it boils down to is that a merge and then unique operation always requires you to merge in the same number of links, while a diff and then merge only requires you to merge in the links you want to add. This is nearly always more efficient, even at numbers that aren't exactly huge; surfing 500 pages with between 5 and 20 links makes a huge difference in time, and even a small number of pages can show a marked difference.
Looking at the data, foreach isn't a bad way to go as compared with array_unique; doing a similar benchmark, foreach consistently beat array_unique. But both are O(n^2), with foreach just growing more slowly. array_diff preserves O(n) time, which is your best case; your algorithm is never going to get any faster than some multiple of the number of pages you visit. PHP's built-in array functions are going to be faster than pretty much any solution written in PHP.
Here is the data, with the random factors taken out because all they did was cause deviations, not affect the results. I've also ramped up the possible "universe" of URLs to a large number for illustration purposes; at a smaller number, both array_unique and foreach produce graphs which look almost linear, albeit that they continue to be outperformed by array_diff. If you plug this into Google Charts, you can see just what I'm talking about.
['Number of Pages', 'array_unique', 'array_diff', 'foreach'],
[1, 6.9856643676758E-5, 6.1988830566406E-5, 0.00012898445129395],
[11, 0.0028481483459473, 0.00087666511535645, 0.0014169216156006],
[21, 0.0091345310211182, 0.0017409324645996, 0.0029785633087158],
[31, 0.019546031951904, 0.0023491382598877, 0.0046005249023438],
[41, 0.036402702331543, 0.0032360553741455, 0.006026029586792],
[51, 0.055278301239014, 0.0039372444152832, 0.0078754425048828],
[61, 0.082642316818237, 0.0048537254333496, 0.010209321975708],
[71, 0.11405396461487, 0.0054631233215332, 0.012364625930786],
[81, 0.15123820304871, 0.0062053203582764, 0.014509916305542],
[91, 0.19236493110657, 0.007127046585083, 0.017033576965332],
[101, 0.24052715301514, 0.0080602169036865, 0.01974892616272],
[111, 0.29827189445496, 0.0085773468017578, 0.023083209991455],
[121, 0.35718178749084, 0.0094895362854004, 0.025837421417236],
[131, 0.42515468597412, 0.010404586791992, 0.029412984848022],
[141, 0.49908661842346, 0.011186361312866, 0.033211469650269],
[151, 0.56992983818054, 0.011844635009766, 0.036608695983887],
[161, 0.65314698219299, 0.012562274932861, 0.039996147155762],
[171, 0.74602556228638, 0.013403177261353, 0.04484486579895],
[181, 0.84450364112854, 0.014075994491577, 0.04839038848877],
[191, 0.94431185722351, 0.01488733291626, 0.052026748657227],
[201, 1.0460951328278, 0.015958786010742, 0.056291818618774],
[211, 1.2530679702759, 0.016806602478027, 0.060890197753906],
[221, 1.2901678085327, 0.017560005187988, 0.065101146697998],
[231, 1.4267380237579, 0.018605709075928, 0.070043087005615],
[241, 1.5581474304199, 0.018914222717285, 0.075717210769653],
[251, 1.8255474567413, 0.020106792449951, 0.08226203918457],
[261, 1.8533885478973, 0.020873308181763, 0.085562705993652],
[271, 1.999392747879, 0.021762609481812, 0.15557670593262],
[281, 2.1670596599579, 0.022242784500122, 0.098419427871704],
[291, 2.4296963214874, 0.023237705230713, 0.10490798950195],
[301, 3.0475504398346, 0.031109094619751, 0.13519287109375],
[311, 3.0027780532837, 0.02937388420105, 0.13496232032776],
[321, 2.9123396873474, 0.025942325592041, 0.12607669830322],
[331, 3.0720682144165, 0.026587963104248, 0.13313007354736],
[341, 3.3559355735779, 0.028125047683716, 0.14407730102539],
[351, 3.5787575244904, 0.031508207321167, 0.15093517303467],
[361, 3.6996841430664, 0.028955698013306, 0.15273785591125],
[371, 3.9983749389648, 0.02990198135376, 0.16448092460632],
[381, 4.1213915348053, 0.030521154403687, 0.16835069656372],
[391, 4.3574469089508, 0.031461238861084, 0.17818260192871],
[401, 4.7959914207458, 0.032914161682129, 0.19097280502319],
[411, 4.9738960266113, 0.033754825592041, 0.19744348526001],
[421, 5.3298072814941, 0.035082101821899, 0.2117555141449],
[431, 5.5753719806671, 0.035769462585449, 0.21576929092407],
[441, 5.7648482322693, 0.035907506942749, 0.2213134765625],
[451, 5.9595069885254, 0.036591529846191, 0.2318480014801],
[461, 6.4193341732025, 0.037969827651978, 0.24672293663025],
[471, 6.7780020236969, 0.039541244506836, 0.25563311576843],
[481, 7.0454154014587, 0.039729595184326, 0.26160192489624],
[491, 7.450076341629, 0.040610551834106, 0.27283143997192]

Related

Solution for calling a function doing lots of stuff in it by Cron?

function cronProcess() {
# > 100,000 users
$users = $this->UserModel->getUsers();
foreach ($users as $user) {
# Do lots of database Insert/Update/Delete, HTTP request stuff
}
}
The problem happens when the number of users reaches ~ 100,000.
I called the function by CURL via CronTab.
So what is the best solution for this?
I do a lot of bulk tasks in CakePHP, some processing millions of records. It's certainly possible to do, the key as others suggested is small batches in a loop.
If this is something you're calling from Cron, it's probably easier to use a Shell (< v3.5) or the newer Command class (v3.6+) than cURL.
Here's generally how I paginate large batches, including some helpful optional things like a progress bar, turning off hydration to speed things up slightly, and showing how many users/second the script was able to process:
<?php
namespace App\Command;
use Cake\Console\Arguments;
use Cake\Console\Command;
use Cake\Console\ConsoleIo;
class UsersCommand extends Command
{
public function execute(Arguments $args, ConsoleIo $io)
{
// I'd guess a Finder would be a more Cake-y way of getting users than a custom "getUsers" function:
// See https://book.cakephp.org/3.0/en/orm/retrieving-data-and-resultsets.html#custom-finder-methods
$usersQuery = $this->UserModel->find('users');
// Get a total so we know how many we're gonna have to process (optional)
$total = $usersQuery->count();
if ($total === 0) {
$this->abort("No users found, stopping..");
}
// Hydration takes extra processing time & memory, which can add up in bulk. Optionally if able, skip it & work with $user as an array not an object:
$usersQuery->enableHydration(false);
$this->info("Grabbing $total users for processing");
// Optionally show the progress so we can visually see how far we are in the process
$progress = $io->helper('Progress')->init([
'total' => 10
]);
// Tune this page value to a size that solves your problem:
$limit = 1000;
$offset = 0;
// Simply drawing the progress bar every loop can slow things down, optionally draw it only every n-loops,
// this sets it to 1/5th the page size:
$progressInterval = $limit / 5;
// Optionally track the rate so we can evaluate the speed of the process, helpful tuning limit and evaluating enableHydration effects
$startTime = microtime(true);
do {
$users = $usersQuery->offset($offset)->toArray();
$count = count($users);
$index = 0;
foreach ($users as $user) {
$progress->increment(1);
// Only draw occasionally, for speed
if ($index % $progressInterval === 0) {
$progress->draw();
}
### WORK TIME
# Do your lots of database Insert/Update/Delete, HTTP request stuff etc. here
###
}
$progress->draw();
$offset += $limit; // Increment your offset to the next page
} while ($count > 0);
$totalTime = microtime(true) - $startTime;
$this->out("\nProcessed an average " . ($total / $totalTime) . " Users/sec\n");
}
}
Checkout these sections in the CakePHP Docs:
Console Commands
Command Helpers
Using Finders & Disabling Hydration
Hope this helps!

Array Insert Time Jump

During deep researching about hash and zval structure and how arrays are based on it, faced with strange insert time.
Here is example:
$array = array();
$someValueToInsert = 100;
for ($i = 0; $i < 10000; ++$i) {
$time = microtime(true);
array_push($array, $someValueToInsert);
echo $i . " : " . (int)((microtime(true) - $time) * 100000000) . "</br>";
}
So, I found that every 1024, 2024, 4048... element will be inserted using much more time(>~x10).
It doesn't depends will I use array_push, array_unshift, or simply $array[] = someValueToInsert.
I'm thinking about that in Hash structure:
typedef struct _hashtable {
...
uint nNumOfElements;
...
} HashTable;
nNumOfElements has default max value, but it doesn't the answer why does it took more time to insert in special counters(1024, 2048...).
Any thoughts ?
While I would suggest double checking my answer on the PHP internals list, I believe the answer lay in zend_hash_do_resize(). When more elements are needed in the hash table, this function is called and the extant hash table is doubled in size. Since the table starts life at 1024, this doubling explains the results you've observed. Code:
} else if (ht->nTableSize < HT_MAX_SIZE) { /* Let's double the table size */
void *old_data = HT_GET_DATA_ADDR(ht);
Bucket *old_buckets = ht->arData;
HANDLE_BLOCK_INTERRUPTIONS();
ht->nTableSize += ht->nTableSize;
ht->nTableMask = -ht->nTableSize;
HT_SET_DATA_ADDR(ht, pemalloc(HT_SIZE(ht), ht->u.flags & HASH_FLAG_PERSISTENT));
memcpy(ht->arData, old_buckets, sizeof(Bucket) * ht->nNumUsed);
pefree(old_data, ht->u.flags & HASH_FLAG_PERSISTENT);
zend_hash_rehash(ht);
HANDLE_UNBLOCK_INTERRUPTIONS();
I am uncertain if the remalloc is the performance hit, or if the rehashing is the hit, or the fact that the whole block is uninterruptable. Would be interesting to put a profiler on it. I think some might have already done that for PHP 7.
Side note, the Thread Safe version does things differently. I'm not overly familiar with that code, so there may be a different issue going on if your using ZTS.
I think it is related to implementation of dynamic arrays.
See here "Geometric expansion and amortized cost" http://en.wikipedia.org/wiki/Dynamic_array
To avoid incurring the cost of resizing many times, dynamic arrays resize by a large amount, **such as doubling in size**, and use the reserved space for future expansion
You can read about arrays in PHP here as well https://nikic.github.io/2011/12/12/How-big-are-PHP-arrays-really-Hint-BIG.html
It is a standard practice for dynamic arrays. E.g. check here C++ dynamic array, increasing capacity
capacity = capacity * 2; // doubles the capacity of the array

php RRD graph separation

I am trying to create RRD graphs with the help of PHP in order to keep track of the inoctets,outoctets and counter of a server.
So far the script is operating as expected but my problems comes when I am trying to produce 2 or more separate graphs. I am trying to produce (hourly, weekly , etc) graphs. I thought by creating a loop would solve my problem, since I have split the RRA in hours and days. Unfortunately I end up having 2 graphs that updating simultaneously as expected but plotting the same thing. Has any one encounter similar problem? I have applied the same program in perl with RRD::Simple,where is extremely easy and everything is adjusted almost automatically.
I have supplied under a working example of my code with the minimum possible data because the code is a bit long:
<?php
$file = "snmp-2";
$rrdFile = dirname(__FILE__) . "/snmp-2.rrd";
$in = "ifInOctets";
$out = "ifOutOctets";
$count = "sysUpTime";
$step = 5;
$rounds = 1;
$output = array("Hourly","Daily");
while (1) {
sleep (6);
$options = array(
"--start","now -15s", // Now -10 seconds (default)
"--step", "".$step."",
"DS:".$in.":GAUGE:10:U:U",
"DS:".$out.":GAUGE:10:U:U",
"DS:".$count.":ABSOLUTE:10:0:4294967295",
"RRA:MIN:0.5:12:60",
"RRA:MAX:0.5:12:60",
"RRA:LAST:0.5:12:60",
"RRA:AVERAGE:0.5:12:60",
"RRA:MIN:0.5:300:60",
"RRA:MAX:0.5:300:60",
"RRA:LAST:0.5:300:60",
"RRA:AVERAGE:0.5:300:60",
);
if ( !isset( $create ) ) {
$create = rrd_create(
"".$rrdFile."",
$options
);
if ( $create === FALSE ) {
echo "Creation error: ".rrd_error()."\n";
}
}
$t = time();
$ifInOctets = rand(0, 4294967295);
$ifOutOctets = rand(0, 4294967295);
$sysUpTime = rand(0, 4294967295);
$update = rrd_update(
"".$rrdFile."",
array(
"".$t.":".$ifInOctets.":".$ifOutOctets.":".$sysUpTime.""
)
);
if ($update === FALSE) {
echo "Update error: ".rrd_error()."\n";
}
$start = $t - ($step * $rounds);
foreach ($output as $test) {
$final = array(
"--start","".$start." -15s",
"--end", "".$t."",
"--step","".$step."",
"--title=".$file." RRD::Graph",
"--vertical-label=Byte(s)/sec",
"--right-axis-label=latency(min.)",
"--alt-y-grid", "--rigid",
"--width", "800", "--height", "500",
"--lower-limit=0",
"--alt-autoscale-max",
"--no-gridfit",
"--slope-mode",
"DEF:".$in."_def=".$file.".rrd:".$in.":AVERAGE",
"DEF:".$out."_def=".$file.".rrd:".$out.":AVERAGE",
"DEF:".$count."_def=".$file.".rrd:".$count.":AVERAGE",
"CDEF:inbytes=".$in."_def,8,/",
"CDEF:outbytes=".$out."_def,8,/",
"CDEF:counter=".$count."_def,8,/",
"COMMENT:\\n",
"LINE2:".$in."_def#FF0000:".$in."",
"COMMENT:\\n",
"LINE2:".$out."_def#0000FF:".$out."",
"COMMENT:\\n",
"LINE2:".$count."_def#FFFF00:".$count."",
);
$outputPngFile = rrd_graph(
"".$test.".png",
$final
);
if ($outputPngFile === FALSE) {
echo "<b>Graph error: </b>".rrd_error()."\n";
}
} /* End of foreach function */
$debug = rrd_lastupdate (
"".$rrdFile.""
);
if ($debug === FALSE) {
echo "<b>Graph result error: </b>".rrd_error()."\n";
}
var_dump ($debug);
$rounds++;
} /* End of while loop */
?>
A couple of issues.
Firstly, your definition of the RRD has a step of 5seconds and RRAs with steps of 12x5s=1min and 300x5s=25min. They also have a length of only 60 rows, so 1hr and 25hr respectively. You'll never get a weekly graph this way! You need to add more rows; also the step seems rather short, and you might need a smaller-step RRA for hourly graphs and a larger-step one for weekly graphs.
Secondly, it is not clear how you're calling the graph function. You seem to be specifying:
"--start","".$start." -15s",
"--end", "".$t."",
"--step","".$step."",
... which would force it to use the 5s interval (unavailable, so the 1min one would always get used) and for the graph to be only for the time window from the start to the last update, not a 'hourly' or 'daily' as you were asking.
Note that the RRA you have defined do not define the time window of the graph you are asking for. Also, just because you have more than one RRA defined, it doesnt mean you'll get more than one graph unless oyu call the graph function twice with different arguments.
If you want a daily graph, use
"--start","end - 1 hour",
"--end",$t,
Do not specify a step as the most appropriate available will be used anyway. For a daily graph, use
"--start","end - 1 day"
"--end",$t,
Similarly, no need to specify a step.
Hopefully this will make it a little clearer. Most of the RRD graph options have sensible defaults, and RRDTool is pretty good at picking the correct RRA to use based on the graph size, time window, and DEF statements.

PHP code reaching execution time limit

I need to go through an array containing points in a map and check their distance from one another. I need to count how many nodes are within 200m and 50m of each one. It works fine for smaller amounts of values. However when I tried to run more values through it (around 4000 for scalability testing) an error occurs saying that I have reached the maximum execution time of 300 seconds. It needs to be able to handle at least this much within 300 seconds if possible.
I have read around and found out that there is a way to disable/change this limit, but I would like to know if there is a simpler way of executing the following code so that the time it takes to run it will decrease.
for($i=0;$i<=count($data)-1;$i++)
{
$amount200a=0;
$amount200p=0;
$amount50a=0;
$amount50p=0;
$distance;
for($_i=0;$_i<=count($data)-1;$_i++)
{
$distance=0;
if($data[$i][0]===$data[$_i][0])
{
}
else
{
//echo "Comparing ".$data[$i][0]." and ".$data[$_i][0]." ";
$lat_a = $data[$i][1] * PI()/180;
$lat_b = $data[$_i][1] * PI()/180;
$long_a = $data[$i][2] * PI()/180;
$long_b = $data[$_i][2] * PI()/180;
$distance =
acos(
sin($lat_a ) * sin($lat_b) +
cos($lat_a) * cos($lat_b) * cos($long_b - $long_a)
) * 6371;
$distance*=1000;
if ($distance<=50)
{
$amount50a++;
$amount200a++;
}
else if ($distance<=200)
{
$amount200a++;
}
}
}
$amount200p=100*number_format($amount200a/count($data),2,'.','');
$amount50p=100*number_format($amount50a/count($data),2,'.','');
/*
$dist[$i][0]=$data[$i][0];
$dist[$i][1]=$amount200a;
$dist[$i][2]=$amount200p;
$dist[$i][3]=$amount50a;
$dist[$i][4]=$amount50p;
//*/
$dist.=$data[$i][0]."&&".$amount200a."&&".$amount200p."&&".$amount50a."&&".$amount50p."%%";
}
Index 0 contains the unique ID of each node, 1 contains the latitude of each node and
index 2 contains the longitude of each node.
The error occurs at the second for loop inside the first loop. This loop is the one comparing the selected map node to other nodes. I am also using the Haversine Formula.
first of all, you are performing in big O notation: O(data^2), which is gonna be slow as hell , and really, either there are 2 possible solutions. Find a proven algorithm that solves the same problem in a better time. Or if you cant, start moving stuff out of the innner for loop, and mathmatically prove if you can convert the inner for loop to mostly simple calculations, which is often something you can do.
after some rewriting, I see some possiblities:
If $data is not a SPLFixedArray (which has a FAR Better access time, ) then make it. since you are accessing that data so many times (4000^2)*2.
secound, write cleaner code. although the optizmier will do its best, if you dont try either to minize the code (which only makes it more readable), then it might not be able to do it as well as possible.
and move intermediate results out of the loops, also something like the size of the array.
Currently you're checking all points against all other points, where in fact you only need to check the current point against all remaining points. The distance from A to B is the same as the distance from B to A, so why calculate it twice?
I would probably make an adjacent array that counts how many nodes are within range of each other, and increment pairs of entries in that array after I've calculated that two nodes are within range of each other.
You should probably come up with a very fast approximation of the distance that can be used to disregard as many nodes as possible before calculating the real distance (which is never going to be super fast).
Generally speaking, beyond algorithmic optimisations, the basic rules of optimisation are:
Don't any processing that you don't have to do: Like not multiplying $distance by 1000. Just change the values you're testing against from 20 and 50 to 0.02 and 0.05, respectively.
Don't call any function more often than you have to: You only need to call count($data) once before any processing starts.
Don't calculate constant values more than once: PI()/180, for example.
Move all possible processing outside of loops. I.e. precalculate as much as possible.
Another minor point which will make your code a little easier to read:
for( $i = 0; $i <= count( $data ) - 1; $i++ ) is the same as:
for( $i = 0; $i < count( $data ); $i++ )
Try this:
$max = count($data);
$CONST_PI = PI() / 180;
for($i=0;$i<$max;$i++)
{
$amount200a=0;
$amount50a=0;
$long_a = $data[$i][2] * $CONST_PI;
$lat_a = $data[$i][1] * $CONST_PI;
for($_i=0;$_i<=$max;$_i++)
//or use for($_i=($i+1);$_i<=$max;$_i++) if you did not need to calculate already calculated in other direction
{
$distance=0;
if($data[$i][0]===$data[$_i][0]) continue;
$lat_b = $data[$_i][1] * $CONST_PI;
$long_b = $data[$_i][2] * $CONST_PI;
$distance =
acos(
sin($lat_a ) * sin($lat_b) +
cos($lat_a) * cos($lat_b) * cos($long_b - $long_a)
) * 6371;
if ($distance<=0.2)
{
$amount200a++;
if ($distance<=0.05)
{
$amount50a++;
}
}
} // for %_i
$amount200p=100*number_format($amount200a/$max,2,'.','');
$amount50p=100*number_format($amount50a/$max,2,'.','');
$dist.=$data[$i][0]."&&".$amount200a."&&".$amount200p."&&".$amount50a."&&".$amount50p."%%";
} // for $i
It will be better to read I think and if you change the commented out line of the for $_i it will be faster at all :)

How to find the centre of a grid of lines on google maps

I'm struggling with a problem with some GIS information that I am trying to put into a KML file to be used by google maps and google Earth.
I have SQL database containing a number of surveys, which are stored in lines. When all the lines are drawn on the map, they create a grid. What I am trying to do is work out the centre of these 'grids' so that I can put a placemarker reference into the kml to show all the grid locations on a map.
lines is stored in the database like this:
118.718318,-19.015803,0 118.722449,-19.016919,0 118.736223,-19.020637,0 118.749936,-19.024023,0 118.763897,-19.027722,0 118.777705,-19.031277,0 118.791416,-19.034826,0 118.805276,-19.038367,0 118.818862,-19.041962,0 118.832862,-19.045582,0 118.846133,-19.049563,0 118.859801,-19.053851,0 118.873322,-19.058145,0 118.887022,-19.062349,0 118.900595,-19.066594,0 118.914066,-19.070839,0 118.927885,-19.075151,0 118.941468,-19.079354,0 118.955064,-19.083658,0 118.968766,-19.087896,0 118.982247,-19.092054,0 118.995795,-19.096324,0 119.009192,-19.100448,0 119.022805,-19.104787,0 119.036414,-19.10893,0 119.049625,-19.113166,0 119.063155,-19.11738,0 119.076626,-19.121563,0 119.090079,-19.125738,0 119.103679,-19.129968,0 119.117009,-19.134168,0 119.130637,-19.138418,0 119.144134,-19.142613,0 119.157749,-19.146767,0 119.171105,-19.151058,0 119.184722,-19.155252,0 119.19844,-19.159399,0 119.211992,-19.163737,0 119.225362,-19.167925,0 119.239109,-19.17218,0 119.252552,-19.176239,0 119.265975,-19.180483,0 119.279718,-19.184901,0 119.2931,-19.189118,0 119.306798,-19.193267,0 119.320561,-19.197631,0 119.33389,-19.201898,0 119.34772,-19.206063,0 119.361322,-19.210282,0 119.374522,-19.214505,0 119.38825,-19.218753,0 119.401948,-19.22299,0 119.415609,-19.227171,0 119.42913,-19.231407,0 119.44269,-19.235604,0 119.456221,-19.240254,0 119.469772,-19.244165,0 119.483159,-19.248341,0 119.496911,-19.252483,0 119.510569,-19.256818,0 119.524034,-19.261068,0 119.537822,-19.265332,0 119.551409,-19.269529,0 119.564744,-19.273764,0 119.578357,-19.277874,0 119.591935,-19.282171,0 119.605472,-19.286424,0 119.619009,-19.290593,0 119.632777,-19.294862,0 119.646394,-19.299146,0 119.660107,-19.303469,0 119.673608,-19.307602,0 119.6872,-19.31182,0 119.700838,-19.31605,0 119.714555,-19.320301,0 119.728202,-19.324615,0 119.741511,-19.328794,0 119.755299,-19.333098,0 119.768776,-19.337206,0 119.782216,-19.341528,0 119.796141,-19.345829,0 119.809691,-19.349879,0 119.822889,-19.354112,0 119.836822,-19.358414,0 119.850077,-19.362618,0 119.864069,-19.366916,0 119.8773,-19.371092,0 119.891263,-19.37536,0 119.904612,-19.379556,0 119.918522,-19.38394,0 119.932101,-19.388108,0 119.945577,-19.392184,0 119.959304,-19.396544,0 119.973042,-19.400809,0 119.986433,-19.405015,0 119.999196,-19.408981,0 120.011959,-19.412946,0 120.014512,-19.41374,0
118.990393,-19.523933,0 118.990833,-19.522296,0 118.994368,-19.509069,0 118.997911,-19.49589,0 119.001249,-19.48258,0 119.004939,-19.469658,0 119.008437,-19.456658,0 119.01183,-19.443361,0 119.015332,-19.430439,0 119.01869,-19.417491,0 119.022303,-19.40421,0 119.025853,-19.391381,0 119.029387,-19.377966,0 119.032821,-19.365006,0 119.036247,-19.352047,0 119.039935,-19.338701,0 119.043376,-19.325781,0 119.046824,-19.312713,0 119.050223,-19.299634,0 119.053822,-19.286711,0 119.05736,-19.273441,0 119.060722,-19.260467,0 119.064357,-19.247382,0 119.067863,-19.234069,0 119.071361,-19.221168,0 119.07486,-19.208268,0 119.078303,-19.19503,0 119.081721,-19.181915,0 119.085296,-19.169016,0 119.088851,-19.155937,0 119.092245,-19.142828,0 119.095887,-19.12944,0 119.099337,-19.116291,0 119.102821,-19.103421,0 119.106333,-19.090384,0 119.109885,-19.077356,0 119.113395,-19.063976,0 119.116866,-19.050973,0 119.120467,-19.037886,0 119.123751,-19.024888,0 119.127385,-19.011622,0 119.130775,-18.998856,0 119.134503,-18.98547,0 119.13873,-18.9728,0 119.14289,-18.959666,0 119.146076,-18.946541,0 119.149065,-18.933693,0 119.152281,-18.920522,0 119.155414,-18.907033,0 119.158442,-18.894058,0 119.161421,-18.88097,0 119.164632,-18.867484,0 119.167668,-18.854501,0 119.170744,-18.841401,0 119.173939,-18.827999,0 119.17703,-18.815028,0 119.17984,-18.802218,0 119.182902,-18.789334,0 119.186088,-18.776136,0 119.189031,-18.762862,0 119.192179,-18.749824,0 119.19539,-18.7365,0 119.198255,-18.723279,0 119.201476,-18.710134,0 119.204345,-18.69705,0 119.20746,-18.683768,0 119.210555,-18.670607,0 119.213641,-18.657632,0 119.216727,-18.644301,0 119.21982,-18.630956,0 119.223094,-18.617243,0 119.22625,-18.603885,0 119.229368,-18.590534,0 119.232494,-18.577364,0 119.235477,-18.564287,0 119.238496,-18.550953,0 119.241392,-18.53789,0 119.244609,-18.524881,0 119.247551,-18.512017,0 119.250532,-18.498916,0 119.253729,-18.485859,0 119.256428,-18.473845,0 119.259288,-18.461645,0 119.262064,-18.4491,0 119.265024,-18.437048,0 119.267877,-18.424992,0 119.270457,-18.4136,0 119.273429,-18.401305,0 119.276346,-18.388484,0 119.279506,-18.375299,0 119.282307,-18.362197,0 119.285494,-18.34918,0 119.288695,-18.336115,0 119.291008,-18.323772,0 119.294026,-18.310635,0 119.297332,-18.297714,0 119.300499,-18.284553,0 119.303442,-18.271193,0 119.307081,-18.258278,0 119.309945,-18.245139,0 119.313121,-18.232137,0 119.31642,-18.218993,0 119.319499,-18.205722,0 119.322801,-18.192774,0 119.325986,-18.179558,0 119.329173,-18.166332,0 119.33236,-18.15312,0 119.335558,-18.140071,0 119.338761,-18.126696,0 119.342007,-18.113502,0 119.345238,-18.100349,0 119.34808,-18.088343,0 119.350259,-18.079138,0
119.912412,-18.177179,0 119.910223,-18.183223,0 119.905619,-18.195973,0 119.901171,-18.208645,0 119.896641,-18.221452,0 119.89198,-18.234323,0 119.887547,-18.246912,0 119.882922,-18.259925,0 119.878421,-18.272522,0 119.873909,-18.285223,0 119.869264,-18.298194,0 119.864746,-18.310939,0 119.860203,-18.323628,0 119.855692,-18.336434,0 119.850997,-18.349262,0 119.846561,-18.362014,0 119.841916,-18.374866,0 119.837406,-18.387565,0 119.832778,-18.400657,0 119.828259,-18.413072,0 119.82364,-18.426273,0 119.819088,-18.438992,0 119.814546,-18.451696,0 119.810103,-18.464425,0 119.80548,-18.477041,0 119.800935,-18.48989,0 119.796359,-18.502741,0 119.791812,-18.515489,0 119.787314,-18.528536,0 119.782724,-18.540965,0 119.778079,-18.553994,0 119.773558,-18.56663,0 119.768931,-18.57955,0 119.76446,-18.592177,0 119.759793,-18.605255,0 119.755393,-18.617736,0 119.750648,-18.630757,0 119.746479,-18.643371,0 119.741779,-18.656176,0 119.737071,-18.669135,0 119.732572,-18.681798,0 119.727954,-18.694669,0 119.723453,-18.707473,0 119.718855,-18.720259,0 119.714169,-18.733198,0 119.70946,-18.746355,0 119.704998,-18.759005,0 119.700581,-18.771793,0 119.696226,-18.784893,0 119.691577,-18.797373,0 119.68662,-18.810064,0 119.682182,-18.822772,0 119.677711,-18.835621,0 119.672955,-18.848475,0 119.668536,-18.861252,0 119.663856,-18.873901,0 119.659412,-18.88663,0 119.654815,-18.899592,0 119.650185,-18.912338,0 119.645586,-18.925118,0 119.640925,-18.937923,0 119.636617,-18.950753,0 119.631874,-18.963446,0 119.627376,-18.976275,0 119.622845,-18.98893,0 119.618175,-19.001942,0 119.613727,-19.014668,0 119.608878,-19.027398,0 119.60445,-19.039954,0 119.599954,-19.053249,0 119.595066,-19.066008,0 119.590562,-19.078866,0 119.586062,-19.091703,0 119.58155,-19.104199,0 119.576948,-19.116924,0 119.572431,-19.129926,0 119.567449,-19.142699,0 119.563227,-19.155474,0 119.558555,-19.168309,0 119.553956,-19.181053,0 119.549545,-19.193649,0 119.544854,-19.20646,0 119.540133,-19.21929,0 119.535606,-19.232162,0 119.53108,-19.245054,0 119.526509,-19.257772,0 119.522079,-19.270427,0 119.517419,-19.283013,0 119.512755,-19.296126,0 119.508159,-19.309031,0 119.503557,-19.321821,0 119.498966,-19.334394,0 119.494487,-19.347232,0 119.489815,-19.359907,0 119.485201,-19.372962,0 119.48067,-19.385811,0 119.476151,-19.398326,0 119.471526,-19.411138,0 119.466856,-19.424266,0 119.462284,-19.436788,0 119.457752,-19.449628,0 119.452975,-19.462718,0 119.448612,-19.47499,0 119.444249,-19.48726,0 119.439886,-19.499531,0 119.437704,-19.505666,0
Each point is where a mesasurement took place.
Is there a PHP library or a formula that can be used to work this out without being too intensive?
Thanks
It depends very much on how you define "center". One somewhat sophisticated approach you might like to try:
Extract a list of points from the lines
Find the convex hull of the points:
Find the centroid of the convex hull
(source: algorithmic-solutions.info)
The convex hull is simply described as the simplest polygon that passes through some of the points, forming an envelop that encloses all of the points within the polygon. wikipedia has a more rigorous description.
The centroid provides a simple way for finding the "center of gravity" of a polygon. A wikipedia and a nice tutorial provide more information.
Alternatively, a very simple implementation might find the minimum bounding rectangle of the lines, and then find the geometric center of the box. The following code assumes the lines you posted above are represented as strings, and are in an array. I'm no php performance expert, but this algorithm should execute in O(n) asymptotic time:
<?php
$line_strs = array(
"118.718318,-19.015803,0 118.722449,-19.016919,0 118.736223,-19.020637,0 118.749936,-19.024023,0 118.763897,-19.027722,0 118.777705,-19.031277,0", "118.791416,-19.034826,0 118.805276,-19.038367,0 118.818862,-19.041962,0 118.832862,-19.045582,0 118.846133,-19.049563,0 118.859801,-19.053851,0",
[SNIP majority of lines...]
"119.448612,-19.47499,0 119.444249,-19.48726,0 119.439886,-19.499531,0 119.437704,-19.505666,0");
$xs = array();
$ys = array();
foreach ($line_strs as $line){
$points = explode(' ', $line);
foreach ($points as $pt){
$xyz = explode(',', $pt);
$xs[] = (float)$xyz[0];
$ys[] = (float)$xyz[1];
}
}
$x_mean = (max($xs) + min($xs)) / 2;
$y_mean = (max($ys) + min($ys)) / 2;
echo "$x_mean,$y_mean\n";
?>
This outputs 119.366415,-18.8015355.
centroids ?
If I have time, I'll edit this answer to include a python implementation
Another very simple answer finds the mean location of all vertices. It favours areas that are densely populated by points. This returns a slightly different answer to the bounding box method. You might want to try them box to see which provides a "better" result for your data.
<?php
$line_strs = array(
"118.718318,-19.015803,0 118.722449,-19.016919,0 118.736223,-19.020637,0 118.749936,-19.024023,0 118.763897,-19.027722,0 118.777705,-19.031277,0", "118.791416,-19.034826,0 118.805276,-19.038367,0 118.818862,-19.041962,0 118.832862,-19.045582,0 118.846133,-19.049563,0 118.859801,-19.053851,0",
[SNIP majority of lines...]
"119.448612,-19.47499,0 119.444249,-19.48726,0 119.439886,-19.499531,0 119.437704,-19.505666,0");
$x_sum = 0.0;
$y_sum = 0.0;
$n = 0;
foreach ($line_strs as $line){
$points = explode(' ', $line);
foreach ($points as $pt){
$xyz = explode(',', $pt);
$x_sum += (float)$xyz[0];
$y_sum += (float)$xyz[1];
$n++;
}
}
$x_mean = $x_sum / $n;
$y_mean = $y_sum / $n;
echo "$x_mean,$y_mean\n";
?>
This outputs 119.402034114,-18.9427670536.

Categories