Related
I have PHP array which I use to draw a graph
Json format:
{"y":24.1,"x":"2017-12-04 11:21:25"},
{"y":24.1,"x":"2017-12-04 11:32:25"},
{"y":24.3,"x":"2017-12-04 11:33:30"},
{"y":24.1,"x":"2017-12-04 11:34:25"},
{"y":24.2,"x":"2017-12-04 11:35:35"},.........
{"y":26.2,"x":"2017-12-04 11:36:35"}, ->goes up for about a minute
{"y":26.3,"x":"2017-12-04 11:37:35"},.........
{"y":24.1,"x":"2017-12-04 11:38:25"},
{"y":24.3,"x":"2017-12-04 11:39:30"}
y=is temperature and x value is date time,
as you can see temperature doesn't change so often even if, it change only for max 0.4. But sometimes after a long period of similar values it change for more than 0.4.
I would like to join those similar values, so graph would not have 200k of similar values but only those that are "important".
I would need an advice, how to make or which algorithm would be perfect to create optimized array like i would like.
perfect output:
{"y":24.1,"x":"2017-12-04 11:21:25"},.........
{"y":24.1,"x":"2017-12-04 11:34:25"},
{"y":24.2,"x":"2017-12-04 11:35:35"},.........
{"y":26.2,"x":"2017-12-04 11:36:35"}, ->goes up for about a minute
{"y":26.3,"x":"2017-12-04 11:37:35"},.........
{"y":24.1,"x":"2017-12-04 11:38:25"}
Any help?
As you specified php I'm going to assume you can handle this on the output side.
Basically, you want logic like "if the absolute value of the temperature exceeds the last temperature by so much, or the time is greater than the last time by x minutes, then let's output a point on the graph". If that's the case you can get the result by the following:
$temps = array(); //your data in the question
$temp = 0;
$time = 0;
$time_max = 120; //two minutes
$temp_important = .4; //max you'll tolerate
$output = [];
foreach($temps as $point){
if(strtotime($point['x']) - $time > $time_max || abs($point['y'] - $temp) >= $temp_important){
// add it to output
$output[] = $point;
}
//update our data points
if(strtotime($point['x']) - $time > $time_max){
$time = strtotime($point['x']);
}
if(abs($point['y'] - $temp) >= $temp_important){
$temp = $point['y'];
}
}
// and out we go..
echo json_encode($output);
Hmm, that's not exactly what you're asking for, as if the temp spiked in a short time and then went down immediately, you'd need to change your logic - but think of it in terms of requirements.
If you're RECEIVING data on the output side I'd write something in javascript to store these points in/out and use the same logic. You might need to buffer 2-3 points to make your decision. Your logic here is performing an important task so you'd want to encapsulate it and make sure you could specify the parameters easily.
You have a function that always inputs an interval (natural numbers in this case), this function returns a result, but is quite expensive on the processor, simulated by sleep in this example:
function calculate($start, $end) {
$result = 0;
for($x=$start;$x<=$end;$x++) {
$result++;
usleep(250000);
}
return $result;
}
In order to be more efficient there is an array of old results, that contains the interval used an the result of the function for that interval:
$oldResults = [
['s'=>1, 'e'=>2, 'r' => 1],
['s'=>2, 'e'=>6, 'r' => 4],
['s'=>4, 'e'=>7, 'r' => 3]
];
If I call calculate(1,10) the function should be able to calculate new intervals based on old results and accumulate them, In this particular case it should take the old result from 1 to 2 add that to the old result from 2 to 6 and do a new calculate(6,10) and add that too. Take in consideration that the function ignores the old saved interval from 4 to 7 since it was more convenient to use 2-6.
This is a visual representation of the problem:
Of course in this example, calculate() is quite simple and you can just find particular ways to solve this problem around it, but in the real code calculate() is complex and the only thing I know is that calculate(n0,n3)==calculate(n0,n1)+calculate(n1,n2)+calculate(n2,n3).
I cannot find a way to solve the reuse of the old data without using a bunch of IF and foreach, I'm sure there is a more elegant approach to solve this.
You can play with the code here.
Note: I'm using PHP but I can read JS, Pyton, C and similar languages.
if you are certain that calculate(n0,n3)==calculate(n0,n1)+calculate(n1,n2)+calculate(n2,n3), then it seems to me that one approach might simply be to establish a database cache.
you can pre-calculate each discrete interval, and store its result in a record.
$start = 0;
$end = 1000;
for($i=1;$i<=$end;$i++) {
$result = calculate($start, $i);
$sql = "INSERT INTO calculated_cache (start, end, result) VALUES ($start,$i,$result)";
// execute statement via whatever dbms api
$start++;
}
now whenever new requests come in, a database lookup should be significantly faster. note you may need to tinker with my boundary cases in this rough example.
function fetch_calculated_cache($start, $end) {
$sql = "
SELECT SUM(result)
FROM calculated_cache
WHERE (start BETWEEN $start AND $end)
AND (end BETWEEN $start AND $end)
";
$result = // whatever dbms api you chose
return $result;
}
there are a couple obvious considerations such as:
cache invalidation. how often will the results of your calculate function change? you'll need to repopulate the database then.
how many intervals do you want to store? in my example, I arbitrarily picked 1000
will you ever need to retrieve non-sequential interval results? you'll need to apply the above procedure in chunks.
i wrote this:
function findFittingFromCache($from, $to, $cache){
//length for measuring usefulnes of chunk from cache (now 0.1 means 10% percent of total length)
$totalLength = abs($to - $from);
$candidates = array_filter($cache, function($val) use ($from, $to, $totalLength){
$chunkLength = abs($val['e'] - $val['s']);
if($from <= $val['s'] && $to >= $val['e'] && ($chunkLength/$totalLength > 0.1)){
return true;
}
return false;
});
//sorting to have non-decremental values of $x['s']
usort($candidates, function($a, $b){ return $a['s'] - $b['s']; });
$flowCheck = $from;
$needToCompute = array();
foreach($candidates as $key => $val){
if($val['s'] < $flowCheck){
//already using something with this interval
unset($candidates[$key]);
} else {
if($val['s'] > $flowCheck){
//save what will be needed to compute
$needToCompute[] = array('s'=>$flowCheck, 'e'=>$val['s']);
}
//increase starting position for next loop
$flowCheck = $val['e'];
}
}
//rest needs to be computed as well
if($flowCheck < $to){
$needToCompute[] = array('s'=>$flowCheck, 'e'=>$to);
}
return array("computed"=>$candidates, "missing"=>$needToCompute);
}
It is function which returns you two arrays, one "computed" holds found already computed pieces, second "missing" holds gaps between them which must be computed yet.
inside function there is 0.1 threshold, which disqualifies chunks shorter than 10% of total searched length, you can rewrite function to send threshold as parameter, or ommit it completely.
i presume results will be stored and after computing added into cache ($oldResults), which might be of any form (for example database as Jeff Puckett suggested). Do not forget to add all computed chunks and whole seeked length into cache.
I am sorry but i can't find a way without cycles and ifs
Working demo:
link
I have scraped 5000 files, stored them in individual files (0-4999.txt), now i need to find duplicate content in them. so i am comparing each file with one another in nested loop (ETA 82 hours). This approach will definitely take hours to complete. My main concern here is the no. of iterations. Can anyone suggest a better approach to cut down iterations and reduce time taken?
current code: NCD algorithm
function ncd_new($sx, $sy, $prec=0, $MAXLEN=9000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare.
# Use $prec=-1 for result range [0-1], $pres=0 for percentual,
# For NCD definition see http://arxiv.org/abs/0809.2553
$x = $min = strlen(gzcompress($sx));
$y = $max = strlen(gzcompress($sy));
$xy= strlen(gzcompress($sx.$sy));
$a = $sx;
if ($x>$y) { # swap min/max
$min = $y;
$max = $x;
$a = $sy;
}
$res = ($xy-$min)/$max; # NCD definition.
if ($MAXLEN<0 || $xy<$MAXLEN) {
$aa= strlen(gzcompress($a.$a));
$ref = ($aa-$min)/$min;
$res = $res - $ref; # correction
}
return ($prec<0)? $res: 100*round($res,2+$prec);
}
looping over each file:
$totalScraped = 5000;
for($fileC=0;$fileC<$totalScraped;$fileC++)
{
$f1 = file_get_contents($fileC.".txt");
$stripstr = array('/\bis\b/i', '/\bwas\b/i', '/\bthe\b/i', '/\ba\b/i');
$file1 = preg_replace($stripstr, '', $f1);
// 0+fileC => exclude already compared files
// eg. if fileC=10 , start loop 11 to 4999
for($fileD=(0+$fileC);$fileD<$totalScraped;$fileD++)
{
$f2 = file_get_contents($fileD.".txt", FILE_USE_INCLUDE_PATH);
$stripstr = array('/\bis\b/i', '/\bwas\b/i', '/\bthe\b/i', '/\ba\b/i');
$file2 = preg_replace($stripstr, '', $f2);
$total=ncd_new($file1,$file2);
echo "$fileName1 vs $fileName2 is: $total%\n";
}
}
You may want to find a way to distinguish likely candidates from unlikely ones.
So, maybe there is a way that you can compute a value for each file (say: a word count, a count of sentences / paragraphs... maybe even a count of individual letters), to identify the unlikely candidates beforehand.
If you could achieve this, you could reduce the amount of comparisons by ordering your arrays by this computed number.
another process that i tried was:
strip html tags from page
replace \s{2,} with \s, \n{2,} with \n, so that text b/w each tag is presented in a single line(almost)
compare two such generated files by taking a line, preg_matching, if found -> duplicate, else break line into array of words, calculate array_intersect, if count is 70% or more of line length -> duplicate.
which was very efficient and i could compare 5000 files in ~10 minutes
but still slow for my requirements.
So i implemented the first logic "ncd algo" method in C language, and it completes the task with 5-10 seconds (depending on the average page size)
I am trying to calculate an average without being thrown off by a small set of far off numbers (ie, 1,2,1,2,3,4,50) the single 50 will throw off the entire average.
If I have a list of numbers like so:
19,20,21,21,22,30,60,60
The average is 31
The median is 30
The mode is 21 & 60 (averaged to 40.5)
But anyone can see that the majority is in the range 19-22 (5 in, 3 out) and if you get the average of just the major range it's 20.6 (a big difference than any of the numbers above)
I am thinking that you can get this like so:
c+d-r
Where c is the count of a numbers, d is the distinct values, and r is the range. Then you can apply this to all the possble ranges, and the highest score is the omptimal range to get an average from.
For example 19,20,21,21,22 would be 5 numbers, 4 distinct values, and the range is 3 (22 - 19). If you plug this into my equation you get 5+4-3=6
If you applied this to the entire number list it would be 8+6-41=-27
I think this works pretty good, but I have to create a huge loop to test against all possible ranges. In just my small example there are 21 possible ranges:
19-19, 19-20, 19-21, 19-22, 19-30, 19-60, 20-20, 20-21, 20-22, 20-30, 20-60, 21-21, 21-22, 21-30, 21-60, 22-22, 22-30, 22-60, 30-30, 30-60, 60-60
I am wondering if there is a more efficient way to get an average like this.
Or if someone has a better algorithm all together?
You might get some use out of standard deviation here, which basically measures how concentrated the data points are. You can define an outlier as anything more than 1 standard deviation (or whatever other number suits you) from the average, throw them out, and calculate a new average that doesn't include them.
Here's a pretty naive implementation that you could fix up for your own needs. I purposely kept it pretty verbose. It's based on the five-number-summary often used to figure these things out.
function get_median($arr) {
sort($arr);
$c = count($arr) - 1;
if ($c%2) {
$b = round($c/2);
$a = $b-1;
return ($arr[$b] + $arr[$a]) / 2 ;
} else {
return $arr[($c/2)];
}
}
function get_five_number_summary($arr) {
sort($arr);
$c = count($arr) - 1;
$fns = array();
if ($c%2) {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a-1);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
else {
$b = round($c/2);
$a = $b-1;
$lower_quartile = array_slice($arr, 1, $a);
$upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
$fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
return $fns;
}
}
function find_outliers($arr) {
$fns = get_five_number_summary($arr);
$interquartile_range = $fns[3] - $fns[1];
$low = $fns[1] - $interquartile_range;
$high = $fns[3] + $interquartile_range;
foreach ($arr as $v) {
if ($v > $high || $v < $low)
echo "$v is an outlier<br>";
}
}
//$numbers = array( 19,20,21,21,22,30,60 ); // 60 is an outlier
$numbers = array( 1,230,239,331,340,800); // 1 is an outlier, 800 is an outlier
find_outliers($numbers);
Note that this method, albeit much simpler to implement than standard deviation, will not find the two 60 outliers in your example, but it works pretty well. Use the code for whatever, hopefully it's useful!
To see how the algorithm works and how I implemented it, go to: http://www.mathwords.com/o/outlier.htm
This, of course, doesn't calculate the final average, but it's kind of trivial after you run find_outliers() :P
Why don't you use the median? It's not 30, it's 21.5.
You could put the values into an array, sort the array, and then find the median, which is usually a better number than the average anyway because it discounts outliers automatically, giving them no more weight than any other number.
You might sort your numbers, choose your preferred subrange (e.g., the middle 90%), and take the mean of that.
There is no one true answer to your question, because there are always going to be distributions that will give you a funny answer (e.g., consider a biased bi-modal distribution). This is why may statistics are often presented using box-and-whisker diagrams showing mean, median, quartiles, and outliers.
rand(1,N) but excluding array(a,b,c,..),
is there already a built-in function that I don't know or do I have to implement it myself(how?) ?
UPDATE
The qualified solution should have gold performance whether the size of the excluded array is big or not.
No built-in function, but you could do this:
function randWithout($from, $to, array $exceptions) {
sort($exceptions); // lets us use break; in the foreach reliably
$number = rand($from, $to - count($exceptions)); // or mt_rand()
foreach ($exceptions as $exception) {
if ($number >= $exception) {
$number++; // make up for the gap
} else /*if ($number < $exception)*/ {
break;
}
}
return $number;
}
That's off the top of my head, so it could use polishing - but at least you can't end up in an infinite-loop scenario, even hypothetically.
Note: The function breaks if $exceptions exhausts your range - e.g. calling randWithout(1, 2, array(1,2)) or randWithout(1, 2, array(0,1,2,3)) will not yield anything sensible (obviously), but in that case, the returned number will be outside the $from-$to range, so it's easy to catch.
If $exceptions is guaranteed to be sorted already, sort($exceptions); can be removed.
Eye-candy: Somewhat minimalistic visualisation of the algorithm.
I don't think there's such a function built-in ; you'll probably have to code it yourself.
To code this, you have two solutions :
Use a loop, to call rand() or mt_rand() until it returns a correct value
which means calling rand() several times, in the worst case
but this should work OK if N is big, and you don't have many forbidden values.
Build an array that contains only legal values
And use array_rand to pick one value from it
which will work fine if N is small
Depending on exactly what you need, and why, this approach might be an interesting alternative.
$numbers = array_diff(range(1, N), array(a, b, c));
// Either (not a real answer, but could be useful, depending on your circumstances)
shuffle($numbers); // $numbers is now a randomly-sorted array containing all the numbers that interest you
// Or:
$x = $numbers[array_rand($numbers)]; // $x is now a random number selected from the set of numbers you're interested in
So, if you don't need to generate the set of potential numbers each time, but are generating the set once and then picking a bunch of random number from the same set, this could be a good way to go.
The simplest way...
<?php
function rand_except($min, $max, $excepting = array()) {
$num = mt_rand($min, $max);
return in_array($num, $excepting) ? rand_except($min, $max, $excepting) : $num;
}
?>
What you need to do is calculate an array of skipped locations so you can pick a random position in a continuous array of length M = N - #of exceptions and easily map it back to the original array with holes. This will require time and space equal to the skipped array. I don't know php from a hole in the ground so forgive the textual semi-psudo code example.
Make a new array Offset[] the same length as the Exceptions array.
in Offset[i] store the first index in the imagined non-holey array that would have skipped i elements in the original array.
Now to pick a random element. Select a random number, r, in 0..M the number of remaining elements.
Find i such that Offset[i] <= r < Offest[i+i] this is easy with a binary search
Return r + i
Now, that is just a sketch you will need to deal with the ends of the arrays and if things are indexed form 0 or 1 and all that jazz. If you are clever you can actually compute the Offset array on the fly from the original, it is a bit less clear that way though.
Maybe its too late for answer, but I found this piece of code somewhere in my mind when trying to get random data from Database based on random ID excluding some number.
$excludedData = array(); // This is your excluded number
$maxVal = $this->db->count_all_results("game_pertanyaan"); // Get the maximum number based on my database
$randomNum = rand(1, $maxVal); // Make first initiation, I think you can put this directly in the while > in_array paramater, seems working as well, it's up to you
while (in_array($randomNum, $excludedData)) {
$randomNum = rand(1, $maxVal);
}
$randomNum; //Your random number excluding some number you choose
This is the fastest & best performance way to do it :
$all = range($Min,$Max);
$diff = array_diff($all,$Exclude);
shuffle($diff );
$data = array_slice($diff,0,$quantity);