In summary I am using stream_get_line to read a line of a file, replace a string and then write the line to another file.
I am using stream_get_line and supplying the "ending" parameter to instruct the function to read lines, or if there is no new line then read 130 bytes.
What I would like to know is how can I know if the 3rd parameter (PHP_EOL) was found, as I need to write exactly the same line (except for my string replacement) to the new file.
For reference...
string stream_get_line ( resource $handle , int $length [, string $ending ] )
It's mainly needed for the last line, sometimes it will contain a newline character and sometimes it doesn't.
My initial idea is to seek to the last line of the file and search the line for a new line character to see if I need to attach a newline to my edited line or not.
You could try using fgets if the stream is in ASCII mode (which only matters on Windows). That function will include the newline if it is found:
$line = fgets(STDIN, 131);
Otherwise, you could use ftell to see how many bytes were read and thus determine whether there was a line ending. For example, if foo.php contains
<?php
while (!feof(STDIN)) {
$pos = ftell(STDIN);
$line = stream_get_line(STDIN, 74, "\n");
$ended = (bool)(ftell(STDIN) - strlen($line) - $pos);
echo ($ended ? "YES " : "NO ") . $line . "\n";
}
executing echo -ne {1..100} '\n2nd to last line\nlast line' | php foo.php will give this output:
NO 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
NO 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 5
NO 3 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
YES 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
YES 2nd to last line
NO last line
Related
I am trying to create Presigned URLs for users to access content from an S3 bucket.
The below code was working fine and all of a sudden I am getting the below error when opening any pre-signed URL that is created.
public function getPresignedUri($p)
{
$s3 = new S3Client([
'region' => getenv('S3_REGION'),
'version' => 'latest',
]);
$cmd = $s3->getCommand('GetObject', [
'Bucket' => getenv('S3_BUCKET'),
'Key' => 'casts/'. $p['file']
]);
$request = $s3->createPresignedRequest($cmd, '+1 hour');
return (string) $request->getUri();
}
<Error><Code>SignatureDoesNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message><AWSAccessKeyId>ASIA3DM6Y5GJC4FYJAFC</AWSAccessKeyId><StringToSign>AWS4-HMAC-SHA256
20180821T072223Z
20180821/ap-southeast-2/s3/aws4_request
fc4f1139d3b146ae027bd0bfc0b3d6dacda81d711b062e0d93a65d04a61aa268</StringToSign><SignatureProvided>5f5d3ae9ef3d9cdfc0d039c39302c584dcfc93f5a94a0f1770bf6781d6958198</SignatureProvided><StringToSignBytes>41 57 53 34 2d 48 4d 41 43 2d 53 48 41 32 35 36 0a 32 30 31 38 30 38 32 31 54 30 37 32 32 32 33 5a 0a 32 30 31 38 30 38 32 31 2f 61 70 2d 73 6f 75 74 68 65 61 73 74 2d 32 2f 73 33 2f 61 77 73 34 5f 72 65 71 75 65 73 74 0a 66 63 34 66 31 31 33 39 64 33 62 31 34 36 61 65 30 32 37 62 64 30 62 66 63 30 62 33 64 36 64 61 63 64 61 38 31 64 37 31 31 62 30 36 32 65 30 64 39 33 61 36 35 64 30 34 61 36 31 61 61 32 36 38</StringToSignBytes><CanonicalRequest>GET
/casts/5B735D22BCB17.mp4
X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=ASIA3DM6Y5GJC4FYJAFC%2F20180821%2Fap-southeast-2%2Fs3%2Faws4_request&X-Amz-Date=20180821T072223Z&X-Amz-Expires=3600&X-Amz-Security-Token=FQoGZXIvYXdzEPn%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDEMaw7h8OwK6f6QN0SLBA4%2B9LzXs7OMNjW7HDqr1jhuK%2FshbOMDBoF00GHqTWUuJWXQuL4ptYpvWRjwpris0USWPMTx0O3WeKacvtw6oN2M1KRoUe3IcNOpFwaixKw8%2Fo5FKXK%2BSCo%2F7U%2B76V4aEFuuWEZkC5qhm9R7ChB7vDNTlmYXx2GOzL2uZYV8dZrAnrUfU5qWpyI4IQb8DvPnpDWB0OgA2SRvuGzkwkVLEtMmHS2SMU32gwX2Oy6YnMswWZeqVQ%2FovfWbxd5AA4O%2BFNQfcNM5l4jsuR2zV8FiKZ3jQRLgfQx5uvydv6FFzb90SDbvUjZd0aAsR1Mre%2FnoQodezAm0xoA5618%2FWd%2BIh3jouN2RflRM3II8UXCWzFFq2NL%2FxweJu2mYXfKNpTkqOEls5dFMo2OWQa3IGXJqT3EZEZKXcQ3z%2F2aOP%2Fyw%2F2GtPdQrdJziwN4lTXyl6%2FGZYd968yjlU6pIk6vB0NVq9q3wKjBiwlsfGTlaJnFJH7DD%2FIY4U6fYOmvAcGnoozAbIcqDZpDPNrvZX75tzSatHHLyQoF56STZPhWK7cCWEo2JWAzg6NE4xBmypFG%2Bkxtv0QtrcUNYD35FvFGbjheUhMnyOKOTz7tsF&X-Amz-SignedHeaders=host
host:app-assets-dev-ap-southeast-2-cmpny.s3.ap-southeast-2.amazonaws.com
host
UNSIGNED-PAYLOAD</CanonicalRequest>
AWS SDK verison aws/aws-sdk-php (3.65.0)
One small difference I can see in the URL is that when it was working it had
X-Amz-SignedHeaders=host
and now it has
X-Amz-SignedHeaders=host%3Bx-amz-security-token
Although not sure what could be causing that extra string??
EDIT 1:
I identified that this issue was due to the SDK version 3.65 ... when I rolled back to 3.31 there was no issue.
However I am not marking this as resolved as I would like to know why a small version change like this made such a big difference and error?
I can see there is major differences in the src/Signature/SignatureS4.php file specifically:
$parsed['query']['X-Amz-SignedHeaders'] = 'host'; (V3.31)
and
$parsed['query']['X-Amz-SignedHeaders'] = implode(';', $this->getPresignHeaders($parsed['headers'])); (V3.65)
however that line alone didn't fix - replacing the whole file did fix the error.
I logged this issue on Github - https://github.com/aws/aws-sdk-php/issues/1609
And it was tested, confirmed and resolved very quickly - https://github.com/aws/aws-sdk-php/pull/1610
This is resolved.
I have looked for a year to try to figure this one out. I am trying to build a bracket running system, for running bowling brackets.
I have a table with an ID column and a BowlerID column, call it bowling_bracket_entries. The ID is unique, but there can be multiple entries of the same BowlerID, ranging from 8 to 1 entry. What I want to do is make pairs from the BowlerID row, but never repeat the same pair, then from those pairings, put them in groups of 4 pairs where no BowlerID repeats within that group of 4 pairings.
Structure of the bowling.bracket_entries table
ID | BowlerID
766 151
767 230
768 201
769 202
770 140
771 205
772 62
773 75
774 56
775 140
759 129
760 60
761 165
762 223
763 145
764 131
765 145
704 197
705 230
706 202
707 167
708 223
709 205
710 217
711 217
712 56
713 60
714 141
715 60
716 193
717 181
718 217
719 75
720 218
721 151
722 223
723 202
724 197
725 140
726 220
727 203
728 56
729 62
730 218
731 160
732 205
733 141
734 167
735 165
736 151
737 205
738 224
739 203
740 142
741 181
742 60
743 60
744 218
745 217
746 224
747 160
748 218
749 223
750 203
751 193
752 202
753 62
754 60
755 142
756 201
757 151
758 203
I tried randomly selecting 2 BowlerID's and putting them together with a delimiter (ie 22~100), then inserting into a Pairings table, then pull the next pairing (ie 36~92), create a variable reverse of that pair (ie 92~36), and check the Pairing table for values that match either, if not found, it inserts, removes the ID of those BowlerIDs from the Entries table and repeats until it runs out of values. Problem is sometimes I get a BowlerID paired with itself. Occasionally, I will get a complete list with no BowlerID's paired with themselves.
SELECT bracket_entries.ID, bracket_entries.BowlerID FROM bracket_entries ORDER BY rand() LIMIT 2
Then put them together and create a pairing (ie 36~68)
$i = 0;
while($pairing=$rsNewPair->fetch_assoc()) {
//Build Pairing List
$thisPairing .= $pairing['BowlerID'];
$IDS .= $pairing['ID'];
$i++;
if($i < 2){
$thisPairing .= "~";
$IDS .= "~";
}
}
$flipFlop = explode('~', $thisPairing);
$reversePairing = $flipFlop[1].'~'.$flipFlop[0];
if($flipFlop[0] == $flipFlop[1]){
header("Refresh:0");
}
And compare to what is already in there.
SELECT bracket_pairings.Pairing FROM bracket_pairings WHERE bracket_pairings.Pairing = '".$thisPairing."' OR bracket_pairings.Pairing = '".$reversePairing."'"
If it doesn't find anything, then insert the pairing into the Pairings table and move on to the next 2
bowling_bracket_pairings table structure
1 203~218
2 193~218
3 217~129
4 201~60
5 60~141
6 141~165
7 197~202
8 230~203
9 220~167
10 60~62
11 151~140
12 151~230
13 193~205
14 60~140
15 217~223
16 203~142
17 60~205
18 197~151
19 205~201
20 218~62
21 56~223
22 217~167
23 56~202
24 217~75
25 224~223
26 160~203
27 151~60
28 131~145
29 140~205
30 202~75
31 62~160
32 142~181
33 224~181
34 145~223
35 165~56
36 218~202
SELECT
PairingID, SUBSTRING_INDEX(Pairing, '~', 1) AS entry1,
SUBSTRING_INDEX(SUBSTRING_INDEX(Pairing, '~', 2), '~', -1) AS entry2
FROM bracket_pairings
Then use a while loop to display the pairings in brackets and push each entry into an array for the 4 pairs until it is full and then compare to make sure any user is not duplicated.
while(($pairings=$rsEntries->fetch_assoc())&&($loop < 5)){
$thisBowlerID1 = $pairings['entry1'];
$thisBowlerID2 = $pairings['entry2'];
if((!in_array($thisBowlerID1, $thisBracket)) || (!in_array($thisBowlerID2, $thisBracket))){
while($players=$rsPlayers->fetch_assoc()){
if($players['BowlerID'] == $thisBowlerID1){
echo $players['BowlerID'].'<br>';
//echo $players['Name'].'('.$players['CurrentAvg'].')<br>';
}
} mysqli_data_seek($rsPlayers, 0);
array_push($thisBracket, $thisBowlerID1);
while($players=$rsPlayers->fetch_assoc()){
if($players['BowlerID'] == $thisBowlerID2){
echo $players['BowlerID'].'<br><br>';
//echo $players['Name'].'('.$players['CurrentAvg'].')<br><br>';
}
} mysqli_data_seek($rsPlayers, 0);
array_push($thisBracket, $thisBowlerID2);
$removeSQL="DELETE FROM bracket_pairings WHERE bracket_pairings.PairingID = ".$pairings['PairingID'];
$removePairing = $connAdmin->query($removeSQL);
$loop++;
}
$thisBracket = array();
}
}
I have 72 entries When I try to put them in groups of 4 (8 entries), It never seems to fill up the 9 brackets, just about 7.5 and then leave a random assortment of pairings left in the table that didn't get placed, yet I still have openings.
Result
Bracket 1
62
141
142
151
131
218
140
56
Bracket 2
145
201
193
160
56
205
129
203
Bracket 3
167
75
217
201
224
217
230
140
Bracket 4
60
193
203
197
141
167
223
220
Bracket 5
60
165
202
142
181
60
202
202
Bracket 6
205
140
62
218
217
60
230
223
Bracket 7
165
223
205
218
205
75
56
151
Bracket 8
202
203
As you can see the result leave 8 unfilled.
Here is what is left over that didn't get included:
5 197~218
10 60~223
15 181~62
20 203~60
25 160~217
30 151~151
35 145~224
Not sure why every fifth one has skipped. I think I am on the right track, but any help or ideas to figure out how to fix the issues that I am having would be great.
Okay, first: Don't save the pairings as a string like "203~60". It makes it harder to work with the database when you have to combine/split the values all the time. Your tables should be in 3NF.
Second: Don't save the pairings in the database when you are still building the pairings. Keep them in the memory of your php to avoid any unnecessary database calls just to see if the pairing is already added, it is much faster that way.
That being said, there are some algorithms you can lookup for your problem. You should check the following links:
Is there a known algorithm for scheduling tournament matchups? an their related questions on softwareengineering.stackexchange.com (this might be even a better place to ask, but check for duplicates)
https://en.wikipedia.org/wiki/Matching_%28graph_theory%29
https://en.wikipedia.org/wiki/Round-robin_tournament#Scheduling_algorithm
https://en.wikipedia.org/wiki/Backtracking
I can think of some algorithm, but it fails in some situations. The algorithm would work like this:
You use the algorithm on https://en.wikipedia.org/wiki/Round-robin_tournament#Scheduling_algorithm to create a scheduling for 9 teams with 8 members each. Let assume they are called "a" to "i". The pairings will look like this:
abcd aibc ahib aghi afgh
hgfe gfed fedc edcb dcbi
aefg adef acde bcfg
cbih bihg ihgf dehi
You get this seeding by holding the "a" team in place and rotate the remaining teams around the table/pairings. However you have to skip one team since you have 9 teams for 4*2 possible seeds. In the ninth group the "a" team is missing and it contains the remaining seedings of "b" to "i".
When we have these 9 teams with 8 members each they could be represented as this:
aaaaaaaa
bbbbbbbb
cccccccc
dddddddd
eeeeeeee
ffffffff
gggggggg
hhhhhhhh
iiiiiiii
When you have more than 9 teams you should try to pair them together like they belong together to one pseudo-team of size 8. This can be looked like this:
aaaaaabb
ccccddde
fffffggg
hhiijjjj
kkkkklll
mmmmnnnn
ooopppqq
rrrrrsss
ttuuvvww
Since these teams would be on "the same" pseudo team, they don't match against each other and the algorithm still works.
However, the algorithm fail when you cannot put the teams in pseudo teams of size 8. Assume you have 2 teams of 8 members and 8 teams of size 7. The
pseudo teams would look like this:
aaaaaaaa
bbbbbbbb
cccccccj
dddddddj
eeeeeeej
fffffffj
gggggggj
hhhhhhhj
iiiiiiij
In this situation, eventually the "8th" player of the row "c" might play against the "8th" player of the row "d", but they are actually on the same team. You might try to be tricky to move the "8th" player of the row "c" to a different place in the "c" row. But when you are on this road of fixing, you can use a backtracking algorithm instead anyway.
By backtracking you brute force all the combinations and skip a combination when you found that the solution doesn't work. Check the URL above to understand backtracking (the animated gif might be helpful).
I have textarea, string :
__A 59.202x5p.
__B 611.08 500p
__C 991,70p.66.113.552.77.88.10p 199x200p
__C2 33 44x100p 55 161x150p 25 33 85x60p 727 77 373 22x220p
__C3 44 16 59x10p 343 x15p 172 200p
i want output like this :
__A 59.20 02x5p.
__B 61 11.08 500p
__C 99 91,70p.66.11 13.55 52.77.88.10p 19 99x200p
__C2 33 22 44x100p 55 16 61 x150p 25 33 85x60p 72 27 77 37 73 22x220p
__C3 44 16 59x10p 34 43 x15p 17 72 200p
If number is hundreds and before "x ? p" or " ?p" ( ? is random number and cant spilit ), it will spilit and line will like this :
__A 59.202x5p. >>> __A 59.20 02x5p.
__B 611.08 500p >>> __B 61 11.08 500p
__C 991,70p.66.113.552.77.88.10p 199x200p >>> __C 99 91,70p.66.11 13.55 52.77.88.10p 19 99x200p
...
I use preg_match + preg_replace + substr but i cant locate where is hundreds number before "x ? p" or " ?p" ( ? is random number and cant spilit )...
And i dont understand how to spilit number like :
__A 59."202"x5p. ( 202 to 20 02 ) >>> __A 59.20 02x5p.
__B 611.08 500p ( 611 to 61 11 ) >>> __B 61 11.08 500p
My English language not good, hope who read my question can understand and help me solve it.
Thank very very much.
Check the following code..
<?php
echo "<u>CURRENT STRING</u><br/>";
echo $value ="__A 59.202x5p.
__B 611.08 500p
__C 991,70p.66.113.552.77.88.10p 199x200p
__C2 33 44x100p 55 161x150p 25 33 85x60p 727 77 373 22x220p
__C3 44 16 59x10p 343 x15p 172 200p";echo '<br>';
for($i=0;$i<=(strlen($value)-4); $i++ ) {
$myvar = $value[$i].$value[$i+1].$value[$i+2];
if (preg_match("/\d{3}/u", $myvar) > 0 && $myvar>100 && strpos($myvar.$value[$i+3], 'p') == 0)
$value = substr($value,0,$i+2).' '.$value[$i+1].substr($value,$i+2,strlen($value));
}
echo "<u>DESIRED STRING</u><br/>";
echo $value;
?>
I have a file that looks like this (yes the line breaks are right):
39 9
30 30 30 31 34 30 30 32 33 32 36 30 31 38 0D 0A 00014002326018..
39 30 30 30 31 34 30 30 32 33 32 36 30 35 34 0D 900014002326054.
0A .
39 30 30 30 31 34 30 30 32 33 32 36 30 39 31 0D 900014002326091.
0A .
39 30 30 30 31 34 30 30 32 33 32 36 31 36 33 0D 900014002326163.
0A .
39 9
30 30 30 31 34 30 30 32 33 000140023
32 36 32 30 30 0D 0A 26200..
39 9
30 30 30 31 34 30 30 32 33 32 36 32 30 30 0D 0A 00014002326200..
39 30 30 30 31 34 30 30 32 33 32 36 31 32 32 0D 900014002326122.
0A .
39 9
30 30 30 31 34 30 30 32 33 000140023
32 36 31 35 34 0D 0A 26154..
39 30 30 30 31 34 30 30 32 33 9000140023
32 36 31 33 31 0D 0A 26131..
39 9
30 30 30 31 34 30 30 32 33 000140023
32 36 31 30 34 0D 0A 26104..
39 30 30 30 31 34 30 30 32 33 32 36 30 39 30 0D 900014002326090.
0A .
39 30 30 30 31 34 30 30 32 33 32 36 31 39 37 0D 900014002326197.
0A .
39 9
30 30 30 31 34 30 30 32 33 32 36 32 30 38 0D 0A 00014002326208..
39 30 30 30 31 34 30 30 32 33 9000140023
32 36 31 31 35 0D 0A 26115..
39 9
30 30 30 31 34 30 30 32 33 000140023
32 36 31 36 34 0D 0A 26164..
39 9
30 30 30 31 34 30 30 32 33 000140023
32 36 30 31 36 0D 0A 39 30 30 30 31 34 30 30 32 26016..900014002
33 3
32 36 32 34 36 0D 0A 26246..
39 9
30 30 30 31 34 30 30 32 33 000140023
32 36 32 34 36 0D 0A 26246..
39 9
30 30 30 31 34 30 30 32 33 000140023
32 36 30 37 39 0D 0A 26079..
39 9
30 30 30 31 34 30 30 32 33 000140023
32 36 31 32 30 0D 0A 26120..
39 9
30 30 30 31 34 30 30 32 33 32 36 32 32 38 0D 0A 00014002326228..
39 30 30 30 31 34 30 30 32 33 9000140023
32 36 31 38 36 0D 0A 26186..
I have this code that grabs the EID tags (the numbers that start with 9000) but I can't figure out how to get it to do multiple lines.
$data = file_get_contents('tags.txt');
$pattern = "/(\d{15})/i";
preg_match_all($pattern, $data, $tags);
$count = 0;
foreach ( $tags[0] as $tag ){
echo $tag . '<br />';
$count++;
}
echo "<br />" . $count . " total head scanned";
For example the first and second line should return 900014002326018 instead of ignoring the first and second line
I am not good with regular expressions, so if you could explain so I learn and stop having to have someone help me with simple regex, that would be awesome.
EDIT: The whole number is 15 digits starting with 9000
You can do this:
$result = preg_replace('~\R?(?:[0-9A-F]{2}\h+)+~', '', $data);
$result = explode('..', rtrim($result, '.'));
pattern details:
\R? # optional newline character
(?: # open a non-capturing group
[0-9A-F]{2} # two hexadecimal characters
\h+ # horizontal white characters (spaces or tabs)
)+ # repeat the non-capturing group one or more times
After this replacement the only content you must remove are the two dots. After removing the trailing dots, you can use these to explode the string to an array.
An other way
Since you know that there is always 48 characters before the part of integers (and dots), you can use this pattern too:
$result = preg_replace('~(?:^|\R).{48}~', '', $data);
An other way without regex
The idea is to read the file line by line and, since the length before the content is always the same (i.e. 16*3 characters -> 48 characters), extract the substring with the integer and concatenate it into the $data temporary variable.
ini_set("auto_detect_line_endings", true);
$data = '';
$handle = #fopen("tags.txt", "r");
if ($handle) {
while (($buffer = fgets($handle, 128)) !== false) {
$data .= substr($buffer, 48, -1);
}
if (!feof($handle)) {
echo "Error: fgets() has failed\n";
}
fclose($handle);
} else {
echo "Error opening the file\n";
}
$result = explode ('..', rtrim($data, '.'));
Note: if the file has a windows format (with the end of line \r\n) you must change the third parameter of the substr() function to -2. If you are interested by how to detect newlines type, you can take a look at this post.
I don't think it's even possible to do this with a single regex, but your code will be far more legible and maintainable if you approach this one step at a time.
This works, and it shouldn't be too hard to figure out how it works:
$eid_tag_src = <<<END_EID_TAGS
39 9
30 30 30 31 34 30 30 32 33 32 36 30 31 38 0D 0A 00014002326018..
39 30 30 30 31 34 30 30 32 33 32 36 30 35 34 0D 900014002326054.
:
etc.
:
39 30 30 30 31 34 30 30 32 33 9000140023
32 36 31 38 36 0D 0A 26186..
END_EID_TAGS;
/* Remove hex data from first 48 characters of each line */
$eid_tag_src = preg_replace('/^.{48}/m','',$eid_tag_src);
/* Remove all white space */
$eid_tag_src = preg_replace('/\s+/','',$eid_tag_src);
/* Replace dots (CRLF) with spaces */
$eid_tag_src = str_replace('..',' ',$eid_tag_src);
/* Convert to array of EID tags */
$eid_tags = explode(' ',trim($eid_tag_src));
print_r($eid_tags);
Here's the output:
Array
(
[0] => 900014002326018
[1] => 900014002326054
[2] => 900014002326091
[3] => 900014002326163
[4] => 900014002326200
[5] => 900014002326200
[6] => 900014002326122
[7] => 900014002326154
[8] => 900014002326131
[9] => 900014002326104
[10] => 900014002326090
[11] => 900014002326197
[12] => 900014002326208
[13] => 900014002326115
[14] => 900014002326164
[15] => 900014002326016
[16] => 900014002326246
[17] => 900014002326246
[18] => 900014002326079
[19] => 900014002326120
[20] => 900014002326228
[21] => 900014002326186
)
Here's an approach using effective grabbing (without replacing):
RegEx: /(?:^.{48}|\.)([0-9]+\.?)/m - explained demo
Which means (in plain english): start grabbing digits followed by an optional dot IF from the start of the line there are 48 characters in front of them OR a dot (special case).
And your code could look like this:
$pattern = '/(?:^.{48}|\.)([0-9]+\.?)/m';
preg_match_all($pattern, $data, $tags);
//join all the bits belonging to the number
$data=implode("", $tags[1]);
//count the dots to have a correct count of the numbers grabbed
//since each number was grabbed with an ending dot initially
$count=substr_count($data, ".");
//replace the dots with a html <br> tag (avoiding a split and a foreach loop)
$tags=str_replace('.', "<br>", $data);
print $tags . "<br>" . $count . " total scanned";
See the code live at http://3v4l.org/Z4EhI
Suppose I sample a selection of database records that return the following numbers:
20.50, 80.30, 70.95, 15.25, 99.97, 85.56, 69.77
Is there an algorithm that can be efficiently implemented in PHP to find the outliers (if there are any) from an array of floats based on how far they deviate from the mean?
Ok let's assume you have your data points in an array like so:
<?php $dataset = array(20.50, 80.30, 70.95, 15.25, 99.97, 85.56, 69.77); ?>
Then you can use the following function (see comments for what is happening) to remove all numbers that fall outside of the mean +/- the standard deviation times a magnitude you set (defaults to 1):
<?php
function remove_outliers($dataset, $magnitude = 1) {
$count = count($dataset);
$mean = array_sum($dataset) / $count; // Calculate the mean
$deviation = sqrt(array_sum(array_map("sd_square", $dataset, array_fill(0, $count, $mean))) / $count) * $magnitude; // Calculate standard deviation and times by magnitude
return array_filter($dataset, function($x) use ($mean, $deviation) { return ($x <= $mean + $deviation && $x >= $mean - $deviation); }); // Return filtered array of values that lie within $mean +- $deviation.
}
function sd_square($x, $mean) {
return pow($x - $mean, 2);
}
?>
For your example this function returns the following with a magnitude of 1:
Array
(
[1] => 80.3
[2] => 70.95
[5] => 85.56
[6] => 69.77
)
For a normally distributed set of data, removes values more than 3 standard deviations from the mean.
<?php
function remove_outliers($array) {
if(count($array) == 0) {
return $array;
}
$ret = array();
$mean = array_sum($array)/count($array);
$stddev = stats_standard_deviation($array);
$outlier = 3 * $stddev;
foreach($array as $a) {
if(!abs($a - $mean) > $outlier) {
$ret[] = $a;
}
}
return $ret;
}
Topic: Detecting local, additive outliers in unordered arrays by walking a small window through the array and calculating the standard deviation for a certain range of values.
Good morning folks,
here is my solution much to late, but since I was looking for detecting outliers via PHP and could'nt find anything basic, I decided somehow smoothing a given dataset in a timeline of 24 h by simply moving a range of 5 items in a row through an unordered array and calculate the local standard deviation to detect the additive outliers.
The first function will simply calculate the average and deviation of a given array, where $col means the column with the values (sorry for the freegrades, this means that in an uncomplete dataset of 5 values you only have 4 freegrades - I don't know the exact english word for Freiheitsgrade):
function analytics_stat ($arr,$col,$freegrades = 0) {
// calculate average called mu
$mu = 0;
foreach ($arr as $row) {
$mu += $row[$col];
}
$mu = $mu / count($arr);
// calculate empiric standard deviation called sigma
$sigma = 0;
foreach ($arr as $row) {
$sigma += pow(($mu - $row[$col]),2);
}
$sigma = sqrt($sigma / (count($arr) - $freegrades));
return [$mu,$sigma];
}
Now its time for the core function, which will move through the given array and create a new array with the result. Margin means the factor to multiply the deviation with, since only one Sigma detects to many outliers, whereas more than 1.7 seems to high:
function analytics_detect_local_outliers ($arr,$col,$range,$margin = 1.0) {
$count = count($arr);
if ($count < $range) return false;
// the initial state of each value is NOT OUTLIER
$arr_result = [];
for ($i = 0;$i < $count;$i++) {
$arr_result[$i] = false;
}
$max = $count - $range + 1;
for ($i = 0;$i < $max;$i++) {
// calculate mu and sigma for current interval
// remember that 5 values will determine the divisor 4 for sigma
// since we only look at a part of the hole data set
$stat = analytics_stat(array_slice($arr,$i,$range),$col,1);
// a value in this interval counts, if it's found outside our defined sigma interval
$range_max = $i + $range;
for ($j = $i;$j < $range_max;$j++) {
if (abs($arr[$j][$col] - $stat[0]) > $margin * $stat[1]) {
$arr_result[$j] = true;
// this would be the place to add a counter to isolate
// real outliers from sudden steps in our data set
}
}
}
return $arr_result;
}
And finally comes the test function with random values in an array with length 24.
As for margin I was curious and choose the Golden Cut PHI = 1.618 ... since I really like this number and some Excel test results have led me to a margin of 1.7, above which outliers very rarelly were detected. The range of 5 is variable, but for me this was enough. So for every 5 values in a row there will be a calculation:
function test_outliers () {
// create 2 dimensional data array with items [hour,value]
$arr = [];
for ($i = 0;$i < 24;$i++) {
$arr[$i] = [$i,rand(0,500)];
}
// set parameter for detection algorithm
$result = [];
$col = 1;
$range = 5;
$margin = 1.618;
$result = analytics_detect_local_outliers ($arr,$col,$range,$margin);
// display results
echo "<p style='font-size:8pt;'>";
for ($i = 0;$i < 24;$i++) {
if ($result[$i]) echo "♦".$arr[$i][1]."♦ "; else echo $arr[$i][1]." ";
}
echo "</p>";
}
After 20 calls of the test function I got these results:
417 140 372 131 449 26 192 222 320 349 94 147 201 ♦342♦ 123 16 15
♦490♦ 78 190 ♦434♦ 27 3 276
379 440 198 135 22 461 208 376 286 ♦73♦ 331 358 341 14 112 190 110 266
350 232 265 ♦63♦ 90 94
228 ♦392♦ 130 134 170 ♦485♦ 17 463 13 326 47 439 430 151 268 172 342
445 477 ♦21♦ 421 440 219 95
88 121 292 255 ♦16♦ 223 244 109 127 231 370 16 93 379 218 87 ♦335♦ 150
84 181 25 280 15 406
85 252 310 122 188 302 ♦13♦ 439 254 414 423 216 456 321 85 61 215 7
297 337 204 210 106 149
345 411 308 360 308 346 ♦451♦ ♦77♦ 16 498 331 160 142 102 ♦496♦ 220
107 143 ♦241♦ 113 82 355 114 452
490 222 412 94 2 ♦480♦ 181 149 41 110 220 ♦477♦ 278 349 73 186 135 181
♦39♦ 136 284 340 165 438
147 311 246 449 396 328 330 280 453 374 214 289 489 185 445 86 426 246
319 ♦30♦ 436 290 384 232
442 302 ♦436♦ 50 114 15 21 93 ♦376♦ 416 439 ♦222♦ 398 237 234 44 102
464 204 421 161 330 396 461
498 320 105 22 281 168 381 216 435 360 19 ♦402♦ 131 128 66 187 291 459
319 433 86 84 325 247
440 491 381 491 ♦22♦ 412 33 273 256 331 79 452 314 485 66 138 116 356
290 190 336 178 298 218
394 439 387 ♦80♦ 463 369 ♦104♦ 388 465 455 ♦246♦ 499 70 431 360 ♦22♦
203 280 241 319 ♦34♦ 238 439 497
485 289 249 ♦416♦ 228 166 217 186 184 ♦356♦ 142 166 26 91 70 ♦466♦ 177
357 298 443 307 387 373 209
338 166 90 122 442 429 499 293 ♦41♦ 159 395 79 307 91 325 91 162 211
85 189 278 251 224 481
77 196 37 326 230 281 ♦73♦ 334 159 490 127 365 37 57 246 26 285 468
228 181 74 ♦455♦ 119 435
328 3 216 149 217 348 65 433 164 473 465 145 341 112 462 396 168 251
351 43 320 123 181 198
216 213 249 219 ♦29♦ 255 100 216 181 233 33 47 344 383 ♦94♦ 323 440
187 79 403 139 382 37 395
366 450 263 160 290 ♦126♦ 304 307 335 396 458 195 171 493 270 434 222
401 38 383 158 355 311 150
402 339 382 97 125 88 300 332 250 ♦86♦ 362 214 448 67 114 ♦354♦ 140 16
♦354♦ 109 0 168 127 89
450 5 232 155 159 264 214 ♦416♦ 51 429 372 230 298 232 251 207 ♦322♦
160 148 206 293 446 111 338
I hope, this will help anyone in the present or future.
Greetings
P.S. To further improve this algorithm you may add a counter, which makes sure, that a certain value must for instance be found at least 2 times, that means in 2 different intervals or windows, before it is labeled as outlier. So a sudden jump of the following values does not make the first value the villain. Let me give you an example:
In 3,6,5,9,37,40,42,51,98,39,33,45 there is an obvious step from 9 to 37 and an isolated value 98. I would like to detect 98, but not 9 or 37.
The first interval 3,6,5,9,37 would detect 37, the second interval 6,5,9,37,40 not. So we would not detect 37, since there is only one problematic interval or one match. Now it should be clear, that 98 counts in 5 intervals and is therefore an outlier. So lets declare a value an outlier, if it "counts" at least 2 times.
Like so often we have to look closely the borders, since they have only one interval, and make for these values an exception.