Splitting string in key value with regex

Splitting string in key value with regex - php

I'm having some trouble parsing plain text output from samtools stats.
Example output:
45205768 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5203838 + 0 duplicates
44647359 + 0 mapped (98.76% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
I'd like to parse the file line-by-line and get the following output in a PHP array like this:
Array(
"in total" => [45205768,0],
...
)
So, long story short, I'd like to get the numerical values from the front of the line as an array of integers and the following string (without the brackets) as key.

^(\d+)\s\+\s(\d+)\s([a-zA-Z0-9 ]+).*$
This regex will put first value, second value and the following string without the brackets in the match groups 1, 2 and 3 respectively.
Regex101 demo

I think this is what your after:
^(\d+)(\s\+\s)(\d+)(.+)
See it work here on Regex101
Pick up the first and third groups

This can be solved with just two capture groups and the fullstring match.
My pattern accurately extracts the desired substrings and trims the trailing spaces from the to-be-declared "keys": Pattern Demo
^(\d+) \+ (\d+) \K[a-z\d ]+(?=\s) #244steps
PHP Code: (Demo)
$txt='45205768 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5203838 + 0 duplicates
44647359 + 0 mapped (98.76% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)';
preg_match_all('/^(\d+) \+ (\d+) \K[a-z\d ]+(?=\s)/m',$txt,$out);
foreach($out[0] as $k=>$v){
$result[$v]=[(int)$out[1][$k],(int)$out[2][$k]]; // re-casting strings as integers
}
var_export($result);
Output:
array (
'in total' => array (0 => 45205768, 1 => 0),
'secondary' => array (0 => 0, 1 => 0),
'supplementary' => array (0 => 0, 1 => 0),
'duplicates' => array (0 => 5203838, 1 => 0),
'mapped' => array (0 => 44647359, 1 => 0),
'paired in sequencing' => array (0 => 0, 1 => 0),
'read1' => array (0 => 0, 1 => 0),
'read2' => array (0 => 0, 1 => 0),
'properly paired' => array (0 => 0, 1 => 0),
'with itself and mate mapped' => array (0 => 0, 1 => 0),
'singletons' => array (0 => 0, 1 => 0),
'with mate mapped to a different chr' => array ( 0 => 0, 1 => 0)
)
Note that the last two lines of the input text generate a duplicate key in the $result array, meaning the earlier line's data is overwritten by the later line's data. If this is a concern, you might restructure your input data or just keep the parenthetical portion as part of the key for unique-ness.

Related

GFS Grib Wind Values Decode And Convert (U & V)

Im doing a Grib2 decoder in PHP, and started with a half written library that I found. Everything is working fine except the values I get from the data are incorrect after converting Int Values to real values. I think I am converting everything right, and even when I test with cloud data it looks correct when I check it in Panoply. I think its with this formula that is all over the internet. Below im using 10 m above ground GFS from https://nomads.ncep.noaa.gov
Y*10^D = R+(X1+X2)*2^E
Im not sure I'm plugging in the values correctly but again it works with cloud cover percentages.
So.... The "Data Representation Values" I get from Grib Section 5
'Reference value (R)' => 886.25067138671875,
'Binary Scale Factor (E)' => 0,
'Decimal Scale Factor (D)' => 2,
'Number of bits used for each packed value' => 11,
'exp' => pow(2, $E), //(Equals 1) (The Library used these as the 2^E)
'base' => pow(10, $D), //(Equals 100) (And the 10^D)
'template' => 0,
As you can see below the numbers definitely have a connection to the Reference Value. The Number closest to 886(R) is 892 and its actual value should be 0.05 as shown below (EX.) The numbers Higher are than 892 are positive and the ones lower than 892 are negative. But when I user the formula (886 + 892 * 1) / 100 it give me 17.78, not 0.05. I seem to be missing something pretty obvious, am I misunderstanding the formula/equation where Y is the value I want...
X1 = 0 (documentation says)
X2 = 892 (documentation says is scaled value, the value in the Grib from bits?)
2^0 = 1
10^2 = 100
R = 886.25067138671875
Y * 10^D = R + (X1 + X2) * 2^E
Y * 100 = R + (X1 + X2) * 1
886 + (0 + 892) * 1 ) / 100
(886 + 892 * 1) / 100
= 17.78
Int Values of wind from Grib (After converting from Bits)
0 => 695,
1 => 639,
2 => 631,
3 => 0,
4 => 436,
5 => 513,
6 => 690,
7 => 570,
8 => 625,
9 => 805,
10 => 892,<-----------(EX.)
11 => 1044,
12 => 952,
13 => 1081,
14 => 1414,
15 => 997,
16 => 1106,
17 => 974,
18 => 1135,
19 => 1069,
20 => 912,
Actual decoded wind values shown in Panoply (Well known Grib App)
-1.9125067
-2.4725068
-2.5525067
-8.862507
-4.5025067
-3.7325068
-1.9625068
-3.1625068
-2.6125066
-0.81250674
0.057493284 <-----------(EX.)
1.5774933
0.6574933
1.9474933
5.2774935
1.1074933
2.1974933
0.87749326
2.4874933
1.8274933
0.2574933

y = 0.01 * (x - 886.25067138671875) seems to work for all points
so 0.01 * (892 - 886.25067138671875) = 0.0574

Regex to split string into array of numbers and characters using PHP

I have an arithmetic string that will be similar to the following pattern.
a. 1+2+3
b. 2/1*100
c. 1+2+3/3*100
d. (1*2)/(3*4)*100
Points to note are that
1. the string will never contain spaces.
2. the string will always be a combination of Numbers, Arithmetic symbols (+, -, *, /) and the characters '(' and ')'
I am looking for a regex in PHP to split the characters based on their type and form an array of individual string characters like below.
(Note: I cannot use str_split because I want numbers greater than 10 to not to be split.)
a. 1+2+3
output => [
0 => '1'
1 => '+'
2 => '2'
3 => '+'
4 => '3'
]
b. 2/1*100
output => [
0 => '2'
1 => '/'
2 => '1'
3 => '*'
4 => '100'
]`
c. 1+2+3/3*100
output => [
0 => '1'
1 => '+'
2 => '2'
3 => '+'
4 => '3'
5 => '/'
6 => '3'
7 => '*'
8 => '100'
]`
d. (1*2)/(3*4)*100
output => [
0 => '('
1 => '1'
2 => '*'
3 => '2'
4 => ')'
5 => '/'
6 => '('
7 => '3'
8 => '*'
9 => '4'
10 => ')'
11 => '*'
12 => '100'
]
Thank you very much in advance.

Use this regex :
(?<=[()\/*+-])(?=[0-9()])|(?<=[0-9()])(?=[()\/*+-])
It will match every position between a digit or a parenthesis and a operator or a parenthesis.
(?<=[()\/*+-])(?=[0-9()]) matches the position with a parenthesis or an operator at the left and a digit or parenthesis at the right
(?<=[0-9()])(?=[()\/*+-]) is the same but with left and right reversed.
Demo here

Since you state that the expressions are "clean", no spaces or such, you could split on
\b|(?<=\W)(?=\W)
It splits on all word boundaries and boundaries between non word characters (using positive lookarounds matching a position between two non word characters).
See an illustration here at regex101

As I said, I will help you with that if you can provide some work you did by yourself to solve that problem.
However, if when crafting an unidimensional array out of an arithmetic expression, your objective is to parse and cimpute that array, then you should build a tree instead and hierarchise it by putting the operators as nodes, the branches being the operands :
'(1*2)/(3*4)*100'
Array
(
[operand] => '*',
[left] => Array
(
[operand] => '/',
[left] => Array
(
[operand] => '*',
[left] => 1,
[right] => 2
),
[right] => Array
(
[operand] => '*',
[left] => 3,
[right] => 4
)
),
[right] => 100
)

There is no need to use regex for this. You just loop through the string and build the array as you want.
Edit, just realized it can be done much faster with a while loop instead of two for loops and if().
$str ="(10*2)/(3*40)*100";
$str = str_split($str); // make str an array
$arr = array();
$j=0; // counter for new array
for($i=0;$i<count($str);$i++){
if(is_numeric($str[$i])){ // if the item is a number
$arr[$j] = $str[$i]; // add it to new array
$k = $i+1;
while(is_numeric($str[$k])){ // while it's still a number append to new array item.
$arr[$j] .= $str[$k];
$k++; // add one to counter.
if($k == count($str)) break; // if counter is out of bounds, break loop.
}
$j++; // we are done with this item, add one to counter.
$i=$k-1; // set new value to $i
}else{
// not number, add it to the new array and add one to array counter.
$arr[$j] = $str[$i];
$j++;
}
}
var_dump($arr);
https://3v4l.org/p9jZp

You can also use this matching regex: [()+\-*\/]|\d+
Demo

I was doing something similar to this for a php calculator demo. A related post.
Consider this pattern for preg_split():
~-?\d+|[()*/+-]~ (Pattern Demo)
This has the added benefit of allowing negative numbers without confusing them for operators. The first "alternative" matches positive or negative integers, while the second "alternative (after the |) matches parentheses and operators -- one at a time.
In the php implementation, I place the entire pattern in a capture group and retain the delimiters. This way no substrings are left behind. ~ is used as the pattern delimiter so that the slash in the pattern doesn't need to be escaped.
Code: (Demo)
$expression = '(1*2)/(3*4)*100+-10';
var_export(
preg_split(
'~(-?\d+|[()*/+-])~',
$expression,
0,
PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE
)
);
Output:
array (
0 => '(',
1 => '1',
2 => '*',
3 => '2',
4 => ')',
5 => '/',
6 => '(',
7 => '3',
8 => '*',
9 => '4',
10 => ')',
11 => '*',
12 => '100',
13 => '+',
14 => '-10',
)

Count distinct occurrences of items among multiple columns in a pattern

I am working on a survey application using MySql and PHP
The responses will be in the following format:
+-c1_1-+-c1_2-+-c1_3-+-c1_1-+-c1_2-+-c1_3-+-....
+ red + blue + pink + cyan + red + gray + ....
+ black+ pink + plum + red + blue + gray + ....
+ cyan + red + blue + blue + pink + plum + ....
+------+------+------+------+------+------+ ....
c1_1 represents Column_For_Question_1_With_Rank_1
c1_2 represents Column_For_Question_1_With_Rank_2
c1_3 represents Column_For_Question_1_With_Rank_3
c2_1 represents Column_For_Question_2_With_Rank_1
c2_2 represents Column_For_Question_2_With_Rank_2
c2_3 represents Column_For_Question_2_With_Rank_3
Scoring is like this:
Rank 1 = color in column cX_1 = gets 3 marks (c1_1,c2_1,c3_1..)
Rank 2 = color in column cX_2 = gets 2 marks (c1_2,c2_2,c3_2..)
Rank 3 = color in column cX_3 = gets 1 mark (c1_3,c2_3,c3_3..)
Score of Red:
appears in cX_1 two time = 3x2=6
appears in cX_2 two times = 2x2=4
So Red gets a score of 6+4=10
Score of Blue:
appears in cX_1 one time = 3x1=3
appears in cX_2 two times = 2x2=4
appears in cX_3 two times = 1x2=2
So blue gets a score of 3+4+2 = 9
Is it possible to write an effective query to arrive at a result like:
+-color-+-score-+
+ red + 10 +
+ blue + 9 +
+ xxx + # +
+ xxx + # +
+ xxx + # +
+-------+-------+
If that is not possible, atleast the number of occurances like:
+-color-+-n_cX_1-+-n_cX_2-+-n_cX_3-+
+ red + 2 + 2 + 0 +
+ blue + 1 + 2 + 2 +
+ xxx + # + # + # +
+ xxx + # + # + # +
+ xxx + # + # + # +
+ xxx + # + # + # +
+-------+--------+--------+--------+
Actually the colors will be replaced by people names.
Each 'set of three consecutive columns' (cX_1,cX_2,cX_3) represent first, second and third ranks rated for each of 9 questions. So there will be 3x9=27 columns
Can someone please help me with this? I am thinking on using count(*) repeatedly but I am sure it is a wrong approach. Searched a lot before posting but could not solve it.
Edit 1:
Want to mention that there might be almost 50 people names in these columns. And each row would represent response from one examiner doing the survey. There will be about 100 such examiners and hence about 100 rows.

Here is my solution, i am messing with PHP more, so my answer will be by PHP side, if you want SQL solution, you should wait for some time to let another user add SQL side solution,
You will get this kind of array in PHP from MySQL,
$result = [
'c1_1'=>['red','black','cyan'],
'c1_2'=>['blue','pink','red'],
'c1_3'=>['pink','plum','blue'],
'c2_1'=>['cyan','red','blue'],
'c2_2'=>['red','blue','pink'],
'c2_3'=>['gray','gray','plum']
];
Now, generate an users array having each user's achievements in each category,
$users = [];
foreach($result as $k => $v)
{
foreach($v as $user)
{
$users[$user][] = $k;
}
}
Now, $users array will look like,
array (size=7)
'red' =>
array (size=4)
0 => string 'c1_1' (length=4)
1 => string 'c1_2' (length=4)
2 => string 'c2_1' (length=4)
3 => string 'c2_2' (length=4)
'black' =>
array (size=1)
0 => string 'c1_1' (length=4)
'cyan' =>
array (size=2)
0 => string 'c1_1' (length=4)
1 => string 'c2_1' (length=4)
'blue' =>
array (size=4)
0 => string 'c1_2' (length=4)
1 => string 'c1_3' (length=4)
2 => string 'c2_1' (length=4)
3 => string 'c2_2' (length=4)
'pink' =>
array (size=3)
0 => string 'c1_2' (length=4)
1 => string 'c1_3' (length=4)
2 => string 'c2_2' (length=4)
'plum' =>
array (size=2)
0 => string 'c1_3' (length=4)
1 => string 'c2_3' (length=4)
'gray' =>
array (size=2)
0 => string 'c2_3' (length=4)
1 => string 'c2_3' (length=4)
Now, define a function to calculate marks, by passing the array by user we just generated,
function marks_of($input)
{
$marks_index = ['_1'=>3, '_2'=>2, '_3'=>1]; // define marks here
$marks = 0;
foreach($input as $marking)
{
$marks += $marks_index[substr($marking, -2)];
}
return $marks;
}
You need to define marks in each category as i commented in the above code.
Now, use it like,
$marks_of_red = marks_of($users['red']);
will give
int 10
To generate an array having each user's marks by name,
$all_users_marks = [];
foreach($users as $name => $achievements)
{
$all_users_marks[$name] = marks_of($users[$name]);
}
Now, $all_users_marks is
array (size=7)
'red' => int 10
'black' => int 3
'cyan' => int 6
'blue' => int 8
'pink' => int 5
'plum' => int 2
'gray' => int 2
As i already said, wait for someone if you want MySQL sided answer.

Normalize the table like this:
question | rank | name
Then use this query to get the total scores:
SELECT name, SUM( 4 - rank ) score FROM mytable GROUP BY name ORDER BY score DESC

Using unpack() to convert to a byte array in PHP

I'm trying to convert a binary string to a byte array of a specific format.
Sample binary data:
ê≤ÚEZêK
The hex version of the binary string looks like this:
00151b000000000190b2f20304455a000003900000004b0000
The Python script uses struct package and unpacks the above string (in binary) using this code:
data = unpack(">hBiiiiih",binarydata)
The desired byte array looks like this. This is also the output of the data array is:
(21, 27, 0, 26260210, 50611546, 912, 75, 0)
How can I unpack the same binary string using PHP's unpack() function and get the same output? That is, what's the >hBiiiiih equivalent in PHP?
So far my PHP code
$hex = "00151b000000000190b2f20304455a000003900000004b0000";
$bin = pack("H*",$hex);
print_r(unpack("x/c*"));
Which gives:
Array ( [*1] => 21 [*2] => 27 [*3] => 0 [*4] => 0 [*5] => 0 [*6] => 0 [*7] => 1 [*8] => -112 [*9] => -78 [*10] => -14 [*11] => 3 [*12] => 4 [*13] => 69 [*14] => 90 [*15] => 0 [*16] => 0 [*17] => 3 [*18] => -112 [*19] => 0 [*20] => 0 [*21] => 0 [*22] => 75 [*23] => 0 [*24] => 0 )
Would also appreciate links to a PHP tutorial on working with pack/unpack.

This produces the same result as does Python, but it treats signed values as unsigned because unpack() does not have format codes for signed values with endianness. Also note that the integers are converted using long, but this is OK because both have the same size.
$hex = "00151b000000000190b2f20304455a000003900000004b0000";
$bin = pack("H*", $hex);
$x = unpack("nbe_unsigned_1/Cunsigned_char/N5be_unsigned_long/nbe_unsigned_2", $bin);
print_r($x);
Array
(
[be_unsigned_1] => 21
[unsigned_char] => 27
[be_unsigned_long1] => 0
[be_unsigned_long2] => 26260210
[be_unsigned_long3] => 50611546
[be_unsigned_long4] => 912
[be_unsigned_long5] => 75
[be_unsigned_2] => 0
)
Because this data is treated as unsigned, you will need to detect whether the original data was negative, which can be done for 2 byte shorts with something similar to this:
if $x["be_unsigned_1"] >= pow(2, 15)
$x["be_unsigned_1"] = $x["be_unsigned_1"] - pow(2, 16);
and for longs using
if $x["be_unsigned_long2"] >= pow(2, 31)
$x["be_unsigned_long2"] = $x["be_unsigned_long2"] - pow(2, 32);

Best practice to implement blank tile search to anagram solver

I currently have an anagram solver on my website that works well and quickly.
I use an array structure to hold number values of each letter used in each word. So basically when someone put in the letters "fghdywkjd" My solver will go through each word in its db and match the amout of letters in each word to the values associated with the letter inputted ie. "fghdywkjd"
I build the array like this
$a = array('a' => 1, 'b' => 1, 'c' => 1, 'd' => 1, 'e' => 1, 'f' => 1, 'g' => 1, 'h' => 1, 'i' => 1, 'j' => 1, 'k' => 1, 'l' => 1, 'm' => 1, 'n' => 1, 'o' => 1, 'p' => 1, 'q' => 1, 'r' => 1, 's' => 1, 't' => 1, 'u' => 1, 'v' => 1, 'w' => 1, 'x' => 1, 'y' => 1, 'z' => 1);
It counts the values as it goes through each word.
I am trying to think of the best way to add a blank tile feature to it that is not going to slow it down.
The only way I can figure out how to add this feature is to wait till I have all my results then take each word found and add the letter "a" and find possibilities, then add the latter "b" and so on. For each word that would be enormous.
Anyways some ideas?

Here's probably how I would do it. I would set up the word database table structure like this: (The main reason for this is speed. We could split the names by letter each query but I think this way is faster though I haven't benchmarked).
name a b c d e f g h i j k l m n o p q r s t u v w x y z
---- - - - - - - - - - - - - - - - - - - - - - - - - - -
test 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0
tests 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0
foo 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
and then in the PHP I'd do this: This assumes that the number of letters in the word has to match the anagram exactly (no extra letters).
<?php
$letters = array_fill_keys (range('a', 'z'), 0);
$word = 'set'; // start with the word 'set'
$wordLetters = str_split(preg_replace("/[^a-z]/", '', strtolower($word))); // remove invalid letters, lowercase, and convert to array
$numberOfWildcards = 1; // Change this to the number of wildcards you want
foreach ($wordLetters as $letter) {
$letters[$letter]++;
}
$query = 'SELECT `name`, 0';
foreach ($letters as $letter => $num) {
// $query .= "+ABS(`$letter`-$num)";
$query .= "+IF(`$letter` > $num, `$letter` - $num, 0)";
}
$query = ' AS difference
FROM `word_table`
WHERE
LENGTH(`name`) = ' . (strlen($word) + $numberOfWildcards) . '
HAVING
difference = ' . $numberOfWildcards;
If you want to see the difference between the word you are checking and all the words in the database get rid of the where and having clauses.
Let me know how this works out for you.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Splitting string in key value with regex - php

^(\d+)\s\+\s(\d+)\s([a-zA-Z0-9 ]+).*$ This regex will put first value, second value and the following string without the brackets in the match groups 1, 2 and 3 respectively. Regex101 demo

I think this is what your after: ^(\d+)(\s\+\s)(\d+)(.+) See it work here on Regex101 Pick up the first and third groups

Related

GFS Grib Wind Values Decode And Convert (U & V)

Regex to split string into array of numbers and characters using PHP

Count distinct occurrences of items among multiple columns in a pattern

Using unpack() to convert to a byte array in PHP

Best practice to implement blank tile search to anagram solver

Categories

Resources