PHP preg_split() pattern - php
I need help finding a PCRE pattern using preg_split().
I'm using the regex pattern below to split a string based on its starting 3 character code and semi-colons. The pattern works fine in Javascript, but now I need to use the pattern in PHP. I tried preg_split() but just getting back junk.
// Each group will begin with a three letter code, have three segments separated by a semi-colon. The string will not be terminated with a semi-colon.
// Pseudocode
string_to_split = "AAA;RED;111;BBB;BLUE;22;CCC;GREEN;33;DDD;WHITE;44"
// This works in JS
// https://regex101.com
$pattern = "/[AAA|BBB|CCC|DDD][^;]*;[^;]*[;][^;]*/gi";
Match 1
Full match 0-11 `AAA;RED;111`
Match 2
Full match 12-23 `BBB;BLUE;22`
Match 3
Full match 24-36 `CCC;GREEN;33`
Match 4
Full match 37-49 `DDD;WHITE;44`
$pattern = "/[AAA|BBB|CCC|DDD][^;]*;[^;]*[;][^;]*/";
$split = preg_split($pattern, $string_to_split);
returns
array(5)
0:""
1:";"
2:";"
3:";"
4:""
According to your additional information in some comments to the answers, I update my answer to be very specific to your source format.
You might want something like this:
$subject = "AAA;RED;111;AAA;Oh my dog;12.34;AAA;Oh Long John;.4556;BBB;Oh Long Johnson;1.2323;BBB;Oh Don Piano;.33;CCC;Why I eyes ya;1.445;CCC;All the live long day;2.3343;DDD;Faith Hilling;.89";
$pattern = '/(?<=;|^)(AAA|BBB|CCC|DDD);([^;]*);((?:\d*\.)?\d+)(?=;|$)/';
preg_match_all($pattern, $subject,$matches);
var_dump($matches);
giving you
array (size=4)
0 =>
array (size=8)
0 => string 'AAA;RED;111' (length=11)
1 => string 'AAA;Oh my dog;12.34' (length=19)
2 => string 'AAA;Oh Long John;.4556' (length=22)
3 => string 'BBB;Oh Long Johnson;1.2323' (length=26)
4 => string 'BBB;Oh Don Piano;.33' (length=20)
5 => string 'CCC;Why I eyes ya;1.445' (length=23)
6 => string 'CCC;All the live long day;2.3343' (length=32)
7 => string 'DDD;Faith Hilling;.89' (length=21)
1 =>
array (size=8)
0 => string 'AAA' (length=3)
1 => string 'AAA' (length=3)
2 => string 'AAA' (length=3)
3 => string 'BBB' (length=3)
4 => string 'BBB' (length=3)
5 => string 'CCC' (length=3)
6 => string 'CCC' (length=3)
7 => string 'DDD' (length=3)
2 =>
array (size=8)
0 => string 'RED' (length=3)
1 => string 'Oh my dog' (length=9)
2 => string 'Oh Long John' (length=12)
3 => string 'Oh Long Johnson' (length=15)
4 => string 'Oh Don Piano' (length=12)
5 => string 'Why I eyes ya' (length=13)
6 => string 'All the live long day' (length=21)
7 => string 'Faith Hilling' (length=13)
3 =>
array (size=8)
0 => string '111' (length=3)
1 => string '12.34' (length=5)
2 => string '.4556' (length=5)
3 => string '1.2323' (length=6)
4 => string '.33' (length=3)
5 => string '1.445' (length=5)
6 => string '2.3343' (length=6)
7 => string '.89' (length=3)
The start marker should occur at the start of string or immidiately after a semicolon, so we do a lookbehind, looking for start or semicolon:
(?<=;|^)
We look for an alternative of AAA,BBB,CCC or DDD and capture it:
(AAA|BBB|CCC|DDD)
After a semicolon we look for any character except a semicolon. The quantifier * means 0 or more time. Use + if you want at least 1.
;([^;]*)
After the next semicolon wie look for a number. This task has to be splitted to fit a valid format: We first look for 0 or more digits followed by a dot:
(?:\d*\.)?
where (?:) means a non-capturing group.
Behind we look for at least one digit: \d+
We want to capture both parts of of the number using parentheses after the searched semicolon:
;((?:\d*\.)?\d+)
This matches "1234", ".1234", "1.234", "12.34" , "123.4" but "1234.", "1.2.3"
Finally we want this to immediately occur before a semicolon or the end of string. Thus we do a lookahead:
(?=;|$)
Lookaheads and lookbehinds are not part of the captured result behind or respectively before.
I've modified your pattern a little, and added a couple of flags to preg_split.
The PREG_SPLIT_NO_EMPTY flag will exclude empty matches from the result, and PREG_SPLIT_DELIM_CAPTURE will include the captured value in the result.
$split = preg_split('/([abcd]{3};[^;]+;\d+);?/i', $string, -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
Result:
Array
(
[0] => AAA;RED;111
[1] => BBB;BLUE;22
[2] => CCC;GREEN;33
[3] => DDD;WHITE;44
)
Alternatively, and more suitably, you can use preg_match_all with the following pattern.
preg_match_all('/([abcd]{3};[^;]+;\d+);?/i', $string, $matches);
print_r($matches[0]);
Result:
Array
(
[0] => AAA;RED;111
[1] => BBB;BLUE;22
[2] => CCC;GREEN;33
[3] => DDD;WHITE;44
)
You don't want to split your string but match elements, use preg_match_all:
$str = "AAA;RED;111;AAA;Oh my dog;2.34;AAA;Oh Long John;.4556;BBB;Oh Long Johnson;1.2323;BBB;Oh Don Piano;.33;CCC;Why I eyes ya;1.445;CCC;All the live long day;2.3343;DDD;Faith Hilling;.89";
$res = preg_match_all('/(?:AAA|BBB|CCC|DDD);[^;]*;[^;]*;?/', $str, $m);
print_r($m[0]);
Output:
Array
(
[0] => AAA;RED;111;
[1] => AAA;Oh my dog;2.34;
[2] => AAA;Oh Long John;.4556;
[3] => BBB;Oh Long Johnson;1.2323;
[4] => BBB;Oh Don Piano;.33;
[5] => CCC;Why I eyes ya;1.445;
[6] => CCC;All the live long day;2.3343;
[7] => DDD;Faith Hilling;.89
)
Explanation:
/ : regex delimiter
(?:AAA|BBB|CCC|DDD) : non capture group AAA or BBB or CCC or DDD
; : a semicolon
[^;]* : 0 or more any character that is not a semicolon
; : a semicolon
[^;]* : 0 or more any character that is not a semicolon
;? : optional semicolon
/ : regex delimiter
Related
Can not match the last group of numbers using php preg_match()
preg_match_all("/(\d{12}) (?:,|$)/","111762396541,561572500056,561729950637,561135281443",$matches); var_dump($mathes): array (size=2) 0 => array (size=4) 0 => string '561762396543,' (length=13) 1 => string '561572500056,' (length=13) 2 => string '561729950637,' (length=13) 3 => string '561135281443' (length=12) 1 => array (size=4) 0 => string '561762396543' (length=12) 1 => string '561572500056' (length=12) 2 => string '561729950637' (length=12) 3 => string '561135281443' (length=12) But I want the $matches like this: array (size=4) 0 => string '561762396543,' (length=13) 1 => string '561572500056,' (length=13) 2 => string '561729950637,' (length=13) 3 => string '561135281443' (length=12) I wanna match groups of numbers(each has 12 digits) and a suffix comma if there is one.The exeption is the last group of numbers,it doesnt have to match a comma,cause it reaches the end of the line.
Try this instead: preg_match_all("/(\d{12}(?:,|$))/","111762396541,561572500056,561729950637,561135281443",$matches); When the $ is inside your character range brackets [ ] it is looking for the $ characters not the end-of-line. EDIT: If you want to include the comma in your matches, then just use the above code sample and look at $matches[0]. If you wanted an easier syntax that matches any sort of word boundary, the \b will match commas and end-of-line, too: preg_match_all("/(\d{12}\b)/","111762396541,561572500056,561729950637,561135281443",$matches);
How to limit a variable search to a single line of text?
Considering this sample text: grupo1, tiago1A, bola1A, mola1A, tijolo1A, pedro1B, bola1B, mola1B, tijolo1B, raimundo1C, bola1C, mola1C, tijolo1C, joao1D, bola1D, mola1D, tijolo1D, felipe1E, bola1E, mola1E, tijolo1E, grupo2, tiago2A, bola2A, mola2A, tijolo2A, pedro2B, bola2B, mola2B, tijolo2B, raimundo2C, bola2C, mola2C, tijolo2C, joao2D, bola2D, mola2D, tijolo2D, felipe2E, bola2E, mola2E, tijolo2E, grupo3, tiago3A, bola3A, mola3A, tijolo3A, pedro3B, bola3B, mola3B, tijolo3B, raimundo3C, bola3C, mola3C, tijolo3C, joao3D, bola3D, mola3D, tijolo3D, felipe3E, bola3E, mola3E, tijolo3E, grupo4, tiago4A, bola4A, mola4A, tijolo4A, pedro4B, bola4B, mola4B, tijolo4B, raimundo4C, bola4C, mola4C, tijolo4C, joao4D, bola4D, mola4D, tijolo4D, felipe4E, bola4E, mola4E, tijolo4E, grupo5, tiago5A, bola5A, mola5A, tijolo5A, pedro5B, bola5B, mola5B, tijolo5B, raimundo5C, bola5C, mola5C, tijolo5C, joao5D, bola5D, mola5D, tijolo5D, felipe5E, bola5E, mola5E, tijolo5E, I would like to capture the 20 values that follow grupo3 and store them in groups of 4. I am using this: (Demo) /grupo3,((.*?),(.*?),(.*?),(.*?)),/ but this only returns the first 4 comma separated values after grupo3. I need generate this array structure: Match 1 Group 1 tiago3A Group 2 bola3A Group 3 mola3A Group 4 tijolo3A Match 2 Group 1 pedro3B Group 2 bola3B Group 3 mola3B Group 4 tijolo3B Match 3 Group 1 raimundo3C Group 2 bola3C Group 3 mola3C Group 4 tijolo3C Match 4 Group 1 joao3D Group 2 bola3D Group 3 mola3D Group 4 tijolo3D Match 5 Group 1 felipe3E Group 2 bola3E Group 3 mola3E Group 4 tijolo3E
You can try the following: /,(.*?),(.*?),(.*?),(.*?),.*?$/m the /m in the end indicates the flag for multi-line and $ before that indicates end of line. Demo Edit: For getting every 4 elements only form the 3rd paragraph /grupo3,((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)),/ Demo And you can get the desired output in PHP like: preg_match('/grupo3,((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)),/', $str, $matches); $groups = []; unset($matches[0]); $matches = array_values($matches); $count = count($matches); $j=0; for($i=1;$i<$count;$i++) { if($i%5 == 0) { $j++; continue; } $groups[$j][] = $matches[$i]; } var_dump($groups); Output will be something like: array (size=5) 0 => array (size=4) 0 => string ' tiago3A' (length=8) 1 => string ' bola3A' (length=7) 2 => string ' mola3A' (length=7) 3 => string ' tijolo3A' (length=9) 1 => array (size=4) 0 => string 'pedro3B' (length=7) 1 => string ' bola3B' (length=7) 2 => string ' mola3B' (length=7) 3 => string ' tijolo3B' (length=9) 2 => array (size=4) 0 => string 'raimundo3C' (length=10) 1 => string ' bola3C' (length=7) 2 => string ' mola3C' (length=7) 3 => string ' tijolo3C' (length=9) 3 => array (size=4) 0 => string 'joao3D' (length=6) 1 => string ' bola3D' (length=7) 2 => string ' mola3D' (length=7) 3 => string ' tijolo3D' (length=9) 4 => array (size=4) 0 => string 'felipe3E' (length=8) 1 => string ' bola3E' (length=7) 2 => string ' mola3E' (length=7) 3 => string 'tijolo3E' (length=0)
Please forgive the lateness of this answer. This is the comprehensive answer with a clean/direct solution that I would have posted earlier if this page wasn't put on hold. This is as refined a solution as I can devise without knowing more about how your input data is generated/accessed. The input: $text='grupo1, tiago1A, bola1A, mola1A, tijolo1A, pedro1B, bola1B, mola1B, tijolo1B, raimundo1C, bola1C, mola1C, tijolo1C, joao1D, bola1D, mola1D, tijolo1D, felipe1E, bola1E, mola1E, tijolo1E, grupo2, tiago2A, bola2A, mola2A, tijolo2A, pedro2B, bola2B, mola2B, tijolo2B, raimundo2C, bola2C, mola2C, tijolo2C, joao2D, bola2D, mola2D, tijolo2D, felipe2E, bola2E, mola2E, tijolo2E, grupo3, tiago3A, bola3A, mola3A, tijolo3A, pedro3B, bola3B, mola3B, tijolo3B, raimundo3C, bola3C, mola3C, tijolo3C, joao3D, bola3D, mola3D, tijolo3D, felipe3E, bola3E, mola3E, tijolo3E, grupo4, tiago4A, bola4A, mola4A, tijolo4A, pedro4B, bola4B, mola4B, tijolo4B, raimundo4C, bola4C, mola4C, tijolo4C, joao4D, bola4D, mola4D, tijolo4D, felipe4E, bola4E, mola4E, tijolo4E, grupo5, tiago5A, bola5A, mola5A, tijolo5A, pedro5B, bola5B, mola5B, tijolo5B, raimundo5C, bola5C, mola5C, tijolo5C, joao5D, bola5D, mola5D, tijolo5D, felipe5E, bola5E, mola5E, tijolo5E,'; The method: (PHP Demo) var_export(preg_match('/^grupo3, \K.*(?=,)/m',$text,$out)?array_chunk(explode(', ',$out[0]),4):'fail'); Use preg_match() to extract the single line, then use explode() to split the string on "comma space", then use array_chunk() to store in an array of 5 subarrays containing 4 elements each. The pattern targets grupo3, at the start of the line, then restarts the full match using \K then greedily matches every non-newline character and stops just before the last comma in the line. The positive lookahead (?=,) doesn't store the final comma in the full string match. (Pattern Demo) My method does not retain any leading and trailing spaces, just the values themselves. Output: array ( 0 => array ( 0 => 'tiago3A', 1 => 'bola3A', 2 => 'mola3A', 3 => 'tijolo3A', ), 1 => array ( 0 => 'pedro3B', 1 => 'bola3B', 2 => 'mola3B', 3 => 'tijolo3B', ), 2 => array ( 0 => 'raimundo3C', 1 => 'bola3C', 2 => 'mola3C', 3 => 'tijolo3C', ), 3 => array ( 0 => 'joao3D', 1 => 'bola3D', 2 => 'mola3D', 3 => 'tijolo3D', ), 4 => array ( 0 => 'felipe3E', 1 => 'bola3E', 2 => 'mola3E', 3 => 'tijolo3E', ), ) p.s. If the search term ($needle) is to be dynamic, you can use something like this to achieve the same result: (PHP Demo) $needle='grupo3'; // if the needle may include any regex-sensitive characters, use preg_quote($needle,'/') at $needle var_export(preg_match('/^'.$needle.', \K.*(?=,)/m',$text,$out)?array_chunk(explode(', ',$out[0]),4):'fail'); /* or this is equivalent... if(preg_match('/^'.$needle.', \K.*(?=,)/m',$text,$out)){ $singles=explode(', ',$out[0]); $groups=array_chunk($singles,4); var_export($groups); }else{ echo 'fail'; } */
preg match all get group multiple times
I am trying to get a regular expression to get a subgroup everytime it is found. This is my code: $string2 = 'cabbba'; preg_match_all('#c(a(b)*a)#',$string2,$result3,PREG_SET_ORDER); var_dump($result3); My goal is to get 'b' as a captured group each time (so 3 times). This codes outputs the following: array (size=1) 0 => array (size=3) 0 => string 'cabbba' (length=6) 1 => string 'abbba' (length=5) 2 => string 'b' (length=1) I want it to show 'b' each times it appears, so something like this array (size=1) 0 => array (size=3) 0 => string 'cabbba' (length=6) 1 => string 'abbba' (length=5) 2 => array (size=3) 0 => string 'b' (length 1) 1 => string 'b' (length 1) 2 => string 'b' (length 1) This is a simplified example, in the real code the subpattern 'b' will be different each time, but it follows the same pattern.
This would be possible only through \G anchor. (?:ca|\G)(b)(?=b|(a)) DEMO
Did you try using a non-greedy modifier for your b*? $string2 = 'cabbba'; preg_match_all('#c(a(b)*?a)#', $string2, $result3, PREG_SET_ORDER); var_dump($result3); Excuse me if it's not what you asked, I'm not sure I really understood your needs... UPDATE: Sorry, previous answer is wrong, please ignore it... I'm trying to elaborate a right one... Just trying something like preg_match_all('#c(a(?:(b{1}))*a)#', $string2, $result3, PREG_SET_ORDER); but it doesn't work, either... :-( UPDATE 2: See Avinash Raj answer, I think it's quite good...
Regex Extracting After the Match
preg_match('/\$(\d+\.\d+)/',$message,$keywords); dd($keywords); Hi , have got a few questions 1) Is it possible to detect/extract the text after the regular expression? eg I'm trying to detect $1.20 possible to detect the text after it eg per hour , /hr , per hr, / hour. 1.1) Maybe like Extract 20 characters after the match 1.2) Possible to know the position of the match if i cant extract? $100000/hour test test test Extract test test tst
1) Put everything you want to extract in the regex, like this: preg_match('#\$(\d+\.\d+)(\s+per hour|\s*/hr|\s+per hr|\s*/hour)?#',$message,$keywords); You'll get the amount in $keywords[1] and the other piece of text in $keywords[2]; 1.1) Use /\$(\d+\.\d+)(.{,20})/ to get at most 20 characters in the second match (if you remove the comma it will match only if after the amount there are at least 20 characters). 1.2) Use the $flags parameter of preg_match(): preg_match('/\$(\d+\.\d+)/',$message,$keywords,PREG_OFFSET_CAPTURE);. Check print_r($keywords) to see how the matched values and their offsets are returned You probably need to find all the appearances. In this case use preg_match_all().
Try this: $re = '~\$(\d+\.?\d+)/?(\w+)?~m'; $str = "$100000/hour\n$100.2000/min"; preg_match_all($re, $str, $matches); var_dump($matches); Demo on regex101 Output array (size=3) 0 => array (size=2) 0 => string '$100000/hour' (length=12) 1 => string '$100.2000/min' (length=13) 1 => array (size=2) 0 => string '100000' (length=6) 1 => string '100.2000' (length=8) 2 => array (size=2) 0 => string 'hour' (length=4) 1 => string 'min' (length=3)
Regex to extract weather forecast, and add to an array
I've bought RegexBuddy, given it a try and unless I am matching on something static, and simple - I just can't get regex! What I am trying to do is from the following line of text; I would like to extract tidal information into an associative array. High Tide: 2.0m on Mon at 08.54pm and 2.4m on Tue at 09.18am And end up with the following array: [0] = 'Day' => 'Mon', 'Time' => '8.54pm', 'Height' => '2.0m', 'Tide' => 'High' [1] = 'Day' => 'Tue', 'Time' => '09.18am', 'Height' => '2.4m', 'Tide' => 'High' The concept I am struggling most with is the fact that there are multiple matches that I wish to extract (e.g. 2.0m and 2.4m). I've managed to match on 2.0m, and 2.4m, but how do I determine which one is which? (First High tide vs second high tide). Any hints?
$string = "High Tide: 2.0m on Mon at 08.54pm and 2.4m on Tue at 09.18am"; preg_match_all("~((High|Low) Tide:)? (\d.\dm) on (\w{3}) at (.{7})~", $string, $matches, PREG_SET_ORDER); var_dump($matches); outputs array 0 => array 0 => string 'High Tide: 2.0m on Mon at 08.54pm' (length=33) 1 => string 'High Tide:' (length=10) 2 => string 'High' (length=4) 3 => string '2.0m' (length=4) 4 => string 'Mon' (length=3) 5 => string '08.54pm' (length=7) 1 => array 0 => string ' 2.4m on Tue at 09.18am' (length=23) 1 => string '' (length=0) 2 => string '' (length=0) 3 => string '2.4m' (length=4) 4 => string 'Tue' (length=3) 5 => string '09.18am' (length=7) i probably got the thing about the low tide wrong so here is some code without the tide $string = "High Tide: 2.0m on Mon at 08.54pm and 2.4m on Tue at 09.18am"; preg_match_all("~(\d.\dm) on (\w{3}) at (.{7})~", $string, $matches, PREG_SET_ORDER); var_dump($matches); outputs: array 0 => array 0 => string '2.0m on Mon at 08.54pm' (length=22) 1 => string '2.0m' (length=4) 2 => string 'Mon' (length=3) 3 => string '08.54pm' (length=7) 1 => array 0 => string '2.4m on Tue at 09.18am' (length=22) 1 => string '2.4m' (length=4) 2 => string 'Tue' (length=3) 3 => string '09.18am' (length=7)
If the word and always separates the two tides, you could break the string in two and process each half separately. For example: $str = "High Tide: 2.0m on Mon at 08.54pm and 2.4m on Tue at 09.18am"; $data = explode(" and ", $str); $result = array(); foreach($data as $tide) { $result[] = parseWithRegex($tide); }
You can use named capturing groups to get an associative array with the result and the pattern to match the string is pretty straight forward. /(?P<tide>high|low)\s+tide:\s+(?P<height1>\d+\.\d+m)\s+on\s+(?P<day1>[a-z]+)\s+at\s+(?P<time1>\d+\.\d+[ap]m)\s+and\s+(?P<height2>\d+\.\d+m)\s+on\s+(?P<day2>[a-z]+)\s+at\s+(?P<time2>\d+\.\d+[ap]m)/i Example script: $string = "High Tide: 2.0m on Mon at 08.54pm and 2.4m on Tue at 09.18am"; // named groups will also assign matches associative to the matches array, e.g. (?P<tide>high|low) will set $matches["tide"] to 'low' or 'high' preg_match( '/ (?P<tide>high|low) # match and capture string "high" or "low" \s+tide:\s+ # match string "tide" surrounded with one or more spaces on each side (?P<height1>\d+\.\d+m) # match and capture one or more digits followed by a dot and one or more digits followed by an m \s+on\s+ # match string "on" surrounded with one or more spaces on each side (?P<day1>[a-z]+) # match one or more letters \s+at\s+ # match string "at" surrounded with one or more spaces on each side (?P<time1>\d+\.\d+[ap]m) # match and capture one or more digits followed by a dot and one or more digits followed by an a or p, and string "m", so am or pm \s+and\s+ # match string "and" surrounded with one or more spaces on each side (?P<height2>\d+\.\d+m) # match and capture one or more digits followed by a dot and one or more digits followed by an m \s+on\s+ # match string "on" surrounded with one or more spaces on each side (?P<day2>[a-z]+) # match one or more letters \s+at\s+ # match string "at" surrounded with one or more spaces on each side (?P<time2>\d+\.\d+[ap]m) # match and capture one or more digits followed by a dot and one or more digits followed by an a or p, and string "m", so am or pm /ix', $string, $matches); print_r($matches); this will print Array ( [0] => High Tide: 2.0m on Mon at 08.54pm and 2.4m on Tue at 09.18am [tide] => High [1] => High [height1] => 2.0m [2] => 2.0m [day1] => Mon [3] => Mon [time1] => 08.54pm [4] => 08.54pm [height2] => 2.4m [5] => 2.4m [day2] => Tue [6] => Tue [time2] => 09.18am [7] => 09.18am )
you can used named groups and then refer to what you captured by name: (?P<name>exp) => $yourVarName['name'] (not tested, but this would be the idea) /^[^\d]+(?P<heightOne>[\d\.]+?m)\son\s(?P<dayOne>\w+?)\sat\s(?P<timeOne>.*?(am|pm))\sand\s(?P<heightTwo>[\d\.]+?m)\son\s(?P<dayTwo>\w+?)\sat\s(?P<timeTwo>.*?(am|pm))$/