PHP: split a string of alternating groups of characters into an array - php

I have a string whose correct syntax is the regex ^([0-9]+[abc])+$. So examples of valid strings would be: '1a2b' or '00333b1119a555a0c'
For clarity, the string is a list of (value, letter) pairs and the order matters. I'm stuck with the input string so I can't change that. While testing for correct syntax seems easy in principle with the above regex, I'm trying to think of the most efficient way in PHP to transform a compliant string into a usable array something like this:
Input:
'00333b1119a555a0c'
Output:
array (
0 => array('num' => '00333', 'let' => 'b'),
1 => array('num' => '1119', 'let' => 'a'),
2 => array('num' => '555', 'let' => 'a'),
3 => array('num' => '0', 'let' => 'c')
)
I'm having difficulty using preg_match for this. For example this doesn't give the expected result, the intent being to greedy-match on EITHER \d+ (and save that) OR [abc] (and save that), repeated until end of string reached.
$text = '00b000b0b';
$out = array();
$x = preg_match("/^(?:(\d+|[abc]))+$/", $text, $out);
This didn't work either, the intent here being to greedy-match on \d+[abc] (and save these), repeated until end of string reached, and split them into numbers and letter afterwards.
$text = '00b000b0b';
$out = array();
$x = preg_match("/^(?:\d+[abc])+$/", $text, $out);
I'd planned to check syntax as part of the preg_match, then use the preg_match output to greedy-match the 'blocks' (or keep the delimiters if using preg_split), then if needed loop through the result 2 items at a time using for (...; i+=2) to extract value-letter in their pairs.
But I can't seem to even get that basic preg_split() or preg_match() approach to work smoothly, much less explore if there's a 'neater' or more efficient way.

Your regex needs a few matching groups
/([0-9]+?)([a-z])/i
This means match all numbers in one group, and all letters in another. Preg match all gets all matches.
The key to the regex is the non greedy flag ? which matches the shortest possible string.
match[0] is the whole match
match[1] is the first match group (the numbers)
match[2] is the second match group (the letter)
example below
<?php
$input = '00333b1119a555a0c';
$regex = '/([0-9]+?)([a-z])/i';
$out = [];
$parsed = [];
if (preg_match_all($regex, $input, $out)) {
foreach ($out[0] as $index => $value) {
$parsed[] = [
'num' => $out[1][$index],
'let' => $out[2][$index],
];
}
}
var_dump($parsed);
output
array(4) {
[0] =>
array(2) {
'num' =>
string(5) "00333"
'let' =>
string(1) "b"
}
[1] =>
array(2) {
'num' =>
string(4) "1119"
'let' =>
string(1) "a"
}
[2] =>
array(2) {
'num' =>
string(3) "555"
'let' =>
string(1) "a"
}
[3] =>
array(2) {
'num' =>
string(1) "0"
'let' =>
string(1) "c"
}
}

Simple solution with preg_match_all(with PREG_SET_ORDER flag) and array_map functions:
$input = '00333b1119a555a0c';
preg_match_all('/([0-9]+?)([a-z]+?)/i', $input, $matches, PREG_SET_ORDER);
$result = array_map(function($v) {
return ['num' => $v[1], 'let' => $v[2]];
}, $matches);
print_r($result);
The output:
Array
(
[0] => Array
(
[num] => 00333
[let] => b
)
[1] => Array
(
[num] => 1119
[let] => a
)
[2] => Array
(
[num] => 555
[let] => a
)
[3] => Array
(
[num] => 0
[let] => c
)
)

You can use:
$str = '00333b1119a555a0c';
$arr=array();
if (preg_match_all('/(\d+)(\p{L}+)/', $str, $m)) {
array_walk( $m[1], function ($v, $k) use(&$arr, $m ) {
$arr[] = [ 'num'=>$v, 'let'=>$m[2][$k] ]; });
}
print_r($arr);
Output:
Array
(
[0] => Array
(
[num] => 00333
[let] => b
)
[1] => Array
(
[num] => 1119
[let] => a
)
[2] => Array
(
[num] => 555
[let] => a
)
[3] => Array
(
[num] => 0
[let] => c
)
)

All of the above work. But they didn't seem to have the elegance I wanted - they needed to loop, use array mapping, or (for preg_match_all()) they needed another almost identical regex as well, just to verify the string matched the regex.
I eventually found that preg_match_all() combined with named captures solved it for me. I hadn't used named captures for that purpose before and it looks powerful.
I also added an optional extra step to simplify the output if dups aren't expected (which wasn't in the question but may help someone).
$input = '00333b1119a555a0c';
preg_match_all("/(?P<num>\d+)(?P<let>[dhm])/", $input, $raw_matches, PREG_SET_ORDER);
print_r($raw_matches);
// if dups not expected this is also worth doing
$matches = array_column($raw_matches, 'num', 'let');
print_r($matches);
More complete version with input+duplicate checking
$input = '00333b1119a555a0c';
if (!preg_match("/^(\d+[abc])+$/",$input)) {
// OPTIONAL: detected $input incorrectly formatted
}
preg_match_all("/(?P<num>\d+)(?P<let>[dhm])/", $input, $raw_matches, PREG_SET_ORDER);
$matches = array_column($raw_matches, 'num', 'let');
if (count($matches) != count($raw_matches)) {
// OPTIONAL: detected duplicate letters in $input
}
print_r($matches);
Explanation:
This uses preg_match_all() as suggested by #RomanPerekhrest and #exussum to break out the individual groups and split the numbers and letters. I used named groups so that the resulting array of $raw_matches is created with the correct names already.
But if dups arent expected, then I used an extra step with array_column(), which directly extracts data from a nested array of entries and creates a desired flat array, without any need for loops, mapping, walking, or assigning item by item: from
(group1 => (num1, let1), group2 => (num2, let2), ... )
to the "flat" array:
(let1 => num1, let2 => num2, ... )
If named regex matches feels too advanced then they can be ignored - the matches will be given numbers anyway and this will work just as well, you would have to manually assign letters and it's just harder to follow.
preg_match_all("/(\d+)([dhm])/", $input, $raw_matches, PREG_SET_ORDER);
$matches = array_column($raw_matches, 1, 2);
If you need to check for duplicated letters (which wasn't in the question but could be useful), here's how: If the original matches contained >1 entry for any letter then when array_column() is used this letter becomes a key for the new array, and duplicate keys can't exist. Only one entry for each letter gets kept. So we just test whether the number of matches originally found, is the same as the number of matches in the final array after array_coulmn. If not, there were duplicates.

Related

Flatten array of regular expressions

I have an array of regular expressions -$toks:
Array
(
[0] => /(?=\D*\d)/
[1] => /\b(waiting)\b/i
[2] => /^(\w+)/
[3] => /\b(responce)\b/i
[4] => /\b(from)\b/i
[5] => /\|/
[6] => /\b(to)\b/i
)
When I'm trying to flatten it:
$patterns_flattened = implode('|', $toks);
I get a regex:
/(?=\D*\d)/|/\b(waiting)\b/i|/^(\w+)/|/\b(responce)\b/i|/\b(from)\b/i|/\|/|/\b(to)\b/i
When I'm trying to:
if (preg_match('/'. $patterns_flattened .'/', 'I'm waiting for a response from', $matches)) {
print_r($matches);
}
I get an error:
Warning: preg_match(): Unknown modifier '(' in ...index.php on line
Where is my mistake?
Thanks.
You need to remove the opening and closing slashes, like this:
$toks = [
'(?=\D*\d)',
'\b(waiting)\b',
'^(\w+)',
'\b(response)\b',
'\b(from)\b',
'\|',
'\b(to)\b',
];
And then, I think you'll want to use preg_match_all instead of preg_match:
$patterns_flattened = implode('|', $toks);
if (preg_match_all("/$patterns_flattened/i", "I'm waiting for a response from", $matches)) {
print_r($matches[0]);
}
If you get the first element instead of all elements, it'll return the whole matches of each regex:
Array
(
[0] => I
[1] => waiting
[2] => response
[3] => from
)
Try it on 3v41.org
<?php
$data = Array
(
0 => '/(?=\D*\d)/',
1 => '/\b(waiting)\b/i',
2 => '/^(\w+)/',
3 => '/\b(responce)\b/i',
4 => '/\b(from)\b/i',
5 => '/\|/',
6 => '/\b(to)\b/i/'
);
$patterns_flattened = implode('|', $data);
$regex = str_replace("/i",'',$patterns_flattened);
$regex = str_replace('/','',$regex);
if (preg_match_all( '/'.$regex.'/', "I'm waiting for a responce from", $matches)) {
echo '<pre>';
print_r($matches[0]);
}
You have to remove the slashes from your regex and also the i parameter in order to make it work. That was the reason it was breaking.
A really nice tool to actually validate your regex is this :
https://regexr.com/
I always use that when i have to make a bigger than usual regular expression.
The output of the above code is :
Array
(
[0] => I
[1] => waiting
[2] => responce
[3] => from
)
There are a few adjustments to make with your $tok array.
To remove the error, you need to remove the pattern delimiters and pattern modifiers from each array element.
None of the capture grouping is necessary, in fact, it will lead to a higher step count and create unnecessary output array bloat.
Whatever your intention is with (?=\D*\d), it needs a rethink. If there is a number anywhere in your input string, you are potentially going to generate lots of empty elements which surely can't have any benefit for your project. Look at what happens when I put a space then 1 after from in your input string.
Here is my recommendation: (PHP Demo)
$toks = [
'\bwaiting\b',
'^\w+',
'\bresponse\b',
'\bfrom\b',
'\|',
'\bto\b',
];
$pattern = '/' . implode('|', $toks) . '/i';
var_export(preg_match_all($pattern, "I'm waiting for a response from", $out) ? $out[0] : null);
Output:
array (
0 => 'I',
1 => 'waiting',
2 => 'response',
3 => 'from',
)

Regex - how to split string by commas, omitting commas in brackets

I have a string, say:
$str = "myTemplate, testArr => [1868,1869,1870], testInteger => 3, testString => 'test, can contain a comma'"
It basically represents a comma delimited list of parameters I need to parse.
I need to split this string in PHP (probably using preg_match_all) by commas (but omitting those in brackets and quotes) so the end result would be array of the following four matches:
myTemplate
testArr => [1868,1869,1870]
testInteger => 3
testString => 'test, can contain a comma'
The problem is with the array and string values. So any commas inside [ ] or ' ' or " " should not be considered as a delimiter.
There are many similar questions here, but I wasn't able to get it working for this particular situation. What would be the correct regex to get this result? Thank you!
You can use this lookaround based regex:
$str = "myTemplate, testArr => [1868,1869,1870], testInteger => 3, testString => 'test, can contain a comma'";
$arr = preg_split("/\s*,\s*(?![^][]*\])(?=(?:(?:[^']*'){2})*[^']*$)/", $str);
print_r( $arr );
There are 2 lookarounds used in this regex:
(?![^][]*\]) - Asserts comma is not inside [...]
(?=(?:(?:[^']*'){2})*[^']*$) - Asserts comma is not inside '...'
PS: This is assuming we don't have unbalanced/nested/escaped quotes and brackets.
RegEx Demo
Output:
Array
(
[0] => myTemplate
[1] => testArr => [1868,1869,1870]
[2] => testInteger => 3
[3] => testString => 'test, can contain a comma'
)
I wound do it like this:
<?php
$str = "myTemplate, testArr => [1868,1869,1870], testInteger => 3, testString => 'test, can contain a comma'";
$pattern[0] = "[a-zA-Z]+,"; // textonly entry
$pattern[1] = "\w+\s*?=>\s*\[.*\]\s*,?"; // array type entry with value enclosed in square brackets
$pattern[2] = "\w+\s*?=>\s*\d+\s*,?"; // array type entry with decimal value
$pattern[3] = "\w+\s*?=>\s*\'.*\'\s*,?"; // array type entry with string value
$regex = implode('|', $pattern);
preg_match_all("/$regex/", $str, $matches);
// You can also use the one liner commented below if you dont like to use the array
//preg_match_all("/[a-zA-Z]+,|\w+\s*?=>\s*\[.*\]\s*,?|\w+\s*?=>\s*\d+\s*,?|\w+\s*?=>\s*\'.*\'\s*,?/", $str, $matches);
print_r($matches);
This is easier to manage and I can easily add/remove patterns if needed. It will output like
Array
(
[0] => Array
(
[0] => myTemplate,
[1] => testArr => [1868,1869,1870],
[2] => testInteger => 3,
[3] => testString => 'test, can contain a comma'
)
)

capturing group under capturing group?

Is possible to capturing group under capturing group so i can have an array like that
regex = (asd1).(lol1),(asd2).(asd2)
string = asd1.lol1,asd2.lol2
return_array[0]=>group[0]='asd1';
return_array[0]=>group[1]='lol1';
return_array[1]=>group[0]='asd2';
return_array[1]=>group[1]='lol2';
While using regular expressions can get what you want, you could also use strtok() to iterate through what seems to simply be comma separated sets:
$results = array();
$str = 'asd1.lol1,asd2.lol2';
$token = strtok($str, ',');
while ($token !== false) {
$results[] = explode('.', $token, 2);
$token = strtok(',');
}
Output:
Array
(
[0] => Array
(
[0] => asd1
[1] => lol1
)
[1] => Array
(
[0] => asd2
[1] => lol2
)
)
With regular expressions your pattern needs to only include the two terms surrounding a period, i.e.:
$pattern = '/(?<=^|,)(\w+)\.(\w+)/';
preg_match_all($pattern, $str, $result, PREG_SET_ORDER);
The (?<=^|,) is a look-behind assertion; it makes sure to only match what comes after if preceded by either the start of your search string or a comma, but it doesn't "consume" anything.
Output:
Array
(
[0] => Array
(
[0] => asd1.lol1
[1] => asd1
[2] => lol1
)
[1] => Array
(
[0] => asd2.lol2
[1] => asd2
[2] => lol2
)
)
You're probably looking for preg_match_all.
$regex = '/^((\w+)\.(\w+)),((\w+)\.(\w+))$/';
$string = 'asd1.lol1,asd2.lol2';
preg_match_all($regex, $string, $matches);
This function will create a 2-dimensional array, where the first dimension represents the matched groups (i.e. the parentheses, 0 contains the whole matched string though) and each have subarrays to all the matched lines (only 1 in this case).
[0] => ("asd1.lol1,asd2.lol2") // a view of $matches
[1] => ("asd1.lol1")
[2] => ("asd1")
[3] => ("lol1")
[4] => ("asd2.lol2")
[5] => ("asd2")
[6] => ("lol2")
Your best bet to have groups is to process the first dimension of the array that you want and to then process them further, i.e. get "asd1.lol1" from 1 and 4 and then process these further into asd1 and lol1.
You wouldn't need as many parentheses in your first run:
$regex = '/^(\w+\.\w+),(\w+\.\w+)$/';
will yield:
[0] => ("asd1.lol1,asd2.lol2")
[1] => ("asd1.lol1")
[2] => ("asd2.lol2")
Then you can split the array in 1 and 2 into more granular values.
Flags can be set to preg_match_all to order the output differently. Particularly, PREG_SET_ORDER allows you to have all matched instances in the same subarray. This is of little importance if you're only processing one string, but if you're matching a pattern in a text, it might be more convenient to have all info about one match in $matches[0], and so forth.
Note that if you're just separating a string by comma and then by any periods, you might not need regular expressions and could conveniently use explode() as so:
$string = 'asd1.lol1,asd2.lol2';
$matches = explode(',', $string);
foreach($matches as &$match) {
$match = explode('.', $match);
}
This will give you exactly what you want, but do note that you don't have as much control over the process as with regular expressions – for instance, asd1.lol1.lmao,asd2.lol2.rofl.hehe will also work and they'll produce bigger arrays than you may want. You can check with count() on the size of the subarray and handle the cases when the array isn't of the appropriate size, though. I still believe that's more comfortable than using regular expressions.

Split a single string into an array using specific Regex rules

I'm processing a single string which contains many pairs of data. Each pair is separated by a ; sign. Each pair contains a number and a string, separated by an = sign.
I thought it would be easy to process, but i've found that the string half of the pair can contain the = and ; sign, making simple splitting unreliable.
Here is an example of a problematic string:
123=one; two;45=three=four;6=five;
For this to be processed correctly I need to split it up into an array that looks like this:
'123', 'one; two'
'45', 'three=four'
'6', 'five'
I'm at a bit of dead end so any help is appreciated.
UPDATE:
Thanks to everyone for the help, this is where I am so far:
$input = '123=east; 456=west';
// split matches into array
preg_match_all('~(\d+)=(.*?);(?=\s*(?:\d|$))~', $input, $matches);
$newArray = array();
// extract the relevant data
for ($i = 0; $i < count($matches[2]); $i++) {
$type = $matches[2][$i];
$price = $matches[1][$i];
// add each key-value pair to the new array
$newArray[$i] = array(
'type' => "$type",
'price' => "$price"
);
}
Which outputs
Array
(
[0] => Array
(
[type] => east
[price] => 123
)
)
The second item is missing as it doesn't have a semicolon on the end, i'm not sure how to fix that.
I've now realised that the numeric part of the pair sometimes contains a decimal point, and that the last string pair does not have a semicolon after it. Any hints would be appreciated as i'm not having much luck.
Here is the updated string taking into account the things I missed in my initial question (sorry):
12.30=one; two;45=three=four;600.00=five
You need a look-ahead assertion for this; the look-ahead matches if a ; is followed by a digit or the end of your string:
$s = '12.30=one; two;45=three=four;600.00=five';
preg_match_all('/(\d+(?:.\d+)?)=(.+?)(?=(;\d|$))/', $s, $matches);
print_r(array_combine($matches[1], $matches[2]));
Output:
Array
(
[12.30] => one; two
[45] => three=four
[600.00] => five
)
I think this is the regex you want:
\s*(\d+)\s*=(.*?);(?=\s*(?:\d|$))
The trick is to consider only the semicolon that's followed by a digit as the end of a match. That's what the lookahead at the end is for.
You can see a detailed visualization on www.debuggex.com.
You can use following preg_match_all code to capture that:
$str = '123=one; two;45=three=four;6=five;';
if (preg_match_all('~(\d+)=(.+?);(?=\d|$)~', $str, $arr))
print_r($arr);
Live Demo: http://ideone.com/MG3BaO
$str = '123=one; two;45=three=four;6=five;';
preg_match_all('/(\d+)=([a-zA-z ;=]+)/', $str,$matches);
echo '<pre>';
print_r($matches);
echo '</pre>';
o/p:
Array
(
[0] => Array
(
[0] => 123=one; two;
[1] => 45=three=four;
[2] => 6=five;
)
[1] => Array
(
[0] => 123
[1] => 45
[2] => 6
)
[2] => Array
(
[0] => one; two;
[1] => three=four;
[2] => five;
)
)
then y can combine
echo '<pre>';
print_r(array_combine($matches[1],$matches[2]));
echo '</pre>';
o/p:
Array
(
[123] => one; two;
[45] => three=four;
[6] => five;
)
Try this but this code is written in c#, you can change it into php
string[] res = Regex.Split("123=one; two;45=three=four;6=five;", #";(?=\d)");
--SJ

How go get everything from between parenthesis in PHP?

Array(
[1] => put returns (between) paragraphs
[2] => (for) linebreak (add) 2 spaces at end
[3] => indent code by 4 (spaces!)
[4] => to make links
)
Want to get text inside brackets (for each value):
take only first match
remove this match from the value
write all matches to new array
After function arrays should look like:
Array(
[1] => put returns paragraphs
[2] => linebreak (add) 2 spaces at end
[3] => indent code by 4
[4] => to make links
)
Array(
[1] => between
[2] => for
[3] => spaces!
[4] =>
)
What is the solution?
I would use the regular expression /\((\([^()]*\)|[^()]*)\)/ (this will match one or two pairs of parentheses) together with preg_split:
$matches = array();
foreach ($arr as &$value) {
$parts = preg_split('/\((\([^()]*\)|[^()]*)\)/', $value, 2, PREG_SPLIT_DELIM_CAPTURE);
if (count($parts) > 1) {
$matches[] = current(array_splice($parts, 1, 1));
$value = implode('', $parts);
}
}
Using preg_split with PREG_SPLIT_DELIM_CAPTURE flag set will contain the matched separators in the result array. So a match was found, there are at least three parts. In that case the second member is the one we are looking for. That member is removed with array_splice that does also return the array of removed members. To get the removed member, current is used on the return value of array_splice. The remaining members are then put back together.
Assuming you meant (between) and not ((between))
$arr = array(
0 => 'put returns (between) paragraphs',
1 => '(for) linebreak (add) 2 spaces at end',
2 => 'indent code by 4 (spaces!)',
3 => 'to make links');
var_dump($arr);
$new_arr = array();
foreach($arr as $key => &$str) {
if(preg_match('/(\(.*?\))/',$str,$m)) {
$new_arr[] = $m[1];
$str = preg_replace('/\(.*?\)/','',$str,1);
}
else {
$new_arr[] = '';
}
}
var_dump($arr);
var_dump($new_arr);
Working link

Categories