regex split a string between [ and ] - php

My string is something like that '[15][18][22]' and now I like so split it into an array of [15] and [18] and [22]. I'm trying with this regex
\[\d+\]
But it only split the first one.
thanks for help

You are better off using preg_match_all with what you want to capture:
if (preg_match_all('/\[\d+]/', $str, $m)) {
print_r($m[0]);
}
Output:
Array
(
[0] => [15]
[1] => [18]
[2] => [22]
)
Or else you may use this preg_split with a capture group:
$str = '[15][18][22]';
$arr = preg_split('/(\[\d+])/', $str, -1,
PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($arr);
Output:
Array
(
[0] => [15]
[1] => [18]
[2] => [22]
)

It just doesn't get any simpler than this. Three characters in the pattern. You only need to explode on the zero-width position after each ]. \K tells the regex engine to forget/release the previously matched character.
~]\K~ Pattern Demo
Code: (Demo)
$string = '[15][18][22]';
var_export(preg_split('~]\K~', $string, -1, PREG_SPLIT_NO_EMPTY));
Output:
array (
0 => '[15]',
1 => '[18]',
2 => '[22]',
)
This will perform with maximum efficiency because it doesn't have any capture groups, lookarounds, or alternatives to slow it down.

Related

Whitespace delimiter not being captured in preg split

<?php
$text = "Testing text splitting\nWith a newline!";
$textArray = preg_split('/\s+/', $text, 0, PREG_SPLIT_DELIM_CAPTURE);
print_r($textArray);
The above code will output the following:
Array
(
[0] => Testing
[1] => text
[2] => splitting
[3] => With
[4] => a
[5] => newline!
)
However to my knowledge the PREG_SPLIT_DELIM_CAPTURE flag should be capturing the whitespace delimiters in the array. Am I missing something?
edit: Ok, after rereading the documentation I now understand PREG_SPLIT_DELIM_CAPTURE is not meant for this case. My desired output would be something like:
Array
(
[0] => Testing
[1] => ' '
[2] => text
[3] => ' '
[4] => splitting
[5] => '\n'
[6] => With
[7] => ' '
[8] => a
[9] => ' '
[10] => newline!
)
So if you read manual for PREG_SPLIT_DELIM_CAPTURE once again which says:
If this flag is set, parenthesized expression in the delimiter pattern will be captured and returned as well.
you will suddenly understand that expression in the delimiter pattern (in your case it is \s) will be captured (i.e added to result) only when it is in parentheses. Now, you can:
$text = "Testing text splitting\nWith a newline!";
$textArray = preg_split('/(\s+)/', $text, 0, PREG_SPLIT_DELIM_CAPTURE);
// parentheses!
print_r($textArray);
You can also use T-Regx library:
$textArray = pattern('(\s+)')->split("Testing text splitting\nWith a newline!")->inc();

how to split a string containing numbers in our outside parenthesis using preg_match_all

I have a string that looks something like this:
535354 345356 3543674 34667 2345347 -3536 4532452 (234536 2345634 -4513453) (2345 -13254 13545)
The text between () is always at the end of the string (at least for now).
i need to split it into an array similar to this:
[0] => [0] 535354,345356,3543674,34667,2345347,-3536,4532452
[1] => [0] 234536,2345634,-4513453
=> [1] 2345,-13254,13545
What expression should i use for preg_match_all?
Best i could get with my limited knowledge is /([0-9]{1,}){1,}.*(?=(\(.*\)))/U but i still get some unwanted elements.
You may use a regex that will match chunks of numbers outside of parentheses and those inside with "~(?<=\()\s*$numrx\s*(?=\))|\s*$numrx~" where a $numrx stands for the number regex (that can be enhanced further).
The -?\d+(?:\s+-?\d+)* matches an optional -, 1 or more digits, and then 0+ sequences of 1+ whitespaces followed with optional - and 1+ digits. (?<=\()\s*$numrx\s*(?=\)) matches the same only if preceded with ( and followed with ).
See this PHP snippet:
$s = "535354 345356 3543674 34667 2345347 -3536 4532452 (234536 2345634 -4513453) (2345 -13254 13545)";
$numrx = "-?\d+(?:\s+-?\d+)*";
preg_match_all("~(?<=\()\s*$numrx\s*(?=\))|\s*$numrx~", $s, $m);
$res = array();
foreach ($m[0] as $k) {
array_push($res,explode(" ",trim($k)));
}
print_r($res);
Output:
[0] => Array
(
[0] => 535354
[1] => 345356
[2] => 3543674
[3] => 34667
[4] => 2345347
[5] => -3536
[6] => 4532452
)
[1] => Array
(
[0] => 234536
[1] => 2345634
[2] => -4513453
)
[2] => Array
(
[0] => 2345
[1] => -13254
[2] => 13545
)
You can use this regex in preg_match_all:
$re = '/\d+(?=[^()]*[()])/';
RegEx Demo
RegEx Breakup:
\d+ # match 1 or more digits
(?= # lookahead start
[^()]* # match anything but ( or )
[()] # match ( or )
) # lookahead end

Sscanf with regex to match even an empty string

I am using this regex in sscanf
sscanf($seat, "%d-%[^(](%[^#]#%[^)])");
And it works well when i'm getting this kind of strings:
173-9B(AA#3.45 EUR#32H)
but when i'm getting this kind of string:
173-9B(#3.14 EUR#32H)
it's all messed up, how can I also accept empty strings between the first ( and the first # ?
You would be better off using a regex in preg_match to handle optional data presence in input:
$re = '/(\d*)-([^(]*)\(([^#]*)#([^)]*)\)/';
preg_match($re, '173-9B(#3.45 EUR#32H)', $m);
unset($m[0]);
print_r($m);
Output:
Array
(
[1] => 173
[2] => 9B
[3] =>
[4] => 3.45 EUR#32H
)
And 2nd example:
preg_match($re, '173-9B(AA#3.45 EUR#32H)', $m);
unset($m[0]);
print_r($m);
Array
(
[1] => 173
[2] => 9B
[3] => AA
[4] => 3.45 EUR#32H
)
Use of ([^#]*) will make it match 0 more characters that are not #.

preg_match_all all combinations with word bounderies

i've got the following string:
$string = "König Friedrich August III. von Sachsen - Adel Sachsen, Waidmannsheil, Kapitaler 16ender erlegt auf der Jagd am 2. Oktober 1905, gelaufen 30.06.1909, Verlag, Karlowa Walter, Dresden";
Now I wan't to find words in that string using preg_match_all:
preg_match_all("/\b(abituria)\b|\b(absolvia)\b|\b(adel sachsen)\b|\b(adel)\b|\b(sachsen)\b|\b(könig)\b/i",$string,$matches);
The string matches only for
array(
0 => "König",
1 => "Adel Sachsen"
)
but I need that it also returns "Adel" in the $matches-Array.
How can I do that? I think my problem is that: "After the first match is found, the subsequent searches are continued on from end of the last match."
Update
That does not work:
preg_match_all('/(?=\b(adel sachsen|adel)\b)/ui', $string, $matches);
print_r($matches[1]);
Array
(
[0] => Adel Sachsen
)
preg_match_all('/(?=\b(adel|adel sachsen)\b)/ui', $string, $matches);
print_r($matches[1]);
Array
(
[0] => Adel
)
But i need the following as result:
Array
(
[0] => Adel Sachsen,
[1] => Adel
)
I would just search for each word/combination (generate a pattern for each) and map the according match to the result array or set false, if it doesn't match. Then filter the false elements:
$arr = ["nadel", "adel", "knödel", "sachsen", "adel sachsen"];
$str = "Friedrich August III. von Sachsen - Adel Sachsen";
$res = array_filter(array_map(function ($s) use (&$str) {
$s = '/\b'.preg_quote($s,'/').'\b/iu';
return preg_match($s, $str, $out) ? $out[0] : false; }, $arr));
sort($res); print_r($res);
See test at eval.in (anonymous functions with array_map: at least PHP 5.3 is required)
Array
(
[0] => Adel
[1] => Adel Sachsen
[2] => Sachsen
)
The function can be further improved to return arrays, if such as different cases for same words is desired or capturing the offset.
You can use lookahead to get your ovelaping matches:
preg_match_all('/(?=\b(abituria|absolvia|adel sachsen|adel|sachsen|könig)\b)/ui',
$string, $matches);
print_r($matches[1]);
Array
(
[0] => König
[1] => Sachsen
[2] => Adel Sachsen
[3] => Sachsen
)
RegEx Demo
Update: Based on your updated code snippet you can do this:
preg_match_all('/(?=\b(adel sachsen)\b)(?=\b(adel)\b)/ui', $string, $matches);
unset($matches[0]);
print_r($matches);
Output:
Array
(
[1] => Array
(
[0] => Adel Sachsen
)
[2] => Array
(
[0] => Adel
)
)
As you already noticed, preg_match_all continues searching after the end of each last match, so it is not the best tool for your task.
The easy but less performant solution would be to do one preg_match for each single search term instead.
If the strings are not much longer than your example I would go for this, optimizing it seems not to be worth it.
If performance is really critical, I would group prefixes of other terms with them, ordering each group by longest term first:
abituria
absolvia
adel sachsen, adel
sachsen
könig
Now use the regex with lookahead assertion:
preg_match_all('/(?=\b(abituria|absolvia|adel sachsen|adel|sachsen|könig)\b)/ui',
$string, $matches);
If $string contains "adel", but not "adel sachsen", it will match correctly. If it contains "adel sachsen", it will only match "adel sachsen", but from the groups that we constructed before, we know that it also matches prefixes of "adel sachsen", i.e. "adel".

Regex for spliting on all unescaped semi-colons

I'm using php's preg_split to split up a string based on semi-colons, but I need it to only split on non-escaped semi-colons.
<?
$str = "abc;def\\;abc;def";
$arr = preg_split("/;/", $str);
print_r($arr);
?>
Produces:
Array
(
[0] => abc
[1] => def\
[2] => abc
[3] => def
)
When I want it to produce:
Array
(
[0] => abc
[1] => def\;abc
[2] => def
)
I've tried "/(^\\)?;/" or "/[^\\]?;/" but they both produce errors. Any ideas?
This works.
<?
$str = "abc;def\;abc;def";
$arr = preg_split('/(?<!\\\);/', $str);
print_r($arr);
?>
It outputs:
Array
(
[0] => abc
[1] => def\;abc
[2] => def
)
You need to make use of a negative lookbehind (read about lookarounds). Think of "match all ';' unless preceed by a '\'".
I am not really proficient with PHP regexes, but try this one:
/(?<!\\);/
Since Bart asks: Of course you can also use regex to split on unescaped ; and take escaped escape characters into account. It just gets a bit messy:
<?
$str = "abc;def\;abc\\\\;def";
preg_match_all('/((?:[^\\\\;]|\\\.)*)(?:;|$)/', $str, $arr);
print_r($arr);
?>
Array
(
[0] => Array
(
[0] => abc;
[1] => def\;abc\\;
[2] => def
)
[1] => Array
(
[0] => abc
[1] => def\;abc\\
[2] => def
)
)
What this does is to take a regular expression for “(any character except \ and ;) or (\ followed by any character)” and allow any number of those, followed by a ; or the end of the string.
I'm not sure how php handles $ and end-of-line characters within a string, you may need to set some regex options to get exactly what you want for those.

Categories