Why does this regex have 3 matches, not 5? - php

I wrote a pretty simple preg_match_all file in PHP:
$fileName = 'A_DATED_FILE_091410.txt';
$matches = array();
preg_match_all('/[0-9][0-9]/',$fileName,$matches);
print_r($matches);
My Expected Output:
$matches = array(
[0] => array(
[0] => 09,
[1] => 91,
[2] => 14,
[3] => 41,
[4] => 10
)
)
What I got instead:
$matches = array(
[0] => array(
[0] => 09,
[1] => 14,
[2] => 10
)
)
Now, in this particular use case this was preferable, but I'm wondering why it didn't match the other substrings? Also, is a regex possible that would give me my expected output, and if so, what is it?

With a global regex (which is what preg_match_all uses), once a match is made, the regex engine continues searching the string from the end of the previous match.
In your case, the regular expression engine starts at the beginning of the string, and advances until the 0, since that is the first character that matches [0-9]. It then advances to the next position (9), and since that matches the second [0-9], it takes 09 as a match. When the engine continues matching (since it has not yet reached the end of the string), it advances its position again (to 1) (and then the above repeats).
See also: First Look at How a Regex Engine Works Internally
If you must get every 2 digit sequence, you can use preg_match and use offsets to determine where to start capturing from:
$fileName = 'A_DATED_FILE_091410.txt';
$allSequences = array();
$matches = array();
$offset = 0;
while (preg_match('/[0-9][0-9]/', $fileName, $matches, PREG_OFFSET_CAPTURE, $offset))
{
list($match, $offset) = $matches[0];
$allSequences[] = $match;
$offset++; // since the match is 2 digits, we'll start the next match after the first
}
Note that the offset returned with the PREG_OFFSET_CAPTURE flag is the start of the match.
I've got another solution that will get five matches without having to use offsets, but I'm adding it here just for curiosity, and I probably wouldn't use it myself in production code (it's a somewhat complex regex too). You can use a regex that uses a lookbehind to look for a number before the current position, and captures the number in the lookbehind (in general, lookarounds are non-capturing):
(?<=([0-9]))[0-9]
Let's walk through this regex:
(?<= # open a positive lookbehind
( # open a capturing group
[0-9] # match 0-9
) # close the capturing group
) # close the lookbehind
[0-9] # match 0-9
Because lookarounds are zero-width and do not move the regex position, this regular expression will match 5 times: the engine will advance until the 9 (because that is the first position which satisfies the lookbehind assertion). Since 9 matches [0-9], the engine will take 9 as a match (but because we're capturing in the lookaround, it'll also capture the 0!). The engine then moves to the 1. Again, the lookbehind succeeds (and captures), and the 1 is added as a 1st subgroup match (and so on, until the engine hits the end of the string).
When we give this pattern to preg_match_all, we'll end up with an array that looks like (using the PREG_SET_ORDER flag to group capturing groups along with the full match):
Array
(
[0] => Array
(
[0] => 9
[1] => 0
)
[1] => Array
(
[0] => 1
[1] => 9
)
[2] => Array
(
[0] => 4
[1] => 1
)
[3] => Array
(
[0] => 1
[1] => 4
)
[4] => Array
(
[0] => 0
[1] => 1
)
)
Note that each "match" has its digits out of order! This is because the capture group in the lookbehind becomes backreference 1 while the whole match is backreference 0. We can put it back together in the correct order though:
preg_match_all('/(?<=([0-9]))[0-9]/', $fileName, $matches, PREG_SET_ORDER);
$allSequences = array();
foreach ($matches as $match)
{
$allSequences[] = $match[1] . $match[0];
}

The search for the next match starts at the first character after the previous match. So when 09 is matched in 091410, the search for the next match starts at 1410.

Also, is a regex possible that would
give me my expected output, and if so,
what is it?
No single one will work because it won't match the same section twice. But you could do something like this:
$i = 0;
while (preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, $i))
{
$i = $matches[0][1]; /* + 1 in many cases */
}
The above is not safe for the general case. You could get stuck in an infinite loop, depending on the pattern. Also, you may not want [0][1], but instead something like [1][1] etc, again, depending on the pattern.
For this particular case, I think it would be much simpler to do it yourself:
$l = strlen($s);
$prev_digit = false;
for ($i = 0; $i < $l; ++$i)
{
if ($s[$i] >= '0' && $s[$i] <= '9')
{
if ($prev_digit) { /* found match */ }
$prev_digit = true;
}
else
$prev_digit = false;
}

Just for fun, another way to do it :
<?php
$fileName = 'A_DATED_FILE_091410.txt';
$matches = array();
preg_match_all('/(?<=([0-9]))[0-9]/',$fileName,$matches);
$result = array();
foreach($matches[1] as $i => $behind)
{
$result[] = $behind . $matches[0][$i];
}
print_r($result);
?>

Related

How to split math operators from numbers in a string

Im trying to split between operators and float/int values in a string.
Example:
$input = ">=2.54";
Output should be:
array(0=>">=",1=>"2.54"); .
Operators cases : >,>=,<,<=,=
I tried something like this:
$input = '0.2>';
$exploded = preg_split('/[0-9]+\./', $input);
but its not working.
Here is a working version using preg_split:
$input = ">=2.54";
$parts = preg_split("/(?<=[\d.])(?=[^\d.])|(?<=[^\d.])(?=[\d.])/", $input);
print_r($parts);
This prints:
Array
(
[0] => >=
[1] => 2.54
)
Here is an explanation of the regex used, which says to split when:
(?<=[\d.])(?=[^\d.]) a digit/dot precedes and a non digit/dot follows
| OR
(?<=[^\d.])(?=[\d.]) a non digit/dot precedes and a digit/dot follows
That is, we split at the interface between a number, possibly a decimal, and an arithmetic symbol.
Try :
$input = ">=2.54";
preg_match("/([<>]?=?) ?(\d*(?:\.\d+)?)/",$input,$exploded);
If you want to split between the operators, you might use and alternation to match the variations of the operators, and use \K to reset the starting point of the reported match.
This will give you the position to split on. Then assert using lookarounds that there is a digit on the left or on the right.
\d\K(?=[<>=])|(?:>=?|<=?|=)\K(?=\d)
Explanation
\d\K(?=[<>=]) Match a digit, forget what was matched and assert either <, > or = on the right
| Or
(?:>=?|<=?|=)\K(?=\d) Match an operator, forget what was matched and assert a digit on the right
Regex demo | Php demo
For example
$strings = [
">=2.54",
"=5",
"0.2>"
];
$pattern = '/\d\K(?=[<>=])|(?:>=?|<=?|=)\K(?=\d)/';
foreach ($strings as $string) {
print_r(preg_split($pattern, $string));
}
Output
Array
(
[0] => >=
[1] => 2.54
)
Array
(
[0] => =
[1] => 5
)
Array
(
[0] => 0.2
[1] => >
)

How to get an array with all images numbers and images letters with regex?

I need to get all concerned images parsing a html in PHP, based on an expression formatted like this:
(fig. 8a-c, 9b-c)
I would like to catch this using a regex in order to output an array such as:
array(
[8] => [a,b,c],
[9] => [b,c])
The expression can be anything like:
(fig. 8)
(fig. 8,9)
(fig. 11a, b)
Here is the regex i have at the moment, but it does not seem to work for every case:
https://regex101.com/r/ShqlnY/3/
Can you help me getting an array containing all included images ? Thanks
Thanks, i ended up with a regular expression like this:
'/(?:\(fig\.\h*|\G(?!^))(\d+)([a-z])?(?:-([a-z])?)?(?:,\h*)?(?=[^)]*\))/m'
used with preg_match_all
You may use
'~(?:\G(?!^),\s*|\(fig\.)\s*\K([0-9]{1,3})([a-z]-[a-z])~'
with preg_match_all to get all the char ranges from inside a (fig. ...) substring (see the regex demo), and then use this post-process code:
$rx = "~(?:\G(?!^),\s*|\(fig\.)\s*\K([0-9]{1,3})([a-z]-[a-z])~";
$s = "(fig. 8a-c, 9b-c)";
preg_match_all($rx, $s, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER, 0);
foreach ($matches as $m) {
$result = [];
$result[] = $m[0][1]; // Position of the match
$result[] = $m[1][0]; // The number
$kv = explode("-", $m[2][0]);
$result = array_merge($result, buildNumChain($kv));
print_r($result);
}
function buildNumChain($arr) {
$ret = [];
foreach(range($arr[0], $arr[1]) as $letter) {
$ret[] = $letter;
}
return $ret;
}
Output:
Array ( [0] => 6 [1] => 8 [2] => a [3] => b [4] => c )
Array ( [0] => 12 [1] => 9 [2] => b [3] => c )
See the PHP demo.
Regex details
(?:\G(?!^),\s*|\(fig\.) - (fig. or end of the previous match + , and 0+ whitespaces
\s* - 0+ whitespaces
\K - match reset operator
([0-9]{1,3}) - Group 1: 1 to 3 digits
([a-z]-[a-z]) - Group 2: a lowercase letter, - and a lowercase letter.
Perhaps for your example data you might use a range and a pattern with 3 capturing groups where the third group is optional.
If the third group does not exists, you return the single value in an array, or else you use the second and the third group to create a range.
(?:^\(fig\.\h*|\G(?!^))(\d+)([a-z])(?:-([a-z])?)?(?:,\h*)?(?=[^)]*\))
(?: Non capturing group
^\(fig\.\h* Match start of the string and (fig. followed by 0+ horizonal whitespaces
| Or
\G(?!^) Assert position at the end of the previous match, not at the start
) Close non capturing group
(\d+)([a-z]) Capture 1+ digits in group 1, Capture a-z in group 2
(?: Non capturing group
-([a-z])?
)? Close non capturing group and make optional
(?:,\h*)? Match optional , and 0+ horizontal whitespace chars
(?=[^)]*\)) Assert what is on the right is a closing parenthesis
Regex demo
For example:
$pattern = "/(?:^\(fig\.\h*|\G(?!^))(\d+)([a-z])(?:-([a-z])?)?(?:,\h*)?(?=[^)]*\))/m";
$str = '(fig. 8a-c, 9b-c)
(fig. 8)
(fig. 8,9)
(fig. 11a, b)';
preg_match_all($pattern, $str, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE, 0);
$matches = array_map(function($x){
if (isset($x[3][0])) {
return [
$x[1][0] => range($x[2][0], $x[3][0]),
"start" => $x[1][1],
"end" => $x[3][1]
];
}
return [
$x[1][0] => [$x[2][1]],
"start" => $x[2][0],
"end" => $x[1][1]
];
}, $matches);
print_r($matches);
Result
Array
(
[0] => Array
(
[8] => Array
(
[0] => a
[1] => b
[2] => c
)
[start] => 6
[end] => 9
)
[1] => Array
(
[9] => Array
(
[0] => b
[1] => c
)
[start] => 12
[end] => 15
)
)
See a php demo

Regular expression to extract a numeric value on a changing position within a variable string

How can I extract the bold numeric part of a string, when the most of the string can change? /data/ is always present and followed by the relevant, variable, numeric part (in this case 123456).
differentcontentLocationhttps://example.com/api/result/13548/data/123456differentstuffincludingwhitespacesandnewlines8484
$str = "differentcontentLocationhttps://example.com/api/result/13548/data/123456differentstuffincludingwhitespacesandnewlines8484";
$str2 = "differentcontentLocationhttps://example.com/api/result/13548/data/123456";
In this example I need 123456. The only constant parts in the string are /data/ and maybe the first part of the URL, like https://.
preg_match("#/data/([0-9]+)([^0-9]+)#siU", $str, $matches);
Results in Array ( [0] => /data/123456d [1] => 123456 [2] => d ), what would be acceptable. But if there's nothing following the relevant numeric part, like in $str2, this expression fails. I've tried to make the tailing part optional with preg_match("#/ads/([0-9]+)(([^0-9]+)?)#siU", $x, $matches);, but it fails, too; returning only the first number of the numeric part.
The U greediness swapping modifier makes all greedy subpattern lazy here, you should remove it together with ([^0-9]+). You also do not need DOTALL modifier because there is no . in your pattern whose behavior could be modified with that s flag.
preg_match("#/data/([0-9]+)#i", $str, $matches);
Now, the pattern will match:
/data/ - a sequence of literal chars
([0-9]+) - Group 1 capturing 1+ digits (same as (\d+))
See the PHP demo.
$str = "differentcontentLocationhttps://e...content-available-to-author-only...e.com/api/result/13548/data/123456differentstuffincludingwhitespacesandnewlines8484";
$str2 = "differentcontentLocationhttps://e...content-available-to-author-only...e.com/api/result/13548/data/123456";
preg_match("#/data/([0-9]+)#i", $str, $matches);
print_r($matches); // Array ( [0] => /data/123456 [1] => 123456 )
preg_match("#/data/([0-9]+)#i", $str2, $matches2);
print_r($matches2); // Array ( [0] => /data/123456 [1] => 123456 )

Split string into associative array (while maintaining characters)

I'm trying to figure out how to split a string that looks like this :
a20r51fx500fy3000
into an associative array that will look like this :
array(
'a' => 20,
'r' => 51,
'fx' => 500,
'fy' => 3000,
);
I don't think I can use preg_split as this will drop the character I'm splitting on (I tried /[a-zA-Z]/ but obviously that didn't do what I wanted it to). I'd prefer if I could do it using some kind of built-in function, but I don't really mind looping if that's required.
Any help would be much appreciated!
Multiple Matches and PREG_SET_ORDER
Do this:
$yourstring = "a20r51fx500fy3000";
$regex = '~([a-z]+)(\d+)~';
preg_match_all($regex,$yourstring,$matches,PREG_SET_ORDER);
$yourarray=array();
foreach($matches as $m) {
$yourarray[$m[1]] = $m[2];
}
print_r($yourarray);
Output:
Array ( [a] => 20 [r] => 51 [fx] => 500 [fy] => 3000 )
If your string can contain upper-case letters, make the regex case-insensitive by adding the i flag after the closing delimiter: $regex = '~([a-z]+)(\d+)~i';
Explanation
([a-z]+) captures letters to Group 1
(\d+) captures digits to Group 1
$yourarray[$m[1]] = $m[2]; creates in index for the letters, and assigns the digits

Two or more matches in expression

Is it possible to make two matches of text - /123/123/123?edit
I need to match 123, 123 ,123 and edit
For the first(123,123,123): pattern is - ([^\/]+)
For the second(edit): pattern is - ([^\?=]*$)
Is it possible to match in one preg_match_all function, or I need to do it twice - one time for one pattern, second one for second?
Thanks !
You can do this with a single preg_match_all call:
$string = '/123/123/123?edit';
$matches = array();
preg_match_all('#(?<=[/?])\w+#', $string, $matches);
/* $matches will be:
Array
(
[0] => Array
(
[0] => 123
[1] => 123
[2] => 123
[3] => edit
)
)
*/
See this in action at http://www.ideone.com/eb2dy
The pattern ((?<=[/?])\w+) uses a lookbehind to assert that either a slash or a question mark must precede a sequence of word characters (\w is a shorthand class equivalent to [a-z0-9_]).

Categories