How to work around PHP lookbehind fixed width limitation? - php

I ran into a problem when trying to match all numbers found between spesific words on my page. How would you match all the numbers in the following text, but only between the word "begin" and "end"?
11
a
b
13
begin
t
899
y
50
f
end
91
h
This works:
preg_match("/begin(.*?)end/s", $text, $out);
preg_match_all("/[0-9]{1,}/", $out[1], $result);
But can it be done in one expression?
I tried this but it doesnt do the trick
preg_match_all("/begin.*([0-9]{1,}).*end/s", $text, $out);

You can make use of the \G anchor like this, and some lookaheads to make sure that you're not going 'out of territory' (out of the area between the two words):
(?:begin|(?!^)\G)(?:(?=(?:(?!begin).)*end)\D)*?(\d+)
regex101 demo
(?: # Begin of first non-capture group
begin # Match 'begin'
| # Or
(?!^)\G # Start the match from the previous end of match
) # End of first non-capture group
(?: # Second non-capture group
(?= # Positive lookahead
(?:(?!begin).)* # Negative lookahead to prevent running into another 'begin'
end # And make sure that there's an 'end' ahead
) # End positive lookahead
\D # Match non-digits
)*? # Second non-capture group repeated many times, lazily
(\d+) # Capture digits
A debuggex if that also helps:

Ideal solution
What is really needed here is a positive lookbehind with variable width. The regex would end up like this:
~(?<=begin.*)\d+(?=.*end)~s
However, as of this writing, the PHP regex flavor doesn't support this feature. Only lookbehind with fixed width is supported. (.Net flavor does though).
Workaround
To acheive our goal, we can use preg_replace_callback with the following regex:
~(?<token>begin|end)|(?<number>\d+)|.*?~s
Sample code
function extract_number($input) {
function matchNumbers($match) {
static $in_region = false;
switch ($match['token']) {
case 'begin':
$in_region=true;
break;
case 'end':
$in_region=false;
break;
}
if ($in_region && isset($match['number'])) {
return $match['number'].',';
} else {
return '';
}
}
$ret=preg_replace_callback('~(?<token>begin|end)|(?<number>\d+)|.*?~s', 'matchNumbers', $input);
return array_filter(explode(',',$ret));
}
echo '<pre>';
echo var_dump(extract_number($str));
echo '</pre>';
Output (with OP's example)
array(3) {
[0]=>
string(3) "899"
[1]=>
string(2) "50"
}

Assuming your project data only has one begin and end "marker" in the text, you can build a more direct and efficient pattern...
Code: (PHP Demo) (Pattern Demo)
$text = "11
a
b
13
begin
t
899
y
50
f
end
91
h";
var_export(preg_match_all('~(?:begin|\G(?!^))(?:(?!end)\D)+\K\d+~s', $text, $out) ? $out[0] : 'no matches');
Output:
array (
0 => '899',
1 => '50',
)
Layman's Breakdown:
(?:begin|\G(?!^)) #match "begin" or continue matching from the position immediately after previous match
(?:(?!end)\D)*? #match zero or more occurrences of any non-digit character while screening for "end". If end is found, immediately cease pattern execution.
\K #restart the fullstring match from this position; this avoids the expense of using a capture group on the desired digits
\d+ #match one or more digits (as much as possible)
See the Pattern Demo link for a more academic breakdown of the pattern.

Related

Get the last letter and everything after it in PHP

I have a few hundred thousand strings that are laid out like the following
AX23784268B2
LJ93842938A1
MN39423287S
IY289383N2
With PHP I'm racking my brain how to return B2, A1, S, and N2.
Tried all sorts of substr, strstr, strlen manipulation and am coming up short.
substr('MN39423287S', -2); ?> // returns 7S, not S
This is a simpler regexp than the other answer:
preg_match('/[A-Z][^A-Z]*$/', $token, $matches);
echo $matches[0];
[A-Z] matches a letter, [^A-Z] matches a non-letter. * makes the preceiding pattern match any number of times (including 0), and $ matches the end of the string.
So this matches a letter followed by any number of non-letters at the end of the string.
$matches[0] contains the portion of the string that the entire regexp matched.
There's many way to do this.
One example would be a regex
<?php
$regex = "/.+([A-Z].?+)$/";
$tokens = [
'AX23784268B2',
'LJ93842938A1',
'MN39423287S',
'IY289383N2',
];
foreach($tokens as $token)
{
preg_match($regex, $token, $matches);
var_dump($matches[1]);
// B2, A2, S, N2
}
How the regex works;
.+ - any character except newline
( - create a group
[A-Z] - match any A-Z character
.?+ - also match any characters after it, if any
) - end group
$ - match the end of the string

PHP - preg_replace_callback for camelCasing

I have the following content
"aa_bb" : "foo"
"pp_Qq" : "bar"
"Xx_yY_zz" : "foobar"
And I want to convert the content on the left side to camelCase
"aaBb" : "foo"
"ppQq" : "bar"
"xxYyZz" : "foobar"
And the code:
// selects the left part
$newString = preg_replace_callback("/\"(.*?)\"(.*?):/", function($matches) {
// selects the characters following underscores
$matches[1] = preg_replace_callback("/_(.?)/", function($matches) {
//removes the underscore and uppercases the character
return strtoupper($matches[1]);
}, $matches[1]);
// lowercases the first character before returning
return "\"".lcfirst($matches[1])."\" : ".$matches[2];
}, $string);
Can this code be simplified?
Note: The content will always be a single string.
First, since you already have a working code you want to improve, consider to post your question in code review instead of stackoverflow next time.
Let's start to improve your original approach:
$result = preg_replace_callback('~"[^"]*"\s*:~', function ($m) {
return preg_replace_callback('~_+(.?)~', function ($n) {
return strtoupper($n[1]);
}, strtolower($m[0]));
}, $str);
pro: patterns are relatively simple and the idea is easy to understand.
cons: nested preg_replace_callback's may hurt the eyes.
After this eyes warm-up exercice, we can try a \G based pattern approach:
$pattern = '~(?|\G(?!^)_([^_"]*)|("(?=[^"]*"\s*:)[^_"]*))~';
$result = preg_replace_callback($pattern, function ($m) {
return ucfirst(strtolower($m[1]));
}, $str);
pro: the code is shorter, no need to use two preg_replace_callback's.
cons: the pattern is from far more complicated.
notice: When you write a long pattern, nothing forbids to use the free-spacing mode with the x modifier and to put comments:
$pattern = '~
(?| # branch reset group: in which capture groups have the same number
\G # contigous to the last successful match
(?!^) # but not at the start of the string
_
( [^_"]* ) # capture group 1
|
( # capture group 1
"
(?=[^"]*"\s*:) # lookahead to check if it is the "key part"
[^_"]*
)
)
~x';
Is there compromises between these two extremes, and what is the good one? Two suggestions:
$result = preg_replace_callback('~"[^"]+"\s*:~', function ($m) {
return array_reduce(explode('_', strtolower($m[0])), function ($c, $i) {
return $c . ucfirst($i);
});
}, $str);
pro: minimal use of regex.
cons: needs two callback functions except that this time the second one is called by array_reduce and not by preg_replace_callback.
$result = preg_replace_callback('~["_][^"_]*(?=[^"]*"\s*:)~', function ($m) {
return ucfirst(strtolower(ltrim($m[0], '_')));
}, $str);
pro: the pattern is relatively simple and the callback function stays simple too. It looks like a good compromise.
cons: the pattern isn't very constrictive (but should suffice for your use case)
pattern description: the pattern looks for a _ or a " and matches following characters that aren't a _ or a ". A lookahead assertion then checks that these characters are inside the key part looking for a closing quote and colon. The match result is always like _aBc or "aBc (underscores are trimmed on the left in the callback function and " stays the same after applying ucfirst).
pattern details:
["_] # one " or _
[^"_]* # zero or more characters that aren't " or _
(?= # open a lookahead assertion (followed with)
[^"]* # all that isn't a "
" # a literal "
\s* # eventual whitespaces
: # a literal :
) # close the lookahead assertion
There's no good answer and what looks simple or complicated really depends on the reader.
You might make use of preg_replace_callback in combination with the \G anchor and capturing groups.
(?:"\K([^_\r\n]+)|\G(?!^))(?=[^":\r\n]*")(?=[^:\r\n]*:)_?([a-zA-Z])([^"_\r\n]*)
In parts
(?: Non capturing group
"\K([^_\r\n]+) Match ", capture group 1 match 1+ times any char except _ or newline
| Or
\G(?!^) Assert position at the previous match, not at the start
) Close group
(?=[^":\r\n]*") Positive lookahead, assert "
(?=[^:\r\n]*:) Positive lookahead, assert :
_? Match optional _
([a-zA-Z]) Capture group 2 match a-zA-Z
([^"_\r\n]*) Capture group 3 match 0+ times any char except _ or newline
In the replacement concatenate a combination of strtolower and strtoupper using the 3 capturing groups.
Regex demo
For example
$re = '/(?:"\K([^_\r\n]+)|\G(?!^))(?=[^":\r\n]*")(?=[^:\r\n]*:)_?([a-zA-Z])([^"_\r\n]*)/';
$str = '"aa_bb" : "foo"
"pp_Qq" : "bar"
"Xx_yY_zz" : "foobar"
"Xx_yYyyyyyYyY_zz_a" : "foobar"';
$result = preg_replace_callback($re, function($matches) {
return strtolower($matches[1]) . strtoupper($matches[2]) . strtolower($matches[3]);
}, $str);
echo $result;
Output
"aaBb" : "foo"
"ppQq" : "bar"
"xxYyZz" : "foobar"
"xxYyyyyyyyyyZzA" : "foobar"
Php demo

preg_replace how to remove all numbers except alphanumeric

How to remove all numbers exept alphanumeric, for example if i have string like this:
Abs_1234abcd_636950806858590746.lands
to become it like this
Abs_1234abcd_.lands
It is probably done like this
Find (?i)(?<![a-z\d])\d+(?![a-z\d])
Replace with nothing.
Explained:
It's important to note that in the class [a-z\d] within assertions,
there exists a digit, without which could let "abc901234def" match.
(?i) # Case insensitive
(?<! [a-z\d] ) # Behind, not a letter nor digit
\d+ # Many digits
(?! [a-z\d] ) # Ahead, not a letter nor digit
Note - a speedier version exists (?i)\d(?<!\d[a-z\d])\d*(?![a-z\d])
Regex1: (?i)\d(?<!\d[a-z\d])\d*(?![a-z\d])
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 2
Elapsed Time: 0.53 s, 530.56 ms, 530564 µs
Matches per sec: 188,478
Regex2: (?i)(?<![a-z\d])\d+(?![a-z\d])
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 2
Elapsed Time: 0.91 s, 909.58 ms, 909577 µs
Matches per sec: 109,941
In this specific example, we can simply use _ as a left boundary and . as the right boundary, collect our digits, and replace:
Test
$re = '/(.+[_])[0-9]+(\..+)/m';
$str = 'Abs_1234abcd_636950806858590746.lands';
$subst = '$1$2';
$result = preg_replace($re, $subst, $str);
echo $result;
Demo
For your example data, you could also match not a word character or an underscore [\W_] using a character class. Then forget what is matched using \K.
Match 1+ digits that you want to replace with a empty string and assert what is on the right is again not a word character or an underscore.
[\W_]\K\d+(?=[\W_])
Regex demo

php preg_replace_callback blockquote regex

I am trying to create a REGEX that will
Input
> quote
the rest of it
> another paragraph
the rest of it
And OUTPUT
quote
the rest of it
another paragraph
the rest of it
with a resulting HTML of
<blockquote>
<p>quote
the rest of it</p>
<p>another paragraph
the rest of it</p>
</blockquote>
This is what I have below
$text = preg_replace_callback('/^>(.*)(...)$/m',function($matches){
return '<blockquote>'.$matches[1].'</blockquote>';
},$text);
DEMO
Any help or suggestion would be appreciated
Here is a possible solution for the given example.
$text = "> quote
the rest of it
> another paragraph
the rest of it";
preg_match_all('/^>([\w\s]+)/m', $text, $matches);
$out = $text ;
if (!empty($matches)) {
$out = '<blockquote>';
foreach ($matches[1] as $match) {
$out .= '<p>'.trim($match).'</p>';
}
$out .= '</blockquote>';
}
echo $out ;
Outputs :
<blockquote><p>quote
the rest of it</p><p>another paragraph
the rest of it</p></blockquote>
Try this regex:
(?s)>((?!(\r?\n){2}).)*+
meaning:
(?s) # enable dot-all option
b # match the character 'b'
q # match the character 'q'
\. # match the character '.'
( # start capture group 1
(?! # start negative look ahead
( # start capture group 2
\r? # match the character '\r' and match it once or none at all
\n # match the character '\n'
){2} # end capture group 2 and repeat it exactly 2 times
) # end negative look ahead
. # match any character
)*+ # end capture group 1 and repeat it zero or more times, possessively
The \r?\n matches a Windows, *nix and (newer) MacOS line breaks. If you need to account for real old Mac computers, add the single \r to it: \r?\n|\r
question: https://stackoverflow.com/a/2222331/9238511

php regular expression minimum and maximum length doesn't work as expected

I want to create a regular expression in PHP, which will allow to user to enter a phone number in either of the formats below.
345-234 898
345 234-898
235-123-456
548 812 346
The minimum length of number should be 7 and maximum length should be 12.
The problem is that, the regular expression doesn't care about the minimum and maximum length. I don't know what is the problem in it. Please help me to solve it. Here is the regular expression.
if (preg_match("/^([0-9]+((\s?|-?)[0-9]+)*){7,12}$/", $string)) {
echo "ok";
} else {
echo "not ok";
}
Thanks for reading my question. I will wait for responses.
You should use the start (^) and the end ($) sign on your pattern
$subject = "123456789";
$pattern = '/^[0-9]{7,9}$/i';
if(preg_match($pattern, $subject)){
echo 'matched';
}else{
echo 'not matched';
}
You can use preg_replace to strip out non-digit symbols and check length of resulting string.
$onlyDigits = preg_replace('/\\D/', '', $string);
$length = strlen($onlyDigits);
if ($length < 7 OR $length > 12)
echo "not ok";
else
echo "ok";
Simply do this:
if (preg_match("/^\d{3}[ -]\d{3}[ -]\d{3}$/", $string)) {
Here \d means any digits from 0-9. Also [ -] means either a space or a hyphen
You can check the length with a lookahead assertion (?=...) at the begining of the pattern:
/^(?=.{7,12}$)[0-9]+(?:[\s-]?[0-9]+)*$/
Breaking down your original regex, it can read like the following:
^ # start of input
(
[0-9]+ # any number, 1 or more times
(
(\s?|-?) # a space, or a dash.. maybe
[0-9]+ # any number, 1 or more times
)* # repeat group 0 or more times
)
{7,12} # repeat full group 7 to 12 times
$ # end of input
So, basically, you're allowing "any number, 1 or more times" followed by a group of "any number 1 or more times, 0 or more times" repeat "7 to 12 times" - which kind of kills your length check.
You could take a more restricted approach and write out each individual number block:
(
\d{3} # any 3 numbers
(?:[ ]+|-)? # any (optional) spaces or a hyphen
\d{3} # any 3 numbers
(?:[ ]+|-)? # any (optional) spaces or a hyphen
\d{3} # any 3 numbers
)
Simplified:
if (preg_match('/^(\d{3}(?:[ ]+|-)?\d{3}(?:[ ]+|-)?\d{3})$/', $string)) {
If you want to restrict the separators to be only a single space or a hyphen, you can update the regex to use [ -] instead of (?:[ ]+|-); if you want this to be "optional" (i.e. there can be no separator between number groups), add in a ? to the end of each.
if (preg_match('/^(\d{3}[ -]\d{3}[ -]\d{3})$/', $string)) {
may it help you out.
Validator::extend('price', function ($attribute, $value, $args) {
return preg_match('/^\d{0,8}(\.\d{1,2})?$/', $value);
});

Categories