Irregular RegEx behavior - php

I have a string:
$day = "11.08.2012 PROC BRE-AMS 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) Y AMS-AMS 13:15-19:15"
And I have a regular expression:
$data = preg_split("/(?=[A-Z]{1,4}[\s]+[A-Z]{3}[\-][A-Z]{3}[\s]+)/", $day);
The expected $data-Array should be:
array
0 => string '11.08.2012 ' (length=11)
1 => string 'PROC 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) ' (length=22)
2 => string 'Y AMS-AMS 13:15-19:15' (length=21)
But my result is:
0 => string '11.08.2012 ' (length=11)
1 => string 'P' (length=1)
2 => string 'R' (length=1)
3 => string 'O' (length=1)
4 => string 'C BRE-AMS 08:00-12:00 ( MIETWAGEN MIT BAK RES 6049687886 ) ' (length=59)
5 => string 'Y AMS-AMS 13:15-19:15' (length=21)
I cannot retrace what´s happening here. Could someone pleaqse explain?

In short, the problem is that (?=...) subexpression in your pattern match a position. I understand that was exactly your intention; the problem is, the next match is started not when the pattern specified in (?=) ends its match - but at the position matched by the lookahead + 1 symbol.
Let's check this process in details. First time the split is attempted, it walks the string until it got to the position marked by asterisk:
11.08.2012 *PROC BRE-AMS 08:00-12:00
... where it can match the pattern given. For the next attempt, the starting position 'bumps along' one symbol, so now we're here:
11.08.2012 P*ROC BRE-AMS 08:00-12:00
... and voila, we again can match this pattern, because of that {1,4} quantifier! That's how you got these 'irregular' P, R and O symbols.
That's for explanation, now for the "how to fix" part. The easiest way out of this, I suppose, is adding this little twist in your split pattern:
$data = preg_split('/\b(?=[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/', $day);
We still match for position - but now this position should be the one that separates a 'word' symbol from a non-word one. The same idea can be expressed with negative lookbehind pattern:
$data = preg_split('/(?<![A-Z])(?=[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/', $day);
... which is actually more precise, but less elegant, I suppose. )
Two sidenotes here: 1) don't use character class syntax when you need to specify a single symbol (simple - - - or 'shortcut' one, like \s); 2) use single quotation marks to delimit your pattern unless you want to interpolate some variables in it.

A hyphen is a metacharacter in a character class. If you want to include a hyphen in a character class you have to backslash escape it (although in this specific case it works since your character class has nothing but a hyphen).
If you need to include the split string, anchor the start of the lookahead to a word boundary, so that only the first letter of the first 1-4 character sequence is tested:
/(?=\b[A-Z]{1,4}\s+[A-Z]{3}-[A-Z]{3}\s+)/'

Related

preg_match $matches does not return special characters

I have a placeholder in my content which has the following format
{{label1#label2_label3}}}
And I'm correctly matching it with this regex
preg_match('/\{\{(\w+|d+|_+|#+)*\}\}/i', $content, $matches);
The problem is that the $matches array which PHP returns has the following data (preg_match docs)
array (size=2)
0 => string '{{label1#label2_label3}}' (length=44)
1 => string 'label2_label3' (length=16)
While my expected output is the following
array (size=2)
0 => string '{{label1#label2_label3}}' (length=44)
1 => string 'label1#label2_label3' (length=16)
My solution was to use a replace to simply get rid of the parenthesis like so:
$matches[1] = str_replace("}", "", (str_replace("{","",$matches[0])));
which works but I'm concerned about the performance while rendering a page with many placeholders.
Is there any flag or function I'm missing to just tell PHP to return the entire string inside {{ }} in $matches1?
Using \w also matches \d and _ so that will leave \w and #
You get that result where label1# is missing as you repeat a capture group which will capture the value of the last iteration.
As you want a match for label1#label2_label3 you can use a single character class to match word characters and the # char and use a non repeating capture group.
{{([\w#]+)}}
Regex demo | PHP demo
$content = "{{label1#label2_label3}}";
preg_match('/{{([\w#]+)}}/i', $content, $matches);
print_r($matches);
Output
Array
(
[0] => {{label1#label2_label3}}
[1] => label1#label2_label3
)
If the # and _ can not be at the start or at the end:
{{([^\W_]+(?:[_#][^\W_]+)*)}}
The pattern in parts:
{{ Match literally
( Capture group 1
[^\W_]+ Match 1+ word characters without _
(?:[_#][^\W_]+)* Optionally repeat matching either _ or # and 1+ word characters without _
) Close group 1
}} Match literally
Regex demo

Make a negative lookahead case-insensitive

I have the following expression:
$exp = "/^(?!.*?that).*$/";
which is meant to match any line that does not contain "that".
I have the following three sentences:
$str = array(
"I like this sentence.", #line1
"I like that sentence.", #line2
"I link THAT sentence." #line3
);
The match is case-sensitive and therefore only lines 1 and 3 are matched. So far so good.
However, I would like to make it case-insensitive, so that it only matches line 1. I have tried with an inline modifier, i.e. "(?-i ... )":
$exp = "/^(?!.*?(?i:that)).*$/";
and as a flag, i.e. "/ ... /i":
$exp = "/^(?!.*?that).*$/i";
but to no avail.
I run the search with the following loop:
foreach($str as $s) {
preg_match_all($exp, $s, $matches);
var_dump($matches);
}
with output:
array (size=1)
0 =>
array (size=1)
0 => string 'I like this sentence.' (length=21)
array (size=1)
0 =>
array (size=0)
empty
array (size=1)
0 =>
array (size=1)
0 => string 'I link THAT sentence.' (length=21)
and an online demo is available here: https://regex101.com/r/bs9rzF/1
I would grateful for any tips about how I can make my regular expression case-insensitive.
EDIT: I was incorrectly using "?-i" instead of "?-i", as some contributors correctly point out. Fixed now.
Your first regex ^(?!.*?that).*$ has nothing to do with case sensitivity as you are not using any modifier for case insensitivity.
The regex matches first and third sentence because your regex is saying that there shouldn't be a word that (case sensitive here) in the sentence, which is true for first and third sentence (In third sentence you have THAT which is not same as that)
To match only the first sentence, you can use the inline modifier (?i) like
(?i)^(?!.*?that).*$
See here
BTW, your /^(?!.*?that).*$/i regex is also correct.
You were close:
^(?!.*?(?i)that).*$
See a demo on regex101.com. In your expression ((?-i)) you were turning the modifier off.

PHP find multiple currency numbers in string

I'm writing php script which will recognise bank payment reports.
For example, I have this code:
$str = "Customer Name /First Polises number - SAT431223 (5.20 eur), BOS32342 (33,85 euro), (32,10 eiro), (78.66 €), €1232,2, (11.45)"
And I need to find all this currency combinations in string, so the input be like this:
5.20
33.85
32.10
78.66
1232.20
11.45
How can I do that? I know the function preg_match(), but I don't understand how to write pattern for that case.
preg_match will give you only first match found. But you can use preg_match_all to get array of all matches.
Here's everything you need to know about how to build regex patterns:
http://php.net/manual/en/reference.pcre.pattern.syntax.php
You need pattern like this: /[0-9]+[,.]{1}[0-9]{2}/
/ - delimiter, can be other character, but you need it on the beginning and end of the pattern.
[0-9] - matches digits
+ and {1}, and {2} - they define amount of charaters. + is "one or more", number in {} is exact number of characters.
[,.]{1} - this matches exactly one ({1}) character from set of ,..
Example code:
$matches = array();
preg_match_all('/[0-9]+[,.]{1}[0-9]{2}/', $str, $matches);
var_dump($matches);
Result:
array (size=1)
0 =>
array (size=5)
0 => string '5.20' (length=4)
1 => string '33,85' (length=5)
2 => string '32,10' (length=5)
3 => string '78.66' (length=5)
4 => string '11.45' (length=5)
I would do this with:
/([0-9]+[,.][0-9]+)/g
Matching:
Numbers (zero or more times)
Dot or Comma
Numbers (zero or more times)
Note the g: Global to get all matches
Example and more detailed break-down of the regex: https://regex101.com/r/eH6aX6/1
That will match any double values in the provided sentence which are not necessarily currency...
Hope it points you t the correct direction

Split String With preg_match

I have string :
$productList="
Saluran Dua(Bothway)-(TAN007);
Speedy Password-(INET PASS);
Memo-(T-Memo);
7-pib r-10/10-(AM);
FBI (R/N/M)-(Rr/R(A));
";
i want the result like this:
Array(
[0]=>TAN007
[1]=>INET PASS
[2]=>T-Memo
[3]=>AM
[4]=>Rr/R(A)
);
I used :
$separator = '/\-\(([A-z ]*)\)/';
preg_match_all($separator, $productList, $match);
$value=$match[1];
but the result:
Array(
[0]=>INET PASS
[1]=>AM
);
there's must wrong code, anybody can help this?
Your regex does not include all the characters that can appear in the piece of text you want to capture.
The correct regex is:
$match = array();
preg_match_all('/-\((.*)\);/', $productList, $match);
Explanation (from the inside to outside):
.* matches anything;
(.*) is the expression above put into parenthesis to capture the match in $match[1];
-\((.*)\); is the above in the context: it matches if it is preceded by -( and followed by );; the parenthesis are escaped to use their literal values and not their special regex interpretation;
there is no need to escape - in regex; it has special interpretation only when it is used inside character ranges ([A-Z], f.e.) but even there, if the dash character (-) is right after the [ or right before the ] then it has no special meaning; e.g. [-A-Z] means: dash (-) or any capital letter (A to Z).
Now, print_r($match[1]); looks like this:
Array
(
[0] => TAN007
[1] => INET PASS
[2] => T-Memo
[3] => AM
[4] => Rr/R(A)
)
for the 1th line you need 0-9
for the 3th line you need a - in and
in the last line you need ()
try this
#\-\(([a-zA-Z/0-9(\)\- ]*)\)#
try with this ReGex
$separator = '#\-\(([A-Za-z0-9/\-\(\) ]*)\)#';

Split string with regular expressions

I have this string:
EXAMPLE|abcd|[!PAGE|title]
I want to split it like this:
Array
(
[0] => EXAMPLE
[1] => abcd
[2] => [!PAGE|title]
)
How to do it?
Thank you.
DEMO
If you don't need anything more than you said, is like parsing a CSV but with | as separator and [ as " so: (\[.*?\]+|[^\|]+)(?=\||$) will do the work I think.
EDIT: Changed the regex, now it accepts strings like [asdf]].[]asf]
Explanation:
(\[.*?\]+|[^\|]+) -> This one is divided in 2 parts: (will match 1.1 or 1.2)
1.1 \[.*?\]+ -> Match everything between [ and ]
1.2 [^\|]+ -> Will match everything that is enclosed by |
(?=\||$) -> This will tell the regular expression that next to that must be a | or the end of the string so that will tell the regex to accept strings like the earlier example.
Given your example, you could use (\[.*?\]|[^|]+).
preg_match_all("#(\[.*?\]|[^|]+)#", "EXAMPLE|abcd|[!PAGE|title]", $matches);
print_r($matches[0]);
// output:
Array
(
[0] => EXAMPLE
[1] => abcd
[2] => [!PAGE|title]
)
use this regex (?<=\||^)(((\[.*\|?.*\])|(.+?)))(?=\||$)
(?<=\||^) Positive LookBehind
1st alternative: \|Literal `|`
2nd alternative: ^Start of string
1st Capturing group (((\[.*\|?.*\])|(.+?)))
2nd Capturing group ((\[.*\|?.*\])|(.+?))
1st alternative: (\[.*\|?.*\])
3rd Capturing group (\[.*\|?.*\])
\[ Literal `[`
. infinite to 0 times Any character (except newline)
\| 1 to 0 times Literal `|`
. infinite to 0 times Any character (except newline)
\] Literal `]`
2nd alternative: (.+?)
4th Capturing group (.+?)
. 1 to infinite times [lazy] Any character (except newline)
(?=\||$) Positive LookAhead
1st alternative: \|Literal `|`
2nd alternative: $End of string
g modifier: global. All matches (don't return on first match)
A Non-regex solution:
$str = str_replace('[', ']', "EXAMPLE|abcd|[!PAGE|title]");
$arr = str_getcsv ($str, '|', ']')
If you expect things like this "[[]]", you would've to escape the inside brackets with slashes in which case regex might be the better option.
http://de2.php.net/manual/en/function.explode.php
$array= explode('|', $string);

Categories