Regular expression to match | but not || - php

My goal is to split a string such as, a|b||c|d in a, b||c and d.
I tried using several methods, but end up splititng my string in any way:
Lookbehind:
var_dump(preg_split("/\\|(?<!\\|\\|)/", 'a|b||c|d'));
array (size=4)
0 => string 'a' (length=1)
1 => string 'b' (length=1)
2 => string '|c' (length=2)
3 => string 'd' (length=1)
Lookahead:
var_dump(preg_split("/(?!\\|\\|)\\|/", 'a|b||c|d'));
array (size=4)
0 => string 'a' (length=1)
1 => string 'b|' (length=2)
2 => string 'c' (length=1)
3 => string 'd' (length=1)
How can I just ignore doublepipes?

Just split your input according to the below regex which uses negative lookarounds.
(?<!\|)\|(?!\|)
DEMO
| is a special meta character in regex which acts like a logical OR or alternation operator. To match a literal | symbol, you need to escape the | in your regex like \|

You can use this regex for splitting:
(?<!\|)\|(?!\|)

Related

PHP preg_split() pattern

I need help finding a PCRE pattern using preg_split().
I'm using the regex pattern below to split a string based on its starting 3 character code and semi-colons. The pattern works fine in Javascript, but now I need to use the pattern in PHP. I tried preg_split() but just getting back junk.
// Each group will begin with a three letter code, have three segments separated by a semi-colon. The string will not be terminated with a semi-colon.
// Pseudocode
string_to_split = "AAA;RED;111;BBB;BLUE;22;CCC;GREEN;33;DDD;WHITE;44"
// This works in JS
// https://regex101.com
$pattern = "/[AAA|BBB|CCC|DDD][^;]*;[^;]*[;][^;]*/gi";
Match 1
Full match 0-11 `AAA;RED;111`
Match 2
Full match 12-23 `BBB;BLUE;22`
Match 3
Full match 24-36 `CCC;GREEN;33`
Match 4
Full match 37-49 `DDD;WHITE;44`
$pattern = "/[AAA|BBB|CCC|DDD][^;]*;[^;]*[;][^;]*/";
$split = preg_split($pattern, $string_to_split);
returns
array(5)
0:""
1:";"
2:";"
3:";"
4:""
According to your additional information in some comments to the answers, I update my answer to be very specific to your source format.
You might want something like this:
$subject = "AAA;RED;111;AAA;Oh my dog;12.34;AAA;Oh Long John;.4556;BBB;Oh Long Johnson;1.2323;BBB;Oh Don Piano;.33;CCC;Why I eyes ya;1.445;CCC;All the live long day;2.3343;DDD;Faith Hilling;.89";
$pattern = '/(?<=;|^)(AAA|BBB|CCC|DDD);([^;]*);((?:\d*\.)?\d+)(?=;|$)/';
preg_match_all($pattern, $subject,$matches);
var_dump($matches);
giving you
array (size=4)
0 =>
array (size=8)
0 => string 'AAA;RED;111' (length=11)
1 => string 'AAA;Oh my dog;12.34' (length=19)
2 => string 'AAA;Oh Long John;.4556' (length=22)
3 => string 'BBB;Oh Long Johnson;1.2323' (length=26)
4 => string 'BBB;Oh Don Piano;.33' (length=20)
5 => string 'CCC;Why I eyes ya;1.445' (length=23)
6 => string 'CCC;All the live long day;2.3343' (length=32)
7 => string 'DDD;Faith Hilling;.89' (length=21)
1 =>
array (size=8)
0 => string 'AAA' (length=3)
1 => string 'AAA' (length=3)
2 => string 'AAA' (length=3)
3 => string 'BBB' (length=3)
4 => string 'BBB' (length=3)
5 => string 'CCC' (length=3)
6 => string 'CCC' (length=3)
7 => string 'DDD' (length=3)
2 =>
array (size=8)
0 => string 'RED' (length=3)
1 => string 'Oh my dog' (length=9)
2 => string 'Oh Long John' (length=12)
3 => string 'Oh Long Johnson' (length=15)
4 => string 'Oh Don Piano' (length=12)
5 => string 'Why I eyes ya' (length=13)
6 => string 'All the live long day' (length=21)
7 => string 'Faith Hilling' (length=13)
3 =>
array (size=8)
0 => string '111' (length=3)
1 => string '12.34' (length=5)
2 => string '.4556' (length=5)
3 => string '1.2323' (length=6)
4 => string '.33' (length=3)
5 => string '1.445' (length=5)
6 => string '2.3343' (length=6)
7 => string '.89' (length=3)
The start marker should occur at the start of string or immidiately after a semicolon, so we do a lookbehind, looking for start or semicolon:
(?<=;|^)
We look for an alternative of AAA,BBB,CCC or DDD and capture it:
(AAA|BBB|CCC|DDD)
After a semicolon we look for any character except a semicolon. The quantifier * means 0 or more time. Use + if you want at least 1.
;([^;]*)
After the next semicolon wie look for a number. This task has to be splitted to fit a valid format: We first look for 0 or more digits followed by a dot:
(?:\d*\.)?
where (?:) means a non-capturing group.
Behind we look for at least one digit: \d+
We want to capture both parts of of the number using parentheses after the searched semicolon:
;((?:\d*\.)?\d+)
This matches "1234", ".1234", "1.234", "12.34" , "123.4" but "1234.", "1.2.3"
Finally we want this to immediately occur before a semicolon or the end of string. Thus we do a lookahead:
(?=;|$)
Lookaheads and lookbehinds are not part of the captured result behind or respectively before.
I've modified your pattern a little, and added a couple of flags to preg_split.
The PREG_SPLIT_NO_EMPTY flag will exclude empty matches from the result, and PREG_SPLIT_DELIM_CAPTURE will include the captured value in the result.
$split = preg_split('/([abcd]{3};[^;]+;\d+);?/i', $string, -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
Result:
Array
(
[0] => AAA;RED;111
[1] => BBB;BLUE;22
[2] => CCC;GREEN;33
[3] => DDD;WHITE;44
)
Alternatively, and more suitably, you can use preg_match_all with the following pattern.
preg_match_all('/([abcd]{3};[^;]+;\d+);?/i', $string, $matches);
print_r($matches[0]);
Result:
Array
(
[0] => AAA;RED;111
[1] => BBB;BLUE;22
[2] => CCC;GREEN;33
[3] => DDD;WHITE;44
)
You don't want to split your string but match elements, use preg_match_all:
$str = "AAA;RED;111;AAA;Oh my dog;2.34;AAA;Oh Long John;.4556;BBB;Oh Long Johnson;1.2323;BBB;Oh Don Piano;.33;CCC;Why I eyes ya;1.445;CCC;All the live long day;2.3343;DDD;Faith Hilling;.89";
$res = preg_match_all('/(?:AAA|BBB|CCC|DDD);[^;]*;[^;]*;?/', $str, $m);
print_r($m[0]);
Output:
Array
(
[0] => AAA;RED;111;
[1] => AAA;Oh my dog;2.34;
[2] => AAA;Oh Long John;.4556;
[3] => BBB;Oh Long Johnson;1.2323;
[4] => BBB;Oh Don Piano;.33;
[5] => CCC;Why I eyes ya;1.445;
[6] => CCC;All the live long day;2.3343;
[7] => DDD;Faith Hilling;.89
)
Explanation:
/ : regex delimiter
(?:AAA|BBB|CCC|DDD) : non capture group AAA or BBB or CCC or DDD
; : a semicolon
[^;]* : 0 or more any character that is not a semicolon
; : a semicolon
[^;]* : 0 or more any character that is not a semicolon
;? : optional semicolon
/ : regex delimiter

Can not match the last group of numbers using php preg_match()

preg_match_all("/(\d{12})
(?:,|$)/","111762396541,561572500056,561729950637,561135281443",$matches);
var_dump($mathes):
array (size=2)
0 =>
array (size=4)
0 => string '561762396543,' (length=13)
1 => string '561572500056,' (length=13)
2 => string '561729950637,' (length=13)
3 => string '561135281443' (length=12)
1 =>
array (size=4)
0 => string '561762396543' (length=12)
1 => string '561572500056' (length=12)
2 => string '561729950637' (length=12)
3 => string '561135281443' (length=12)
But I want the $matches like this:
array (size=4)
0 => string '561762396543,' (length=13)
1 => string '561572500056,' (length=13)
2 => string '561729950637,' (length=13)
3 => string '561135281443' (length=12)
I wanna match groups of numbers(each has 12 digits) and a suffix comma if there is one.The exeption is the last group of numbers,it doesnt have to match a comma,cause it reaches the end of the line.
Try this instead:
preg_match_all("/(\d{12}(?:,|$))/","111762396541,561572500056,561729950637,561135281443",$matches);
When the $ is inside your character range brackets [ ] it is looking for the $ characters not the end-of-line.
EDIT: If you want to include the comma in your matches, then just use the above code sample and look at $matches[0].
If you wanted an easier syntax that matches any sort of word boundary, the \b will match commas and end-of-line, too:
preg_match_all("/(\d{12}\b)/","111762396541,561572500056,561729950637,561135281443",$matches);

preg match all get group multiple times

I am trying to get a regular expression to get a subgroup everytime it is found. This is my code:
$string2 = 'cabbba';
preg_match_all('#c(a(b)*a)#',$string2,$result3,PREG_SET_ORDER);
var_dump($result3);
My goal is to get 'b' as a captured group each time (so 3 times). This codes outputs the following:
array (size=1)
0 =>
array (size=3)
0 => string 'cabbba' (length=6)
1 => string 'abbba' (length=5)
2 => string 'b' (length=1)
I want it to show 'b' each times it appears, so something like this
array (size=1)
0 =>
array (size=3)
0 => string 'cabbba' (length=6)
1 => string 'abbba' (length=5)
2 => array (size=3)
0 => string 'b' (length 1)
1 => string 'b' (length 1)
2 => string 'b' (length 1)
This is a simplified example, in the real code the subpattern 'b' will be different each time, but it follows the same pattern.
This would be possible only through \G anchor.
(?:ca|\G)(b)(?=b|(a))
DEMO
Did you try using a non-greedy modifier for your b*?
$string2 = 'cabbba';
preg_match_all('#c(a(b)*?a)#', $string2, $result3, PREG_SET_ORDER);
var_dump($result3);
Excuse me if it's not what you asked, I'm not sure I really understood your needs...
UPDATE:
Sorry, previous answer is wrong, please ignore it...
I'm trying to elaborate a right one...
Just trying something like
preg_match_all('#c(a(?:(b{1}))*a)#', $string2, $result3, PREG_SET_ORDER);
but it doesn't work, either... :-(
UPDATE 2:
See Avinash Raj answer, I think it's quite good...

preg_match Regex Matching Full String

I have a simple regex, but it's matching more than I want...
Basically, I'm trying to match certain operators (eg. > < != =) followed by a string.
Regex:
/^(<=|>=|<>|!=|=|<|>)(.*)/
Example subject:
>42
What I'm getting:
array (size=3)
0 => string '>42' (length=3)
1 => string '>' (length=1)
2 => string '42' (length=2)
What I'm trying to get:
array (size=2)
0 => string '>' (length=1)
1 => string '42' (length=2)
What I don't understand is that my regex works perfectly on Regex101
Edit: To clarify, how can I get rid of the full string match?
Your answer is correct.Group(0) is the whole match.Group(1) if first group and group(2) is the second group.
You are getting all 3 groups \0, \1, and '\2'. see the group matching at the bottom of the page
assuming your matches are in $matches you can run array_shift($matches) to remove the '\0' match if you wish.

strange behavior of preg_match_all()

Following code:
$string ='۱۲۳۴۵۶۷۸۹۰';
$regex ='#۱#';
preg_match_all($regex,$string,$match);
var_dump($match);
will output:
array(1) {
[0] =>
array(1) {
[0] =>
string(2) "۱"
}
}
but
$regex2 ='#[۱]#';
preg_match_all($regex2,$string,$match);
var_dump($match);
will output
array (size=1)
0 =>
array (size=11)
0 => string '�' (length=1)
1 => string '�' (length=1)
2 => string '�' (length=1)
3 => string '�' (length=1)
4 => string '�' (length=1)
5 => string '�' (length=1)
6 => string '�' (length=1)
7 => string '�' (length=1)
8 => string '�' (length=1)
9 => string '�' (length=1)
10 => string '�' (length=1)
Indeed I want use RegEx like [۱۲۳۴۵۶۷۸۹۰]‍‍‍‍‍‍, but the function output strange result with such RegEx's. I am using PHP 5.4
Try adding the Unicode flag:
$regex = '#[۱]#u';
The reason for this is because ۱ is actually several bytes long. On it's own, it's harmless because those exact bytes are either the symbol, or the individual bytes being there coincidentally. However, in a character class any of the individual bytes may match any of the individual bytes in the other characters, which is does because they are close together in the map.

Categories