Regular expression doesn't work as expected: '/=(\w+\s*)+=/' - php

This is what I have:
<?php
preg_match_all('/=(\w+\s*)+=/', 'aaa =bbb ccc ddd eee= zzz', $match);
print_r($match);
It matches only eee:
Array
(
[0] => Array
(
[0] => =bbb ccc ddd eee=
)
[1] => Array
(
[0] => eee
)
)
I need it to match bbb, ccc, ddd, eee, e.g.:
...
[1] => Array
(
[0] => bbb
[1] => ccc
[2] => ddd
[3] => eee
)
...
Where is the problem?

Try this regex:
(\w+)(?=[^=]*=[^=]*$)
Explaining:
(\w+) # group all words
(?= # only if right after can be found:
[^=]* # regardless of non '=' character
= # one '=' character
[^=]*$ # non '=' character till the end makes sure the first words are eliminated... You can try remove it from regex101 to see what happens.
)
Regex live here.
Hope it helps.

Thats is expected behaviour. Group captures are overwritten on repetition.
1 group, 1 capture
Instead of trying to get them in 1 match attempt, you should match one token on each attempt. Use \G to match the end of last match.
Something like this should work:
/(?(?=^)[^=]*+=(?=.*=)|\G\s+)([^\s=]+)/
regex101 Demo
Regex break-down
(?(?=^) ... | ... ) IF on start of string
[^=]*+= consume everything up to the first =
(?=.*=) and check there's a closing = as well
ELSE
\G\s+ only match if the last match ended here, consuming preceding spaces
([^\s=]+) Match 1 token, captured in group 1.
If you're also interested in matching more than 1 set of tokens, you need to match the text in between sets as well:
/(?(?=^)[^=]*+=(?=.*=)|\G\s*+(?:=[^=]*+=(?=.*=))?)([^\s=]+)/
regex101 Demo

Your regex starts and ends with an =, so the only possible match is:
=bbb ccc ddd eee=

You can use preg_replace with preg_split, i.e.:
$string = "aaa =bbb ccc ddd eee= zzz";
$matches = preg_split('/ /', preg_replace('/^.*?=|=.*?$/', '', $string));
print_r($matches);
OUTPUT:
Array
(
[0] => bbb
[1] => ccc
[2] => ddd
[3] => eee
)
DEMO:
http://ideone.com/pAmjbk

Related

Finding sentences between characters

I am trying to find sentences between pipe | and dot ., e.g.
| This is one. This is two.
The regex pattern I use :
preg_match_all('/(:\s|\|+)(.*?)(\.|!|\?)/s', $file0, $matches);
So far I could not manage to capture both sentences. The regex I use captures only the first sentence.
How can I solve this problem?
EDIT: as it may seen from the regex, I am trying to find the sentences BETWEEN (: or |) AND (. or ! or ?)
Column or pipe indicates starting point for sentences.
The sentences might be:
: Sentence one. Sentence two. Sentence three.
| Sentence one. Sentence two?
| Sentence one. Sentence two! Sentence three?
I would keep it simple and just match on:
\s*[^.|]+\s*
This says to match any content not consisting of pipes or full stops, and it also trims optional whitespace before/after each sentence.
$input = "| This is one. This is two.";
preg_match_all('/\s*[^.|]+\s*/s', $input, $matches);
print_r($matches[0]);
This prints:
Array
(
[0] => This is one
[1] => This is two
)
This does the job:
$str = '| This is one. This is two.';
preg_match_all('/(?:\s|\|)+(.*?)(?=[.!?])/', $str, $m);
print_r($m)
Output:
Array
(
[0] => Array
(
[0] => | This is one
[1] => This is two
)
[1] => Array
(
[0] => This is one
[1] => This is two
)
)
Demo & explanation
Another option is to make use of \G to get iterative matches asserting the position at the end of the previous match and capture the values in a capturing group matching a dot and 0+ horizontal whitespace chars after.
(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*
In parts
(?: Non capturing group
\|\h* Match | and 0+ horizontal whitespace chars
| Or
\G(?!^) Assert position at the end of previous match
) Close group
( Capture group 1
- [^.\r\n]+ Match 1+ times any char other than . or a newline
) Close group
\.\h* Match 1 . and 0+ horizontal whitespace chars
Regex demo | Php demo
For example
$re = '/(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*/';
$str = '| This is one. This is two.
John loves Mary.| This is one. This is two.';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => | This is one.
[1] => This is one
)
[1] => Array
(
[0] => This is two
[1] => This is tw
)
)
To keep it simple, find everything between | and . and then split:
$input = "John loves Mary. | This is one. This is two. | Sentence 1. Sentence 2.";
preg_match_all('/\|\s*([^|]+)\./', $input, $matches);
if ($matches) {
foreach($matches[1] as $match) {
print_r(preg_split('/\.\s*/', $match));
}
}
Prints:
Array
(
[0] => This is one
[1] => This is two
)
Array
(
[0] => Sentence 1
[1] => Sentence 2
)

need some help on regex in preg_match_all()

so I need to extract the ticket number "Ticket#999999" from a string.. how do i do this using regex.
my current regex is working if I have more than one number in the Ticket#9999.. but if I only have Ticket#9 it's not working please help.
current regex.
preg_match_all('/(Ticket#[0-9])\w\d+/i',$data,$matches);
thank you.
In your pattern [0-9] matches 1 digit, \w matches another digit and \d+ matches 1+ digits, thus requiring 3 digits after #.
Use
preg_match_all('/Ticket#([0-9]+)/i',$data,$matches);
This will match:
Ticket# - a literal string Ticket#
([0-9]+) - Group 1 capturing 1 or more digits.
PHP demo:
$data = "Ticket#999999 ticket#9";
preg_match_all('/Ticket#([0-9]+)/i',$data,$matches, PREG_SET_ORDER);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => Ticket#999999
[1] => 999999
)
[1] => Array
(
[0] => ticket#9
[1] => 9
)
)

PHP preg_match_all does not match everything

Consider the following code snippet:
$example = "DELIM1test1DELIM2test2DELIM1test3DELIM2test4"; // and so on
preg_match_all('/DELIM1(.*?)DELIM2(.*?)/', $example, $matches);
$matches array becomes:
array:3 [
0 => array:2 [
0 => "DELIM1test1DELIM2"
1 => "DELIM1test3DELIM2"
]
1 => array:2 [
0 => "test1"
1 => "test3"
]
2 => array:2 [
0 => ""
1 => ""
]
]
As you can see, it fails to get test2 and test4. Any reason why that happens and what could be a possible solution? Thank you.
.*? is non-greedy; if you have no constraint after it, it will match the minimum necessary: zero characters. You need a constraint after it to force it to match more than trivially. For example:
/DELIM1(.*?)DELIM2(.*?)(?=DELIM1|$)/
Lazy subpatterns at the end of the patter match either 0 (*?) or 1 (+?) characters because they match as few as possible.
You can still use lazy matching and append a lookahead that will require a DELIM1 to appear after the value or the end of string:
/DELIM1(.*?)DELIM2(.*?)(?=$|DELIM1)/
See demo. It is very close in terms of performance with a tempered greedy token (DELIM1(.*?)DELIM2((?:(?!DELIM1).)*) - demo).
However, the best approach is to unroll it:
DELIM1(.*?)DELIM2([^D]*(?:D(?!ELIM1)[^D]*)*)
See another demo
preg_split would be better:
$example = "DELIM1test1DELIM2test2DELIM1test3DELIM2test4"; // and so on
$keywords = preg_split("/DELIM1|DELIM2/", $example,0,PREG_SPLIT_NO_EMPTY);
print_r($keywords);
output:
Array
(
[0] => test1
[1] => test2
[2] => test3
[3] => test4
)
demo: http://ideone.com/s5nC0k
Those values are OUTSIDE of your anchors, so they won't get matched. e.g. (with some extra spaces)
str: DELIM1 test1 DELIM2 test2 DELIM1 test3 DELIM2 test4
pat: DELIM1 (.*?) DELIM2 (.*?) DELIM1 (.*?) DELIM2 (.*?)
match #1 match #2
(.*?) is a non-greedy match, and can/will match a 0-length string. Since the boundary between M2 and te is a 0-length string, that invisible zero-length character matches and the pattern terminates there.
You can use this negative lookahead regex:
preg_match_all('/DELIM1((?:(?!DELIM1|DELIM2).)*)DELIM2((?:(?!DELIM1|DELIM2).)*)/',
$example, $matches);
(?:(?!DELIM1|DELIM2).)* will match 0 or more of any character that doesn't have DELIM1 or DELIM2 at next position.
Output:
print_r($matches);
Array
(
[0] => Array
(
[0] => DELIM1test1DELIM2test2
[1] => DELIM1test3DELIM2test4
)
[1] => Array
(
[0] => test1
[1] => test3
)
[2] => Array
(
[0] => test2
[1] => test4
)
)

split string by spaces and colon but not if inside quotes

having a string like this:
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf"
the desired result is:
[0] => Array (
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
what I get with:
preg_match_all("/\'(?:[^()]|(?R))+\'|'[^']*'|[^(),\s]+/", $str, $m);
is:
[0] => Array (
[0] => dateto:'2015-10-07
[1] => 15:05'
[2] => xxxx
[3] => datefrom:'2015-10-09
[4] => 15:05'
[5] => yyyy
[6] => asdf
)
Also tried with preg_split("/[\s]+/", $str) but no clue how to escape if value is between quotes. Can anyone show me how and also please explain the regex. Thank you!
I would use PCRE verb (*SKIP)(*F),
preg_split("~'[^']*'(*SKIP)(*F)|\s+~", $str);
DEMO
Often, when you are looking to split a string, using preg_split isn't the best approach (that seems a little counter intuitive, but that's true most of the time). A more efficient way consists to find all items (with preg_match_all) using a pattern that describes all that is not the delimiter (white-spaces here):
$pattern = <<<'EOD'
~(?=\S)[^'"\s]*(?:'[^']*'[^'"\s]*|"[^"]*"[^'"\s]*)*~
EOD;
if (preg_match_all($pattern, $str, $m))
$result = $m[0];
pattern details:
~ # pattern delimiter
(?=\S) # the lookahead assertion only succeeds if there is a non-
# white-space character at the current position.
# (This lookahead is useful for two reasons:
# - it allows the regex engine to quickly find the start of
# the next item without to have to test each branch of the
# following alternation at each position in the strings
# until one succeeds.
# - it ensures that there's at least one non-white-space.
# Without it, the pattern may match an empty string.
# )
[^'"\s]* #"'# all that is not a quote or a white-space
(?: # eventual quoted parts
'[^']*' [^'"\s]* #"# single quotes
|
"[^"]*" [^'"\s]* # double quotes
)*
~
demo
Note that with this a little long pattern, the five items of your example string are found in only 60 steps. You can use this shorter/more simple pattern too:
~(?:[^'"\s]+|'[^']*'|"[^"]*")+~
but it's a little less efficient.
For your example, you can use preg_split with negative lookbehind (?<!\d), i.e.:
<?php
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf";
$matches = preg_split('/(?<!\d)(\s)/', $str);
print_r($matches);
Output:
Array
(
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
Demo:
http://ideone.com/EP06Nt
Regex Explanation:
(?<!\d)(\s)
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\d)»
Match a single character that is a “digit” «\d»
Match the regex below and capture its match into backreference number 1 «(\s)»
Match a single character that is a “whitespace character” «\s»

regex match between 2 strings

For example I have the text
a1aabca2aa3adefa4a
I want to extract 2 and 3 with a regex between abc and def, so 1 and 4 should be not included in the result.
I tried this
if(preg_match_all('#abc(?:a(\d)a)+def#is', file_get_contents('test.txt'), $m, PREG_SET_ORDER))
print_r($m);
I get this
> Array
(
[0] => Array
(
[0] => abca1aa2adef
[1] => 3
)
)
But I want this
Array
(
[0] => Array
(
[0] => abca1aa2adef
[1] => 2
[2] => 3
)
)
Is this possible with one preg_match_all call? How can I do it?
Thanks
preg_match_all(
'/\d # match a digit
(?=.*def) # only if followed by <anything> + def
(?!.*abc) # and not followed by <anything> + abc
/x',
$subject, $result, PREG_PATTERN_ORDER);
$result = $result[0];
works on your example. It assumes that there is exactly one instance of abc and def per line in your string.
The reason why your attempt didn't work is that your capturing group (\d) that matches the digit is within another, repeated group (?:a(\d)a)+. With every repetition, the result of the capture is overwritten. This is how regular expressions work.
In other words - see what's happening during the match:
Current position Current part of regex Capturing group 1
--------------------------------------------------------------
a1a no match, advancing... undefined
abc abc undefined
a2a (?:a(\d)a) 2
a3a (?:a(\d)a) (repeated) 3 (overwrites 2)
def def 3
You ask if it is possible with a single preg_match_all.
Indeed it is.
This code outputs exactly what you want.
<?php
$subject='a1aabca2aa3adefa4a';
$pattern='/abc(?:a(\d)a+(\d)a)def/m';
preg_match_all($pattern, $subject, $all_matches,PREG_OFFSET_CAPTURE | PREG_PATTERN_ORDER);
$res[0]=$all_matches[0][0][0];
$res[1]=$all_matches[1][0][0];
$res[2]=$all_matches[2][0][0];
var_dump($res);
?>
Here is the output:
array
0 => string 'abca2aa3adef' (length=12)
1 => string '2' (length=1)
2 => string '3' (length=1)

Categories