This is my current regex (used in parsing an iCal file):
/(.*?)(?:;(?=(?:[^"]*"[^"]*")*[^"]*$))([\w\W]*)/
The current output using preg_match() is this:
//Output 1 - `preg_match()`
Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London"
[1] => VALUE=DATE;RSVP=FALSE;LANGUAGE=en-gb
)
I would like to extend my regex to output this (i.e. find multiple matches):
//Output 2
Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London"
[1] => VALUE=DATE
[2] => RSVP=FALSE
[3] => LANGUAGE=en-gb
)
The regex should search for each semicolon not contained within a quoted substring and provide that as a match.
Cannot just swap to preg_match_all() as gives this unwanted output
//Output 3 - `preg_match_all()`
Array
(
[0] => Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London";VALUE=DATE;RSVP=FALSE;LANGUAGE=en-gb
)
[1] => Array
(
[0] => TZID="Greenwich Mean Time:Dublin; Edinburgh; Lisbon; London"
)
[2] => Array
(
[0] => VALUE=DATE;RSVP=FALSE;LANGUAGE=en-gb
)
)
You need to use preg_match_all to get all the match of the string.
The pattern you use isn't designed to get several results since [\w\W]* matches everything until the end of the string.
But it's only one of your problems, a pattern designed like this need to check (for each colon) if the number of quotes is odd or even until the end of the file!: (?=(?:[^"]*"[^"]*")*[^"]*$). Imagine a minute how many times the whole string is parsed with this lookahead.
To avoid the problem, you can use a different approach that doesn't try to find colons, but that tries to describe everything that is not a colon: So you are looking for every parts of text that doesn't contains quotes or colon + quoted parts whatever the content.
You can use this kind of pattern:
$pattern = '~[^\r\n";]+(?:"[^"\\\]*(?:\\\.[^"\\\]*)*"[^\r\n";]*)*~';
if (preg_match_all($pattern, $str, $matches))
print_r($matches[0]);
pattern details:
~ # pattern delimiter
[^\r\n";]+ #" # all that is not a newline, a double quote or a colon
(?: # non-capturing group: to include eventual quoted parts
" #"# a literal quote
[^"\\\]* #"# all that is not a quote or a backslash
(?:\\\.[^"\\\]*)* #"# optional group to deal with escaped characters
" #"#
[^\r\n";]* #"#
)* # repeat zero or more times
~
demo
(.+?)(?:;(?=(?:[^"]*"[^"]*")*[^"]*$)|$)
Try this.See demo.
https://regex101.com/r/pG1kU1/18
You can use the following to match:
(.*?(?:;|$))(?![^"]*")
See DEMO
or split by:
;(?![^"]*")
See DEMO
Related
I've had a good look around for a question that asked this before; alas, my search for a PHP preg_match search returned no results (maybe my searching skills fell short, I suppose justified considering it's a Regex question!).
Consider the text below:
The quick __("brown ") fox jumps __('over the') lazy __("dog")
Now currently I need to 'scan' for the given method __('') above, whereas it could include the spacing and different quotations ('|"). My best attempt after numerous 'iterations':
(__\("(.*?)"\))|(__\('(.*?)'\))
Or at its simplest form:
__\((.*?)\)
To break this down:
Anything that starts with __
Escaped ( and quotation mark " or '. Thus, \(\"
(.*?) Non-greedy match of all characters
Escaped closing " and last bracket.
| between the two expressions match either/or.
However, this only gets partial matches, and spaces are throwing off the search entirely. Apologies if this has been asked before, please link me if so!
Tester Link for the pattern provided above:
PHP Live Regex Test Tool
When the searched method string uses single quotes it will end up in another capture group than if it has double quotes. So in fact, your regular expression works (except for the spaces, see further down), but you'd have to look at a different index in your result array:
$input = 'The quick __("brown ") fox jumps __(\'over the\') lazy __("dog")';
// using your regular expression:
$res = preg_match_all("/(__\(\"(.*?)\"\))|(__\('(.*?)'\))/", $input, $matches);
print_r ($matches);
Note that you need preg_match_all instead of preg_match to get all matches.
Output:
Array
(
[0] => Array
(
[0] => __("brown ")
[1] => __('over the')
[2] => __("dog")
)
[1] => Array
(
[0] => __("brown ")
[1] =>
[2] => __("dog")
)
[2] => Array
(
[0] => brown
[1] =>
[2] => dog
)
[3] => Array
(
[0] =>
[1] => __('over the')
[2] =>
)
[4] => Array
(
[0] =>
[1] => over the
[2] =>
)
)
So, the result array has 5 elements, the first one representing the complete match, and all the others correspond to the 4 capture groups you have in your regular expression. As the capture groups for single quotes are not those of the double quotes, you'll find the matches at different places.
To "solve" this, you could use a back reference in your regular expression, which would look back to see which was the opening quote (single or double) and require the same to be repeated at the end:
$res = preg_match_all("/__\(([\"'])(.*?)\\1\)/", $input, $matches);
Note the back reference \1 (the backslash had to be escaped with another one). This refers back to the first capture group, where we have ["'] (again an escape was necessary) to match both kinds of quotes.
You also wanted to deal with spaces. On your PHP Live Regex you used a test string that had such spaces between the brackets and quotes. To deal with these so they still match the method strings correctly, the regular expression should get two additional \s*:
$res = preg_match_all("/__\(\s*([\"'])(.*?)\\1\s*\)/", $input, $matches);
Now the output is:
Array
(
[0] => Array
(
[0] => __("brown ")
[1] => __('over the')
[2] => __("dog")
)
[1] => Array
(
[0] => "
[1] => '
[2] => "
)
[2] => Array
(
[0] => brown
[1] => over the
[2] => dog
)
)
... and the text captured by the groups is now nicely arranged.
See this code run on eval.in and PHP Live Regex.
When working with stuff like this, don't forget about escaping:
<?php
ob_start();
?>
The quick __("brown ") fox jumps __( 'over the' ) lazy __("dog").
And __("everyone says \"hi\"").
<?php
$content = ob_get_clean();
$re = <<<RE
/__ \(
\s*
" ( (?: \\\\. | [^"])+ ) "
|
' ( (?: \\\\. | [^'])+ ) '
\s*
\)
/x
RE;
preg_match_all($re, $content, $matches, PREG_SET_ORDER);
foreach($matches as $match)
echo end($match), "\n";
How about this:
(__(\('[^']+'\)|\("[^"]+"\)))
Instead of the non greedy ., use any char but the quotes [^'] or [^"]
Enclose double and single quotes with square brackets as a character class:
$str = 'The quick __( "brown ") fox jumps __(\'over the\') lazy __("dog")';
preg_match_all("/__\(\s*([\"']).*?\\1\s*\)/ium", $str, $matches);
echo '<pre>';
var_dump($matches[0]);
// the output:
array (size=3)
0 => string '__( "brown ")'
1 => string '__('over the')'
2 => string '__("dog")'
And here is example with the same solution on phpliveregex.com:
http://www.phpliveregex.com/p/exF
(section preg_match_all)
Here is my test code:
$test = '#12345 abc #12 #abd engng#geneng';
preg_match_all('/(^|\s)#([^# ]+)/', $test, $matches);
print_r($matches);
And the output $matches:
Array ( [0] => Array ( [0] => #12345 [1] => #12 [2] => #abd ) [1] => Array ( [0] => [1] => [2] => ) [2] => Array ( [0] => 12345 [1] => 12 [2] => abd ) )
My question is why does it have an empty row?
[1] => Array ( [0] => [1] => [2] => )
If I get ride of (^|\s) in the regex, the second row will disappear. However I would not able to prevent matching #geneng.
Any answer will be appreciated.
The problem with your regular expression is that it matches # even when it is preceded by whitespace. Because \s will match the whitespace, it will be captured into $matches array. You can solve this problem by using lookarounds. In this case, it can be solved with a positive lookbehind:
preg_match_all('/(?<=^|\s)#([^# ]+)/', $test, $matches);
This will match the part after # only if it is preceded by a space or beginning-of-the line anchor. It's important to note that lookarounds do not actually consume characters. They just assert that the given regular expression is either followed or preceded by something.
Demo
It's because of the memory capture to test (^|\s):
preg_match_all('/(^|\s)#([^# ]+)/', $test, $matches);
^^^^^^
It's captured as memory location #1, so to avoid that you can simply use non-capturing parentheses:
preg_match_all('/(?:^|\s)#([^# ]+)/', $test, $matches);
^^
preg_match_all uses by default the PREG_PATTERN_ORDER flag. This means that you will obtain:
$matches[0] -> all substrings that matches the whole pattern
$matches[1] -> all capture groups 1
$matches[2] -> all capture groups 2
etc.
You can change this behavior using the PREG_SET_ORDER flag:
$matches[0] -> array with the whole pattern and the capture groups for the first result
$matches[1] -> same for the second result
$matches[2] -> etc.
In your code you (PREG_PATTERN_ORDER by default) you obtain $matches[1] with only empty or blank items because it is the content of capture group 1 (^|\s)
There is 2 set of parentheses that's why you get an empty row. PHP thinks, you want 2 set of matching in the string. Removing one of them will remove one array.
FYI: In this case, you can not use [^|\s] instead of (^|\s). Cause PHP will think, you want to exclude the white space.
Further on from my previous question about preg_split which was answers super fast, thanks to nick; I would really like to extend the scenario to no split the string when a delimiter is within quotes. For example:
If I have the string foo = bar AND bar=foo OR foobar="foo bar", I'd wish to split the sting on every space or = character but include the = character in the returned array (which works great currently), but I don't want to split the string either of the delimiters are within quotes.
I've got this so far:
<!doctype html>
<?php
$string = 'foo = bar AND bar=foo';
$array = preg_split('/ +|(=)/', $string, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
?>
<pre>
<?php
print_r($array);
?>
</pre>
Which gets me:
Array
(
[0] => foo
[1] => =
[2] => bar
[3] => AND
[4] => bar
[5] => =
[6] => foo
)
But if I changed the string to:
$string = 'foo = bar AND bar=foo OR foobar = "foo bar"';
I'd really like the array to be:
Array
(
[0] => foo
[1] => =
[2] => bar
[3] => AND
[4] => bar
[5] => =
[6] => foo
[6] => OR
[6] => foobar
[6] => =
[6] => "foo bar"
)
Notice the "foo bar" wasn't split on the space because it's in quotes?
Really not sure how to do this within the RegEx or if there is even a better way but all your help would be very much appreciated!
Thank you all in advance!
Try
$array = preg_split('/(?: +|(=))(?=(?:[^"]*"[^"]*")*[^"]*$)/', $string, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
The
(?=(?:[^"]*"[^"]*")*[^"]*$)
part is a lookahead assertion making sure that there is an even number of quote characters ahead in the string, therefore it will fail if the current position is between quotes:
(?= # Assert that the following can be matched:
(?: # A group containing...
[^"]*" # any number of non-quote characters followed by one quote
[^"]*" # the same (to ensure an even number of quotes)
)* # ...repeated zero or more times,
[^"]* # followed by any number of non-quotes
$ # until the end of the string
)
I was able to do this by adding quoted strings as a delimiter a-la
"(.*?)"| +|(=)
The quoted part will be captured. It seems like this is a bit tenuous and I did not test it extensively, but it at least works on your example.
But why bother splitting?
After a look at this old question, this simple solution comes to mind, using a preg_match_all rather than a preg_split. We can use this simple regex to specify what we want:
"[^"]*"|\b\w+\b|=
See online demo.
I am doing a simple tutorial which able to catch the keywords automatically, the code is as below:-
$content = "#abc i love you #def #you , and you?";
preg_match_all("/[\n\r\t]*\#(.+?)\s/s",$content, $tag_matches);
print_r($tag_matches);
output:-
Array ( [0] => Array ( [0] => #abc [1] => #def [2] => #you ) [1] => Array ( [0] => abc [1] => def [2] => you ) )
'#' symbol with words are the keywords
the output is correct, but if I insert any punctuation symbols beside the keyword, e.g: #you, , the output will become you, , may I know how do I filter punctuation symbols after keywords?
besides this, if I insert any keywords together just like #def#you, , the output is def#you, is anyone can help me to separate it/
Thanks All.
Try using a word boundary \b instead of whitespace \s. That will stop the match when it reaches anything other than a word character (i.e., [a-zA-Z0-9_]).
/[\n\r\t]*\#(.+?)\b/s
Conceptually, that's what you were trying to do anyway by putting whitespace there (i.e., denote end of word).
You could try:
/[\n\r\t]*\#([\w]*)\s/s
The * actually has the same behavior as +?. By matching the . you are every character. If you have tags which are hyphenated you may want to add - inside of the brackets.
Note: See the bottom of this post for an explanation for why this wasn't originally working.
In PHP, I am attempting to match lower-case characters at the end of every line in a string buffer.
The regex pattern should be [a-z]$. But that only matches the last letter of the string. I believe this a regex modifier issue; I have experimented with /s /m /D, but nothing appears to match as expected.
<?php
$pattern = '/[a-z]$/';
$string = "this
is
a
broken
sentence";
preg_match_all($pattern, $string, $matches);
print_r($matches);
?>
Here's the output:
Array
(
[0] => Array
(
[0] => e
)
)
Here's what I expect the output to be:
Array (
[0] => Array (
[0] => s
[1] => s
[2] => a
[3] => n
[4] => e
)
)
Any advice?
Update: The PHP source code was written on a Windows machine; text editors in Windows, by convention, represent newlines differently than text editors on Unix system.
It appears that the byte-code representation of Windows text files (inheriting from DOS) was not respected by the PHP regex engine. Converting the end-of-line byte-code format to Unix solved the original problem.
Adam Wagner (see below) has posted a pattern that matches regardless of end-of-line byte-representation.
zerkms has the canonical regular expression, to which I am awarding the answer.
$pattern = '/[a-z]$/m';
$string = "this
is
a
broken
sentence";
preg_match_all($pattern, $string, $matches);
print_r($matches);
http://ideone.com/XkeD2
This will return exactly what you want
As #Will points out, it appears you either want the first char of each string, or your example is wrong. If you want the last char of each line (only if it's a lower-case char) you could try this:
/[a-z](?:\n)|[a-z]$/
The first segment [a-z](?:\n), checks to for lowercase chars before newlines. Then [a-z]$ get the last char of the string (in-case it's not followed by a newline.
With your example string, the output is:
Array
(
[0] => Array
(
[0] => s
[1] => a
[2] => n
[3] => e
)
)
Note - The 's' from 'is' is not present because it is followed by a space. To capture this 's' as well (ignoring trailing spaces), you can update the regex to: /[a-z](?:[ ]*\n)|[a-z](?:[ ]*)$/, which checks for 0 or more spaces immediately before the newline (or end of string). Which outputs:
Array
(
[0] => Array
(
[0] => s
[1] => s
[2] => a
[3] => n
[4] => e
)
)
Update
It appears the line-ending style wasn't liking your regex. To account for crazy line-endings (an other unsavory white-space at the end of the lines), you can use this (and still get the /m goodness).
/[a-z](?:\W*)$/m
It looks like you want to match before every newline, not at the end of the file. Perhaps you want
$pattern = '/[a-z]\n/';