php regex: Use quotes for match, but don't capture them - php

I'm unsure if I should be using preg_match, preg_match_all, or preg_split with delim capture. I'm also unsure of the correct regex.
Given the following:
$string = " ok 'that\\'s cool' \"yeah that's \\\"cool\\\"\"";
I want to get an array with the following elems:
[0] = "ok"
[1] = "that\'s"
[2] = "yeah that's \"cool\""

You can not do this with a regular expression because you're trying to parse a non-context-free grammar. Write a parser.
Outline:
read character by character, if you see a \ remember it.
if you see a " or ' check if the previous character was \. You now have your delimiting condition.
record all the tokens in this manner
Your desired result set seems to trim spaces, you also lost a couple of the \s, perhaps this is a mistake but it can be important.
I would expect:
[0] = " ok " // <-- spaces here
[1] = "that\\'s cool"
[2] = " \"yeah that's \\\"cool\\\"\"" // leading space here, and \" remains

Actually, you might be surprised to find that you can do this in regex:
preg_match_all("((?|\"((?:\\\\.|[^\"])+)\"|'((?:\\\\.|[^'])+)'|(\w+)))",$string,$m);
The desired result array will be in $m[1].

You can do it with a regex:
$pattern = <<<'LOD'
~
(?J)
# Definitions #
(?(DEFINE)
(?<ens> (?> \\{2} )+ ) # even number of backslashes
(?<sqc> (?> [^\s'\\]++ | \s++ (?!'|$) | \g<ens> | \\ '?+ )+ ) # single quotes content
(?<dqc> (?> [^\s"\\]++ | \s++ (?!"|$) | \g<ens> | \\ "?+ )+ ) # double quotes content
(?<con> (?> [^\s"'\\]++ | \s++ (?!["']|$) | \g<ens> | \\ ["']?+ )+ ) # content
)
# Pattern #
\s*+ (?<res> \g<con>)
| ' \s*+ (?<res> \g<sqc>) \s*+ '?+
| " \s*+ (?<res> \g<dqc>) \s*+ "?+
~x
LOD;
$subject = " ok 'that\\'s cool' \"yeah that's \\\"cool\\\"\"";
preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
var_dump($match['res']);
}
I made the choice to trim spaces in all results, then " abcd " will give abcd. This pattern allows all backslashes you want, anywhere you want. If a quoted string is not closed at the end of the string, the end of the string is considered as the closing quote (this is why i have made the closing quotes optional). So, abcd " ef'gh will give you abcd and ef'gh

Related

Regex split string on a char with exception for inner-string

I have a string like aa | bb | "cc | dd" | 'ee | ff' and I'm looking for a way to split this to get all the values separated by the | character with exeption for | contained in strings.
The idea is to get something like this [a, b, "cc | dd", 'ee | ff']
I've already found an answer to a similar question here : https://stackoverflow.com/a/11457952/11260467
However I can't find a way to adapt it for a case with multiple separator characters, is there someone out here which is less dumb than me when it come to regular expressions ?
This is easily done with the (*SKIP)(*FAIL) functionality pcre offers:
(['"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*
In PHP this could be:
<?php
$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
$pattern = '~([\'"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*~';
$splitted = preg_split($pattern, $string);
print_r($splitted);
?>
And would yield
Array
(
[0] => aa
[1] => bb
[2] => "cc | dd"
[3] => 'ee | ff'
)
See a demo on regex101.com and on ideone.com.
This is easier if you match the parts (not split). Patterns are greedy by default, they will consume as many characters as possible. This allows to define more complex patterns for the quoted string before providing a pattern for an unquoted token:
$subject = '[ aa | bb | "cc | dd" | \'ee | ff\' ]';
$pattern = <<<'PATTERN'
(
(?:[|[]|^) # after | or [ or string start
\s*
(?<token> # name the match
"[^"]*" # string in double quotes
|
'[^']*' # string in single quotes
|
[^\s|]+ # non-whitespace
)
\s*
)x
PATTERN;
preg_match_all($pattern, $subject, $matches);
var_dump($matches['token']);
Output:
array(4) {
[0]=>
string(2) "aa"
[1]=>
string(2) "bb"
[2]=>
string(9) ""cc | dd""
[3]=>
string(9) "'ee | ff'"
}
Hints:
The <<<'PATTERN' is called HEREDOC syntax and cuts down on escaping
I use () as pattern delimiters - they are group 0
Naming matches makes code a lot more readable
Modifier x allows to indent and comment the pattern
Use
$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
preg_match_all("~(?|\"([^\"]*)\"|'([^']*)'|([^|'\"]+))(?:\s*\|\s*|\z)~", $string, $matches);
print_r(array_map(function($x) {return trim($x);}, $matches[1]));
See PHP proof.
Results:
Array
(
[0] => aa
[1] => bb
[2] => cc | dd
[3] => ee | ff
)
EXPLANATION
--------------------------------------------------------------------------------
(?| Branch reset group, does not capture:
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^\"]* any character except: '\"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^|'\"]+ any character except: '|', ''', '\"'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of grouping
It's interesting that there are so many ways to construct a regular expression for this problem. Here is another that is similar to #Jan's answer.
(['"]).*?\1\K| *\| *
PCRE Demo
(['"]) # match a single or double quote and save to capture group 1
.*? # match zero or more characters lazily
\1 # match the content of capture group 1
\K # reset the starting point of the reported match and discard
# any previously-consumed characters from the reported match
| # or
\ * # match zero or more spaces
\| # match a pipe character
\ * # match zero or more spaces
Notice that the part before the pipe character ("or") serves merely to move the engine's internal string pointer to just past the closing quote or a quoted substring.

PHP: Parse comma-delimited string outside single and double quotes and parentheses

I've found several partial answers to this question, but none that cover all my needs...
I am trying to parse a user generated string as if it were a series of php function arguments to determine the number of arguments:
This string:
$arg1,$arg2='ABC,DEF',$arg3="GHI\",JKL",$arg4=array(1,'2)',"3\"),")
will be inserted as the arguments of a function:
function my_function( [insert string here] ){ ... }
I need to parse the string on the commas, taking into account single- and double-quotes, parentheses, and escaped quotes and parentheses to create an array:
array(4) {
[0] => $arg1
[1] => $arg2='ABC,DEF'
[2] => $arg3="GHI\",JKL"
[3] => $arg4=array(1,'2)',"3\"),")
}
Any help with a regular expression or parser function to accomplish this is appreciated!
It isn't possible to solve this problem with a classical csv tool since there is more than one character able to protect parts of the string.
Using preg_split is possible but will result in a very complicated and inefficient pattern. So the best way is to use preg_match_all. There are however several problems to solve:
as needed, a comma enclosed in quotes or parenthesis must be ignored (seen as a character without special meaning, not as a delimiter)
you need to extract the params, but you need to check if the string has the good format too, otherwise the match results may be totally false!
For the first point, you can define subpatterns to describe each cases: the quoted parts, the parts enclosed between parenthesis, and a more general subpattern able to match a complete param and that uses the two previous subpatterns when needed.
Note that the parenthesis subpattern needs to refer to the general subpattern too, since it can contain anything (and commas too).
The second point can be solved using the \G anchor that ensures that all matchs are contiguous. But you need to be sure that the end of the string has been reached. To do that, you can add an optional empty capture group at the end of the main pattern that is created only if the anchor for the end of the string \z succeeds.
$subject = <<<'EOD'
$arg1,$arg2='ABC,DEF',$arg3="GHI\",JKL",$arg4=array(1,'2)',"3\"),")
EOD;
$pattern = <<<'EOD'
~
# named groups definitions
(?(DEFINE) # this definition group allows to define the subpatterns you want
# without matching anything
(?<quotes>
' [^'\\]*+ (?s:\\.[^'\\]*)*+ ' | " [^"\\]*+ (?s:\\.[^"\\]*)*+ "
)
(?<brackets> \( \g<content> (?: ,+ \g<content> )*+ \) )
(?<content> [^,'"()]*+ # ' # (<-- comment for SO syntax highlighting)
(?:
(?: \g<brackets> | \g<quotes> )
[^,'"()]* # ' #
)*+
)
)
# the main pattern
(?: # two possible beginings
\G(?!\A) , # a comma contiguous to a previous match
| # OR
\A # the start of the string
)
(?<param> \g<content> )
(?: \z (?<check>) )? # create an item "check" when the end is reached
~x
EOD;
$result = false;
if ( preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER) &&
isset(end($matches)['check']) )
$result = array_map(function ($i) { return $i['param']; }, $matches);
else
echo 'bad format' . PHP_EOL;
var_dump($result);
demo
You could split the argument string at ,$ and then append $ back the array values:
$args_array = explode(',$', $arg_str);
foreach($args_array as $key => $arg_raw) {
$args_array[$key] = '$'.ltrim($arg_raw, '$');
}
print_r($args_array);
Output:
(
[0] => $arg1
[1] => $arg2='ABC,DEF'
[2] => $arg3="GHI\",JKL"
[3] => $arg4=array(1,'2)',"3\"),")
)
If you want to use a regex, you can use something like this:
(.+?)(?:,(?=\$)|$)
Working demo
Php code:
$re = '/(.+?)(?:,(?=\$)|$)/';
$str = "\$arg1,\$arg2='ABC,DEF',\$arg3=\"GHI\",JKL\",\$arg4=array(1,'2)',\"3\"),\")\n";
preg_match_all($re, $str, $matches);
Match information:
MATCH 1
1. [0-5] `$arg1`
MATCH 2
1. [6-21] `$arg2='ABC,DEF'`
MATCH 3
1. [22-39] `$arg3="GHI\",JKL"`
MATCH 4
1. [40-67] `$arg4=array(1,'2)',"3\"),")`

preg_split shortcode attributes into array

I would like to parse shortcode into array via "preg_split".
This is example shortcode:
[contactform id="8411" label="This is \" first label" label2='This is second \' label']
and this should be result array:
Array
(
[id] => 8411
[label] => This is \" first label
[label2] => This is second \' label
)
I have this regexp:
$atts_arr = preg_split('~\s+(?=(?:[^\'"]*[\'"][^\'"]*[\'"])*[^\'"]*$)~', trim($shortcode, '[]'));
Unfortunately, this works only if there is no escaping of quotes \' or \".
Thx in advance!
Using preg_split is not always handy or appropriate in particular when you have to deal with escaped quotes. So, a better approach consists to use preg_match_all, example:
$pattern = <<<'EOD'
~
(\w+) \s*=
(?|
\s* "([^"\\]*(?:\\.[^"\\]*)*)"
|
\s* '([^'\\]*(?:\\.[^'\\]*)*)'
# | uncomment if you want to handle unquoted attributes
# ([^]\s]*)
)
~xs
EOD;
if (preg_match_all($pattern, $yourshortcode, $matches))
$attributes = array_combine($matches[1], $matches[2]);
The pattern uses the branch reset feature (?|...(..)...|...(...)..) that gives the same number(s) to the capture groups for each branch.
I was speaking about the \G anchor in my comment, this anchor succeeds if the current position is immediatly after the last match. It can be useful if you want to check the syntax of your shortcode from start to end at the same time (otherwise it is totally useless). Example:
$pattern2 = <<<'EOD'
~
(?:
\G(?!\A) # anchor for the position after the last match
# it ensures that all matches are contiguous
|
\[(?<tagName>\w+) # begining of the shortcode
)
\s+
(?<key>\w+) \s*=
(?|
\s* "(?<value>[^"\\]*(?:\\.[^"\\]*)*)"
|
\s* '([^'\\]*(?:\\.[^'\\]*)*')
# | uncomment if you want to handle unquoted attributes
# ([^]\s]*)
)
(?<end>\s*+]\z)? # check that the end has been reached
~xs
EOD;
if (preg_match_all($pattern2, $yourshortcode, $matches) && isset($matches['end']))
$attributes = array_combine($matches['key'], $matches['value']);

Get all strings matching pattern in text

I'm trying to get from text all strings which are between t(" and ") or t(' and ').
I came up with regexp /[^t\(("|\')]*(?=("|\')\))/, but it is not ignoring character 't' when it is not before to '('.
For example:
$str = 'This is a text, t("string1"), t(\'string2\')';
preg_match_all('/[^t\(("|\')]*(?=("|\')\))/', $str, $m);
var_dump($m);
returns ring1 and ring2, but I need to get string1 and string2.
You can consider this also.
You need to use separate regex for each.
(?<=t\(").*?(?="\))|(?<=t\(\').*?(?='\))
DEMO
Code:
$re = "/(?<=t\\(\").*?(?=\"\\))|(?<=t\\(\\').*?(?='\\))/m";
$str = "This is a text, t(\"string1\"), t('string2')";
preg_match_all($re, $str, $matches);
OR
Use capturing group along with \K
t\((['"])\K.*?(?=\1\))
DEMO
\K discards the previously matched characters from printing at the final.
You can do it in few steps with this pattern:
$pattern = '~t\((?|"([^"\\\]*+(?s:\\\.[^"\\\]*)*+)"\)|\'([^\'\\\]*+(?s:\\\.[^\'\\\]*)*+)\'\))~';
if (preg_match_all($pattern, $str, $matches))
print_r($matches[1]);
It is a little long and repetitive, but it is fast and can deal with escaped quotes.
details:
t\(
(?| # Branch reset feature allows captures to have the same number
"
( # capture group 1
[^"\\]*+ # all that is not a double quote or a backslash
(?s: # non capturing group in singleline mode
\\. # an escaped character
[^"\\]* # all that is not a double quote or a backslash
)*+
)
"\)
| # OR the same with single quotes (and always in capture group 1)
'([^'\\]*+(?s:\\.[^'\\]*)*+)'\)
)
demo

regex exclude nested html tag

i have a piece of text:
<strong>blalblalba</strong>blasldasdsadasdasd<strong> 3.5m Euros<br>
<span class="style6">SOLD</span></strong>
and I want to remove <strong> contains $|euros|Euros</strong>
So far I have:
preg_replace('#<strong>.*?(^<strong>).*?(\$|euros|Euros|EUROS).*?</strong>#is', '', $result);
but it is not working... I was trying also negative lock head (?!) but still not working...
Any help? Thanks
With the assumption you expect two stong's before your Euros, I think this may be what you want: preg_replace('#^<strong>.*?<strong>.*?(\$[euros|Euros|EUROS]).*?</strong>#is', '', $result);
You can try this, must use 'Dot-All' modifier or substitute [\S\s] -
# <strong>(?:(?!\1)(?:\$|euros|Euros|EUROS)()|(?!<strong>).)+</strong>\1
<strong>
(?:
(?! \1 )
(?: \$ | euros | Euros | EUROS )
( )
|
(?! <strong> )
.
)+
</strong>
\1

Categories