Parsing Blocks with Regular Expression in PHP - php

I am stuck with parsing a string containing key-value pairs with operators in between (like the one below) in PHP. I am planning to user regex to parse it (I am not good at it though).
key: "value" & key2 : "value2" | title: "something \"here\"..." &( key: "this value in paranthesis" | key: "another value")
Basically the units in the above block are as follows
key - Anything that qualifies to be a javascript variables.
value - Any string long or short but enclosed in double quotes ("").
pair - (key:value) A key and value combined by colon just like in javascript objects.
operator - (& or |) Simply indicating 'AND' or 'OR'.
There can be multiple blocks nested within prantheses ( and ).
Being inspired from Matt (http://stackoverflow.com/questions/2467955/convert-javascript-regular-expression-to-php-pcre-expression) I have used the following regular expressions.
$regs[':number'] = '(?:-?\\b(?:0|[1-9][0-9]*)(?:\\.[0-9]+)?(?:[eE][+-]?[0-9]+)?\\b)';
$regs[':oneChar'] = '(?:[^\\0-\\x08\\x0a-\\x1f\"\\\\]|\\\\(?:[\"/\\\\bfnrt]|u[0-9A-Fa-f]{4}))';
$regs[':string'] = '(?:\"'.$regs[':oneChar'].'*\")';
$regs[':varName'] = '\\$(?:'.$regs[':oneChar'].'[^ ,]*)';
$regs[':func'] = '(?:{[ ]*'.$regs[':oneChar'].'[^ ]*)';
$regs[':key'] = "({$regs[':varName']})";
$regs[':value'] = "({$regs[':string']})";
$regs[':operator'] = "(&|\|)";
$regs[':pair'] = "(({$regs[':key']}\s*:)?\s*{$regs[':value']})";
if(preg_match("/^{$regs[':value']}/", $query, $matches))
{
print_r($matches);
}
When executing the above, PHP throws an error near the IF condition
Warning: preg_match() [function.preg-match]: Unknown modifier '\' in /home/xxxx/test.xxxx.com/experiments/regex/index.php on line 23
I have tried to preg_match with :string and :oneChar but still I get the same error.
Therefor I feel there is something wrong in the :oneChar reg ex. Kindly help me in resolving this issue.

I see at least one error in the second regular expression ($regs[':oneChar']). There is a forward slash in it. And it is conflicting with the forward slashes being used in preg_match as delimiters. Use preg_match("#^{$regs[':value']}#", $query, $matches) instead.
You may also need to use preg_quote on the input string.
$query = preg_quote($query, '/');
Beyond that, I would run each of your regular expressions one at a time to see which one is throwing the error.

Related

How to replace below preg_replace with preg_replace_callback?

I am having difficulty converting below preg_replace() function call
preg_replace("/\{(.*?)\}/e", '$\1', $data)
to using preg_replace_callback() (because of the removed e modifier in PHP 7.0).
I have tried this but I have no idea how to fully handle '$\1':
preg_replace_callback('/\{(.*?)\}/', function ($matches) {
return $matches[0];
}, $data);
Any help would be highly appreciated.
I'd like to suggest the following code as an answer to the concrete question:
$vars = get_defined_vars();
$result = preg_replace_callback('/{(.*?)}/', function ($matches) use ($vars) {
return $vars[$matches[1]];
}, $data);
unset($vars);
The remaining part of the answer should provide more information and references for mainly two things:
Show how this can be solved with divide and conquer also leading to a step-by-step guide on how to port such code.
Add more context as depending on how/where that code is to be ported, there can be differences, also for error handling and PHP version compatibility requirements.
This should make the answer more applicable to similar variable variables related preg_replace() with e modifier migration question based on backreferences.
The e (PREG_REPLACE_EVAL) Modifier
This feature was DEPRECATED in PHP 5.5.0, and REMOVED as of PHP 7.0.0.
It was only used by preg_replace() and was ignored by other PCRE functions.
From a previous PHP manual description revision:
If this deprecated modifier is set, preg_replace() does normal substitution of backreferences in the replacement string, evaluates it as PHP code, and uses the result for replacing the search string. Single quotes, double quotes, backslashes (\) and NULL chars will be escaped by backslashes in substituted backreferences.
Rationale and context why it was deprecated/removed can be found in RFC: Remove preg_replace /e modifier, mainly three issue classes:
Security issues
Overescaping of quotes
Use as obfuscation in exploit scripts
The PHP RFC Wiki page has more details, and the information is a good addition to the answer as a port at least crosses 1. and 2. for the removed PHP code evaluation.
The '$\1' Replacement
As per the e modifiers description, '$\1' will be evaluated after the backreference \1 (first matching group) is replaced.
In the questions example that is the contents of the angle brackets {...}:
'/\{(.*?)\}/'
~~~~~
1 : first matching group
For example when the subject string is "Hello {name}", the contents of the first matching group is "name". Resolving it leads to the following PHP code that then is evaluated:
$name
That is a variable named "name". The evaluation is done within the scope where preg_replace() is called.
So far the description of the replacement pattern.
How to make compatible with PHP 7.0.0 (and earlier/later)?
A common way to start changing away from the e modifier is to make use of preg_replace_callback() instead of preg_replace(), which is done by replacing it and using an anonymous function (or any other callback method, however anonymous functions are normally the preferable way in most cases).
This is also (thankfully) outlined on the reference question. In the following I'll first leave backslash escaping of the substituted backreferences out to simplify the solution (and address it later).
An example of what has been done so far (with only a slight correction on the $matches index - it needs to be 1 not 0):
preg_replace_callback('/\{(.*?)\}/', function ($matches) {
return $matches[1];
}, $data);
The \1 backreference from the first matching group is done by using $matches[1] here. It will contain the contents of the angle brackets {...}, e.g. "name" from the previous example.
(compare: Changing preg_replace to preg_replace_callback)
More or less obviously for the here specific $\1 replacement, it is incomplete as it would only replace with the name of the variable and not (yet) its contents.
Still missing is to connect the name with the original scope. Which requires a little more work.
Obtain Variables in preg_replace() Scope
To obtain all variables defined in the same scope as the preg_replace_callback() (previously preg_replace()) call, the get_defined_vars() function is an option:
This function returns a multidimensional array containing a list of all defined variables, be them environment, server or user-defined variables, within the scope that get_defined_vars() is called.
Using that array within the anonymous callback function then allows to obtain the value of a variable by its name as array key:
$vars = get_defined_vars(); # <1>
preg_replace_callback('/\{(.*?)\}/', function ($matches) use ($vars) { # <2>
$name = $matches[1];
return $vars[$name]; # <3>
}, $data);
Obtain variables from preg_replace scope.
Use variables with the anonymous function (the use language construct).
Access variables value by name and return.
This was the missing part in the question to turn the backreference used as variable name to obtain the actual value already.
As so often, there are similar ways to achieve the same, some of them more depending on context. Truly get_defined_vars() is a pretty generic way to create a "variable table" and map names to their value. But there can be circumstances for which an array is already available and there might be no need to call that function.
Alternative to get_defined_vars(): Use of $GLOBALS array
This approach has been chosen by Wiktor Stribiżew in his answer:
Given the scope is the global scope (likely not, but if), then the $GLOBALS superglobal can be used instead:
$result = preg_replace_callback('/\{(.*?)\}/', function ($matches) {
$name = $matches[1];
return $GLOBALS[$name];
}, $data);
No need to call get_defined_vars() nor to unset the $vars array after the call (or otherwise need to potentially care about it). But this is binding to global variable state (may or not be an issue with the application).
Alternative to get_defined_vars(): Re-Use of another array (if available)
Given variables were previously imported into the scope where preg_replace() with the e modifier was running from an array, then the import is redundant and the array itself can be used with the callbacks function use clause. An example:
function replace_variables(string $data, array $vars) {
# previously here: extract($vars);
$result = preg_replace_callback('/\{(.*?)\}/', function ($matches) use ($vars) {
$name = $matches[1];
return $vars[$name];
}, $data);
# ...
}
As extract() comes with side effects you normally want to prevent, this would catch two birds with one stone: The variables array was already available and get_defined_vars() must not be called. Additionally, an unsafe extract operation can be dropped as it is not necessary any longer to create variables in the scope of the earlier preg_replace().
This should leave enough food for thought to connect the name in the backreference to the value. The PHP manual has more about variable scope in case there is a more specific context. Normally get_defined_vars() should address most issues if an array is not yet available.
Notes for the '/\{(.*?)\}/' Regular Expression Pattern
This pattern comes with some caveats, therefore I'm leaving some notes for additional information and to open up on error handling and changes of it due to porting, which will address more issues.
The backslashes "\" are redundant:
Just a minor thing to get it out of the way:
ok.....: '/\{(.*?)\}/'
correct: '/{(.*?)}/'
This change can be always done, those backslashes are redundant. They don't qualify as quantifiers.
This improves readability of the pattern.
Change in Regular Expression Pattern PHP Error Behaviour
Second worth a note on the search pattern is to highlight a potential incompatibility:
The pattern allows a zero-length match, that is the empty angle brackets group {} does match leading to a zero-length (variable) name. It could be used to present a default value (e.g. null) but perhaps you may want to not have it matching at all or may want to add error handling.
w/ empty.: '/{(.*?)}/'
w/ length: '/{(.+?)}/'
Which brings up a related point: Undefined variable/index warnings.
To prevent undefined index warnings these could resolve to null silently (or you may want to add error handling). This has been done in the upfront code porting suggestion at the very beginning of the answer.
Note thought that these errors were harsher with the previous preg_replace() call with the e modifier as the empty name resulted in a parse error when evaluated and then a fatal error. Example:
PHP Parse error: syntax error, unexpected ';', expecting variable (T_VARIABLE) or '$' in ... : regexp code on line 1
PHP Fatal error: preg_replace(): Failed evaluating code:
$
To define such errors out of existence as of a PHP 7.0.0 (and above/below) compatible port:
$vars = get_defined_vars();
$result = preg_replace_callback('/{(.*?)}/', function ($matches) use ($vars) {
$name = $matches[1];
return isset($vars[$name]) ? $vars[$name] : null;
}, $data);
unset($vars);
Alternatively it is possible to mimik the old error behaviour (a bit) by throwing (e.g. on empty name), as it triggers a fatal, uncaught exception error:
$vars = get_defined_vars();
$result = preg_replace_callback('/{(.*?)}/', function ($matches) use ($vars) {
$name = $matches[1];
if ('' === $name) {
throw new \RuntimeException('preg_replace_callback(): callback: Expected variable name, got zero-length string.');
}
return isset($vars[$name]) ? $vars[$name] : null;
}, $data);
unset($vars);
(if backwards compatibility below PHP 7.0.0 is not an issue, throwing an \Error is a more matching alternative for PHP 7.0.0 and above. Alternatively use trigger_error() instead to include versions below PHP 7.0.0 as well)
However, I'd suggest to look more into how the overall process can be made more error-safe. Even this depends much on the context of the original code and requires a more decent look, it allows benefiting from the changes. The following discussion/example will show even more.
Changes in Replacement Pattern (previous Backslash Escapes for Backreferences)
Removing the e (PREG_REPLACE_EVAL) modifier does not only require to have a callback function but also comes with another change: Backslash escapes were added earlier but will not any longer with the callback function.
This has been kept out so far. To complete the answer, it should get some attention. First as a reminder, from the (now removed) e modifier documentation what this is about:
Single quotes, double quotes, backslashes (\) and NULL chars will be escaped by backslashes in substituted backreferences.
This can lead to code that contains one or more calls to stripslashes() within the replacement pattern. This is not the case for this question so the consequences are that backslash escapes aren't added any longer.
As mario writes in an answer to the reference question:
[...] stripslashes() often becomes redundant in literal expressions.
In this question, it is a little different: As stripslashes() is not within the replacement pattern, there is nothing to be redundant / remove in "$\1".
To demonstrate the changes with a double and single quote within a "variable name" in the absence of the escaping for preg_replace_callback() compared to using the e modifier:
Data
e Modifier
Callback
{abc}
$abc
$abc
{a"bc}
$a\"bc (E)
$a"bc (I)
${${'abc'}}
$${\'abc\' (E)
$${'abc' (I)
(E): PHP Parse error
(I): Invalid variable name (informative only)
This once more highlights that the original replacement pattern has issues with the name stored as backreference to the first matching group - as discussed above for a zero-length variable name - it is lax and allows invalid names (which could have lead to PHP Parse errors due to evaluating the replacement previously).
The backslash escaping added to that. As the regular expression pattern does a lazy match (.*? - the question mark after the asterix) it was at least not completely in free-form.
The port therefore has less such issues but only on a finer difference.
Therefore, porting itself does not address this issue much. Actually what was a PHP fatal error earlier now turns into an undefined index PHP warning with the consequence that the script continues to run where it stopped earlier.
This could be seen as an argument for (or against) failing early with the port - it depends.
It could be done by checking for invalid variable names (assuming those would have caused a fatal parse error during evaluation - not an undefined variable warning).
A PCRE regular expression pattern for variable names in PHP is ^[a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]*$). One idea which came to my mind was to use it to classify a name whether it is a valid PHP variable name or not.
Additionally, the next example is an opportunity to show how the backslash escapes and the error/warning behaviour can be preserved:
$vars = get_defined_vars();
$result = preg_replace_callback('/{(.*?)}/', function ($matches) use ($vars) {
$name = addcslashes($matches[1], "'\"\0"); # <1>
if (!preg_match('(^[a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]*$)D', $name)) { # <2>
trigger_error("Not a variable name: $name", E_USER_ERROR);
}
if (!array_key_exists($name, $vars)) { # <3>
trigger_error("Undefined variable: $name");
}
return isset($vars[$name]) ? $vars[$name] : null; # <4>
}, $data);
unset($vars);
Backslash escape single quotes, double quotes, backslashes (\) and NULL chars for backreference \1 (as the e modifier did so).
Trigger fatal error on invalid variable names as those with the e modifier would result in a parse error followed by a fatal error evaluating code with syntax error(s).
Trigger warning on undefined variable.
Undefined variables (now not isset() indexes) result in null values.
This more verbose example is to mimik even more of the original behaviour and therefore could be seen as a more complete port. However, it contradicts many of the benefits why the e modifier was deprecated and removed in the first place. Therefore, do not apply it blindly, it is an additional example to highlight the differences between the e modifiers eval and the callback version.
This is also the reason I've kept this out of the foremost answer.
PHP Version Compatibility
The port as outlined above is done with an anonymous function and therefore it is compatible with PHP 5.3 or later.
If backwards compatibility is not necessary or as an outlook for a future migration, some comments on more recent PHP versions:
Since PHP 7.4 arrow functions can be used. They have the benefit that the scope is automatically inherited ("closed"), so the use-clause becomes redundant. However variable variables can not be used with arrow functions which makes the array as "variable table" still necessary - like before. It can condense the code thought, especially if error conditions (see discussion above) from the pattern would be removed already (not the case in the following example code, it still uses the original pattern):
$vars = get_defined_vars();
$result = preg_replace_callback('/{(.*?)}/', fn($matches) => $vars[$matches[1]] ?? null, $data);
unset($vars);
Since PHP 8.0 - as throw new \Error is an expression - throwing could be another option, however for my taste it is not of much benefit then as control is not fine-grained and also readability is degraded. Your mileage may vary thought, it is an option since PHP 8.0:
$vars = get_defined_vars();
$result = preg_replace_callback(
'/{(.*?)}/',
fn($matches) =>
$vars[$matches[1]]
?? throw new \Error(sprintf('Expected existing variable name, got "%s" which is undefined', $matches[1])),
$data
);
unset($vars);
You can access global variables using $GLOBALS Superglobal array:
preg_replace_callback('/\{(.*?)\}/', function ($matches) {
return $GLOBALS[$matches[1]];
}, $data);
See the PHP demo:
$data = 'Some {abc} here';
$abc = "Word";
echo preg_replace_callback('/\{(.*?)\}/', function ($matches) {
return $GLOBALS[$matches[1]];
}, $data);
Output:
Some Word here

PHP extend regex to accept round brackets

I have an existing regex which checks if the string is not wrapped in either quotes or square brackets then
I'm wrapping that string in quotes
My existing regex is as follow -
if (!preg_match('/^["\[].*["\]]$/', $filter)) {
$filter = '%22' . $filter . '%22';
}
Now I want to extend this regex to check already wrapped in either quotes or square brackets or parentheses
For parentheses, my string value i.e my $filter value would be something like (123 456)
Can anyone help to extended this regex?
I think the regular expression is good for complex string math but like this simple string match why do you use regex it takes almost O(n) time where n is you string size some time it takes O(2^m), where m is the length of regex. But if you check with a simple if just check 1st and the last characters of string it takes O(1). Here is the regex solution.
/^(?(?=")(["].*["])|(?=\[)([\[].*[\]]))|(?=\()(([\(].*[\)]))$/
The regular expressions are a powerful tool but they are not a Swiss army knife. There are problems that simply cannot be resolved using regex and there are problems that can be resolved using regex but a simpler approach produces code that is easier to read and understand. This is such a problem.
Let's reformulate the problem. If the first character of the string is " and also its last character is " then the string is wrapped in quotes and it does not need other processing. The same when the first character is [ and the last one is ]. Or ( and ).
The first character of a string stored in the variable $filter is $filter[0]. The last one is $filter[-1]. Extract them into a new string and search for it into a list of combinations of quotes and parentheses:
if (! in_array($filter[0].$filter[-1], ['""', '[]', '()'])) {
// the string is not enclosed in quotes, square brackets or parentheses
// do something with it (enclose it, etc)
}
If you are using PHP 5 (any version) or PHP 7.0 then you are out of luck (and out of PHP updates, btw) and you cannot use $filter[-1] (because this functionality has been introduced in PHP 7.1).
The PHP function substr() comes to the rescue.
substr($filter, -1) does the same thing as $filter[-1] (returns the last character of $filter and works in all PHP versions.
There are two corner cases to consider:
When $filter is '"' (a string of exactly one character that is a double quote), the code above will report it as enclosed in quotes when, in fact, it is not.
When $filter is '' (the empty string) the code produces two warnings (but does not report it as being enclosed in quotes.
Both cases can be easily solved by adding a check of the string's length to avoid running the other test if the string is too short:
if (strlen($filter) < 2 || ! in_array($filter[0].$filter[-1], ['""', '[]', '()'])) {
// the string is not enclosed in quotes, square brackets or parentheses
// do something with it (enclose it, etc)
}

my regexp does not work for a simple word match

I want to see if the current request is on the localhost or not. For doing this, I use the following regular expression:
return ( preg_match("/(^localhost).*/", $url) == true ||
preg_match("/^({http|ftp|https}://localhost).*/", $url) == true )
? true : false;
And here is the var_dump() of $url:
string 'http://localhost/aone/public/' (length=29)
Which keeps returning false though. What is the problem of this regular expression?
You are currently using the forward slash (/) as the delimiter, but you aren't escaping it inside your pattern string. This will result in an error and will cause your preg_match() statement to not work (if you don't have error reporting enabled).
Also, you are using alternation incorrectly. If you want to match either foo or bar, you'd write (foo|bar), and not {foo|bar}.
The updated preg_match() should look like:
preg_match("/^(http|ftp|https):\/\/localhost.*/", $url)
Or with a different delimiter (so you don't have to escape all the / characters):
preg_match("#^(http|ftp|https)://localhost.*#", $url)
Curly braces have a special meaning in a regex, they are used to quantify the preceding character(s).
So:
/^({http|ftp|https}://localhost).*/
Should probably be something like:
#^((http|ftp|https)://localhost).*#
Edit: changed the delimiters so that the forward slash does not need to be escaped
This
{http|ftp|https}
is wrong.
I suppose you mean
(http|ftp|https)
Also, if you want only group and don't capture, please add ?::
(?:http|ftp|https)
I would change your current code to:
return preg_match("~^(?:(?:https?|ftp)://)?localhost~", $url);
You were using { and } for grouping, when those are used for quantifying and otherwise mean literal { and `} characters.
A couple of things to add is that:
you can use https? instead of (http|https);
you can use other delimiters for the regex when your pattern has those symbols as delimiters. This avoids you excessive escaping;
you can combine the two regex, since one part is optional (the (?:https?|ftp):// part) and doing so would make the later comparator unnecessary;
the .* at the end is not required.

preg_replace PHP not working?

Why doesn't preg_replace return anything in this scenario? I've been trying to figure it out all night.
Here is the text contained within $postContent:
Test this. Here is a quote: [Quote]1[/Quote] Quote is now over.
Here is my code:
echo "Test I'm Here!!!";
$startQuotePos = strpos($postContent,'[Quote]')+7;
$endQuotePos = strpos($postContent,'[/Quote]');
$postStrLength = strlen($postContent);
$quotePostID = substr($postContent,$startQuotePos,($endQuotePos-$postStrLength));
$quotePattern = '[Quote]'.$quotePostID.'[/Quote]';
$newPCAQ = preg_replace($quotePattern,$quotePostID,$postContent);
echo "<br />$startQuotePos<br />$endQuotePos<br />$quotePostID<br />Qpattern:$quotePattern<br />PCAQ: $newPCAQ<br />";
This is my results:
Test I'm Here!!!
35
36
1
Qpattern:[Quote]1[/Quote]
PCAQ:
For preg_replace(), "[Quote]" matches a single character that is one of the following: q, u, o, t, or e.
If you want that preg_replace() finds the literal "[Quote]", you need to escape it as "\[Quote\]". preg_quote() is the function you should use: preg_quote("[Quote]").
Your code is also wrong because a regular expression is expected to start with a delimiter. In the preg_replace() call I am showing at the end of my answer, that is #, but you could use another character, as long as it doesn't appear in the regular expression, and it is used also at the end of the regular expression. (In my case, # is followed by a pattern modifier, and pattern modifiers are the only characters allowed after the pattern delimiter.)
If you are going to use preg_replace(), it doesn't make sense that you first find where "[Quote]" is. I would rather use the following code:
$newPCAQ = preg_replace('#\[Quote\](.+?)\[/Quote\]#i', '\1', $postContent);
I will explain the regular expression I am using:
The final '#i' is saying to preg_replace() to ignore the difference between lowercase, and uppercase characters; the string could contain "[QuOte]234[/QuOTE]", and that substring would match the regular expression the same.
I use a question mark in "(.+?)" to avoid ".+" is too greedy, and matches too much characters. without it, the regular expression could include in a single match a substring like "[Quote]234[/Quote] Other text [Quote]475[/Quote]" while this should be matched as two substrings: "[Quote]234[/Quote]", and "[Quote]475[/Quote]".
The '\1' string I am using as replacement string is saying to preg_replace() to use the string matched from the sub-group "(.+?)" as replacement. In other words, the call to preg_replace() is removing "[Quote]", and "[/Quote]" surrounding other text. (It doesn't replace "[/Quote]" that doesn't match with "[Quote]", such as in "[/Quote] Other text [Quote]".)
your regex must start & end with '/':
$quotePattern = '/[Quote]'.$quotePostID.'[/Quote]/';
The reason you don't see anything for the return value of preg_replace is because it has returned NULL (see the manual link for details). This is what preg_replace returns when an error occurs, which is what happened in your situation. The string value of NULL is a zero-length string. You can see this by using var_dump instead, which will tell you that preg_replace returned NULL.
Your regular expression is invalid and as such PHP will throw an E_WARNING level error of Warning: preg_replace(): Unknown modifier '['
There are a couple of reason for this. First, you need to specify an opening and closing delimiter for you regular expression as preg_* functions use PCRE style regular expression. Second, you want to also consider using preg_quote on your patter (sans the delimiter) to ensure it is escaped properly.
$postContent = "Test this. Here is a quote: [Quote]1[/Quote] Quote is now over.";
/* Specify a delimiter for your regular expression */
$delimiter = '#';
$startQuotePos = strpos($postContent,'[Quote]')+7;
$endQuotePos = strpos($postContent,'[/Quote]');
$postStrLength = strlen($postContent);
$quotePostID = substr($postContent,$startQuotePos,($endQuotePos-$postStrLength));
/* Make sure you use the delimiter in your pattern and escape it properly */
$quotePattern = $delimiter . preg_quote("[Quote]{$quotePostID}[/Quote]", $delimiter) . $delimiter;
$newPCAQ = preg_replace($quotePattern,$quotePostID,$postContent);
echo "<br />$startQuotePos<br />$endQuotePos<br />$quotePostID<br />Qpattern:$quotePattern<br />PCAQ: $newPCAQ<br />";
The output will be:
35
36
1
Qpattern:#[Quote]1[/Quote]#
PCAQ: Test this. Here is a quote: 1 Quote is now over.

What does the Unknown modifier 'c' mean in Regex? [duplicate]

This question already has answers here:
Warning: preg_replace(): Unknown modifier
(3 answers)
Closed 3 years ago.
I'm a newbie with regular expressions and i need some help :).
I have this:
$url = '<img src="http://mi.url.com/iconos/oks/milan.gif" alt="Milan">';
$pattern = '/<img src="http:\/\/mi.url.com/iconos/oks/(.*)" alt="(.*)"\>/i';
preg_match_all($pattern, $url, $matches);
print_r($matches);
And I get this error:
Warning: preg_match_all() [function.preg-match-all]: Unknown modifier 'c'
I want to select that 'milan.gif'.
How can I do that?
If you’re using / as delimiter, you need to escape every occurrence of that character inside the regular expression. You didn’t:
/<img src="http:\/\/mi.url.com/iconos/oks/(.*)" alt="(.*)"\>/i
^
Here the marked / is treated as end delimiter of the regular expression and everything after is is treated as modifier. i is a valid modifier but c isn’t (see your error message).
So:
/<img src="http:\/\/mi\.url\.com\/iconos\/oks\/(.*)" alt="(.*)"\>/i
But as Pekka already noted in the comments, you shouldn’t try to use regular expressions on a non-regular language like HTML. Use an HTML parser instead. Take a look at Best methods to parse HTML.
The problem is that you haven't escaped the forward slashes in the url string (you have escaped the ones in the http:// part, but not the url path).
Therefore the first one it comes across it (which is after .com), it thinks is the end of the regex, so it treats everything after that slash as the 'modifier' codes.
The next character ('i') is a valid modifier (as you know, since you're actually using it in your example), so that passes the test. However the next character ('c') is not, so it throws an error, which is what you're seeing.
To fix it, simply escape the slashes. So your example would look like this:
$pattern = '/<img src="http:\/\/mi.url.com\/iconos\/oks\/(.*)" alt="(.*)"\\>/i';
Hope that helps.
Note, as someone has already said, it's generally not advisable to use regex to match HTML, since HTML can be too complex to match accurately. It's generally preferrable to use a DOM parser. In your example, the regex could fail if the alt attribute or the end of the image URL contains unexpected characters, or if the quoting in the HTML code isn't as you expect.

Categories