Weird error using preg_match and unicode - php

if (preg_match('(\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+)', '2010/02/14/this-is-something'))
{
// do stuff
}
The above code works. However this one doesn't.
if (preg_match('/\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+/u', '2010/02/14/this-is-something'))
{
// do stuff
}
Maybe someone could shed some light as to why the one below doesn't work. This is the error that is being produced:
A PHP Error was encountered
Severity: Warning
Message: preg_match()
[function.preg-match]: Unknown
modifier '\'

Try this: (delimit the regex with ())
if (preg_match('#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#', '2010/02/14/this-is-something'))
{
// do stuff
}
Edited

The modifier u is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32.
Also as nvl observed, you are using / as the delimiter and you are not escaping the / present in the regex. So you'lll have to use:
/\p{Nd}{4}\/\p{Nd}{2}\/\p{Nd}{2}\/\p{L}+/u
To avoid this escaping you can use a different set of delimiters like:
#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#
or
#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#
As a tip, if your delimiter is present in your regex, its better to choose a different delimiter not found in the regex. This keeps the regex clean and short.

In the second regex you're using / as the regex delimiter, but you're also using it in the regex. The compiler is trying to interpret this part as a complete regex:
/\p{Nd}{4}/
It thinks the next character after the second / should be a modifier like 'u' or 'm', but it sees a backslash instead, so it throws that cryptic exception.
In the first regex you're using parentheses as regex delimiters; if you wanted to add the u modifier, you would put it after the closing paren:
'(\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+)u'
Although it's legal to use parentheses or other bracketing characters ({}, [], <>) as regex delimiters, it's not a good idea IMO. Most people prefer to use one of the less common punctuation characters. For example:
'~\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+~u'
'%\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+%u'
Of course, you could also escape the slashes in the regex with backslashes, but why bother?

Related

PHP Regex: match character set OR end of string

I am porting code from Node.js to PHP and keep getting errors with this regular expression:
^/[a-z0-9]{6}([^0-9a-z]|$)
PHP complains about a dollar sign:
Unknown modifier '$'
In JavaScript I was able to check if a string was ending with [^0-9a-z] or END OF STRING.
How do I do this in PHP with preg_match()?
My PHP code looks like this:
<?
$sExpression = '^/[a-z0-9]{6}([^0-9a-z]|$)';
if (preg_match('|' . $sExpression . '|', $sUrl)) {
// ...
}
?>
The JavaScript code was similar to this:
var sExpression = '^/[a-z0-9]{6}([^0-9a-z]|$)';
var oRegex = new RegExp(sExpression);
if (oRegex.test(sUrl)) {
// ...
}
To match a string that starts with a slash, followed by six alphanumerics and is then followed by either the end-of-string or something that's not alphanumeric:
preg_match('~^/[a-z0-9]{6}([^0-9a-z]|$)~i', $str);
The original JavaScript probably used new RegExp(<expression>), but PCRE requires a proper enclosure of the expression; those are the ~ characters I've put in the above code. Btw, I've made the expression case insensitive by using the i modifier; feel free to remove it if not desired.
You were using | as the enclosure; as such, you should have escaped the pipe character inside the expression, but by doing so you would have changed the meaning. It's generally best to choose delimiters that do not have a special meaning in an expression; it also helps to choose delimiters that don't occur as such in the expression, e.g., my choice of ~ avoids having to escape any character.
Expressions in PCRE can be generalised as:
<start-delimiter> stuff <end-delimiter> modifiers
Typically, the starting delimiter is the same as the ending delimiter, except for cases such as [expression]i or {expression}i whereby the opening brace is matched with the closing brace :)
Fix the regular expression first:
^/[a-z0-9]{6}([^0-9a-z]|$)
Try this.
As others pointed out, I'm an idiot and saw a / as a \ ... LOL.
Ok, well go at this again,
I’d avoid using the "|" and just do it this way.
if (preg_match('/^\/[a-z0-9]{6}([^0-9a-z]|$)/', $sUrl)) { ... }
Reducing this to just matching a particular character or end of string (PHP),
\D777(\D|$)\
This will match:
xxx777xxx or xxx777 but not xxx7777 or xxx7777xxx

Unexpected ] error in simple preg replace script [duplicate]

This question already has answers here:
preg_match() Unknown modifier '[' help
(2 answers)
Closed 8 years ago.
I have a script that downloads the latest newsletter from a group inbox on a spare touchscreen in our office. It works fine, but people keep accidentally unsubscribing us so I want to hide the unsubscribe link from the email.
$preg_replace seems like it would work because I can set up a pattern that simply removes any link withthe word "unsubscribe" in. I validated the pattern below using the tool at http://regex101.com/ , and it even picks up variations like "manage subscription" as well. It is ok if the odd legitimate link with the word subscribe also get removed - there won't be many and it's only for internal use.
However, when I execute I get an error.
Here's my code:
line 53: $pat='<\s*(a|A)\s+[^>]*>[^<>]*ubscri[^<>]*<\s*\/(a|A)\s*>';
line 54: $themail[bodycontent]= preg_replace($pat, ' ',$themail[bodycontent]);
and I get this error:
preg_replace() [function.preg-replace]: Unknown modifier ']' in /home/trev/public_html/bigscreen/screen-functions.php on line 54
It must be something really simple like an unescaped char but I have gone code blind and can't for the life of me see it.
How do I get this pattern:
<\s*(a|A)\s+[^>]*>[^<>]*ubscri[^<>]*<\s*\/(a|A)\s*>
to run in a simple php script?
Thanks
You haven't used any delimiters so it's treating the < character as the delimiter
Try something like this instead
$pat='#<\s*(a|A)\s+[^>]*>[^<>]*ubscri[^<>]*<\s*\/(a|A)\s*>#';
You have no delimiter. Or rather you do, but it's not the one you meant. PCRE is interpreting your first < as the opening delimiter (you can use matching brackets as delimiters - in fact, I use parentheses to help remind myself that the entire match is index 0). Then it sees the first > as the ending delimiter. Anything after that should be a modifier, but of course ] is not a modifier.
Wrap your regex with (...) to give it a proper set of delimiters.
$themail[bodycontent] should be either $themail['bodycontent'] or $themail[$bodycontent].
It's trying to parse bodycontent] ... as the array index.
Patterns used in preg_match need to be enclosed by a pair of delimiter characters.
For example, a / or a ~ at the start and end of the string.
Anything outside of these delimiters at the end of the string is considered to be a regex "modifier".
Your example doesn't have delimiters, so PHP is wrongly assuming that the < character is the delimiter. It therefore sees the next < character as the closing delimiter, and therefore, anything after that as a modifier. Obviously all that stuff is supposed to be inside the pattern and isn't valid as modifiers, which is why PHP is complaining.
Solution: Add a pair of modifier characters:
$pat='~<\s*(a|A)\s+[^>]*>[^<>]*ubscri[^<>]*<\s*\/(a|A)\s*>~';
^ ^
add this ...and this
(it doesn't have to be ~, you can choose your own modifier character to suit your needs. Best one to use is one that doesn't occur in your string (although you can escape it if it does)
Starting and ending of pattern with slash /
$pat='/<\s*(a|A)\s+[^>]*>[^<>]*ubscri[^<>]*<\s*\/(a|A)\s*>/';

regular expression to validate URL not working correctly in PHP

I am using a regular expression to validate URL. This expression works very well in JavaScript, But in PHP it gives me this error
A PHP Error was encountered
Severity: Warning
Message: preg_match() [function.preg-match]: Unknown modifier '('
Filename: home/auth.php
Line Number: 1596
A PHP Error was encountered
Severity: Warning
Message: preg_match() [function.preg-match]: Unknown modifier '('
Filename: home/auth.php
Line Number: 1601
This is my expression
$pattern ="/^(http|https|ftp)\:\/\/www\.([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*(\.){1}((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$/";
This is the php function
public function valid_url($data)
{
$data = trim($data);
if(!$data)
{
return TRUE;
}
$pattern ="/^(http|https|ftp)\:\/\/www\.([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*(\.){1}((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$/";
$valid = preg_match($pattern,$data);
if(!$valid)
{
$data = "http://".$data;
$valid = preg_match($pattern,$data);
}
if(!$valid)
{
$this->form_validation->set_message('valid_url', 'Please enter a valid URL.');
return FALSE;
}
else
{
return TRUE;
}
}
I am not very good at regular expressions so I could not figure out the issue, please help me correct the regular expression.
Wow, that is a big expression. I found several faults in it, and I shall hopefully explain them to you. Let's break it apart:
$pattern ="/
Here was your first mistake. As a forward slash is used in multiple sections of a url, you should use a different delimiter. I would suggest a tilde ~, as this is not used in a url very often. This would mean you don't have to keep escaping the forward slash every where with \/.
^(http|https|ftp)\:\/\/www\.([a-zA-Z0-9\.\-]+
This character class contains the next error. Within a character class, a dot just means a dot. There is no need to escape it. Furthermore, with placing the dash at the end, it also does not need escaping as it cannot possibly mean a range. The character class can be shortened to become [a-zA-Z0-9.-]+.
(\:[a-zA-Z0-9\.&%\$\-]+
Here we have the next error, & within the character class. This will match an & or an a or an m or a ;, not just an &. You don't need to convert it to the html code as doing so will mean to match any of the characters that the code contains. And using the previous knowledge, you don't need to escape the dot, or the dash if it is at the end. You also don't need to escape the dollar sign, as in a character class it just means a dollar. Remember, within a character class, all meta characters are just standard characters except the caret ^, the backslash \, the closing square bracket ], the dash - (but this can be left if it's at the end), and whatever you choose as your delimiter, e.g. tilde ~. This character class can then become, [a-zA-Z0-9.&%$-]+.
)*#)*(\.){1}
Part of this might be an error, it might not be. Basically, is there any need to capture the dot here? If there is not a need to capture it, leave the brackets alone. However, there is a definite error in the repetition. {1} is completely and utterly superfluous. Everything in there has to be repeated at least once. This is just making the code messy. The above can shortened into, )*#)*\..
((25[0-5]|2[0-4][0-9]|[0-1]{1}
Again, the {1} is not needed. Remove it, ((25[0-5]|2[0-4][0-9]|[0-1].
[0-9]{2}|[1-9]{1}[0-9]{1}
And again twice, this becomes [0-9]{2}|[1-9][0-9].
You keep doing this, the next block of code you have can be shortened:
|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])
Into
|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])
It's not amazingly better, but every little helps. Next:
|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+
The two character classes can be optimized, |([a-zA-Z0-9-]+\.)*[a-zA-Z0-9-]+.
\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2})
This is very restrictive, but I assume you have it like this for a reason so I'll leave it.
)(\:[0-9]+)*(/
And here is the cause of your error. You did not escape the forward slash. However, I am going to leave it as using a different delimiter would avoid this and also tidy up your pattern.
($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$/";
That character class can be greatly shortened now knowing that we don't need to escape everything within them. It can become, ($|[a-zA-Z0-9.,?'\\+&%$#=~_-]+))*$/";.
Using everything we now know your pattern can be made much prettier and easier to handle.
It can become instead:
$pattern = "~^(http|https|ftp)://www\.([a-zA-Z0-9.-]+(:[a-zA-Z0-9.&%$-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])|([a-zA-Z0-9-]+\.)+(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(:[0-9]+)*(/($|[a-zA-Z0-9.,?'\\+&%$#=\~_-]+))*$~";
Now that you have a smaller expression, finding faults and more customization should be a little easier.
Just a quick note
I keep noticing that you have used the following syntax at the beginning of some groupings, (\:. I have removed the backslash as it is not needed for a colon. However, were you trying to make it so the group was not captured? If so, the syntax for that is, (?:.
Edit:: You can also optimize the pattern further by utilizing character classes
\d = [0-9]
\w = [a-zA-Z0-9_]
Adding i to the end of the last pattern delimiter turns case insensitivity on too. Which means, instead of writing [a-zA-Z] you can just write [a-z] instead.
Also, the http|https can just become https?
So you pattern could be shortened further too:
$pattern = "~^(https?|ftp)://www\.([a-z\d.-]+(:[a-z\d.&%$-]+)*#)*((25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9])\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9]|0)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9]|0)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|\d)|([a-z\d-]+\.)+(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-z]{2}))(:\d+)*(/($|[\w.,?'\\+&%$#=\~-]+))*$~i";
I see one error:
[0-9]+)*(/($
to
[0-9]+)*(\/($
or to
[0-9]+)*(($
if the / is supposed to be an ender, which it's not supposed to be.
But seriously, is there no other way you can achieve this? This string is really hard to troubleshoot.
Why don't use standard php function filter_var?
http://lv.php.net/manual/ru/function.filter-var.php

Regular expression error: no ending delimiter

I'm trying to execute this regular expression:
<?php
preg_match("/^([^\x00-\x1F]+?){0,1}/", 'test string');
?>
But keep getting an error:
Warning: preg_match() [function.preg-match]: No ending delimiter '/' found in /var/www/preg.php on line 6
I can't understand where it is coming from. I have an ending delimeter right there... I tried to change delimiter to other symbols and it didn't help.
I would appreciate your help on this problem.
I guess PHP chokes on the NULL character that denotes the end of a string in C.
Try it with single quotes so that \x00 is interpreted by the PCRE engine and not by PHP:
'/^([^\x00-\x1F]+?){0,1}/'
It seems that this is an already known bug (see Problems with strings containing \x00).
Like Gumbo said, preg_match is not binary safe.
Use instead:
preg_match("/^([^\\x{00}-\\x{1F}]+?){0,1}/", 'test string'));
This is the correct way to specify Unicode code points in PCRE.
I am not sure about php, but maybe the problem is that you need to escape your backslashes?
try "/^([^\\x00-\\x1F]+?){0,1}/"

Grubers new and improved URL recognising regex

I've been trying to use grubers latest url matching regex in a php project.
To test it I threw together something very simple:
$regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:"'.,<>?«»“”‘’]))";
$array = pret_match_all($regex, $theblockofurltext);
print_r($array);
The first problem was the " would escape a string, depending which I wrapped the regex with, so I just removed it. The use of this is personal and I will never have " anywhere near a url anyway. This left me with a new regex.
$regex = "(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'.,<>?«»“”‘’]))";
Raring to go I then ran my little script and it gave me the following error:
Warning: preg_split() [function.preg-split]: Unknown modifier '\' in D:\wwwroot\xxx\index.php on line 14
Unfortunately my REGEX class at school wasn't taught to anywhere near the levels of this regex requires, and I have no idea where to begin fixing this for use with PHP. Any help would be greatly appreciated. No doubt I'm probably doing something stupid too, so please go easy on me :)
Jon
Add # before and after your RE.
$regex = "#(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'.,<>?«»“”‘’]))#";
If you use PCRE, the regular expression must be enclosed in delimiters. Now, parenthesis () can also be delimiters, that is why the engine thinks, your expression is only (?i) and interprets the next \ as modifier.
You could use ~ as delimiter:
$regex = "~(?i)\b...]))~";
Update:
I don't know whether PHP supports the partial modifying of an expression with (?i). So you might have to remove this and put the modifier after the delimiter instead (you apply it to the whole expression anyway):
$regex = "~\b...]))~i";

Categories