PHP Regex: match character set OR end of string - php

I am porting code from Node.js to PHP and keep getting errors with this regular expression:
^/[a-z0-9]{6}([^0-9a-z]|$)
PHP complains about a dollar sign:
Unknown modifier '$'
In JavaScript I was able to check if a string was ending with [^0-9a-z] or END OF STRING.
How do I do this in PHP with preg_match()?
My PHP code looks like this:
<?
$sExpression = '^/[a-z0-9]{6}([^0-9a-z]|$)';
if (preg_match('|' . $sExpression . '|', $sUrl)) {
// ...
}
?>
The JavaScript code was similar to this:
var sExpression = '^/[a-z0-9]{6}([^0-9a-z]|$)';
var oRegex = new RegExp(sExpression);
if (oRegex.test(sUrl)) {
// ...
}

To match a string that starts with a slash, followed by six alphanumerics and is then followed by either the end-of-string or something that's not alphanumeric:
preg_match('~^/[a-z0-9]{6}([^0-9a-z]|$)~i', $str);
The original JavaScript probably used new RegExp(<expression>), but PCRE requires a proper enclosure of the expression; those are the ~ characters I've put in the above code. Btw, I've made the expression case insensitive by using the i modifier; feel free to remove it if not desired.
You were using | as the enclosure; as such, you should have escaped the pipe character inside the expression, but by doing so you would have changed the meaning. It's generally best to choose delimiters that do not have a special meaning in an expression; it also helps to choose delimiters that don't occur as such in the expression, e.g., my choice of ~ avoids having to escape any character.
Expressions in PCRE can be generalised as:
<start-delimiter> stuff <end-delimiter> modifiers
Typically, the starting delimiter is the same as the ending delimiter, except for cases such as [expression]i or {expression}i whereby the opening brace is matched with the closing brace :)

Fix the regular expression first:
^/[a-z0-9]{6}([^0-9a-z]|$)
Try this.
As others pointed out, I'm an idiot and saw a / as a \ ... LOL.
Ok, well go at this again,
I’d avoid using the "|" and just do it this way.
if (preg_match('/^\/[a-z0-9]{6}([^0-9a-z]|$)/', $sUrl)) { ... }

Reducing this to just matching a particular character or end of string (PHP),
\D777(\D|$)\
This will match:
xxx777xxx or xxx777 but not xxx7777 or xxx7777xxx

Related

my regexp does not work for a simple word match

I want to see if the current request is on the localhost or not. For doing this, I use the following regular expression:
return ( preg_match("/(^localhost).*/", $url) == true ||
preg_match("/^({http|ftp|https}://localhost).*/", $url) == true )
? true : false;
And here is the var_dump() of $url:
string 'http://localhost/aone/public/' (length=29)
Which keeps returning false though. What is the problem of this regular expression?
You are currently using the forward slash (/) as the delimiter, but you aren't escaping it inside your pattern string. This will result in an error and will cause your preg_match() statement to not work (if you don't have error reporting enabled).
Also, you are using alternation incorrectly. If you want to match either foo or bar, you'd write (foo|bar), and not {foo|bar}.
The updated preg_match() should look like:
preg_match("/^(http|ftp|https):\/\/localhost.*/", $url)
Or with a different delimiter (so you don't have to escape all the / characters):
preg_match("#^(http|ftp|https)://localhost.*#", $url)
Curly braces have a special meaning in a regex, they are used to quantify the preceding character(s).
So:
/^({http|ftp|https}://localhost).*/
Should probably be something like:
#^((http|ftp|https)://localhost).*#
Edit: changed the delimiters so that the forward slash does not need to be escaped
This
{http|ftp|https}
is wrong.
I suppose you mean
(http|ftp|https)
Also, if you want only group and don't capture, please add ?::
(?:http|ftp|https)
I would change your current code to:
return preg_match("~^(?:(?:https?|ftp)://)?localhost~", $url);
You were using { and } for grouping, when those are used for quantifying and otherwise mean literal { and `} characters.
A couple of things to add is that:
you can use https? instead of (http|https);
you can use other delimiters for the regex when your pattern has those symbols as delimiters. This avoids you excessive escaping;
you can combine the two regex, since one part is optional (the (?:https?|ftp):// part) and doing so would make the later comparator unnecessary;
the .* at the end is not required.

Unexpected ] error in simple preg replace script [duplicate]

This question already has answers here:
preg_match() Unknown modifier '[' help
(2 answers)
Closed 8 years ago.
I have a script that downloads the latest newsletter from a group inbox on a spare touchscreen in our office. It works fine, but people keep accidentally unsubscribing us so I want to hide the unsubscribe link from the email.
$preg_replace seems like it would work because I can set up a pattern that simply removes any link withthe word "unsubscribe" in. I validated the pattern below using the tool at http://regex101.com/ , and it even picks up variations like "manage subscription" as well. It is ok if the odd legitimate link with the word subscribe also get removed - there won't be many and it's only for internal use.
However, when I execute I get an error.
Here's my code:
line 53: $pat='<\s*(a|A)\s+[^>]*>[^<>]*ubscri[^<>]*<\s*\/(a|A)\s*>';
line 54: $themail[bodycontent]= preg_replace($pat, ' ',$themail[bodycontent]);
and I get this error:
preg_replace() [function.preg-replace]: Unknown modifier ']' in /home/trev/public_html/bigscreen/screen-functions.php on line 54
It must be something really simple like an unescaped char but I have gone code blind and can't for the life of me see it.
How do I get this pattern:
<\s*(a|A)\s+[^>]*>[^<>]*ubscri[^<>]*<\s*\/(a|A)\s*>
to run in a simple php script?
Thanks
You haven't used any delimiters so it's treating the < character as the delimiter
Try something like this instead
$pat='#<\s*(a|A)\s+[^>]*>[^<>]*ubscri[^<>]*<\s*\/(a|A)\s*>#';
You have no delimiter. Or rather you do, but it's not the one you meant. PCRE is interpreting your first < as the opening delimiter (you can use matching brackets as delimiters - in fact, I use parentheses to help remind myself that the entire match is index 0). Then it sees the first > as the ending delimiter. Anything after that should be a modifier, but of course ] is not a modifier.
Wrap your regex with (...) to give it a proper set of delimiters.
$themail[bodycontent] should be either $themail['bodycontent'] or $themail[$bodycontent].
It's trying to parse bodycontent] ... as the array index.
Patterns used in preg_match need to be enclosed by a pair of delimiter characters.
For example, a / or a ~ at the start and end of the string.
Anything outside of these delimiters at the end of the string is considered to be a regex "modifier".
Your example doesn't have delimiters, so PHP is wrongly assuming that the < character is the delimiter. It therefore sees the next < character as the closing delimiter, and therefore, anything after that as a modifier. Obviously all that stuff is supposed to be inside the pattern and isn't valid as modifiers, which is why PHP is complaining.
Solution: Add a pair of modifier characters:
$pat='~<\s*(a|A)\s+[^>]*>[^<>]*ubscri[^<>]*<\s*\/(a|A)\s*>~';
^ ^
add this ...and this
(it doesn't have to be ~, you can choose your own modifier character to suit your needs. Best one to use is one that doesn't occur in your string (although you can escape it if it does)
Starting and ending of pattern with slash /
$pat='/<\s*(a|A)\s+[^>]*>[^<>]*ubscri[^<>]*<\s*\/(a|A)\s*>/';

regular expression to validate URL not working correctly in PHP

I am using a regular expression to validate URL. This expression works very well in JavaScript, But in PHP it gives me this error
A PHP Error was encountered
Severity: Warning
Message: preg_match() [function.preg-match]: Unknown modifier '('
Filename: home/auth.php
Line Number: 1596
A PHP Error was encountered
Severity: Warning
Message: preg_match() [function.preg-match]: Unknown modifier '('
Filename: home/auth.php
Line Number: 1601
This is my expression
$pattern ="/^(http|https|ftp)\:\/\/www\.([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*(\.){1}((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$/";
This is the php function
public function valid_url($data)
{
$data = trim($data);
if(!$data)
{
return TRUE;
}
$pattern ="/^(http|https|ftp)\:\/\/www\.([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*(\.){1}((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$/";
$valid = preg_match($pattern,$data);
if(!$valid)
{
$data = "http://".$data;
$valid = preg_match($pattern,$data);
}
if(!$valid)
{
$this->form_validation->set_message('valid_url', 'Please enter a valid URL.');
return FALSE;
}
else
{
return TRUE;
}
}
I am not very good at regular expressions so I could not figure out the issue, please help me correct the regular expression.
Wow, that is a big expression. I found several faults in it, and I shall hopefully explain them to you. Let's break it apart:
$pattern ="/
Here was your first mistake. As a forward slash is used in multiple sections of a url, you should use a different delimiter. I would suggest a tilde ~, as this is not used in a url very often. This would mean you don't have to keep escaping the forward slash every where with \/.
^(http|https|ftp)\:\/\/www\.([a-zA-Z0-9\.\-]+
This character class contains the next error. Within a character class, a dot just means a dot. There is no need to escape it. Furthermore, with placing the dash at the end, it also does not need escaping as it cannot possibly mean a range. The character class can be shortened to become [a-zA-Z0-9.-]+.
(\:[a-zA-Z0-9\.&%\$\-]+
Here we have the next error, & within the character class. This will match an & or an a or an m or a ;, not just an &. You don't need to convert it to the html code as doing so will mean to match any of the characters that the code contains. And using the previous knowledge, you don't need to escape the dot, or the dash if it is at the end. You also don't need to escape the dollar sign, as in a character class it just means a dollar. Remember, within a character class, all meta characters are just standard characters except the caret ^, the backslash \, the closing square bracket ], the dash - (but this can be left if it's at the end), and whatever you choose as your delimiter, e.g. tilde ~. This character class can then become, [a-zA-Z0-9.&%$-]+.
)*#)*(\.){1}
Part of this might be an error, it might not be. Basically, is there any need to capture the dot here? If there is not a need to capture it, leave the brackets alone. However, there is a definite error in the repetition. {1} is completely and utterly superfluous. Everything in there has to be repeated at least once. This is just making the code messy. The above can shortened into, )*#)*\..
((25[0-5]|2[0-4][0-9]|[0-1]{1}
Again, the {1} is not needed. Remove it, ((25[0-5]|2[0-4][0-9]|[0-1].
[0-9]{2}|[1-9]{1}[0-9]{1}
And again twice, this becomes [0-9]{2}|[1-9][0-9].
You keep doing this, the next block of code you have can be shortened:
|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])
Into
|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])
It's not amazingly better, but every little helps. Next:
|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+
The two character classes can be optimized, |([a-zA-Z0-9-]+\.)*[a-zA-Z0-9-]+.
\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2})
This is very restrictive, but I assume you have it like this for a reason so I'll leave it.
)(\:[0-9]+)*(/
And here is the cause of your error. You did not escape the forward slash. However, I am going to leave it as using a different delimiter would avoid this and also tidy up your pattern.
($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$/";
That character class can be greatly shortened now knowing that we don't need to escape everything within them. It can become, ($|[a-zA-Z0-9.,?'\\+&%$#=~_-]+))*$/";.
Using everything we now know your pattern can be made much prettier and easier to handle.
It can become instead:
$pattern = "~^(http|https|ftp)://www\.([a-zA-Z0-9.-]+(:[a-zA-Z0-9.&%$-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])|([a-zA-Z0-9-]+\.)+(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(:[0-9]+)*(/($|[a-zA-Z0-9.,?'\\+&%$#=\~_-]+))*$~";
Now that you have a smaller expression, finding faults and more customization should be a little easier.
Just a quick note
I keep noticing that you have used the following syntax at the beginning of some groupings, (\:. I have removed the backslash as it is not needed for a colon. However, were you trying to make it so the group was not captured? If so, the syntax for that is, (?:.
Edit:: You can also optimize the pattern further by utilizing character classes
\d = [0-9]
\w = [a-zA-Z0-9_]
Adding i to the end of the last pattern delimiter turns case insensitivity on too. Which means, instead of writing [a-zA-Z] you can just write [a-z] instead.
Also, the http|https can just become https?
So you pattern could be shortened further too:
$pattern = "~^(https?|ftp)://www\.([a-z\d.-]+(:[a-z\d.&%$-]+)*#)*((25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9])\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9]|0)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9]|0)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|\d)|([a-z\d-]+\.)+(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-z]{2}))(:\d+)*(/($|[\w.,?'\\+&%$#=\~-]+))*$~i";
I see one error:
[0-9]+)*(/($
to
[0-9]+)*(\/($
or to
[0-9]+)*(($
if the / is supposed to be an ender, which it's not supposed to be.
But seriously, is there no other way you can achieve this? This string is really hard to troubleshoot.
Why don't use standard php function filter_var?
http://lv.php.net/manual/ru/function.filter-var.php

regex with special characters?

i am looking for a regex that can contain special chracters like / \ . ' "
in short i would like a regex that can match the following:
may contain lowercase
may contain uppercase
may contain a number
may contain space
may contain / \ . ' "
i am making a php script to check if a certain string have the above or not, like a validation check.
The regular expression you are looking for is
^[a-z A-Z0-9\/\\.'"]+$
Remember if you are using PHP you need to use \ to escape the backslashes and the quotation mark you use to encapsulate the string.
In PHP using preg_match it should look like this:
preg_match("/^[a-z A-Z0-9\\/\\\\.'\"]+$/",$value);
This is a good place to find the regular expressions you might want to use.
http://regexpal.com/
You can always escape them by appending a \ in front of the special characters.
try this:
preg_match("/[A-Za-z0-9\/\\.'\"]/", ...)
NikoRoberts is 100% correct.
I would only add the following suggestion: When creating a PHP regex pattern string, always use: single-quotes. There are far fewer chars which need to be escaped (i.e. only the single quote and the backslash itself needs to be escaped (and the backslash only needs to be escaped if it appears at the end of the string)).
When dealing with backslash soup, it helps to print out the (interpreted) regex string. This shows you exactly what is being presented to the regex engine.
Also, a "number" might have an optional sign? Yes? Here is my solution (in the form of a tested script):
<?php // test.php 20110311_1400
$data_good = 'abcdefghijklmnopqrstuvwxyzABCDE'.
'FGHIJKLMNOPQRSTUVWXYZ0123456789+- /\\.\'"';
$data_bad = 'abcABC012~!###$%^&*()';
$re = '%^[a-zA-Z0-9+\- /\\\\.\'"]*$%';
echo($re ."\n");
if (preg_match($re, $data_good)) {
echo("CORRECT: Good data matches.\n");
} else {
echo("ERROR! Good data does NOT match.\n");
}
if (preg_match($re, $data_bad)) {
echo("ERROR! Bad data matches.\n");
} else {
echo("CORRECT: Bad data does NOT match.\n");
}
?>
The following regex will match a single character that fits the description you gave:
[a-zA-Z0-9\ \\\/\.\'\"]
If your point is to insure that ONLY characters in this range of characters are used in your string, then you can use the negation of this which would be:
[^a-zA-Z0-9\ \\\/\.\'\"]
In the second case, you could use your regex to find the bad stuff (that you don't want to be included), and if it didn't find anything then your string pattern must be kosher, because I'm assuming that if you find one character that is not in the proper range, then your string is not valid.
so to put it in PHP syntax:
$regex = "[^a-zA-Z0-9\ \\\/\.\'\"]"
if preg_match( $regex, ... ) {
// handle the bad stuff
}
Edit 1:
I've completely ignored the fact that backslashes are special in php double-quoted strings, so here is a correcting to the above code:
$regex = "[^a-zA-Z0-9\\ \\\\\\/\\.\\'\\\"]"
If that doesn't work it shouldn't take too much for someone to debug how many of the backslashes need to be escaped with a backslash, and what other characters need also to be escaped....

Weird error using preg_match and unicode

if (preg_match('(\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+)', '2010/02/14/this-is-something'))
{
// do stuff
}
The above code works. However this one doesn't.
if (preg_match('/\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+/u', '2010/02/14/this-is-something'))
{
// do stuff
}
Maybe someone could shed some light as to why the one below doesn't work. This is the error that is being produced:
A PHP Error was encountered
Severity: Warning
Message: preg_match()
[function.preg-match]: Unknown
modifier '\'
Try this: (delimit the regex with ())
if (preg_match('#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#', '2010/02/14/this-is-something'))
{
// do stuff
}
Edited
The modifier u is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32.
Also as nvl observed, you are using / as the delimiter and you are not escaping the / present in the regex. So you'lll have to use:
/\p{Nd}{4}\/\p{Nd}{2}\/\p{Nd}{2}\/\p{L}+/u
To avoid this escaping you can use a different set of delimiters like:
#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#
or
#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#
As a tip, if your delimiter is present in your regex, its better to choose a different delimiter not found in the regex. This keeps the regex clean and short.
In the second regex you're using / as the regex delimiter, but you're also using it in the regex. The compiler is trying to interpret this part as a complete regex:
/\p{Nd}{4}/
It thinks the next character after the second / should be a modifier like 'u' or 'm', but it sees a backslash instead, so it throws that cryptic exception.
In the first regex you're using parentheses as regex delimiters; if you wanted to add the u modifier, you would put it after the closing paren:
'(\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+)u'
Although it's legal to use parentheses or other bracketing characters ({}, [], <>) as regex delimiters, it's not a good idea IMO. Most people prefer to use one of the less common punctuation characters. For example:
'~\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+~u'
'%\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+%u'
Of course, you could also escape the slashes in the regex with backslashes, but why bother?

Categories