Double quotes in PHP regular expression? - php

I am pretty familiar with PHP, but I am brand new with regular expressions in PHP. I am trying to figure out how to only allow a-z, A-Z, 0-9, :, ' (single quote), " " (double quote), +, -, ., (comma), &, !, *, (, and ).
I have found several working examples of what I am looking for EXCEPT how to allow the single quote and the double quote.
An Example of what I am looking for is:
Hello, this is just an example of what I am looking for: "Hello World!".
I am trying to validate a textarea $_POST['suggestion'] using:
$errors = array();
if(!preg_match('insert regular expression',$_POST['suggestion'])){
$errors['suggestion2'] = "Invalid";
}
With everything I have tried, I always get:
An Example of what I am looking for is: Hello, this is just an example of what I am looking for: \"Hello Wolrd!\".
I don't understand why the \ are in front of the quotes?

You can use the following regex:
[a-zA-Z0-9:'"+.,&!*()-]
Note that the hyphen - is placed at the end position so as not to form a range (and it can match a literal -). +, *, ., ( and ) do not have to be escaped inside a character class. Generally, ^-] should be escaped, but if they appear at the start of final position in the character class, they do not have to. \ must be escaped in the character class, but you are not allowing it.
Also, if you want to match chunks of allowed symbols, add a + quantifier after the character class: [a-zA-Z0-9:'"+.,&!*()-]+.
See demo here and here.
Sample PHP code:
$re = "/[a-zA-Z0-9:'\"+.,&!*()-]/";
$str = "a-zA-Z0-9:'\"+.,&!*()-";
preg_match_all($re, $str, $matches);
EDIT:
Since you updated the question, here is the information to turn off escaping double quotes in earlier PHP versions. As one of the options, you may go to .htaccess file and set php_flag magic_quotes_gpc Off.

Related

PHP/Regex: Get content between Stars but not if there's a leading backslash

I would like to get everything between two stars - except of they have a leading backslash.
So for example:
*hello* world
should return "hello", but
*hello \* world*
should return "hello * world"
I tried the following regex:
/(?<!\\)\*(.+?)(?<!\\)\*/s
which works perfect on http://regex101.com/ but php returns:
Warning: preg_replace(): Compilation failed: missing ) at offset 21
What am I doing wrong?
--
EDIT 1:
Here's my PHP-Code for that:
var_dump(preg_replace('/(?<!\\)\*(.+?)(?<!\\)\*/s', '<strong>$1</strong>', '*hello world*'));
You are not escaping the backslashes correctly which results in escaping the ) character.
To match a \ in PHP you need 4 backslashes
/(?<!\\\\)\*(.+?)(?<!\\\\)\*/s
It must be done like this because every backslash in a C-like string
must be escaped by a backslash. That would give us a regular
expression with 2 backslashes, as you might have assumed at first.
However, each backslash in a regular expression must be escaped by a
backslash, too. This is the reason that we end up with 4 backslashes.
Or use a character class with 2 backslashes
/(?<![\\])\*(.+?)(?<![\\])\*/s
A literal backslash can also be matched using preg_match() by using a
character class instead. Backslashes are not escaped when they appear
within character classes in regular expressions. Therefore (“[\]“)
would match a literal backslash. The backslash must still be escaped
once by another backslash because it is still a C-like string.
Edit Found this article which explains why this is necessary. Also, added explanations.
You can use this regex:
\*(.*?(?<!\\))\*
Working demo

Backslash in Regex- PHP

I am trying to learn Regex in PHP and stuck in here now. My ques may appear silly but pls do explain.
I went through a link:
Extra backslash needed in PHP regexp pattern
But I just could not understand something:
In the answer he mentions two statements:
2 backslashes are used for unescaping in a string ("\\\\" -> \\)
1 backslash is used for unescaping in the regex engine (\\ -> \)
My ques:
what does the word "unescaping" actually means? what is the purpose of unescaping?
Why do we need 4 backslashes to include it in the regex?
The backslash has a special meaning in both regexen and PHP. In both cases it is used as an escape character. For example, if you want to write a literal quote character inside a PHP string literal, this won't work:
$str = ''';
PHP would get "confused" which ' ends the string and which is part of the string. That's where \ comes in:
$str = '\'';
It escapes the special meaning of ', so instead of terminating the string literal, it is now just a normal character in the string. There are more escape sequences like \n as well.
This now means that \ is a special character with a special meaning. To escape this conundrum when you want to write a literal \, you'll have to escape literal backslashes as \\:
$str = '\\'; // string literal representing one backslash
This works the same in both PHP and regexen. If you want to write a literal backslash in a regex, you have to write /\\/. Now, since you're writing your regexen as PHP strings, you need to double escape them:
$regex = '/\\\\/';
One pair of \\ is first reduced to one \ by the PHP string escaping mechanism, so the actual regex is /\\/, which is a regex which means "one backslash".
I think you can use "preg_quote()":
http://php.net/preg_quote
This function escapes special chars, so you can give an input as it is, without escaping by yourself:
<?php
$string = "online 24/7. Only for \o/";
$escaped_string = preg_quote($string, "/"); // 2nd param is optional and used if you want to escape also the delimiter of your regex
echo $escaped_string; // $escaped_string: "online 24\/7. Only for \\o\/"
?>

How can I use PHP's preg_replace function to convert Unicode code points to actual characters/HTML entities?

I want to convert a set of Unicode code points in string format to actual characters and/or HTML entities (either result is fine).
For example, if I have the following string assignment:
$str = '\u304a\u306f\u3088\u3046';
I want to use the preg_replace function to convert those Unicode code points to actual characters and/or HTML entities.
As per other Stack Overflow posts I saw for similar issues, I first attempted the following:
$str = '\u304a\u306f\u3088\u3046';
$str2 = preg_replace('/\u[0-9a-f]+/', '&#x$1;', $str);
However, whenever I attempt to do this, I get the following PHP error:
Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u
I tried all sorts of things like adding the u flag to the regex or changing /\u[0-9a-f]+/ to /\x{[0-9a-f]+}/, but nothing seems to work.
Also, I've looked at all sorts of other relevant pages/posts I could find on the web related to converting Unicode code points to actual characters in PHP, but either I'm missing something crucial, or something is wrong because I can't fix the issue I'm having.
Can someone please offer me a concrete solution on how to convert a string of Unicode code points to actual characters and/or a string of HTML entities?
From the PHP manual:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
First of all, in your regular expression, you're only using one backslash (\). As explained in the PHP manual, you need to use \\\\ to match a literal backslash (with some exceptions).
Second, you are missing the capturing groups in your original expression. preg_replace() searches the given string for matches to the supplied pattern and returns the string where the contents matched by the capturing groups are replaced with the replacement string.
The updated regular expression with proper escaping and correct capturing groups would look like:
$str2 = preg_replace('/\\\\u([0-9a-f]+)/i', '&#x$1;', $str);
Output:
おはよう
Expression: \\\\u([0-9a-f]+)
\\\\ - matches a literal backslash
u - matches the literal u character
( - beginning of the capturing group
[0-9a-f] - character class -- matches a digit (0 - 9) or an alphabet (from a - f) one or more times
) - end of capturing group
i modifier - used for case-insensitive matching
Replacement: &#x$1
& - literal ampersand character (&)
# - literal pound character (#)
x - literal character x
$1 - contents of the first capturing group -- in this case, the strings of the form 304a etc.
RegExr Demo.
This page here—titled Escaping Unicode Characters to HTML Entities in PHP—seems to tackle it with this nice function:
function unicode_escape_sequences($str){
$working = json_encode($str);
$working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working);
return json_decode($working);
}
That seems to work with json_encode and json_decode to take pure UTF-8 and convert it into Unicode. Very nice technique. But for your example, this would work.
$str = '\u304a\u306f\u3088\u3046';
echo preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $str);
The output is:
おはよう
Which is:
おはよう
Which translates to:
Good morning

PHP Regex for matching a UNC path

I'm after a bit of regex to be used in PHP to validate a UNC path passed through a form. It should be of the format:
\\server\something
... and allow for further sub-folders. It might be good to strip off a trailing slash for consistency although I can easily do this with substr if need be.
I've read online that matching a single backslash in PHP requires 4 backslashes (when using a "C like string") and think I understand why that is (PHP escaping (e.g. 2 = 1, so 4 = 2), then regex engine escaping (the remaining 2 = 1). I've seen the following two quoted as equivalent suitable regex to match a single backslash:
$regex = "/\\\\/s";
or apparently this also:
$regex = "/[\\]/s";
However these produce different results, and that is slightly aside from my final aim to match a complete UNC path.
To see if I could match two backslashes I used the following to test:
$path = "\\\\server";
echo "the path is: $path <br />"; // which is \\server
$regex = "/\\\\\\\\\/s";
if (preg_match($regex, $path))
{
echo "matched";
}
else
{
echo "not matched";
}
The above however seems to match on two or more backslashes :( The pattern is 8 slashes, translating to 2, so why would an input of 3 backslashes ($path = "\\\\\\server") match?
I thought perhaps the following would work:
$regex = "/[\\][\\]/s";
and again, no :(
Please help before I jump out a window lol :)
Use this little gem:
$UNC_regex = '=^\\\\\\\\[a-zA-Z0-9-]+(\\\\[a-zA-Z0-9`~!##$%^&(){}\'._-]+([ ]+[a-zA-Z0-9`~!##$%^&(){}\'._-]+)*)+$=s';
Source: http://regexlib.com/REDetails.aspx?regexp_id=2285 (adopted to PHP string escaping)
The RegEx shown above matches for valid hostname (which allows only a few valid characters) and the path part behind the hostname (which allows many, but not all characters)
Sidenote on the backslashes issue:
When you use double quotes (") to enclose your string, you must be aware of PHP special character escaping.. "\\" is a single \ in PHP.
Important: even with single quotes (') those backslashes must be escaped.
A PHP string with single quotes takes everything in the string literally (unescaped) with a few exceptions:
A backslash followed by a backslash (\\) is interpreted as a single backslash.
('C:\\*.*' => C:\*.*)
A backslash followed by a single-quote (\') is interpreted as a single quote.
('I\'ll be back' => I'll be back)
A backslash followed by anything else is interpreted as a backslash.
('Just a \ somewhere' => Just a \ somewhere)
Also, you must be aware of PCRE escape sequences.
The RegEx parser treats \ for character classes, so you need to escape it for RegEx, again.
To match two \\ you must write $regex = "\\\\\\\\" or $regex = '\\\\\\\\'
From the PHP docs on PCRE escape sequences:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \, then "\\" or '\\' must be used in PHP code.
Regarding your Question:
why would an input of 3 backslashes ($path = "\\\server") match with regex "/\\\\\\\\/s"?
The reason is that you have no boundaries defined (use ^ for beginning and $ for end of string), thus it finds \\ "somewhere" resulting in a positive match. To get the expected result, you should do something like this:
$regex = '/^\\\\\\\\[^\\\\]/s';
The RegEx above has 2 modifications:
^ at the beginning to only match two \\ at the beginning of the string
[^\\] negative character class to say: not followed by an additional backslash
Regarding your last RegEx:
$regex = "/[\\][\\]/s";
You have a confusion (see above for clarification) with backslash escaping here. "/[\\][\\]/s" is interpreted by PHP to /[\][\]/s, which will let the RegEx fail because \ is a reserved character in RegEx and thus must be escaped.
This variant of your RegEx would work, but also match any occurance of two backslashes for the same reason i already explained above:
$regex = '/[\\\\][\\\\]/s';
Echo your regex as well, so you see what's the actual pattern, writing those slashes inside PHP can become akward for the pattern, so you can verify it's correct.
Also you should put ^ at the beginning of the pattern to match from string start and $ to the end to specify that the whole string has to be matched.
\\server\something
Regex:
~^\\\\server\\something$~
PHP String:
$pattern = '~^\\\\\\\\server\\\\something$~';
For the repetition, you want to say that a server exists and it's followed by one or more \something parts. If server is like something, this can be simplified:
^\\(?:\\[a-z]+){2,}$
PHP String:
$pattern = '~^\\\\(?:\\\\[a-z]+){2,}$~';
As there was some confusion about how \ characters should be written inside single quoted strings:
# Output:
#
# * Definition as '\\' ....... results in string(1) "\"
# * Definition as '\\\\' ..... results in string(2) "\\"
# * Definition as '\\\\\\' ... results in string(3) "\\\"
$slashes = array(
'\\',
'\\\\',
'\\\\\\',
);
foreach($slashes as $i => $slashed) {
$definition = sprintf('%s ', var_export($slashed, 1));
ob_start();
var_dump($slashed);
$result = rtrim(ob_get_clean());
printf(" * Definition as %'.-12s results in %s\n", $definition, $result);
}

Help with php regex for limiting allowed characters

I'm working in php and want to set some rules for a submitted text field. I want to allow letters, numbers, spaces, and the symbols # ' , -
This is what I have:
/^(a-z,0-9+# )+$/i
That seems to work but when I add the ' or - symbols I get errors.
Almost there. What you're looking for is called character classes. These are denoted by the use of square brackets. For example
/^[-a-z0-9+#,' ]+$/i
To include the hyphen character, it needs to be the first or last character in the class.
Edit
As you want to include the single quote and you're using PHP where regular expressions must be represented as strings, be careful with how you quote the pattern. In this case, you can use either of
$pattern = "/^[-a-z0-9+#,' ]+\$/i"; // or
$pattern = '/^[-a-z0-9+#,\' ]+$/i';
You should use a character class - [a-zA-Z0-9 #',-]
Note that - should be used first or last or escaped otherwise it gets treated as denoting a range and you will get errors
I want to allow letters, numbers, spaces, and the symbols #, ', , and -.
Use this regex...
/^[-a-zA-Z\d ',#]+\z/
Note the \z. If you use $, you are allowing a trailing \n. CodePad.
Ensure to escape the ' if you are using ' as your string delimiter.
Please use /^[a-z,0-9+\#\-,\s]+$/i
Use this regex:
/^[-a-z0-9,# ']+$/i

Categories