Why are backslashes used in preg_match function of PHP? - php

I'm been practicing the preg_match() function in PHP. The tutorial said that it is needed to add fore slashes before the characters.
I also noticed that without the slashes, it works strangely. It gives a warning:
preg_match(): Delimiter must not be alphanumeric or backslash.
Q: What difference does the fore slashes do?
Here's the code:
$string = 'Okay, I\'m fine with it! ';
$math = 'Okay'; // I need to add fore slashes for it to work
echo preg_match($math, $string); // It supposedly echoes out 1 or 0
// depending if the former argument
// is in the latter argument

There is no particular reason, it's a syntaxic choice. This syntax has the avantage to be handy to add global modifiers to the pattern:
delimiter - pattern - delimiter - [global modifiers]
As explained in the error message and in the php manual, you can choose the delimiter between special characters, the most commonly used is the slash, but it's not always a pertinent choice in particular when the pattern contains a lot of literal slashes that need to be escaped.

It's because you can also apply switches to the regular expression (eg. m for multiline, u for Unicode) and these need to be defined outside of the delimiter, so the syntax is
opening delimiter expression closing delimiter [optional switches]
e.g.
/^[a-z]*$/mi
for the multiline (m) and case insensitive (i) switches, using a delimiter of /
The delimiter must not be a character that can be misinterpreted by the regexp parser, it must be very clear that it is a delimiter, so it cannot be alpha (e.g. i, or a \ that is used to "escape" characters in the regexp
Note that you can also use braces as delimiters, so
[^[a-z]*$]mi
is valid

Related

PHP Preg_replace() pattern not work

I wan to replace text using preg_replace.But my search string have a / so it makes problem.
How can I solve it?
$search='r/trtrt';
echo preg_replace('/\b'.addslashes($search).'\b/', 'ERTY', 'TG FRT');
I am getting error preg_replace(): Unknown modifier 'T'
Use a different delimiter and don't use addslashes, that is escaping non-regex special characters (or a mix of regex and non-regex characters, I'd say the majority of the time dont use addslashes).
$search='r/trtrt';
echo preg_replace('~\b'. $search.'\b~', 'ERTY', 'TG FRT');
You could use preg_quote as an alternative. Just changing the delimiter is the easiest solution though.
use ~ as delimiter:
$search='r/trtrt';
echo preg_replace('~\b'.addslashes($search).'\b~', 'ERTY', 'TG FRT');
I always use ~ as it is one of the least used char in a string but you can use any character you want and won't need to escape your regexp chars!
You don't need addslashes() in your case but if you have a more complex regexp and you want to escape chars you should use preg_quote($search).
Why not escape it the way it is meant to be done
$search='r/trtrt';
echo preg_replace('/\b'.preg_quote($search, '/').'\b/', 'ERTY', 'TG FRT');
http://php.net/manual/en/function.preg-quote.php
preg_quote() takes str and puts a backslash in front of every
character that is part of the regular expression syntax. This is
useful if you have a run-time string that you need to match in some
text and the string may contain special regex characters
delimiter
If the optional delimiter is specified, it will also be escaped.
This is useful for escaping the delimiter that is required by the PCRE
functions. The / is the most commonly used delimiter.
Add slashes is not the function to use here. It provides no escaping for any of the special characters in Regx.
The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) {
} = ! < > | : -
Using the proper functions promote readability of the code, if at some later point in time you or another coder see the ~ delimiter they may just think its part of a personal "style" or pay it little attention. However, seeing the input properly escaped will tell any experienced coder that the input could contain characters that conflict with regular expressions.
Personally, readability is at the top of my list whenever I write code. If you cant understand it at a glance, what good is it.

PCRE regex with lookahead and lookbehind always returns true

I’m trying to create a regex for form validation but it always returns true. The user must be able to add something like {user|2|S} as input but also use brackets if they are escaped with \.
This code checks for the left bracket { for now.
$regex = '/({(?=([a-zA-Z0-9]+\|[0-9]*\|(S|D[0-9]*)}))|[^{]|(?<=\\\){)*/';
if (preg_match($regex, $value)) {
return TRUE;
} else {
return FALSE;
}
A possible correct input would be:
Hello {user|1|S}, you have {amount|2|D2}
or
Hello {user|1|S}, you have {amount|2|D2} in \{the_bracket_bank\}
However, this should return false:
Hello {user|1|S}, you have {amount|2}
and this also:
Hello {user|1|S}, you have {amount|2|D2} in {the_bracket_bank}
A live example can be found here: http://regexr.com?37tpu Note that there is a \ in the lookbehind at the end, PHP was giving me error messages because I had to escape it an extra time in my code.
The main error is that you do not specify that the regex should match from the beginning to the of the checked string. Use the ^ and $ assertions.
I think you have to escape { and } in your regex as they have special meaning. Together they form a quantifier.
The (?<=\\\) is better written (?<=\\\\). The backslash has to be double escaped as it has special meaning in both single-quoted string and PCRE regex. Using \\\ works too, because if single-quoted string contains any escape sequence except \\ and \', it handles it as literal backslash and letter, therefore \) is taken literally. But explicitly escaping the backslash twice seems easier to read to me.
The regex should be
$regex = '/^(\{(?=([a-zA-Z0-9]+\|[0-9]*\|(S|D[0-9]*)\}))|[^{]|(?<=\\\\)\{)*$/';
But notice that the look-around assertions are not necessary. This regex should do the job too:
$regex = '/^([^{]|\\\{|\{[a-zA-Z0-9]+\|[0-9]*\|(S|D[0-9]*)\})*$/';
Any non-{ characters are matched by the first alternative. When a { is read, one of the remaining two alternatives is used. Either the pattern for the brace thing matches, or the regex engine backtracks one character and tries to match \{ character sequence. If it fails, both ways, it backtracks further till it reaches string start and fails completely.
Matching without lookbehind
You can make a regex for this without using lookbehind/lookaheads (which is usually recommended).
For example, if your requirement is that you can match any character but a { and a } unless it's preceded by a \. You can also say:
Match any character but a { and a } OR match a \{ or a \}. To match any character but a { and a } use:
[^{}]
To match a \{ use:
\\\{
One backslash is for escaping the { (which might not be necessary, depending on your regex compiler) and one backslash is for escaping the other backslash.
You would end up with this:
(?:
[^{}]
|
\\\{
|
\\\}
)+
I nicely formatted this regex so that it's readable. If you want to use it in your code like this make sure to use the [PCRE_EXTENDED][1] modifier.
Looks more of a job for a lookbehind to me:
/((?<!\\\\)\{[a-zA-Z0-9]+\|[0-9]+\|[SD][0-9]*\})/
However, the obfuscation factor is so high that I would rather recognize all bracketed strings and parse them later.

How can I use PHP's preg_replace function to convert Unicode code points to actual characters/HTML entities?

I want to convert a set of Unicode code points in string format to actual characters and/or HTML entities (either result is fine).
For example, if I have the following string assignment:
$str = '\u304a\u306f\u3088\u3046';
I want to use the preg_replace function to convert those Unicode code points to actual characters and/or HTML entities.
As per other Stack Overflow posts I saw for similar issues, I first attempted the following:
$str = '\u304a\u306f\u3088\u3046';
$str2 = preg_replace('/\u[0-9a-f]+/', '&#x$1;', $str);
However, whenever I attempt to do this, I get the following PHP error:
Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u
I tried all sorts of things like adding the u flag to the regex or changing /\u[0-9a-f]+/ to /\x{[0-9a-f]+}/, but nothing seems to work.
Also, I've looked at all sorts of other relevant pages/posts I could find on the web related to converting Unicode code points to actual characters in PHP, but either I'm missing something crucial, or something is wrong because I can't fix the issue I'm having.
Can someone please offer me a concrete solution on how to convert a string of Unicode code points to actual characters and/or a string of HTML entities?
From the PHP manual:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
First of all, in your regular expression, you're only using one backslash (\). As explained in the PHP manual, you need to use \\\\ to match a literal backslash (with some exceptions).
Second, you are missing the capturing groups in your original expression. preg_replace() searches the given string for matches to the supplied pattern and returns the string where the contents matched by the capturing groups are replaced with the replacement string.
The updated regular expression with proper escaping and correct capturing groups would look like:
$str2 = preg_replace('/\\\\u([0-9a-f]+)/i', '&#x$1;', $str);
Output:
おはよう
Expression: \\\\u([0-9a-f]+)
\\\\ - matches a literal backslash
u - matches the literal u character
( - beginning of the capturing group
[0-9a-f] - character class -- matches a digit (0 - 9) or an alphabet (from a - f) one or more times
) - end of capturing group
i modifier - used for case-insensitive matching
Replacement: &#x$1
& - literal ampersand character (&)
# - literal pound character (#)
x - literal character x
$1 - contents of the first capturing group -- in this case, the strings of the form 304a etc.
RegExr Demo.
This page here—titled Escaping Unicode Characters to HTML Entities in PHP—seems to tackle it with this nice function:
function unicode_escape_sequences($str){
$working = json_encode($str);
$working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working);
return json_decode($working);
}
That seems to work with json_encode and json_decode to take pure UTF-8 and convert it into Unicode. Very nice technique. But for your example, this would work.
$str = '\u304a\u306f\u3088\u3046';
echo preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $str);
The output is:
おはよう
Which is:
おはよう
Which translates to:
Good morning

PHP preg_match_all strange behaviour with "/" character

Using :
preg_match_all(
"/\b".$KeyWord."\b/u",
$SearchStr,
$Array1,
PREG_OFFSET_CAPTURE);
This code works fine for all cases except when there is a / in the $KeyWord var. Then I get a warning and unsuccessful match of course.
Any idea how to work around this?
Thanks
use preg_quote() around the keyword.
http://us2.php.net/preg_quote
but also provide your delimiter, so it gets escaped: preg_quote($KeyWord, "/")
You must parse $KeyWord and add "\" before all spec symbols, you can use preg_quote()
Dynamic Values In Patterns
You are using a dynamic value inside the pattern. Like escaping for SQL or HTML, a specific escaping for the value is needed. If you do not escape meta characters inside the value are interpreted by the regex engine. The escaping function for PCRE patterns is preg_quote().
preg_match_all(
"(\b".preg_quote($KeyWord)."\b)u",
$SearchStr,
$Array1,
PREG_OFFSET_CAPTURE
);
Delimiters
The syntax of a pattern in PHPs preg_* function is:
DELIMITER PATTERN DELIMITER OPTIONS
The / is the delimiter in your pattern. So the / inside the $keyWord was recognized as the closing delimiter.
But all non alphanumeric characters can be used. In Perl and JS you can define a regular expression directly (not as string) using / so it is often the default in tutorials.
Most delimiters have to be escaped inside the pattern.
Match a \: '/\//'
The exception to this rule are brackets. You use any of the bracket pairs as delimiter. And because it is a pair, they can still be used inside the pattern.
Match a \: '(/)'
The () brackets are a good decision, you can count them as "subpattern 0".
You can use preg_quote to handle the backslash character.
From the manual:
puts a backslash in front of every
character that is part of the regular
expression syntax
You can also pass the delimiter as the second parameter and it will also be escaped. However, if you're using # as your delimiter, then there's no need to escape /
So, you can either use:
preg_match_all("/\b".preg_quote($KeyWord, "/")."\b/u", $SearchStr,$Array1,PREG_OFFSET_CAPTURE))
or, if you are sure that your keyword does not contain any other regex-special characters, you can simply change the delimiter, and use to escape the backslash:
preg_match_all("#\b".$KeyWord."\b#u", $SearchStr,$Array1,PREG_OFFSET_CAPTURE))

What's wrong with this php/regex query?

preg_replace("/(/s|^)(php|ajax|c\+\+|javascript|c#)(/s|$)/i", '$1#$2$3', $somevar);
It's meant to turn, for example, PHP into #PHP.
Warning: preg_replace(): Unknown modifier '|'
It's because you are using the forward slash (/) as your delimiter. When the regex engine gets to /s (3rd character) it thinks the regex is over and the rest of it are modifiers. But no such modifier (|) exists, thus the error.
Next time, you can either:
Change your delimiters to something you won't use in your regex, ie:
preg_replace("!(/s|^)(php|ajax|c\+\+|javascript|c#)(/s|$)!i", '$1#$2$3', $somevar);
Or escape those characters with a backslash, ie: "/something\/else/"*
I also suspect you didn't intend to use /s, but the escape character \s that matches whitespace characters.
The first character in the regular expression is the delimiter. If you need to use this inside your regular expression then you need to escape it:
"/(\/s|^)...
^
Or alternatively, choose another delimiter that isn't used anywhere in your regular expression so that you don't need to escape:
"~(/s|^)...(/s|$)~i"
I prefer to do the latter as it makes the regular expression more readable.
(Although as NullUserException points out, the actual error is that you should have used a backslash instead of a slash).

Categories