How do I put a period into a PHP regular expression?
The way it is used in the code is:
echo(preg_match("/\$\d{1,}\./", '$645.', $matches));
But apparently the period in that $645. doesn't get recognized. Requesting tips on how to make this work.
Since . is a special character, you need to escape it to have it literally, so \..
Remember to also escape the escape character if you want to use it in a string. So if you want to write the regular expression foo\.bar in a string declaration, it needs to be "foo\\.bar".
Escape it. The period has a special meaning within a regular expression in that it represents any character — it's a wildcard. To represent and match a literal . it needs to be escaped which is done via the backslash \, i.e., \.
/[0-9]\.[ab]/
Matches a digit, a period, and "a" or "ab", whereas
/[0-9].[ab]/
Matches a digit, any single character1, and "a" or "ab".
Be aware that PHP uses the backslash as an escape character in double-quoted string, too. In these cases you'll need to doubly escape:
$single = '\.';
$double = "\\.";
UPDATE
This echo(preg_match("/\$\d{1,}./", '$645.', $matches)); could be rewritten as echo(preg_match('/\$\d{1,}\./', '$645.', $matches)); or echo(preg_match("/\\$\\d{1,}\\./", '$645.', $matches));. They both work.
1) Not linefeeds, unless configured via the s modifier.
Related
I am trying to escape a string for use in a regular expression in PHP. So far I tried:
preg_quote(addslashes($string));
I thought I need addslashes in order to properly account for any quotes that are in the string. Then preg_quote escapes the regular expression characters.
However, the problem is that quotes are escaped with backslash, e.g. \'. But then preg_quote escapes the backslash with another one, e.g. \\'. So this leaves the quote unescaped once again. Switching the two functions does not work either because that would leave an unescaped backslash which is then interpreted as a special regular expression character.
Is there a function in PHP to accomplish the task? Or how would one do it?
The proper way is to use preg_quote and specify the used pattern delimiter.
preg_quote() takes str and puts a backslash in front of every character that is part of the regular expression syntax... characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
Trying to use a backslash as delimiter is a bad idea. Usually you pick a character, that's not used in the pattern. Commonly used is slash /pattern/, tilde ~pattern~, number sign #pattern# or percent sign %pattern%. It is also possible to use bracket style delimiters: (pattern)
Your regex with modification mentioned in comments by #CasimiretHippolyte and #anubhava.
$pattern = '/(?<![a-z])' . preg_quote($string, "/") . '/i';
Maybe wanted to use \b word boundary. No need for any additional escaping.
<?php
$a='/\\\/';
$b='/\\\\/';
var_dump($a);//string '/\\/' (length=4)
var_dump($b);//string '/\\/' (length=4)
var_dump($a===$b);//boolean true
?>
Why is the string with 3 backslashes equal to the string with 4 backslashes in PHP?
And can we use the 3-backslash version in regular expression?
The PHP reference says we must use 4 backslashes.
Note:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
$b='/\\\\/';
php parses the string literal (more or less) character by character. The first input symbol is the forward slash. The result is a forward slash in the result (of the parsing step) and the input symbol (one character, the /) is taken away from the input.
The next input symbol is a backslash. It's taken from the input and the next character/symbol is inspected. It's also a backslash. That's a valid combination, so the second symbol is also taken from the input and the result is a single blackslash (for both input symbols).
The same with the third and fourth backslash.
The last input symbol (within the literal) is the forwardslash -> forwardslash in the result.
-> /\\/
Now for the string with three backslashes:
$a='/\\\/';
php "finds" the first blackslash, the next character is a blackslash - that's a valid combination resulting in one single blackslash in the result and both characters in the input literal taken.
php then "finds" the third blackslash, the next character is a forward-slash, this is not a valid combination. So the result is a single blackslash (because php loves and forgives you....) and only one character taken from the input.
The next input character is the forward-slash, resulting in a forwardslash in the result.
-> /\\/
=> both literals encode the same string.
It is explained in the documentation on the page about Strings:
Under the Single quoted section it says:
The simplest way to specify a string is to enclose it in single quotes (the character ').
To specify a literal single quote, escape it with a backslash (\). To specify a literal backslash, double it (\\). All other instances of backslash will be treated as a literal backslash.
Let's try to interpret your strings:
$a='/\\\/';
The forward slashes (/) have no special meaning in PHP strings, they represent themselves.
The first backslash (\) escapes the second backslash, as explained in the first sentence from the second paragraph quoted above.
The third backslash stands for itself, as explained in the last sentence of the above quote, because it is not followed by an apostrophe (') or a backslash (\).
As a result, the variable $a contains this string: /\\/.
On
$b='/\\\\/';
there are two backslashes (the second and the fourth) that are escaped by the first and the third backslash. The final (runtime) string is the same as for $a: /\\/.
Note
The discussion above is about the encoding of strings in PHP source. As you can see, there always is more than one (correct) way to encode the same string. Other options (beside string literals enclosed in single or double quotes, using heredoc or nowdoc syntax) is to use constants (for literal backslashes, for example) and build the strings from pieces.
For example:
define('BS', '\'); // can also use '\\', the result is the same
$c = '/'.BS.BS.'/';
uses no escaping and a single backslash. The constant BS contains a literal backslash and it is used everywhere a backslash is needed for its intrinsic value. Where a backslash is needed for escaping then a real backslash is used (there is no way to use BS for that).
The escaping in regex is a different thing. First, the regex is parsed at the runtime and at runtime $a, $b and $c above contain /\\/, no matter how they were generated.
Then, in regex a backslash that is not followed by a special character is ignored (see the difference above, in PHP it is interpreted as a literal backslash).
Combining PHP & regex
There are endless possibilities to make the things complicate. Let's try to keep them simple and put some guidelines for regex in PHP:
enclose the regex string in apostrophes ('), if it's possible; this way there are only two characters that needs to be escaped for PHP: the apostrophe and the backslash;
when parse URLs, paths or other strings that can contain forward slashes (/) use #, ~, ! or # as regex delimiter (which one is not used in the regex itself); this way there is no need to escape the delimiter when it is used inside the regex;
don't escape in regex characters when it's not needed; f.e., the dash (-) has a special meaning only when it is used in character classes; outside them it's useless to escape it (and even in character classes it can be used unquoted without having any special meaning if it is placed as the very first or the very last character inside the [...] enclosure);
I want to use preg_match and regular expression in PHP to check that a string starts with either "+44" or "0", but how can I do this without the + being read as matching the preceding character once or more? Would (+44|0) work?
use the ^ to signify start with and a backslash \ to escape the + character. So you'll check for
^\+44 | ^0
In php, to store the regexp in a string, you don't need to double backslash \ to confuse things, just use single quotes instead like:
$regexp = '^\+44 | ^0';
In fact, you don't even need to use anything, this works too:
$regexp = "^\+44 | ^0";
The backslash is the default escape character for regular expressions. You may have to escape the backslash itself as well if it is used in a PHP string, so you'd use something like "(\\+44|0)" as string constant. The regular expression itself would then be (\+44|0).
You can do it several ways. Amongst those I know two:
One is escaping the + with escape character(i.e. back slash)
^(\+44|0)
Or placing the + inside the character class [] where it means the character as it's literal meaning.
^([+]44|0)
^ is the anchor character that means the start of the string/line based on your flag(modifier).
To match a literal backslash, many people and the PHP manual say: Always triple escape it, like this \\\\
Note:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
Here is an example string: \test
$test = "\\test"; // outputs \test;
// WON'T WORK: pattern in double-quotes double-escaped backslash
#echo preg_replace("~\\\t~", '', $test); #output -> \test
// WORKS: pattern in double-quotes with triple-escaped backslash
#echo preg_replace("~\\\\t~", '', $test); #output -> est
// WORKS: pattern in single-quotes with double-escaped backslash
#echo preg_replace('~\\\t~', '', $test); #output -> est
// WORKS: pattern in double-quotes with double-escaped backslash inside a character class
#echo preg_replace("~[\\\]t~", '', $test); #output -> est
// WORKS: pattern in single-quotes with double-escaped backslash inside a character class
#echo preg_replace('~[\\\]t~', '', $test); #output -> est
Conclusion:
If the pattern is single-quoted, a backslash has to be double-escaped \\\ to match a literal \
If the pattern is double-quoted, it depends whether
the backlash is inside a character-class where it must be at least double-escaped \\\
outside a character-class it has to be triple-escaped \\\\
Who can show me a difference, where a double-escaped backslash in a single-quoted pattern e.g. '~\\\~' would match anything different than a triple-escaped backslash in a double-quoted pattern e.g. "~\\\\~" or fail.
When/why/in what scenario would it be wrong to use a double-escaped \ in a single-quoted pattern e.g. '~\\\~' for matching a literal backslash?
If there's no answer to this question, I would continue to always use a double-escaped backslash \\\ in a single-quoted PHP regex pattern to match a literal \ because there's possibly nothing wrong with it.
A backslash character (\) is considered to be an escape character by both PHP's parser and the regular expression engine (PCRE). If you write a single backslash character, it will be considered as an escape character by PHP parser. If you write two backslashes, it will be interpreted as a literal backslash by PHP's parser. But when used in a regular expression, the regular expression engine picks it up as an escape character. To avoid this, you need to write four backslash characters, depending upon how you quote the pattern.
To understand the difference between the two types of quoting patterns, consider the following two var_dump() statements:
var_dump('~\\\~');
var_dump("~\\\\~");
Output:
string(4) "~\\~"
string(4) "~\\~"
The escape sequence \~ has no special meaning in PHP when it's used in a single-quoted string. Three backslashes do also work because the PHP parser doesn't know about the escape sequence \~. So \\ will become \ but \~ will remain as \~.
Which one should you use:
For clarity, I'd always use ~\\\\~ when I want to match a literal backslash. The other one works too, but I think ~\\\\~ is more clear.
There is no difference between the actual escaping of the slash in either single or double quoted strings in PHP - as long as you do it correct. The reason why you're getting a WONT WORK on your first example is, as pointed out in the comments, it expands \t to the tab meta character.
When you're using just three backslashes, the last one in your single quoted string will be interpreted as \~, which as far as single quoted strings go, will be left as it is (since it does not match a valid escape sequence). It is however just a coincidence that this will be parsed as you expect in this case, and not have some sort of side effect (i.e, \\\' would not behave the same way).
The reason for all the escaping is that the regular expression also needs backslashes escaped in certain situations, as they have special meaning there as well. This leads to the large number of backslashes after each other, such as \\\\ (which takes eight backslashes for the markdown parser, as it yet again adds another level of escaping).
Hopefully that clears it up, as you seem to be confused regarding the handling of backslashes in single/double quoted strings more than the behaviour in the regular expression itself (which will be the same regardless of " or ', as long as you escape things correctly).
I have a regular expression that allows only specific characters from the name fields in an HTML form, namely letters, white space, single quotes, hyphens and periods. Here is the pattern:
return mb_ereg_match("^[\w\s'-\.]+$", $name);
Problem is this pattern, for some reason, returns true when there are literal asterisks in $name. This shouldn't be possible unless I'm missing something. I've done multiple searches on literal asterisks and all I found was the "\*" pattern for intentionally matching them.
The same pattern in preg_match() also returns a match when passed a string like "*John".
What the heck am I missing?
You need a double-backslash in front of these codes. One to escape the backslash, one to escape the escape sequence.
You also need to escape the -, otherwise it accepts all characters "between" ' and ..
return mb_ereg_match("^[\\w\\s'\\-\\.]+$", $name);
Have a look at a working case (using preg_match): http://ideone.com/E8afAM
When enclosed in square-brackets, the hyphen acts as a special character to denote a range. In your case, it's matching all characters in the range ' to ..
Escaping the hyphen should return the desired result:
^[\w\s'\-\.]+$
I have a regular expression that allows only specific characters from the name fields in an HTML form, namely letters, white space, single quotes, hyphens and periods.
You miss, that \w is not a letter character. php.net says:
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word".
And, the perl definition is:
A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit) or a connecting punctuation character, such as an underscore ("_").
The connecting punctuation character should mean only _ as i read, but this is maybe a multibyte extension's bug.
If you use mb_ereg_match only for whole unicode matches, give a try to preg_match's /u modifier & the Unicode character properties feature, since php 5.1.0