Regex match fails when adding special characters

Regex match fails when adding special characters - php

Consider this,
$uri = '/post/search/foo';
$pattern = '~/post/search/[A-Za-z0-9_-]+(?=/|$)~i';
$matches = array();
preg_match($pattern, $uri, $matches);
print_r($matches); // Success
It works fine, since [A-Za-z0-9_-] belongs to foo. Since I'm writing a route plugin,
I want this to be abble to match special chars as well.
I imagine a regex pattern to be like this:
[A-Z0-9!##$%^&*()_+|\/?><~"№;:'*]+(?=/|$)
I've tried to escape each special character with a slash, and escape a whole pattern using preg_quote() with no luck - I always encounter compilation failures.
The question is, how a proper matching for A-Z0-9!##$%^&*()_+|\/?><~"№;:'* should be done?

Is there a reason you don't want to just use an ungreedy .?
As in:
'~/post/search/.+(?=/|$)~iU'

Escaping inside the character class is not difficult, only ^ (only at the first position), - (not at the first or last position), \ and [] are special characters there, and ' as string delimiter. And additionaly the regex delimiter.
You use ~ as regex delimiter and I think this is the critical point in your character class, because the delimiter is not escaped by default when using preg_quote().
So this should be working
[A-Z0-9!##$%^&*()_+|\/?><\~"№;:\'*]+(?=/|$)

Related

PHP how to use a variable in match

I am using the blow code to see if my password contains special char which works fine. but I would like to be able to use a variable like $mySpecialChar instead of the "[\'^£$%&*()}{##~?><>,|=_+¬-]" string, I'm not sure if I can do that. Reason for that is because I want to be able to pull string from a datatable.
I've tried preg_match_all("/".$mySpecialChar."/"), but no luck.
$matches = array();
if (preg_match_all("/[\'^£$%&*()}{##~?><>,|=_+¬-]/", $pwd, $matches) > 0) {
foreach ($matches[0] as $match) { $specialcase += strlen($match); }
}

Make sure to escape any variables you put in a regular expression
preg_match_all('/'.preg_quote($mySpecialChar, '/').'/', $pwd, $matches);
preg_quote
string preg_quote ( string $str [, string $delimiter = NULL ] )
preg_quote() takes str and puts a backslash in front of every character that is part of the regular expression syntax. This is useful if you have a run-time string that you need to match in some text and the string may contain special regex characters.
The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
Note that / is not a special regular expression character.
You have 5 or more special characters in there.
Of note is that last line in the quote above Note that / is not a special regular expression character. While not entirely necessary in your case (I don't see / in your variable string), If you put the second argument as the delimiter it will escape that too. If you pay close attention to what I put above, you will see that is exactly what I did, preg_quote($mySpecialChar, '/')
If you don't quote these, well it's anyone guess what it will do. you could get an error, you could get an empty capture group () you could match anything with the . etc. etc. AS you have it,
[\'^£$%&*()}{##~?><>,|=_+¬-]
This is a character set, so it will escape most of the stuff inside it, that's if that's intentional. If you had [^\'£$%&*()}{##~?><>,|=_+¬-] you would have a not (or negative) character set.
Seeing as you are using preg_match_all, and not preg_match, I can probably assume you don't want the character set. Otherwise why use preg_match_all
It should simply be, if you want to match everything in $mySpecialChar:
preg_match('/['.preg_quote($mySpecialChar, '/').']+', $pwd, $matches);
If you are just trying to match the stuff between the [....], I would still escape it as it doesn't matter, but if you put it in a database and have it start with ^ instead it will make a difference, or if you get the - between certain characters 0-9 for example it may make a difference. Escaping never hurts, just remove the [] when you save it and replace them as I have above.
The [ .... ]+ means 1 or more, the [ ... ]* means none or more. the [...]+
? means one or more non-gready etc. Then you should be able to use just [...]+ with preg_match which will give you a cleaner match then using [...] match one, with preg_match_all.
Most of the time \W (uppercase) will also match most symbols, basically that means [^a-zA-Z0-9_] or not a-Z, 0-9 and _

You could always just look for characters that AREN'T the basic ones:
preg_match_all('/[^0-9A-Za-z]/', $pwd, $matches)
Much shorter and just as effective.
You can easily put this in a string if you like:
$specialChars = '[^0-9A-Za-z]';
preg_match_all("/{$specialChars}/", $pwd, $matches);
Running this on the provided password will return an array in $matches which contains all of the special characters from the string. All you need to do in order to evaluate password complexity is look at the length of $pwd and how many entries are in $matches, as this tells you the number of special characters.

PCRE regex with lookahead and lookbehind always returns true

I’m trying to create a regex for form validation but it always returns true. The user must be able to add something like {user|2|S} as input but also use brackets if they are escaped with \.
This code checks for the left bracket { for now.
$regex = '/({(?=([a-zA-Z0-9]+\|[0-9]*\|(S|D[0-9]*)}))|[^{]|(?<=\\\){)*/';
if (preg_match($regex, $value)) {
return TRUE;
} else {
return FALSE;
}
A possible correct input would be:
Hello {user|1|S}, you have {amount|2|D2}
or
Hello {user|1|S}, you have {amount|2|D2} in \{the_bracket_bank\}
However, this should return false:
Hello {user|1|S}, you have {amount|2}
and this also:
Hello {user|1|S}, you have {amount|2|D2} in {the_bracket_bank}
A live example can be found here: http://regexr.com?37tpu Note that there is a \ in the lookbehind at the end, PHP was giving me error messages because I had to escape it an extra time in my code.

The main error is that you do not specify that the regex should match from the beginning to the of the checked string. Use the ^ and $ assertions.
I think you have to escape { and } in your regex as they have special meaning. Together they form a quantifier.
The (?<=\\\) is better written (?<=\\\\). The backslash has to be double escaped as it has special meaning in both single-quoted string and PCRE regex. Using \\\ works too, because if single-quoted string contains any escape sequence except \\ and \', it handles it as literal backslash and letter, therefore \) is taken literally. But explicitly escaping the backslash twice seems easier to read to me.
The regex should be
$regex = '/^(\{(?=([a-zA-Z0-9]+\|[0-9]*\|(S|D[0-9]*)\}))|[^{]|(?<=\\\\)\{)*$/';
But notice that the look-around assertions are not necessary. This regex should do the job too:
$regex = '/^([^{]|\\\{|\{[a-zA-Z0-9]+\|[0-9]*\|(S|D[0-9]*)\})*$/';
Any non-{ characters are matched by the first alternative. When a { is read, one of the remaining two alternatives is used. Either the pattern for the brace thing matches, or the regex engine backtracks one character and tries to match \{ character sequence. If it fails, both ways, it backtracks further till it reaches string start and fails completely.

Matching without lookbehind
You can make a regex for this without using lookbehind/lookaheads (which is usually recommended).
For example, if your requirement is that you can match any character but a { and a } unless it's preceded by a \. You can also say:
Match any character but a { and a } OR match a \{ or a \}. To match any character but a { and a } use:
[^{}]
To match a \{ use:
\\\{
One backslash is for escaping the { (which might not be necessary, depending on your regex compiler) and one backslash is for escaping the other backslash.
You would end up with this:
(?:
[^{}]
|
\\\{
|
\\\}
)+
I nicely formatted this regex so that it's readable. If you want to use it in your code like this make sure to use the [PCRE_EXTENDED][1] modifier.

Looks more of a job for a lookbehind to me:
/((?<!\\\\)\{[a-zA-Z0-9]+\|[0-9]+\|[SD][0-9]*\})/
However, the obfuscation factor is so high that I would rather recognize all bracketed strings and parse them later.

PHP preg_match_all strange behaviour with "/" character

Using :
preg_match_all(
"/\b".$KeyWord."\b/u",
$SearchStr,
$Array1,
PREG_OFFSET_CAPTURE);
This code works fine for all cases except when there is a / in the $KeyWord var. Then I get a warning and unsuccessful match of course.
Any idea how to work around this?
Thanks

use preg_quote() around the keyword.
http://us2.php.net/preg_quote
but also provide your delimiter, so it gets escaped: preg_quote($KeyWord, "/")

You must parse $KeyWord and add "\" before all spec symbols, you can use preg_quote()

Dynamic Values In Patterns
You are using a dynamic value inside the pattern. Like escaping for SQL or HTML, a specific escaping for the value is needed. If you do not escape meta characters inside the value are interpreted by the regex engine. The escaping function for PCRE patterns is preg_quote().
preg_match_all(
"(\b".preg_quote($KeyWord)."\b)u",
$SearchStr,
$Array1,
PREG_OFFSET_CAPTURE
);
Delimiters
The syntax of a pattern in PHPs preg_* function is:
DELIMITER PATTERN DELIMITER OPTIONS
The / is the delimiter in your pattern. So the / inside the $keyWord was recognized as the closing delimiter.
But all non alphanumeric characters can be used. In Perl and JS you can define a regular expression directly (not as string) using / so it is often the default in tutorials.
Most delimiters have to be escaped inside the pattern.
Match a \: '/\//'
The exception to this rule are brackets. You use any of the bracket pairs as delimiter. And because it is a pair, they can still be used inside the pattern.
Match a \: '(/)'
The () brackets are a good decision, you can count them as "subpattern 0".

You can use preg_quote to handle the backslash character.
From the manual:
puts a backslash in front of every
character that is part of the regular
expression syntax
You can also pass the delimiter as the second parameter and it will also be escaped. However, if you're using # as your delimiter, then there's no need to escape /
So, you can either use:
preg_match_all("/\b".preg_quote($KeyWord, "/")."\b/u", $SearchStr,$Array1,PREG_OFFSET_CAPTURE))
or, if you are sure that your keyword does not contain any other regex-special characters, you can simply change the delimiter, and use to escape the backslash:
preg_match_all("#\b".$KeyWord."\b#u", $SearchStr,$Array1,PREG_OFFSET_CAPTURE))

rexexp solution for php

I have tried to work this out myself (even bought a Kindle book!), but I am struggling with backreferences in php.
What I want is like the following example:
var $html = "hello %world|/worldlink/% again";
output:
hello world again
I tried stuff like:
preg_replace('/%([a-z]+)|([a-z]+)%/', '\1', $html);
but with no joy.
Any ideas please? I am sure someone will post the exact answer but I would like an explanation as well please - so that I don't have to keep asking these questions :)

The slashes "/" are not included in your allowed range [a-z]. Instead use
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);

Your expression:
'/%([a-z]+)|([a-z]+)%/'
Is only capturing one thing. The | in the middle means "OR". You're trying to capture both, so you don't need an OR in there. You want a literal | symbol so you need to escape it:
'/%([a-z]+)\|([a-z\/]+)%/'
The / character also needs to be included in your char set, and escaped as above.

Your regex (/%([a-z]+)|([a-z]+)%/) reads this way:
Match % followed by + (= one or
more) a-z characters (and store this
into backreference #1).
Or (the |):
Match + (= one or more) a-z
characters (and store this into
backreference #2) followed by a
%.
What you are looking for is:
preg_replace('~%([a-z]+)[|]([a-z/]+)%~', '$1', $html);
Basically I just escaped the | regex meta character (you can do this by either surrounding it with [] like I did or just prepending a backwards slash \, personally I find the former easier to read), and added a / to the second capture group.
I also changed your delimiters from / to ~ because tildes are much more unlikely to appear in strings, if you want to keep using / as your delimiter you also have to escape their occurrences in your regex.
It's also recommended that you use the $ syntax instead of \ in your replacement backreferences:
$replacement may contain references
of the form \\n or (since PHP 4.0.4)
$n, with the latter form being the
preferred one.

Here is a version that works according to the OPs data/information provided (using a non-slash delimiter to avoid escaping slashes):
preg_replace('#%([a-z]+)\|([a-z/]+)%#', '\1', $html);
Using a non slash delimiter, would alleviate the need to escape slashes.
Outputs:
hello world again
The Explanation
Why yours did not work. First up the | is an OR operator, and, in your example, should be escaped. Second up, since you are using /'s or expect slashes it is better to use a non-slash delimiter, such as #. Third up, the slash needed to be added to list of allowed matches. As stated before you may want to include a bit more options, as any type of word with numbers underscores periods hyphens will fail / break the script. Hopefully that is the explanation you were looking for.

Here's what works for me:
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);

Your regular expression doesn't escape the |, and doesn't include the proper characters for the URL.
Here's a basic live example supporting only a-z and slashes:
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);
In reality, you're going to want to change those [a-z]+ blocks to something more expressive. Do some searches for URL-matching regular expressions, and pick one that fits what you want.

$html = "hello %world|/worldlink/% again";
echo preg_replace('/([A-ZA-z_ ]*)%(.+)\|(.+)%([A-ZA-z_ ]*)/', '$1$2$4', $html);
output:
hello world again
here is a working code : http://www.ideone.com/0qhZ8

Including new lines in PHP preg_replace function

I'm trying to match a string that may appear over multiple lines. It starts and ends with a specific string:
{a}some string
can be multiple lines
{/a}
Can I grab everything between {a} and {/a} with a regex? It seems the . doesn't match new lines, but I've tried the following with no luck:
$template = preg_replace( $'/\{a\}([.\n]+)\{\/a\}/', 'X', $template, -1, $count );
echo $count; // prints 0
It matches . or \n when they're on their own, but not together!

Use the s modifier:
$template = preg_replace( $'/\{a\}([.\n]+)\{\/a\}/s', 'X', $template, -1, $count );
// ^
echo $count;

I think you've got more problems than just the dot not matching newlines, but let me start with a formatting recommendation. You can use just about any punctuation character as the regex delimiter, not just the slash ('/'). If you use another character, you won't have to escape slashes within the regex. I understand '%' is popular among PHPers; that would make your pattern argument:
'%\{a\}([.\n]+)\{/a\}%'
Now, the reason that regex didn't work as you intended is because the dot loses its special meaning when it appears inside a character class (the square brackets)--so [.\n] just matches a dot or a linefeed. What you were looking for was (?:.|\n), but I would have recommended matching the carriage-return as well as the linefeed:
'%\{a\}((?:.|[\r\n])+)\{/a\}%'
That's because the word "newline" can refer to the Unix-style "\n", Windows-style "\r\n", or older-Mac-style "\r". Any given web page may contain any of those or a mixture of two or more styles; a mix of "\n" and "\r\n" is very common. But with /s mode (also known as single-line or DOTALL mode), you don't need to worry about that:
'%\{a\}(.+)\{/a\}%s'
However, there's another problem with the original regex that's still present in this one: the + is greedy. That means, if there's more than one {a}...{/a} sequence in the text, the first time your regex is applied it will match all of them, from the first {a} to the last {/a}. The simplest way to fix that is to make the + ungreedy (a.k.a, "lazy" or "reluctant") by appending a question mark:
'%\{a\}(.+?)\{/a\}%s'
Finally, I don't know what to make of the '$' before the opening quote of your pattern argument. I don't do PHP, but that looks like a syntax error to me. If someone could educate me in this matter, I'd appreciate it.

From http://www.regular-expressions.info/dot.html:
"The dot matches a single character,
without caring what that character is.
The only exception are newline
characters."
you will need to add a trailing /s flag to your expression.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex match fails when adding special characters - php

Is there a reason you don't want to just use an ungreedy .? As in: '~/post/search/.+(?=/|$)~iU'

Related

PHP how to use a variable in match

PCRE regex with lookahead and lookbehind always returns true

PHP preg_match_all strange behaviour with "/" character

rexexp solution for php

Including new lines in PHP preg_replace function

Categories

Resources