Regex to match characters that must be escaped in a PHP regex

Regex to match characters that must be escaped in a PHP regex - php

I've had a look at this question, which shows what characters need to be escaped. However, I'm having a lot of trouble constructing a regex that will match any instance of one of those characters in a string.
For some background on the problem, I'm implementing a simple word-for-word (or term-for-term if you prefer) translation database where users enter language pairs, and can then trigger translations on blocks of text. The problem comes when users enter strings like "Yes/No". So, in PHP, I need to escape the string to be matched, and place it like this:
"/\b".$target."\b/"
So, what do I need to be looking at in terms of a preg_replace?

You want to use preg_quote(). As the documentation clearly states:
preg_quote() takes str and puts a backslash in front of every character that is part of the regular expression syntax. This is useful if you have a run-time string that you need to match in some text and the string may contain special regex characters.
Or \Q ... \E, ( What's between \Q and \E is treated as normal characters, not regular expression characters. )

Related

php preg_match mismatch

I would like to know why preg_match('/(?<=\s)[^,]+(?=\s)/',$data,$matches);
matches "List Processes 8989" in the string "20180513 List Processes 8989". The regex I am using should not match numeric characters. What is wrong?

The [^,] basically means any character except ,. If you want to exclude numeric characters as well, you can replace it with [^,0-9], or better [^,\d], so your regex would look like this:
(?<=\s)[^,\d]+(?=\s)
Try it online.
I'm assuming the input string in your question is only part of the actual input string you're using because the regex you provided won't match the numbers at the end unless they're followed by a whitespace.
References:
Negated Character Classes.
Difference between [0-9] and \d.

Regex to find string containing special characters in text

I'm trying to formulate a regular expression that will allow me to find a string within a piece of text, if the string exists on its own i.e. not within another word (but surrounded by special characters is ok).
/\bword\b/i
The above regex works fine, and finds "word" in the text. The problem comes when the word I want to find is something like "c++". In this case it matches on any occurrence of the "c" character on it's own. I've tried escaping the "+" characters but it doesn't make any difference. I'm assuming because "+" is a non-word character, I'm possibly going down the wrong route and using word boundaries is not what I should be doing.
So I guess the question is, how can I use a regular expression to find a string in a piece of text, on it's own, and regardless of whether the string is alphanumeric or contains special characters. So in the following piece of text it should match on the 3 occurences of "c++":
c++
(c++)
perl/c++/assembly
But it should not match on the following:
maniac++
c++abc
This is intended so that my script can tell if a specific skill exists within a user's CV/resume. I'm using this with PHP's preg_match_all() function.
I've done a lot of searching but can't come up with a solution, hopefully someone with good regex knowledge can help.

Try this:
/(?<!\w)(c\+\+)(?!\w)/
The (?<!\w) is a negative lookbehind clause, meaning that a word character should not immediately precede your pattern. The (?!\w) part is negative lookahead, meaning that a word character should not immediately follow.
Hope this helps!

Why does this regex not validate in the same way in PHP?

when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher

The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.

You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/

Regex for netbios names

I got this issue figuring out how to build a regexp for verifying a netbios name. According to the ms standard these characters are illegal
\/:*?"<>|
So, thats what I'm trying to detect. My regex is looking like this
^[\\\/:\*\?"\<\>\|]$
But, that wont work.
Can anyone point me in the right direction? (not regexlib.com please...)
And if it matters, I'm using php with preg_match.
Thanks

Your regular expression has two problems:
you insist that the match should span the entire string. As Andrzej says, you are only matching strings of length 1.
you are quoting too many characters. In a character class (i.e. []), you only need to quote characters that are special within character classes, i.e. hyphen, square bracket, backslash.
The following call works for me:
preg_match('/[\\/:*?"<>|]/', "foo"); /* gives 0: does not include invalid characters */
preg_match('/[\\/:*?"<>|]/', "f<oo"); /* gives 1: does include invalid characters */

As it stands at the moment, your regex will match the start of the string (^), then exactly one of the characters in the square brackets (i.e. the illegal characters), then then end of the string ($).
So this likely isn't working because a string of length > 1 will trivially fail to match the regex, and thus be considered OK.
You likely don't need the start and end anchors (the ^ and $). If you remove these, then the regex should match one of the bracketed characters occurring anywhere on the input text, which is what you want.
(Depending on the exact regex dialect, you may canonically need less backslashes within the square brackets, but they are unlikely to do any harm in any case).

Why don't reg expressions from regexlib.com work in PHP?

I found a regex on http://regexlib.com/REDetails.aspx?regexp_id=73
It's for matching a telephone number with international code like so:
^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$
When using with PHP's preg_match, the expression fails? Why is that?

You need to surround it with / delimiters:
preg_match('/^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$/', $phoneNumber)
And make sure you don't leave out the backslashes (\).

Because preg_match expects the regex to be delimited, usually with slashes (but, as correctly noted below, other characters are possible as long as they are matched):
preg_match('/^(\(?\+?[0-9]*\)?)?[0-9_ ()-]*$/', $subject)
Apart from that, the original regex was copied wrong - several characters were unescaped. The original on regexlib has a few warts, too (some characters were escaped needlessly).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex to match characters that must be escaped in a PHP regex - php

Related

php preg_match mismatch

Regex to find string containing special characters in text

Why does this regex not validate in the same way in PHP?

Regex for netbios names

Why don't reg expressions from regexlib.com work in PHP?

Categories

Resources