Regex to find string containing special characters in text

Regex to find string containing special characters in text - php

I'm trying to formulate a regular expression that will allow me to find a string within a piece of text, if the string exists on its own i.e. not within another word (but surrounded by special characters is ok).
/\bword\b/i
The above regex works fine, and finds "word" in the text. The problem comes when the word I want to find is something like "c++". In this case it matches on any occurrence of the "c" character on it's own. I've tried escaping the "+" characters but it doesn't make any difference. I'm assuming because "+" is a non-word character, I'm possibly going down the wrong route and using word boundaries is not what I should be doing.
So I guess the question is, how can I use a regular expression to find a string in a piece of text, on it's own, and regardless of whether the string is alphanumeric or contains special characters. So in the following piece of text it should match on the 3 occurences of "c++":
c++
(c++)
perl/c++/assembly
But it should not match on the following:
maniac++
c++abc
This is intended so that my script can tell if a specific skill exists within a user's CV/resume. I'm using this with PHP's preg_match_all() function.
I've done a lot of searching but can't come up with a solution, hopefully someone with good regex knowledge can help.

Try this:
/(?<!\w)(c\+\+)(?!\w)/
The (?<!\w) is a negative lookbehind clause, meaning that a word character should not immediately precede your pattern. The (?!\w) part is negative lookahead, meaning that a word character should not immediately follow.
Hope this helps!

Related

Regular Expression that matches attribute units in attribute names including special characters

I am fairly new to using regular expressions and I am stuck on a problem that I am trying to solve. I have issues understanding what's going on and I hope that someone can hint me in the right direction.
What I am trying to achieve:
To avoid duplicates in the view, I want to check if an attribute name contains the respective attribute unit. For example if $attribute['name'] = "Cutting speed (in m/Min.)" and attribute['unit'] = "m/min" the attribute unit should not be displayed as it is already mentioned in the name.
How I am trying to achieve this:
I am checking for the attribute unit by using the following regular expression: ~\b' . attribute['unit'] . '\b~i'
This works well in for the above mentioned example, but not so well if the unit is a special character, like % or ", for instance.
The Problems
While testing for the special character issue I came accross the following phenomenon:
if I use this regex /\b%\b/ it behaves not as expected and matches the % in bla%bla but not the % if it is preceded or followed by a space: https://regex101.com/r/56iYEI/3
It seems like the % turns the behavior of the regex to its opposite. I tested with other "special characters" as well (" and &), and they seem to have the same effect.
I was directed to this question (Regular Expression Word Boundary and Special Characters) before and read the answers. I now understand that \b checks for word boundaries. But it is still unclear to me why it behaves the way it does as soon as a % or " turns up.
The questions
How come a % turns this checking for word boundaries by \b around?
How can I achieve my goal to match for alphanumeric units as well as for special character units, like % or "?
Looking forward to any hints. Thanks in advance!

A word break is a point between a string of word characters and a string of non-word characters (or start or end). The non-word characters don't have to be a space.
foo"##bar {}qux
In this string the words breaks are before and after foo, bar, and qux.
The expression /\b"##\b/ will match chars between foo and bar. However /\b"#\b/ will not because there is no word (and thus no word break) after the #.
To solve this, check either a word break or a non-word character. The following expression matches both cases; /(^|\W|\b)"#($|\W|\b)/.
'~(^|\W|\b)' . attribute['unit'] . '($|\W|\b)~i'
P.S. If attribute['unit'] can contain any characters, be sure to quote before using it in the regex using preg_quote().

Match 'exclamation mark' character 'not immediately preceded by a word'

I want to delete every ! character from a string that is not immediately preceded by a word. To accomplish this task, I was thinking about preg_replace() to perform a Regex match.
That is, I'd like the following blasphemy of a text:
search! query ! !key!words that! acc!ept exclamation! marks!
... to become:
search! query keywords that! accept exclamation! marks!
There is no need to take double+ occurrences into account, since I filter those out using (![!]+) - although if someone knows of a solution that takes double+ occurrences into consideration, I'd be more than glad to welcome it, since it removes the need for an extra lookup.
So far I have (!\b)|(\s+!\s+)|(!\s+!) which - besides being a bit whacky in my opinion - works almost perfectly, but sometimes removes spacing between words, producing the result of
search! querykeywords that! accept exclamation! marks!
EDIT
I need to take accented and/or uppercase characters into consideration when parsing the string.

You want to remove an ! when
there's no word break before it (as in foo !)
or there is a word break after it (as in !foo)
That gives:
\B!|!\b
https://regex101.com/r/xF7bG6/1

([^a-z])\!+|\!+([a-z]), with a replacement of $1$2 should match multiple !'s that are not preceded by a letter (\W) or have a letter immediately after (\w).
If your regular expression language takes positive lookaheads/lookbehinds, then you can use (?<=[^a-z])\!+|\!+(?=[a-z]) with no replacement string.

Custom definition for word boundary for words that begin or end with non-word characters

I have an array of words that contain strings like "DOM" *".Net"* and "C++". I'm trying to perform whole word match for each of these strings in some text, by using the word boundary wild card. If the words are read into a variable, it would look like:
preg_match("/\b".preg_quote($word)."\b/",...)
This works fine for an example like "DOM", but not for ".Net" or "C++" because word boundary is also seen at . in case of .Net and is already seen at + in case of C++. Is there an alternative way in regular expressions in PHP to treat .Net or C++ as "words" for word boundary?

This cannot be done, since \b matches for non-word characters (\W).
What you could do instead is search for characters that do not match some set of characters you define to be words, as shown below:
preg_match("/([^a-zA-Z_.+])".preg_quote($word)."\1/",...);
Edit: Added a backrefrence, so you only need to type that sequence once.

character classes... lets say you only want to do spaces and commas you would do this
preg_match("/[, ]".preg_quote($word)."[, ]/",...)

PHP Regex for checking space or certain characters after string

I need a regex which can basically check for space, line break etc after string.
So conditions are,
Allow special characters ., _, -, + inside the string i.e.#hello.world, #hello_world, #helloworld, etc.
Discard anything including special characters where there is no alpha-numeric string after them i.e. #helloworld.<space>, #helloworld-<space>, #helloworld.?, etc. must be parsed as #helloworld
My existing RegEx is /#([A-Za-z0-9+_.-]+)/ which works perfectly Condition #1, but still there seems to be a problem Condition #2
I am using above RegEx in preg_replace()
Solution:
$str = preg_replace('##[\w+.\-]+\b#', '[[$0]]', $str);
This works perfectly.
Tested with
http://gskinner.com/RegExr/

You can use word boundaries to easily find the position between an alphanumeric letter and a non-alphanumeric letter:
$str = preg_replace('##[\w+.\-]+\b#', '[[$0]]', $str);
Working example: http://ideone.com/0ShCm

Here's an idea:
Use strrev to reverse the string
Use strcspn to find the longest prefix of the reversed string that does not contain any alphanumeric characters
Cut the prefix off with substr
Reverse the string again; this is your final result
See it in action.
I 'm not taking into account any requirement that restricts the legal characters in the string to some subset, but you can use your regular expression for that (or even strspn, which might be faster).

The reason is because it's reading the string as a whole. If you want it to parse out everything after the alphanumeric section you might have to do like and end(explode()); and run that through to make sure that it isn't valid and if it isn't valid then remove it from the equation, but then you'd have to check the end for every possible explode point i.e. .,-,~,etc.
Then again another trap that you might run into is that in the case of a item or anything w/ alphanumeric value it might just parse everything from after the last alphanumeric character on.
Sorry that this isn't much help, but I figured thinking aloud does help.

Why does this regex not validate in the same way in PHP?

when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher

The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.

You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.