Regular Expression that matches attribute units in attribute names including special characters

Regular Expression that matches attribute units in attribute names including special characters - php

I am fairly new to using regular expressions and I am stuck on a problem that I am trying to solve. I have issues understanding what's going on and I hope that someone can hint me in the right direction.
What I am trying to achieve:
To avoid duplicates in the view, I want to check if an attribute name contains the respective attribute unit. For example if $attribute['name'] = "Cutting speed (in m/Min.)" and attribute['unit'] = "m/min" the attribute unit should not be displayed as it is already mentioned in the name.
How I am trying to achieve this:
I am checking for the attribute unit by using the following regular expression: ~\b' . attribute['unit'] . '\b~i'
This works well in for the above mentioned example, but not so well if the unit is a special character, like % or ", for instance.
The Problems
While testing for the special character issue I came accross the following phenomenon:
if I use this regex /\b%\b/ it behaves not as expected and matches the % in bla%bla but not the % if it is preceded or followed by a space: https://regex101.com/r/56iYEI/3
It seems like the % turns the behavior of the regex to its opposite. I tested with other "special characters" as well (" and &), and they seem to have the same effect.
I was directed to this question (Regular Expression Word Boundary and Special Characters) before and read the answers. I now understand that \b checks for word boundaries. But it is still unclear to me why it behaves the way it does as soon as a % or " turns up.
The questions
How come a % turns this checking for word boundaries by \b around?
How can I achieve my goal to match for alphanumeric units as well as for special character units, like % or "?
Looking forward to any hints. Thanks in advance!

A word break is a point between a string of word characters and a string of non-word characters (or start or end). The non-word characters don't have to be a space.
foo"##bar {}qux
In this string the words breaks are before and after foo, bar, and qux.
The expression /\b"##\b/ will match chars between foo and bar. However /\b"#\b/ will not because there is no word (and thus no word break) after the #.
To solve this, check either a word break or a non-word character. The following expression matches both cases; /(^|\W|\b)"#($|\W|\b)/.
'~(^|\W|\b)' . attribute['unit'] . '($|\W|\b)~i'
P.S. If attribute['unit'] can contain any characters, be sure to quote before using it in the regex using preg_quote().

Related

Match 'exclamation mark' character 'not immediately preceded by a word'

I want to delete every ! character from a string that is not immediately preceded by a word. To accomplish this task, I was thinking about preg_replace() to perform a Regex match.
That is, I'd like the following blasphemy of a text:
search! query ! !key!words that! acc!ept exclamation! marks!
... to become:
search! query keywords that! accept exclamation! marks!
There is no need to take double+ occurrences into account, since I filter those out using (![!]+) - although if someone knows of a solution that takes double+ occurrences into consideration, I'd be more than glad to welcome it, since it removes the need for an extra lookup.
So far I have (!\b)|(\s+!\s+)|(!\s+!) which - besides being a bit whacky in my opinion - works almost perfectly, but sometimes removes spacing between words, producing the result of
search! querykeywords that! accept exclamation! marks!
EDIT
I need to take accented and/or uppercase characters into consideration when parsing the string.

You want to remove an ! when
there's no word break before it (as in foo !)
or there is a word break after it (as in !foo)
That gives:
\B!|!\b
https://regex101.com/r/xF7bG6/1

([^a-z])\!+|\!+([a-z]), with a replacement of $1$2 should match multiple !'s that are not preceded by a letter (\W) or have a letter immediately after (\w).
If your regular expression language takes positive lookaheads/lookbehinds, then you can use (?<=[^a-z])\!+|\!+(?=[a-z]) with no replacement string.

Regex to find string containing special characters in text

I'm trying to formulate a regular expression that will allow me to find a string within a piece of text, if the string exists on its own i.e. not within another word (but surrounded by special characters is ok).
/\bword\b/i
The above regex works fine, and finds "word" in the text. The problem comes when the word I want to find is something like "c++". In this case it matches on any occurrence of the "c" character on it's own. I've tried escaping the "+" characters but it doesn't make any difference. I'm assuming because "+" is a non-word character, I'm possibly going down the wrong route and using word boundaries is not what I should be doing.
So I guess the question is, how can I use a regular expression to find a string in a piece of text, on it's own, and regardless of whether the string is alphanumeric or contains special characters. So in the following piece of text it should match on the 3 occurences of "c++":
c++
(c++)
perl/c++/assembly
But it should not match on the following:
maniac++
c++abc
This is intended so that my script can tell if a specific skill exists within a user's CV/resume. I'm using this with PHP's preg_match_all() function.
I've done a lot of searching but can't come up with a solution, hopefully someone with good regex knowledge can help.

Try this:
/(?<!\w)(c\+\+)(?!\w)/
The (?<!\w) is a negative lookbehind clause, meaning that a word character should not immediately precede your pattern. The (?!\w) part is negative lookahead, meaning that a word character should not immediately follow.
Hope this helps!

Validate unicode textarea for minimum length

I have to validate Russian text (utf8) entered in textarea field of the form. The number of characters (no spaces, no empty lines) should be at least 500. The text should be checked with regex and can have many lines.
I have tried:
#^.{500}.*#
This indeed makes the restriction somehow. However, it seems that this pattern does not respect unicode. 260 Russian characters are enough to pass the check. I cannot figure out how to:
check unicode characters
do not count white spaces
do not count empty lines

Okay, so firstly . by default matches bytes, because the input string is interpreted as ASCII. Using Unicode mode changes that (as Esailija correctly pointed out), so that . correctly matches (Unicode) characters:
#^.{500}#u
You don't need the trailing .*, because there is no need to match the full string in PHP. Note that this does not match if there is a line-break within the first 500 characters, because . does not match line-breaks (you should add the s modifier as well, to change that).
For the second requirement to exclude whitespace from the count, you could do something like this:
#^(?:\s*\S){500}#u
That subgroup matches as many space-character as possible, and then one non-space character. And that together has to be matched 500 times. Hence, you only get one repetition per one non-whitespace character, as required.
Note that there is no need for the s modifier for this to work in under all circumstances, because we don't use ..
There is one caveat though, which is explained in this article, though. With Unicode some characters are made up of multiple code points. For instance, à can be written as one character a followed by another code point (U+0300 or `) which is a combining mark. So while there are two different Unicode code points, they are still only one character. However, . matches code points (because it doesn't distinguish between combining marks and "stand-alone characters"). I suppose that will not affect your situation, since Cyrillic doesn't use accents. But it's something worth to be aware of. If it is relevant for you, you might want to look into a more advanced solution like Ωmega's.

You need the u flag to activate UTF-8 awaraness in preg_ functions:
$regex = '#^.{500}.*#u';
If you just want to see if it's 500 characters long, you can just use mb_strlen:
mb_internal_encoding("UTF-8");
$input_without_whitespace = preg_replace( '/[\x{0009}\x{000B}\x{000C}\x{0020}\x{00A0}\x{FEFF}\x{200C}\x{200D}]/u', "", $input );
if( mb_strlen( $input_without_whitespace ) > 500 ) {
}

Use regex pattern
/(?>\s*+\P{M}\p{M}*){500}/u

regex validation

I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong

Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.

First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)

I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.

Allow + in regex email validate email [duplicate]

This question already has answers here:
How to validate an email address in PHP
(15 answers)
Closed 2 years ago.
Regex is blowing my mind. How can I change this to validate emails with a plus sign? so I can sign up with test+spam#gmail.com
if(!preg_match("/^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*$/i", $_GET['em'])) {

It seems like you aren't really familiar with what your regex is doing currently, which would be a good first step before modifying it. Let's walk through your regex using the email address john.robert.smith#mail.com (in each section below, the bolded part is what is matched by that section):
^ is the start of string
anchor.
It specifies that any match must
begin at the beginning of the
string. If the pattern is not
anchored, the regex engine can match
a substring, which is often
undesired.
Anchors are zero-width, meaning that
they do not capture any characters.
[_a-z0-9-]+ is made up of two
elements, a character
class
and a repetition
modifer:
[...] defines a character class, which tells the regex engine,
any of these characters are valid matches. In this case the class
contains the characters a-z, numbers
0-9 and the dash and underscore (in
general, a dash in a character class
defines a range, so you can use
a-z instead of
abcdefghijklmnopqrstuvwxyz; when
given as the last character in the
class, it acts as a literal dash).
+ is a repetition modifier that specifies that the preceding token
(in this case, the character class)
can be repeated one or more times.
There are two other repetition
operators: * matches zero or more
times; ? matches exactly zero or
one times (ie. makes something
optional).
(captures
john.robert.smith#mail.com)
(\.[_a-z0-9-]+)* again contains a
repeated character class. It also
contains a
group,
and an escaped character:
(...) defines a group, which allows you to group multiple tokens
together (in this case, the group
will be repeated as a
whole).Let's say we wanted to
match 'abc', zero or more times (ie.
abcabcabc matches, abcccc doesn't).
If we tried to use the pattern
abc*, the repetition modifier
would only apply to the c, because
c is the last token before the
modifier. In order to get around
this, we can group abc ((abc)*),
in which case the modifier would
apply to the entire group, as if it
was a single token.
\. specifies a literal dot character. The reason this is needed
is because . is a special
character in regex, meaning any
character.
Since we want to match an actual dot
character, we need to escape it.
(captures
john.robert.smith#mail.com)
# is not a special character in
regex, so, like all other
non-special characters, it matches
literally.
(captures john.robert.smith#mail.com)
[a-z0-9-]+ again defines a repeated character class, like item #2 above.
(captures john.robert.smith#mail.com)
(\.[a-z0-9-]+)* is almost exactly the same pattern as #3 above.
(captures john.robert.smith#mail.com)
$ is the end of string anchor. It works the same as ^ above, except matches the end of the string.
With that in mind, it should be a bit clearer how to add a section with captures a plus segment. As we saw above, + is a special character so it has to be escaped. Then, since the + has to be followed by some characters, we can define a character class with the characters we want to match and define its repetition. Finally, we should make the whole group optional because email addresses don't need to have a + segment:
(\+[a-z0-9-]+)?
When inserted into your regex, it'd look like this:
/^[_a-z0-9-]+(\.[_a-z0-9-]+)*(\+[a-z0-9-]+)?#[a-z0-9-]+(\.[a-z0-9-]+)*$/i

Save your sanity. Get a pre-made PHP RFC 822 Email address parser

I've used this regex to validate emails, and it works just fine with emails that contain a+:
/^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/

\+ will match a literal + sign, but be aware: You still won't be close to matching all possible email addresses according to the RFC spec, because the actual regex for that is madness. It's almost certainly not worth it; you should use a real email parser for this.

This is another solution (is similar to the solution found by David):
//Escaped for .Net
^[_a-zA-Z0-9-]+((\\.[_a-zA-Z0-9-]+)*|(\\+[_a-zA-Z0-9-]+)*)*#[a-zA-Z0-9-]+(\\.[a-zA-Z0-9-]+)*(\\.[a-zA-Z]{2,4})$
//Native
^[_a-zA-Z0-9-]+((\.[_a-zA-Z0-9-]+)*|(\+[_a-zA-Z0-9-]+)*)*#[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})$

This is the another solution
/^[_a-z0-9-+]+(\.[_a-z0-9-+]+)*(\+[a-z0-9-]+)?#[a-z0-9-.]+(\.[a-z0-9]+)$/
or For razor page(#=\u0040)
/^[_a-z0-9-+]+(\.[_a-z0-9-+]+)*(\+[a-z0-9-]+)?\u0040[a-z0-9-.]+(\.[a-z0-9]+)$/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.