I have to validate Russian text (utf8) entered in textarea field of the form. The number of characters (no spaces, no empty lines) should be at least 500. The text should be checked with regex and can have many lines.
I have tried:
#^.{500}.*#
This indeed makes the restriction somehow. However, it seems that this pattern does not respect unicode. 260 Russian characters are enough to pass the check. I cannot figure out how to:
check unicode characters
do not count white spaces
do not count empty lines
Okay, so firstly . by default matches bytes, because the input string is interpreted as ASCII. Using Unicode mode changes that (as Esailija correctly pointed out), so that . correctly matches (Unicode) characters:
#^.{500}#u
You don't need the trailing .*, because there is no need to match the full string in PHP. Note that this does not match if there is a line-break within the first 500 characters, because . does not match line-breaks (you should add the s modifier as well, to change that).
For the second requirement to exclude whitespace from the count, you could do something like this:
#^(?:\s*\S){500}#u
That subgroup matches as many space-character as possible, and then one non-space character. And that together has to be matched 500 times. Hence, you only get one repetition per one non-whitespace character, as required.
Note that there is no need for the s modifier for this to work in under all circumstances, because we don't use ..
There is one caveat though, which is explained in this article, though. With Unicode some characters are made up of multiple code points. For instance, à can be written as one character a followed by another code point (U+0300 or `) which is a combining mark. So while there are two different Unicode code points, they are still only one character. However, . matches code points (because it doesn't distinguish between combining marks and "stand-alone characters"). I suppose that will not affect your situation, since Cyrillic doesn't use accents. But it's something worth to be aware of. If it is relevant for you, you might want to look into a more advanced solution like Ωmega's.
You need the u flag to activate UTF-8 awaraness in preg_ functions:
$regex = '#^.{500}.*#u';
If you just want to see if it's 500 characters long, you can just use mb_strlen:
mb_internal_encoding("UTF-8");
$input_without_whitespace = preg_replace( '/[\x{0009}\x{000B}\x{000C}\x{0020}\x{00A0}\x{FEFF}\x{200C}\x{200D}]/u', "", $input );
if( mb_strlen( $input_without_whitespace ) > 500 ) {
}
Use regex pattern
/(?>\s*+\P{M}\p{M}*){500}/u
Related
I am fairly new to using regular expressions and I am stuck on a problem that I am trying to solve. I have issues understanding what's going on and I hope that someone can hint me in the right direction.
What I am trying to achieve:
To avoid duplicates in the view, I want to check if an attribute name contains the respective attribute unit. For example if $attribute['name'] = "Cutting speed (in m/Min.)" and attribute['unit'] = "m/min" the attribute unit should not be displayed as it is already mentioned in the name.
How I am trying to achieve this:
I am checking for the attribute unit by using the following regular expression: ~\b' . attribute['unit'] . '\b~i'
This works well in for the above mentioned example, but not so well if the unit is a special character, like % or ", for instance.
The Problems
While testing for the special character issue I came accross the following phenomenon:
if I use this regex /\b%\b/ it behaves not as expected and matches the % in bla%bla but not the % if it is preceded or followed by a space: https://regex101.com/r/56iYEI/3
It seems like the % turns the behavior of the regex to its opposite. I tested with other "special characters" as well (" and &), and they seem to have the same effect.
I was directed to this question (Regular Expression Word Boundary and Special Characters) before and read the answers. I now understand that \b checks for word boundaries. But it is still unclear to me why it behaves the way it does as soon as a % or " turns up.
The questions
How come a % turns this checking for word boundaries by \b around?
How can I achieve my goal to match for alphanumeric units as well as for special character units, like % or "?
Looking forward to any hints. Thanks in advance!
A word break is a point between a string of word characters and a string of non-word characters (or start or end). The non-word characters don't have to be a space.
foo"##bar {}qux
In this string the words breaks are before and after foo, bar, and qux.
The expression /\b"##\b/ will match chars between foo and bar. However /\b"#\b/ will not because there is no word (and thus no word break) after the #.
To solve this, check either a word break or a non-word character. The following expression matches both cases; /(^|\W|\b)"#($|\W|\b)/.
'~(^|\W|\b)' . attribute['unit'] . '($|\W|\b)~i'
P.S. If attribute['unit'] can contain any characters, be sure to quote before using it in the regex using preg_quote().
I'm validating phone numbers with the following regex
^((\+\d{1,3}(-|.| )?\(?\d\)?(-| |.)?\d{1,5})|(\(?\d{2,6}\)?))(-|.| )?(\d{3,4})(-|.| )?(\d{4})(( x| ext)\d{1,5}){0,1}$
and it's working perfectly.
I need to add Arabic numbers support e.g. "٠١٢٣٤٥٦٧٨٩"
I already did some research and found out that the \u0660 and \u0669 counts from 0 to 9 in Arabic, but I need this added into my working regex.
Thanks
Don't change the pattern. Just do:
$temp = str_replace(['٠','١','٢','٣','٤','٥','٦','٧','٨','٩'], range(0, 9), $input);`
Then run the test on the temporary variable. Sorry, first array is back to front, visually but byte order should be right.
You can change your pattern, so that \d is replaced by [\d\x{0660}-\x{0669}] for every occurance of \d. \x{....} is used to represent a specific character with the given hex-code and you can also apply ranges with these. The same can be done in Javascript by using \u...., so your pattern would be [\d\u0660-\u0669].
You could alternatively turn on the u-flag (unicode) for your pattern, which then will cause \d to match any unicode digit (including latin and arabic, but not restricted to them). It will also affect other tokens like \w and [[:alpha:]], but that should not be an issue here.
I have the following regular expression to detect mentions and extract them into string:
preg_match_all('/(?<=^|\s)#([^#\s]+)/'
this works well for detecting strings like this:
#ajksdh
#kajshd123
#12398asdd
however I wanted to make an exception so that it doesn't detect mention strings that end with 'rb', so the following shouldn't be matched
#72rb
#80rb
so the format is some numbers followed by 'rb'. Is this even possible?
Step 1
To exclude strings ending with rb, just add a closing boundary and a negative lookbehind:
(?<=^|\s)#([^#\s]+)(?<!rb)\b
See demo
Step 2
What this is missing is that the [^#\s] does not really define what you want (I am guessing). At the moment, it is matching newlines, for instance, and Japanese characters. This is probably closer to what you want:
(?<=^|\s)#((?:(?!#)\w)+)(?<!rb)\b
See demo
Fine-Tuning
If instead of just \w you want to allow more characters, let me know which, and we can tune this. For instance, to allow all ASCII characters except space, we could use:
(?<=^|\s)#((?:(?!#)[!-~])+)(?<!rb)\b
I'm using PHP Version 5.3.27
I'm trying to get my regex to match whitespace, and special characters such as ♦◘•♠♥☻, the other known special characters which are %$#&*# are already matched, but somehow the ones I mentioned before are not matched..
Current regex
preg_match('/^[a-zA-Z0-9[:space:]]+$/', $login)
My apology for asking two questions on the same subject. I hope this one is clear enough for you.
use this
[\W]+
will match any non-word character.
Your regex doesn't contain any reference to the special characters mentioned. You would need to include them in the character class for them to be matched.
To match those kinds of special characters you can use the unicode values.
Example:
\u0000-\uFFFF
\x00-\xFF
The top is UTF-16, the bottom is UTF-8.
Refer to a UTF-8/16 character table online to match up your symbols with their unicode values, then create a range to keep your expression short.
You can use the \p{S} character class (or \p{So}) that matches symbol characters (that includes this kind of characters: ╭₠☪♛♣♞♉♆☯♫):
preg_match('/^[a-zA-Z0-9\h\p{S}]+$/u', $login)
To find more possibilities you can check the pcre documentation at: http://www.pcre.org/pcre.txt
If you need to be more precise, the best way is to use character ranges in the character class. You can find code of characters here.
I need a regex which can basically check for space, line break etc after string.
So conditions are,
Allow special characters ., _, -, + inside the string i.e.#hello.world, #hello_world, #helloworld, etc.
Discard anything including special characters where there is no alpha-numeric string after them i.e. #helloworld.<space>, #helloworld-<space>, #helloworld.?, etc. must be parsed as #helloworld
My existing RegEx is /#([A-Za-z0-9+_.-]+)/ which works perfectly Condition #1, but still there seems to be a problem Condition #2
I am using above RegEx in preg_replace()
Solution:
$str = preg_replace('##[\w+.\-]+\b#', '[[$0]]', $str);
This works perfectly.
Tested with
http://gskinner.com/RegExr/
You can use word boundaries to easily find the position between an alphanumeric letter and a non-alphanumeric letter:
$str = preg_replace('##[\w+.\-]+\b#', '[[$0]]', $str);
Working example: http://ideone.com/0ShCm
Here's an idea:
Use strrev to reverse the string
Use strcspn to find the longest prefix of the reversed string that does not contain any alphanumeric characters
Cut the prefix off with substr
Reverse the string again; this is your final result
See it in action.
I 'm not taking into account any requirement that restricts the legal characters in the string to some subset, but you can use your regular expression for that (or even strspn, which might be faster).
The reason is because it's reading the string as a whole. If you want it to parse out everything after the alphanumeric section you might have to do like and end(explode()); and run that through to make sure that it isn't valid and if it isn't valid then remove it from the equation, but then you'd have to check the end for every possible explode point i.e. .,-,~,etc.
Then again another trap that you might run into is that in the case of a item or anything w/ alphanumeric value it might just parse everything from after the last alphanumeric character on.
Sorry that this isn't much help, but I figured thinking aloud does help.