Trying to add arabic numeric support to a working regex

Trying to add arabic numeric support to a working regex - php

I'm validating phone numbers with the following regex
^((\+\d{1,3}(-|.| )?\(?\d\)?(-| |.)?\d{1,5})|(\(?\d{2,6}\)?))(-|.| )?(\d{3,4})(-|.| )?(\d{4})(( x| ext)\d{1,5}){0,1}$
and it's working perfectly.
I need to add Arabic numbers support e.g. "٠١٢٣٤٥٦٧٨٩"
I already did some research and found out that the \u0660 and \u0669 counts from 0 to 9 in Arabic, but I need this added into my working regex.
Thanks

Don't change the pattern. Just do:
$temp = str_replace(['٠','١','٢','٣','٤','٥','٦','٧','٨','٩'], range(0, 9), $input);`
Then run the test on the temporary variable. Sorry, first array is back to front, visually but byte order should be right.

You can change your pattern, so that \d is replaced by [\d\x{0660}-\x{0669}] for every occurance of \d. \x{....} is used to represent a specific character with the given hex-code and you can also apply ranges with these. The same can be done in Javascript by using \u...., so your pattern would be [\d\u0660-\u0669].
You could alternatively turn on the u-flag (unicode) for your pattern, which then will cause \d to match any unicode digit (including latin and arabic, but not restricted to them). It will also affect other tokens like \w and [[:alpha:]], but that should not be an issue here.

Related

regex POSIX expression for half space,semi space or zero space

There is a POSIX bracket expression list like [:alnum:], [:alpha:]...
https://www.regular-expressions.info/posixbrackets.html
which one is for half space or semi space or zero space?
EDIT1: actually, i am using PHP regex_replace in smarty code, like below
{$title|regex_replace:'/[^[:punct:][:alnum:][:space:]]/u':''}
This code,replace all characters with null,except than puctuation,alpha numeric and space.
But unfortunately, it also replaces half space with null string.
For example: unicode persian string $title = '☺این‌یک تست (آزمایش) است‌'
will change to 'اینیک تست (آزمایش) است‌'.
But the correct string should be 'این‌یک تست (آزمایش) است‌'
As you see, it also replace half space in 'این‌یک' with null value and convert it to 'اینیک'
I want to prevent it.
EDIT2: half space or zero space is :
Decimal character code: 8204
Hexadecimal character code: 0x200c
HTML character reference: ‌
Java string: \u200c
A SOLUTION:
If i add persian(farsi) keyboard to windows and change keyboard language to persian(farsi), then
if i use SHIFT+SPACE to add a half space between two latest brackets ]] , it works great! :
{$title|regex_replace:'/[^[:punct:][:alnum:][:space:]‌]/u':''}
(There is a half space character between latest two brackets using persian keyboard)
But unfortunately it does not work using hex code \x200c, and i don't know why!?

The standard POSIX character classes generally capture classes of characters. If you want to match the character U+2002 then simply match exactly that character, literally or using whatever symbolic representation your programming language supports.
Python:
r = re.compile('\u2002')
if r.match(somestring):
...
Though of course, you don't need a regex for that:
if '\u2002' in somestring:
...
(I'm guessing you mean U+2002; there's a number of other spaces, none of which has a name which exactly contains "half space". A modern POSIX [:space:] should match all of them, of course.)
Update: If PHP's [:space:] is not properly POSIX and/or Unicode-compliant, probably simply add the code point to your expression.
{$title|regex_replace:'/[^[:punct:][:alnum:][:space:]\x{2002}]/u':''}
(with kudos to Regular expressions for a range of unicode points PHP)

You could replace any separators (\p{Z}) with a standard full space before applying the actual regular expression. Here doing both sequentially :
preg_replace(['/(\p{Z})/', '[^[:punct:][:alnum:][:space:]]/u'], [' ', ''], $title)

how to detect certain ending words in a mention

I have the following regular expression to detect mentions and extract them into string:
preg_match_all('/(?<=^|\s)#([^#\s]+)/'
this works well for detecting strings like this:
#ajksdh
#kajshd123
#12398asdd
however I wanted to make an exception so that it doesn't detect mention strings that end with 'rb', so the following shouldn't be matched
#72rb
#80rb
so the format is some numbers followed by 'rb'. Is this even possible?

Step 1
To exclude strings ending with rb, just add a closing boundary and a negative lookbehind:
(?<=^|\s)#([^#\s]+)(?<!rb)\b
See demo
Step 2
What this is missing is that the [^#\s] does not really define what you want (I am guessing). At the moment, it is matching newlines, for instance, and Japanese characters. This is probably closer to what you want:
(?<=^|\s)#((?:(?!#)\w)+)(?<!rb)\b
See demo
Fine-Tuning
If instead of just \w you want to allow more characters, let me know which, and we can tune this. For instance, to allow all ASCII characters except space, we could use:
(?<=^|\s)#((?:(?!#)[!-~])+)(?<!rb)\b

Regex blocking special characters

I'm using PHP Version 5.3.27
I'm trying to get my regex to match whitespace, and special characters such as ♦◘•♠♥☻, the other known special characters which are %$#&*# are already matched, but somehow the ones I mentioned before are not matched..
Current regex
preg_match('/^[a-zA-Z0-9[:space:]]+$/', $login)
My apology for asking two questions on the same subject. I hope this one is clear enough for you.

use this
[\W]+
will match any non-word character.

Your regex doesn't contain any reference to the special characters mentioned. You would need to include them in the character class for them to be matched.
To match those kinds of special characters you can use the unicode values.
Example:
\u0000-\uFFFF
\x00-\xFF
The top is UTF-16, the bottom is UTF-8.
Refer to a UTF-8/16 character table online to match up your symbols with their unicode values, then create a range to keep your expression short.

You can use the \p{S} character class (or \p{So}) that matches symbol characters (that includes this kind of characters: ╭₠☪♛♣♞♉♆☯♫):
preg_match('/^[a-zA-Z0-9\h\p{S}]+$/u', $login)
To find more possibilities you can check the pcre documentation at: http://www.pcre.org/pcre.txt
If you need to be more precise, the best way is to use character ranges in the character class. You can find code of characters here.

secured regular expression that restrict specific special characters

I tried to create regular expression with specification below
any alphabetic character (at least one)
any numeric character (at least one)
no spaces
accept all special characters (except ",;&|')
^(?=.*[0-9])(?=.*[a-z])(?!.*\s)((?!.*[",;&|'])|(?=(.*\W){1,}))(?!.*[",;&|'])$
This is the one I tried.
What I can do with this?

Question is still vague in nature, please provide some examples of accepted strings.
Just to get you started you can use:
character class in a negative lookahead
Don't forget start & end anchors:
Regex:
/^(?=.*?\d)(?=.*?[a-z])(?!.*?[ ",;&|']).+$/i
This regex will match 1 or more characters that are not one of ",;&|' and atleast one digit and a-z alpgabet is required.
Live Demo: http://www.rubular.com/r/nxdi79ZcRx
In PHP use it like this:
'/^(?=.*?\d)(?=.*?[a-z])(?!.*?[ ",;&|\']).+$/i'

Validate unicode textarea for minimum length

I have to validate Russian text (utf8) entered in textarea field of the form. The number of characters (no spaces, no empty lines) should be at least 500. The text should be checked with regex and can have many lines.
I have tried:
#^.{500}.*#
This indeed makes the restriction somehow. However, it seems that this pattern does not respect unicode. 260 Russian characters are enough to pass the check. I cannot figure out how to:
check unicode characters
do not count white spaces
do not count empty lines

Okay, so firstly . by default matches bytes, because the input string is interpreted as ASCII. Using Unicode mode changes that (as Esailija correctly pointed out), so that . correctly matches (Unicode) characters:
#^.{500}#u
You don't need the trailing .*, because there is no need to match the full string in PHP. Note that this does not match if there is a line-break within the first 500 characters, because . does not match line-breaks (you should add the s modifier as well, to change that).
For the second requirement to exclude whitespace from the count, you could do something like this:
#^(?:\s*\S){500}#u
That subgroup matches as many space-character as possible, and then one non-space character. And that together has to be matched 500 times. Hence, you only get one repetition per one non-whitespace character, as required.
Note that there is no need for the s modifier for this to work in under all circumstances, because we don't use ..
There is one caveat though, which is explained in this article, though. With Unicode some characters are made up of multiple code points. For instance, à can be written as one character a followed by another code point (U+0300 or `) which is a combining mark. So while there are two different Unicode code points, they are still only one character. However, . matches code points (because it doesn't distinguish between combining marks and "stand-alone characters"). I suppose that will not affect your situation, since Cyrillic doesn't use accents. But it's something worth to be aware of. If it is relevant for you, you might want to look into a more advanced solution like Ωmega's.

You need the u flag to activate UTF-8 awaraness in preg_ functions:
$regex = '#^.{500}.*#u';
If you just want to see if it's 500 characters long, you can just use mb_strlen:
mb_internal_encoding("UTF-8");
$input_without_whitespace = preg_replace( '/[\x{0009}\x{000B}\x{000C}\x{0020}\x{00A0}\x{FEFF}\x{200C}\x{200D}]/u', "", $input );
if( mb_strlen( $input_without_whitespace ) > 500 ) {
}

Use regex pattern
/(?>\s*+\P{M}\p{M}*){500}/u

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.