I have got a forum site and I am currently working on the final piece, the registration form and I want to validate the username. It should only contain Arabic and English alphanumerics and a maximum of one space between words.
I've got the english alphanumeric part working but not the Arabic nor the double spaces.
I am using the preg_match() function to match the username input with the RegEX.
What I currently have:
!preg_match('/\p{Arabic}/', $username) && !preg_match('/^[A-Za-z0-9]$/')
//this is currently inside and if statement, so if they both don't match then it is false.
You should put the unicode properties inside your regular regex because this can all be done with 1 regex. You also need to quantify that character class otherwise you only allow 1 character. This regex should do it.
^[\p{Arabic}a-zA-Z\p{N}]+\h?[\p{N}\p{Arabic}a-zA-Z]*$
Use the u modifier in PHP so unicode works as expected.
PHP Usage:
preg_match('/^[\p{Arabic}a-zA-Z\p{N}]+\h?[\p{N}\p{Arabic}a-zA-Z]*$/u', $string);
Demo: https://regex101.com/r/fsRchS/2/
Related
While there have been many questions regarding the non-english characters regex issue I have not been able to find a working answer. Moreover, there does not seem to be any simple PHP library which would help me to filter non-english input.
Could you please suggest me a regular expression which would allow
all english alphabet characters (abc...)
all non-english alphabet characters (šýüčá...)
spaces
case insensitive
in validation as well as sanitization. Essentially, I want either preg_match to return false when the input contains anything else than the 4 points above or preg_replace to get rid of everything except these 4 categories.
I was able to create
'/^((\p{L}\p{M}*)|(\p{Cc})|(\p{Z}))+$/ui' from http://www.regular-expressions.info/unicode.html. This regular expression works well when validating input but not when sanitizing it.
EDIT:
User enters 'český [jazyk]' as an input. Using '/^[\p{L}\p{Zs}]+$/u' in preg_match, the script determines that the string contains unallowed characters (in this case '[' and ']'). Next I would like to use preg_replace, to delete those unwanted characters. What regular expression should I pass into preg_replace to match all characters that are not specified by the regular expression stated above?
I think all you need is a character class like:
^[\p{L}\p{Zs}]+$
It means: The whole string (or line, with (?m) option) can only contain Unicode letters or spaces.
Have a look at the demo.
$re = "/^[\\p{L}\\p{Zs}]+$/um";
$str = "all english alphabet characters (abc...)\nall non-english alphabet characters (šýüčá...)\nspace s\nšýüčá šýüčá šýüčá ddd\nšýüčá eee 4\ncase insensitive";
preg_match_all($re, $str, $matches);
To remove all symbols that are not Unicode letters or spaces, use this code:
$re = "/[^\\p{L}\\p{Zs}]+/u";
$str = "český [jazyk]";
echo preg_replace($re, "", $str);
The output of the sample program:
český jazyk
I'm new to regular expressions, but with little bit of searching on StackOverflow I managed to get what I want (If 2+ words are seperated by comma then it returns true and returns false if that isn't the case or the word ends with comma but nothing after), except I am having a problem with Croatian characters (č,ć,ž,đ,š upper and lowercase). My current preg_match looks like
if (preg_match('/^(([a-zA-Z0-9]+\\s*,\\s*)+(\\s*)([a-zA-Z0-9]+))$/', $data))
{
//do stuff
}
But the problem with this approach is it won't return true if it has Č, ć, ž... and I know that is because of [a-zA-Z] that doesn't "look" for this characters. So, my question is how to write a regex that will return true with Croatian characters. And also if this could be done easier feel free to comment, as I would like to hear your suggestions on that. BTW, I have done this with the help of regex101.com
The \p{L} shorthand class and u option makes it possible to match Unicode letters.
This program returns FOUND!:
$data = "Čdd, ćdd, žddd";
if (preg_match('/^(([\\p{L}0-9]+\\s*,\\s*)+(\\s*)([\\p{L}0-9]+))$/u', $data))
{
echo "<h1>FOUND!</h1>";
}
As per Regular-Expressions.info:
You can match a single character belonging to the "letter" category
with \p{L}.
and its another page devoted to PHP regex:
You should specify /u for regular expressions that use \x{FFFF}, \X or
\p{L} to match Unicode characters, graphemes, properties or scripts.
PHP will interpret '/regex/u' as a UTF-8 string rather than as an
ASCII string.
Also, see one of the examples at preg_match function documentation page:
For those who search for a unicode regular expression example using
preg_match here it is:
Check for Persian digits preg_match( "/[^\x{06F0}-\x{06F9}\x]+/u" ,
'۱۲۳۴۵۶۷۸۹۰' );
I'm trying to formulate a regular expression that will allow me to find a string within a piece of text, if the string exists on its own i.e. not within another word (but surrounded by special characters is ok).
/\bword\b/i
The above regex works fine, and finds "word" in the text. The problem comes when the word I want to find is something like "c++". In this case it matches on any occurrence of the "c" character on it's own. I've tried escaping the "+" characters but it doesn't make any difference. I'm assuming because "+" is a non-word character, I'm possibly going down the wrong route and using word boundaries is not what I should be doing.
So I guess the question is, how can I use a regular expression to find a string in a piece of text, on it's own, and regardless of whether the string is alphanumeric or contains special characters. So in the following piece of text it should match on the 3 occurences of "c++":
c++
(c++)
perl/c++/assembly
But it should not match on the following:
maniac++
c++abc
This is intended so that my script can tell if a specific skill exists within a user's CV/resume. I'm using this with PHP's preg_match_all() function.
I've done a lot of searching but can't come up with a solution, hopefully someone with good regex knowledge can help.
Try this:
/(?<!\w)(c\+\+)(?!\w)/
The (?<!\w) is a negative lookbehind clause, meaning that a word character should not immediately precede your pattern. The (?!\w) part is negative lookahead, meaning that a word character should not immediately follow.
Hope this helps!
I would like to have a regular expression that matches:
Arabic letters.
List item
English alphanumeric.
3 Spaces maximum.
4 Underscores maximum.
Any order.
I tried varies solution but couldn't solve it.
Here is what i have now:
preg_match('#^([^\W_]*\s){0,3}[^\W_]*$#', $username)
The above expression allows:
3 spaces maximum
English alpanumerics
No underscore allowed
You can check if your Regex flavour supports this \p{Arabic} or \p{InArabic}.
Also experiment with mb_ereg_match() function: http://si2.php.net/manual/en/function.mb-ereg-match.php
If that doesn't work, there is no other option than explicitly writing all arabic characters into the expression. Messy, but does the work.
Since you are using php, you can first list all arabic characters into a string variable and then add that variable to regex, for the code manageability's sake.
I don't know about arabic characters, but the following regexp should match the others
([a-zA-Z0-9]{1,})\s{0,3}_{0,4}
This will match
(Alphanumeric)(0-3 spaces)(0-4 underscores)
If there are more than 4 underscores, the last ones will be omitted
If there are more than 3 spaces then the part after the 3 spaces will be ignored.
EDIT:
For arabic letters: First declare a string containing all arabic letters
so you'll have
$arabic='all_arabic_letters';
Then your regexp string will be
$regex='[' . $arabic . ']{1,}([a-zA-Z0-9]{1,})\s{0,3}_{0,4}';
And match it as follows:
preg_match($regex, $username);
I am trying to validate some user inputs, but my regex fails when it encounters diacritics. I am talking about characters like ăĂ and so on.
What should I add to the regex code so it should also validate diacritics from within inputs?
Thank you!
P.S.: If it matters, I am using PHP with CakePHP framework.
This is the piece of code I am currently using for validating user input: return preg_match('|^[0-9a-zA-Z_-\s]*$|', $value);
Assuming you want to match letters, then allowing Unicode letters should help:
Use /\p{L}+/u for example if you want to match a sequence of letters. Don't forget the /u (Unicode) modifier.
In your case:
return preg_match('|^[0-9\p{L}_\s-]*$|u', $value);
should work.
As an aside, it's probably not a good idea to use | as a regex delimiter. For the current regex / would do just fine; other alternatives are ~ or # because they seldom occur in text and don't have any special meaning in regexes.