PHP regex to accept Japanese and english languages - php

I am trying to create a regex to filter only alphabets or numbers from English and Japanese languages. This is what I have tried,
preg_match('/(?![\n\r])[\x00-\x1F\x80-\xFF][^\x4e00-\x9fa0)]/u', $value)
But I am not getting the desired result. What might I be doing wrong?

You should use unicode character properties
Also you may have a look on this website which contains some other regex examples http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/
Updated character list based on #Álvaro González notice about the three alphabets.
this regex should do what you expect :
preg_match('/[\p{L}\p{N}\p{Katakana}\p{Hiragana}\p{Han}]+/u', $value)
\p{L} will match any letter, \p{N} any number and \p{Katakana} will match any Katakana char etc...
You may need to add word delimiters into the accepted characters if you are not matching single words

The following regex checks the line is not Japanese language:
if(!preg_match('/^[\x{3041}-\x{3096}\x{30a1}-\x{30fc}\x{4e00}-\x{9faf}]+$/u', $line)){
// ...
}
You can find more in the document:
https://www.w3.org/International/questions/qa-forms-utf-8.html

Related

Regex - Match only unicode alphabet not numbers

I'm using PHP, and trying to write a regular expression that matches any alphabet in any language but not numbers.
I've tried /\p{L}+/ But it matches unicode alphabets and numbers too. I'm checking against Arabic and English languages. English numbers doesn't pass which is normal, but Arabic numbers pass which is not normal.
Is there another regular expression that matches only alphabets in any language ?
The regex engine need to know that the target string is an unicode string (to avoid interpretation errors). To do that you can use the u modifier, that has two functions:
it expands classical shorthand character classes like \w \d to unicode characters (and not only ascii characters)
it forces the string to be seen as an unicode string
So you can use: /\pL+/u
Note that in your particular case, the first behavior is not needed, but you can only switch on the second behavior with: /(*UTF8)\pL+/ ((*UTF8) must be placed at the very begining of the pattern)

Regex blocking special characters

I'm using PHP Version 5.3.27
I'm trying to get my regex to match whitespace, and special characters such as ♦◘•♠♥☻, the other known special characters which are %$#&*# are already matched, but somehow the ones I mentioned before are not matched..
Current regex
preg_match('/^[a-zA-Z0-9[:space:]]+$/', $login)
My apology for asking two questions on the same subject. I hope this one is clear enough for you.
use this
[\W]+
will match any non-word character.
Your regex doesn't contain any reference to the special characters mentioned. You would need to include them in the character class for them to be matched.
To match those kinds of special characters you can use the unicode values.
Example:
\u0000-\uFFFF
\x00-\xFF
The top is UTF-16, the bottom is UTF-8.
Refer to a UTF-8/16 character table online to match up your symbols with their unicode values, then create a range to keep your expression short.
You can use the \p{S} character class (or \p{So}) that matches symbol characters (that includes this kind of characters: ╭₠☪♛♣♞♉♆☯♫):
preg_match('/^[a-zA-Z0-9\h\p{S}]+$/u', $login)
To find more possibilities you can check the pcre documentation at: http://www.pcre.org/pcre.txt
If you need to be more precise, the best way is to use character ranges in the character class. You can find code of characters here.

Match whole words in utf

I want to replace all occurrences of a with 5. Here is the code that works well:
$content=preg_replace("/\ba\b/","5", $content);
unless I have words like zapłać where a is between non standard characters, or zmarła where there is a Unicode (or non-ASCII) letter followed by a at the end of word. Is there any easy way to fix it?
the problem is that the predefined character class \w is ASCII based and that does not change, when the u modifier is used. (See regular-expressions.info, preg is PCRE in the columns)
You can use lookbehind and lookahead to do it:
$content=preg_replace("/(?<!\p{L})a(?!\p{L})/","5",$content);
This will replace "a" if there is not a letter before and not a letter ahead.
\p{L}: any kind of letter from any language.
$content=preg_replace("/\ba\b/u","5",$content);

Regex for word characters in any language

Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_] to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!
Note that this is not for security filtering but rather for tokenizing a text.
Try [\pL_] - see the reference at
http://php.net/manual/en/regexp.reference.unicode.php
Try \p{L}. It matches any kind of letter from any language. If you don't want to use char set [].

Match Arabic/English Alphanumeric using Regex

I would like to have a regular expression that matches:
Arabic letters.
List item
English alphanumeric.
3 Spaces maximum.
4 Underscores maximum.
Any order.
I tried varies solution but couldn't solve it.
Here is what i have now:
preg_match('#^([^\W_]*\s){0,3}[^\W_]*$#', $username)
The above expression allows:
3 spaces maximum
English alpanumerics
No underscore allowed
You can check if your Regex flavour supports this \p{Arabic} or \p{InArabic}.
Also experiment with mb_ereg_match() function: http://si2.php.net/manual/en/function.mb-ereg-match.php
If that doesn't work, there is no other option than explicitly writing all arabic characters into the expression. Messy, but does the work.
Since you are using php, you can first list all arabic characters into a string variable and then add that variable to regex, for the code manageability's sake.
I don't know about arabic characters, but the following regexp should match the others
([a-zA-Z0-9]{1,})\s{0,3}_{0,4}
This will match
(Alphanumeric)(0-3 spaces)(0-4 underscores)
If there are more than 4 underscores, the last ones will be omitted
If there are more than 3 spaces then the part after the 3 spaces will be ignored.
EDIT:
For arabic letters: First declare a string containing all arabic letters
so you'll have
$arabic='all_arabic_letters';
Then your regexp string will be
$regex='[' . $arabic . ']{1,}([a-zA-Z0-9]{1,})\s{0,3}_{0,4}';
And match it as follows:
preg_match($regex, $username);

Categories