Php - regular expression to check if the string has chinese chars - php

I have the string $str and I want to check if it`s content has Chinese chars or not (true/false)
$str = "赕就可消垻,只有当所有方块都被消垻时才可以过关";
can you please help me?
Thanks!
Adrian

You could use a unicode character class http://www.regular-expressions.info/unicode.html
preg_match("/\p{Han}+/u", $utf8_str);
This just checks for the presence of at least one chinese character. You might want to expand on this if you want to match the complete string.

#mario answer is right!
For Chinese chars use this regex: /[\x{4e00}-\x{9fa5}]+/u
And Don't forget the u modifier!!!
About u modifier reference
TKS to mario

preg_match("/^\p{Han}{2,10}+$/u", $str);
Use /^\p{Han}{2,10}+$/u regex which allows Chinese character only.
It allows chinese character only &
It allows Minimum 2 character &
It allows maximum 10 character
You can change minimum and maximum character by changing {2,10} as per your need.
\p & /u are very important to add please don't avoid to add it.

This link to a previous question on identifying simplified or traditional Chinese might give you some ideas... you don't actually specify which you mean, and I don't know Chinese well enough to recognise the difference

Related

Why preg_match with option 'i' does not match? [duplicate]

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:
// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";
This is eventually passed to a call to preg_match(). As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.
Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L} to do more work than I think it does?
Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.
I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.
Your regex should be:
// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';
If you want to replace Unicode old pattern with new pattern you should write:
$text = preg_replace('/\bold pattern\b/u', 'new pattern', $text);
So the key here is u modifier
Note : Your server php version shoud be at least PHP 4.3.5
as mentioned here php.net | Pattern Modifiers
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This
modifier is available from PHP 4.1.0 or greater on Unix and from PHP
4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
Thanks AgreeOrNot who give me that key here preg_replace match whole word in arabic
I tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use u modifier in PHP 4.3.5. , I upgrade php version and it works
Its important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the u modifier, see next example it should work with you
$text = preg_replace('/\bمرحبا بك\b/u', 'NEW', $text);
First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks \pM should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up: array_map('dechex', utf8ToUnicode($text))).
For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?
Anyone else looking here and not getting this to work, please note that /u will not produce consistent result with Unicode scripts across different PHP versions.
See example: https://3v4l.org/4hB9e
Related: Incosistent regex result for Thai characters across different PHP version
<?php preg_match('/[a-zığüşöç]/u',$title) ?>

Cut string at space with a regex but watch for umlauts? [duplicate]

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:
// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";
This is eventually passed to a call to preg_match(). As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.
Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L} to do more work than I think it does?
Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.
I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.
Your regex should be:
// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';
If you want to replace Unicode old pattern with new pattern you should write:
$text = preg_replace('/\bold pattern\b/u', 'new pattern', $text);
So the key here is u modifier
Note : Your server php version shoud be at least PHP 4.3.5
as mentioned here php.net | Pattern Modifiers
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This
modifier is available from PHP 4.1.0 or greater on Unix and from PHP
4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
Thanks AgreeOrNot who give me that key here preg_replace match whole word in arabic
I tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use u modifier in PHP 4.3.5. , I upgrade php version and it works
Its important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the u modifier, see next example it should work with you
$text = preg_replace('/\bمرحبا بك\b/u', 'NEW', $text);
First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks \pM should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up: array_map('dechex', utf8ToUnicode($text))).
For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?
Anyone else looking here and not getting this to work, please note that /u will not produce consistent result with Unicode scripts across different PHP versions.
See example: https://3v4l.org/4hB9e
Related: Incosistent regex result for Thai characters across different PHP version
<?php preg_match('/[a-zığüşöç]/u',$title) ?>

Regex blocking special characters

I'm using PHP Version 5.3.27
I'm trying to get my regex to match whitespace, and special characters such as ♦◘•♠♥☻, the other known special characters which are %$#&*# are already matched, but somehow the ones I mentioned before are not matched..
Current regex
preg_match('/^[a-zA-Z0-9[:space:]]+$/', $login)
My apology for asking two questions on the same subject. I hope this one is clear enough for you.
use this
[\W]+
will match any non-word character.
Your regex doesn't contain any reference to the special characters mentioned. You would need to include them in the character class for them to be matched.
To match those kinds of special characters you can use the unicode values.
Example:
\u0000-\uFFFF
\x00-\xFF
The top is UTF-16, the bottom is UTF-8.
Refer to a UTF-8/16 character table online to match up your symbols with their unicode values, then create a range to keep your expression short.
You can use the \p{S} character class (or \p{So}) that matches symbol characters (that includes this kind of characters: ╭₠☪♛♣♞♉♆☯♫):
preg_match('/^[a-zA-Z0-9\h\p{S}]+$/u', $login)
To find more possibilities you can check the pcre documentation at: http://www.pcre.org/pcre.txt
If you need to be more precise, the best way is to use character ranges in the character class. You can find code of characters here.

Regular expression that allows letters (like "ñ") from any language

trying to let users use special characters in other languages such as Spanish or French. I originally had this:
"/[^A-Za-z0-9\.\_\- ]/i"
and then changed it to
"/[^\p{L}\p{N}\.\_\-\(\) ]/i"
but still doesn't work. letters such as "ñ" should be allowed. Thanks.
Revision:
I found that adding a (*UTF8) at the beginning helps solve the problem. So I'm using the following code:"/(*UTF8)[^\p{L}A-Za-z0-9._- ]/i"
Revision:
After looking at the answers I decided to use: "/[^\p{Xwd}. -]/u". Thanks(It works even with the Chinese alphabet.
for latin languages you can use the \p{Latin} character class:
/[^\p{Latin}0-9._ -]/u
But if you want all other letters and digits:
/[^\p{Xwd}. -]/u
The "u" modifier indicates that the string must be read as an unicode string.
You could also look into specifying a unicode range, ie. [\w\u00C0-\u024F.-]+ to include Latin extended letters. But it's hard to try and restrict characters to such a broad subset; what about Chinese, Vietnamese, etc.? I'm with Dagon on this one – best to allow anything.

Regex for description form input

a textarea is a part of my form. The user has to write a little text and I want to validate this text. For now I am using the following regex:
/^[0-9a-zA-ZäöüÄÖÜ_\-']+$/
Although I have mentioned the äöüÄÖÜ in the regex it handles all words with äöü.. as invalid. Furthermore it does not accept empty spaces.
Any ideas how to improve the regex?
Use a Unicode-aware regex:
/[\pL\pN_\-]+/
the PCRE u modifier allows for utf-8. You are also missing a space from the regex, and you can condense it a bit:
/^[0-9a-zäöü\- ]+$/ui
Though I'm not sure if 'i' will work with the capitals of the foreign characters.
You may also want to include punctuation.
First, you might have an encoding issue, that's why äöüÄÖÜ are registered as invalid. I'm not a PHP user, so I can't answer your question directly, but taking a look at this page might help you. Also, using appropriate character classes could work better than explicitly writing all appropriate letters. Alas, this is also probably encoding configuration dependent.
Second, you need a space in your regex, so
/^[0-9a-z A-ZäöüÄÖÜ_\-']+$/ // note space after a-z
should work. Note what I wrote in last paragraph about using character classes. \w might be sufficient instead of a-zA-ZäöüÄÖÜ
You may just use \w to indicate all "word" characters (letters, digits, etc.) So the regex will be
/^[\w_\-' ]+$/
What text from the user are you considering to be "valid"?

Categories