Regex for word characters in any language - php

Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_] to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!
Note that this is not for security filtering but rather for tokenizing a text.

Try [\pL_] - see the reference at
http://php.net/manual/en/regexp.reference.unicode.php

Try \p{L}. It matches any kind of letter from any language. If you don't want to use char set [].

Related

PHP regex to accept Japanese and english languages

I am trying to create a regex to filter only alphabets or numbers from English and Japanese languages. This is what I have tried,
preg_match('/(?![\n\r])[\x00-\x1F\x80-\xFF][^\x4e00-\x9fa0)]/u', $value)
But I am not getting the desired result. What might I be doing wrong?
You should use unicode character properties
Also you may have a look on this website which contains some other regex examples http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/
Updated character list based on #Álvaro González notice about the three alphabets.
this regex should do what you expect :
preg_match('/[\p{L}\p{N}\p{Katakana}\p{Hiragana}\p{Han}]+/u', $value)
\p{L} will match any letter, \p{N} any number and \p{Katakana} will match any Katakana char etc...
You may need to add word delimiters into the accepted characters if you are not matching single words
The following regex checks the line is not Japanese language:
if(!preg_match('/^[\x{3041}-\x{3096}\x{30a1}-\x{30fc}\x{4e00}-\x{9faf}]+$/u', $line)){
// ...
}
You can find more in the document:
https://www.w3.org/International/questions/qa-forms-utf-8.html

Regex - Match only unicode alphabet not numbers

I'm using PHP, and trying to write a regular expression that matches any alphabet in any language but not numbers.
I've tried /\p{L}+/ But it matches unicode alphabets and numbers too. I'm checking against Arabic and English languages. English numbers doesn't pass which is normal, but Arabic numbers pass which is not normal.
Is there another regular expression that matches only alphabets in any language ?
The regex engine need to know that the target string is an unicode string (to avoid interpretation errors). To do that you can use the u modifier, that has two functions:
it expands classical shorthand character classes like \w \d to unicode characters (and not only ascii characters)
it forces the string to be seen as an unicode string
So you can use: /\pL+/u
Note that in your particular case, the first behavior is not needed, but you can only switch on the second behavior with: /(*UTF8)\pL+/ ((*UTF8) must be placed at the very begining of the pattern)

Regex blocking special characters

I'm using PHP Version 5.3.27
I'm trying to get my regex to match whitespace, and special characters such as ♦◘•♠♥☻, the other known special characters which are %$#&*# are already matched, but somehow the ones I mentioned before are not matched..
Current regex
preg_match('/^[a-zA-Z0-9[:space:]]+$/', $login)
My apology for asking two questions on the same subject. I hope this one is clear enough for you.
use this
[\W]+
will match any non-word character.
Your regex doesn't contain any reference to the special characters mentioned. You would need to include them in the character class for them to be matched.
To match those kinds of special characters you can use the unicode values.
Example:
\u0000-\uFFFF
\x00-\xFF
The top is UTF-16, the bottom is UTF-8.
Refer to a UTF-8/16 character table online to match up your symbols with their unicode values, then create a range to keep your expression short.
You can use the \p{S} character class (or \p{So}) that matches symbol characters (that includes this kind of characters: ╭₠☪♛♣♞♉♆☯♫):
preg_match('/^[a-zA-Z0-9\h\p{S}]+$/u', $login)
To find more possibilities you can check the pcre documentation at: http://www.pcre.org/pcre.txt
If you need to be more precise, the best way is to use character ranges in the character class. You can find code of characters here.

Regular Expression Doesn't Work Properly With Turkish Characters

I write a regex that should extracts following patterns;
"çççoookkk gggüüüzzzeeelll" (it means vvveeerrryyy gggoooddd with turkish characters "ç" and "ü")
"ccccoookkk ggguuuzzzeeelll" (it means the same but with english characters "c" and "u")
here is the regular expressions i'm trying;
"\b[çc]+o+k+\sg+[üu]+z+e+l+\b" : this works in english but not in turkish characters
"çok": finds "çok" but when i try "ç+o+k+" doesn't work for "çççoookkk", it finds "çoookkk"
"güzel": finds "güzel" but when i try "g+ü+z+e+l+" doesn't work for "gggüüüzzzeeelll"
"\b(c+o+k+)|(ç+o+k+)\s(g+u+z+e+l)|(g+ü+z+e+l+)\b": doesn't work properly
"[çc]ok\sg[uü]zel": I also tried this to get "çok güzel" pattern but doesn't work neither.
I thing the problem might be using regex operators with turkish characters. I don't know how can i solve this.
I am using http://www.myregextester.com to check if my regular expressions are correct.
I am using Php programming language to get a specific pattern from searched tweets via Twitter Rest Api.
Thanks,
You have not specified what programming language you are using, but in many of them, the \b character class can only be used with plain ASCII encoding.
Internally, \b is processed as a boundary between \w and \W sets.
In turn, \w is equal to [a-zA-Z0-9_].
If you are not using any fancy space marks (you shouldn't), then consider using regular whitespace char classes (\s).
See this table (scroll down to Word Boundaries section) to check if your language supports Unicode for \b. If it says, "ascii", then it does not.
As a side note, depending on your programming language, you may consider using direct Unicode code points instead of national characters.
Se also: utf-8 word boundary regex in javascript
Further reading:
An excellent article about using Unicode characters in regular expressions
An article for word boundaries
List of Turkish Unicode code points

Match whole words in utf

I want to replace all occurrences of a with 5. Here is the code that works well:
$content=preg_replace("/\ba\b/","5", $content);
unless I have words like zapłać where a is between non standard characters, or zmarła where there is a Unicode (or non-ASCII) letter followed by a at the end of word. Is there any easy way to fix it?
the problem is that the predefined character class \w is ASCII based and that does not change, when the u modifier is used. (See regular-expressions.info, preg is PCRE in the columns)
You can use lookbehind and lookahead to do it:
$content=preg_replace("/(?<!\p{L})a(?!\p{L})/","5",$content);
This will replace "a" if there is not a letter before and not a letter ahead.
\p{L}: any kind of letter from any language.
$content=preg_replace("/\ba\b/u","5",$content);

Categories