preg_match unicode does not work with some languages

preg_match unicode does not work with some languages - php

With this regular expression can not validate the text in the following languages:
/^[\p{L}\p{Nd}-_.]{1,20}$/u
Languages that do not work:
Bengali, Gujarati, Hindi, Marathi, Thai, Tamil, Telugu, Vietnamese
when used with PHP's preg_match.
What am I missing?

You're using the dash incorrectly. If you want it to match a literal dash character, you need to either escape it (\-) or put it at the end of the character class.
Also, I'm not familiar with those languages, but I guess you might need to account for marks as well:
/^[\p{L}\p{Nd}\p{M}_.-]{1,20}$/u

The problem doesn't come from your regex (except the fact that the character - must be always at the begining or at the end of a character class) . Note that your pattern can be shorten as:
/^[\w.-]{1,20}$/u
or
/^[\p{Xan}.-]{1,20}$/u
if you want to remove the underscore

Related

Regex blocking special characters

I'm using PHP Version 5.3.27
I'm trying to get my regex to match whitespace, and special characters such as ♦◘•♠♥☻, the other known special characters which are %$#&*# are already matched, but somehow the ones I mentioned before are not matched..
Current regex
preg_match('/^[a-zA-Z0-9[:space:]]+$/', $login)
My apology for asking two questions on the same subject. I hope this one is clear enough for you.

use this
[\W]+
will match any non-word character.

Your regex doesn't contain any reference to the special characters mentioned. You would need to include them in the character class for them to be matched.
To match those kinds of special characters you can use the unicode values.
Example:
\u0000-\uFFFF
\x00-\xFF
The top is UTF-16, the bottom is UTF-8.
Refer to a UTF-8/16 character table online to match up your symbols with their unicode values, then create a range to keep your expression short.

You can use the \p{S} character class (or \p{So}) that matches symbol characters (that includes this kind of characters: ╭₠☪♛♣♞♉♆☯♫):
preg_match('/^[a-zA-Z0-9\h\p{S}]+$/u', $login)
To find more possibilities you can check the pcre documentation at: http://www.pcre.org/pcre.txt
If you need to be more precise, the best way is to use character ranges in the character class. You can find code of characters here.

Regular Expression Doesn't Work Properly With Turkish Characters

I write a regex that should extracts following patterns;
"çççoookkk gggüüüzzzeeelll" (it means vvveeerrryyy gggoooddd with turkish characters "ç" and "ü")
"ccccoookkk ggguuuzzzeeelll" (it means the same but with english characters "c" and "u")
here is the regular expressions i'm trying;
"\b[çc]+o+k+\sg+[üu]+z+e+l+\b" : this works in english but not in turkish characters
"çok": finds "çok" but when i try "ç+o+k+" doesn't work for "çççoookkk", it finds "çoookkk"
"güzel": finds "güzel" but when i try "g+ü+z+e+l+" doesn't work for "gggüüüzzzeeelll"
"\b(c+o+k+)|(ç+o+k+)\s(g+u+z+e+l)|(g+ü+z+e+l+)\b": doesn't work properly
"[çc]ok\sg[uü]zel": I also tried this to get "çok güzel" pattern but doesn't work neither.
I thing the problem might be using regex operators with turkish characters. I don't know how can i solve this.
I am using http://www.myregextester.com to check if my regular expressions are correct.
I am using Php programming language to get a specific pattern from searched tweets via Twitter Rest Api.
Thanks,

You have not specified what programming language you are using, but in many of them, the \b character class can only be used with plain ASCII encoding.
Internally, \b is processed as a boundary between \w and \W sets.
In turn, \w is equal to [a-zA-Z0-9_].
If you are not using any fancy space marks (you shouldn't), then consider using regular whitespace char classes (\s).
See this table (scroll down to Word Boundaries section) to check if your language supports Unicode for \b. If it says, "ascii", then it does not.
As a side note, depending on your programming language, you may consider using direct Unicode code points instead of national characters.
Se also: utf-8 word boundary regex in javascript
Further reading:
An excellent article about using Unicode characters in regular expressions
An article for word boundaries
List of Turkish Unicode code points

JavaScript/PHP Regular Expression

I'm trying to match first names and Lastname with something like this.
$pattern = '/[a-zA-Z\-]{3,30} +[a-zA-Z]+/';
This works great, except when I have a first name like this Mélissa Smith
My match becomes Lissa Smith
How do I match for all special characters like é

in javascript, you can use a unicode char range instead of A-Za-z:
"Mélissa Smith".match( /[\u80-\uffff]{3,30} +[\u80-\uffff]+/ )
equals: ["Mélissa Smith"]

Put the regex into Unicode mode with the /u modifier and use an appropriate Unicode character class instead of hardcoding just latin letters:
$pattern = '/^(\pL|-){3,30}\s+\pL+$/u';
I also anchored the pattern between ^ and $ because otherwise it could end up matching things you didn't intend it to.
You have to keep in mind that when you do this, the input (as well as the pattern itself) must be encoded in UTF-8.
However, it has to be said that naively parsing names like this is not going to give you very good results. People's full names are way too involved for something this simple to work across the board.

Try using the POSIX expression [:alpha:] instead of [a-zA-Z-] to catch the characters. [:alpha:] will catch equivalent characters such as accents.
http://www.regular-expressions.info/posixbrackets.html

Regular Expressions: How to Express \w Without Underscore

Is there a concise way to express:
\w but without _
That is, "all characters included in \w, except _"
I'm asking this because I'm looking for the most concise way to express domain name validation. A domain name may include lowercase and uppercase letters, numbers, period signs and dashes, but no underscores. \w includes all of the above, plus an underscore. So, is there any way to "remove" an underscore from \w via regex syntax?
Edited: I'm asking about regex as used in PHP.
Thanks in advance!

the following character class (in Perl)
[^\W_]
\W is the same as [^\w]

You could use a negative lookahead: (?!_)\w
However, I think writing [a-zA-Z0-9.-] is more readable.

To be on the safe side, usually, we will use character class:
[a-zA-Z0-9.-]
The regex "fragment" above match English alphabet, and digits, plus period . and dash -. It should work even with the most basic regex support.
Shorter may be better, but only if you know exactly what it represents.
I don't know what language you are using. In a lot of engines, \w is equivalent to [a-zA-Z0-9_] (some requires "ASCII mode" for this). However, some engine have Unicode support for regex, and may extend \w to match Unicode characters.

If my understanding is right \w means [A-Za-z0-9_] period signs, dashes are not included.
info:
http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes
so I guess what you want is [a-zA-Z0-9.-]

Some regex flavours have a negative lookbehind syntax you might use:
\w(?<!_)

I would start with [^_], and then think of what else characters I need to deny. If you need to filter a keyboard input, it's quite simple to enumerate all the unwanted characters.

You can write something like this:
\([^\w]|_)\u
If you use preg_filter with this string any character in \w (excluding _ underscore) will be filtered.

Regex for word characters in any language

Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_] to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!
Note that this is not for security filtering but rather for tokenizing a text.

Try [\pL_] - see the reference at
http://php.net/manual/en/regexp.reference.unicode.php

Try \p{L}. It matches any kind of letter from any language. If you don't want to use char set [].

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_match unicode does not work with some languages - php

With this regular expression can not validate the text in the following languages: /^[\p{L}\p{Nd}-_.]{1,20}$/u Languages that do not work: Bengali, Gujarati, Hindi, Marathi, Thai, Tamil, Telugu, Vietnamese when used with PHP's preg_match. What am I missing?

You're using the dash incorrectly. If you want it to match a literal dash character, you need to either escape it (\-) or put it at the end of the character class. Also, I'm not familiar with those languages, but I guess you might need to account for marks as well: /^[\p{L}\p{Nd}\p{M}_.-]{1,20}$/u

The problem doesn't come from your regex (except the fact that the character - must be always at the begining or at the end of a character class) . Note that your pattern can be shorten as: /^[\w.-]{1,20}$/u or /^[\p{Xan}.-]{1,20}$/u if you want to remove the underscore

Related

Regex blocking special characters

Regular Expression Doesn't Work Properly With Turkish Characters

JavaScript/PHP Regular Expression

Regular Expressions: How to Express \w Without Underscore

Regex for word characters in any language

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_match unicode does not work with some languages - php

With this regular expression can not validate the text in the following languages​​: /^[\p{L}\p{Nd}-_.]{1,20}$/u Languages ​​that do not work: Bengali, Gujarati, Hindi, Marathi, Thai, Tamil, Telugu, Vietnamese when used with PHP's preg_match. What am I missing?

You're using the dash incorrectly. If you want it to match a literal dash character, you need to either escape it (\-) or put it at the end of the character class. Also, I'm not familiar with those languages, but I guess you might need to account for marks as well: /^[\p{L}\p{Nd}\p{M}_.-]{1,20}$/u

The problem doesn't come from your regex (except the fact that the character - must be always at the begining or at the end of a character class) . Note that your pattern can be shorten as: /^[\w.-]{1,20}$/u or /^[\p{Xan}.-]{1,20}$/u if you want to remove the underscore

Related

Regex blocking special characters

Regular Expression Doesn't Work Properly With Turkish Characters

JavaScript/PHP Regular Expression

Regular Expressions: How to Express \w Without Underscore

Regex for word characters in any language

Categories

Resources

With this regular expression can not validate the text in the following languages: /^[\p{L}\p{Nd}-_.]{1,20}$/u Languages that do not work: Bengali, Gujarati, Hindi, Marathi, Thai, Tamil, Telugu, Vietnamese when used with PHP's preg_match. What am I missing?