Regex - Match only unicode alphabet not numbers - php

I'm using PHP, and trying to write a regular expression that matches any alphabet in any language but not numbers.
I've tried /\p{L}+/ But it matches unicode alphabets and numbers too. I'm checking against Arabic and English languages. English numbers doesn't pass which is normal, but Arabic numbers pass which is not normal.
Is there another regular expression that matches only alphabets in any language ?

The regex engine need to know that the target string is an unicode string (to avoid interpretation errors). To do that you can use the u modifier, that has two functions:
it expands classical shorthand character classes like \w \d to unicode characters (and not only ascii characters)
it forces the string to be seen as an unicode string
So you can use: /\pL+/u
Note that in your particular case, the first behavior is not needed, but you can only switch on the second behavior with: /(*UTF8)\pL+/ ((*UTF8) must be placed at the very begining of the pattern)

Related

PHP - regex to allow unicode charcaters

I was using the following regex with preg_replace to filter inputs:
/[^A-Za-z0-9[:space:][:blank:]_<>=##£€$!?:;%,.\\'\\\"()&+\\/-]/
However this does not allow accented characters like umlauts so I changed it to:
/[^\w[:space:][:blank:]_<>=##$£€!?:;%,.\\'\\\"()&+\\/-]/u
This however does work with the £ or € characters, nothing is returned, but I need to accept these characters, I have tried escaping them but that doesn't work.
Also I want to create an regex that is similar to just A-Za-z but will allow accented characters, how can I do that?
From http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8) This modifier turns on additional functionality of PCRE
that is incompatible with Perl. Pattern and subject strings are
treated as UTF-8. An invalid subject will cause the preg_* function to
match nothing; an invalid pattern will trigger an error of level
E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid
since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been
regarded as valid UTF-8.
That means that first you have to make sure the input string is proper UTF-8 text.
Secondly, have you heard of unicode categories? If not, head to http://www.regular-expressions.info/unicode.html and search for Unicode categories. For example you could use \p{S} to match all currency symbols, or \p{L} for all letters. Your regex could (probably) be written as follows: /[^\p{L}\p{P}\p{N}\p{S}\p{M}]/.
This will though match pretty much nothing, as it allows pretty much all characters to be used - ^ at the start of a regex character class (something between [ and ]) means "everything that is not what is in this class will be matched".
On top of that, your regex will only match input that has a length of exactly one - if you want to match everything, you should begin adding a + after your closing ] to keep matching characters until the pattern fails.
So, for that sake, what exactly are you trying to achieve? Maybe we can suggest you some more regex improvements if we know what you're trying to do.

Regular Expression Doesn't Work Properly With Turkish Characters

I write a regex that should extracts following patterns;
"çççoookkk gggüüüzzzeeelll" (it means vvveeerrryyy gggoooddd with turkish characters "ç" and "ü")
"ccccoookkk ggguuuzzzeeelll" (it means the same but with english characters "c" and "u")
here is the regular expressions i'm trying;
"\b[çc]+o+k+\sg+[üu]+z+e+l+\b" : this works in english but not in turkish characters
"çok": finds "çok" but when i try "ç+o+k+" doesn't work for "çççoookkk", it finds "çoookkk"
"güzel": finds "güzel" but when i try "g+ü+z+e+l+" doesn't work for "gggüüüzzzeeelll"
"\b(c+o+k+)|(ç+o+k+)\s(g+u+z+e+l)|(g+ü+z+e+l+)\b": doesn't work properly
"[çc]ok\sg[uü]zel": I also tried this to get "çok güzel" pattern but doesn't work neither.
I thing the problem might be using regex operators with turkish characters. I don't know how can i solve this.
I am using http://www.myregextester.com to check if my regular expressions are correct.
I am using Php programming language to get a specific pattern from searched tweets via Twitter Rest Api.
Thanks,
You have not specified what programming language you are using, but in many of them, the \b character class can only be used with plain ASCII encoding.
Internally, \b is processed as a boundary between \w and \W sets.
In turn, \w is equal to [a-zA-Z0-9_].
If you are not using any fancy space marks (you shouldn't), then consider using regular whitespace char classes (\s).
See this table (scroll down to Word Boundaries section) to check if your language supports Unicode for \b. If it says, "ascii", then it does not.
As a side note, depending on your programming language, you may consider using direct Unicode code points instead of national characters.
Se also: utf-8 word boundary regex in javascript
Further reading:
An excellent article about using Unicode characters in regular expressions
An article for word boundaries
List of Turkish Unicode code points

Regex for word characters in any language

Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_] to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!
Note that this is not for security filtering but rather for tokenizing a text.
Try [\pL_] - see the reference at
http://php.net/manual/en/regexp.reference.unicode.php
Try \p{L}. It matches any kind of letter from any language. If you don't want to use char set [].

Generating all Unicode characters not in ASCII scheme in PHP?

This regular expression is supposed to match all non-ASCII characters, 0-128 code points:
/[^x00-x7F]/i
Imagine I want to test (just out of curiosity) this regular expression with all Unicode characters, 0-1114111 code points.
Generating this range maybe simple with range(0, 1114111). Then I should covert each decimal number to hexadecimal with dechex() function.
After that, how can i convert the hexadecimal number to the actual character? And how can exclude characters already in ASCII scheme?
It depends on how you are going to do the matching and whether you are going to put the PCRE regex engine into UTF-8 mode with the /u modifier.
If you do use the /u modifier then first of all you must use UTF-8 encoding for both the regular expression and the subject and the regex engine will automatically interpret legal UTF-8 byte sequences as just one character. In this mode the regular expression [^x00-x7F] will match all characters outside the Latin-1 supplement block, including those with code points greater than 255. You will also need to generate the UTF-8 representations of each character (given its code point) manually.
If you do not use the /u modifier then the regex engine will be dumb: it will consider each byte as a separate character, which means that you have to work at byte rather than character level. On the other hand, you will now be able to work with any encoding you prefer. However, you will have to ditch the [^x00-x7F] regex (because it's only going to be matching random bytes in the string) and work with a regular expression that embodies the rules of your chosen encoding (example for UTF-8). To generate the encoded forms of random characters you will again need to use custom code that depends on the specific encoding.
I think the hex2bin(string) function will convert a hex string into a binary string. To exclude ASCII character codepoints, just begin from the x80 hex codepoint (skipping x00 to x7F).
But it does sort of sound like you're trying to unit test the regex library, which seems unnecessary unless you are developing the regex library, or you need to be extremely paranoid.

How to determine if a non-English string is in upper case?

I'm using the following code to check for a string where all the characters are upper-case letters:
if (preg_match('/^[\p{Lu}]+$/', $word)) {
This works great for English, but fails to detect letters with accents, Russian letters, etc. Is \p{Lu} supposed to work for all languages? Is there a better approach?
A special option is the /u which turns on the Unicode matching mode, instead of the default 8-bit matching mode. You should specify /u for regular expressions that use \x{FFFF}, \X or \p{L} to match Unicode characters, graphemes, properties or scripts. PHP will interpret '/regex/u' as a UTF-8 string rather than as an ASCII string.
http://www.regular-expressions.info/php.html --
using function u can do change in Uppercase of String ....
Function Available here :
string name="manish niitian";
console.Writeline("Your String in Uppercase is : "+name.UPPERCASE());

Categories