How to determine if a non-English string is in upper case? - php

I'm using the following code to check for a string where all the characters are upper-case letters:
if (preg_match('/^[\p{Lu}]+$/', $word)) {
This works great for English, but fails to detect letters with accents, Russian letters, etc. Is \p{Lu} supposed to work for all languages? Is there a better approach?

A special option is the /u which turns on the Unicode matching mode, instead of the default 8-bit matching mode. You should specify /u for regular expressions that use \x{FFFF}, \X or \p{L} to match Unicode characters, graphemes, properties or scripts. PHP will interpret '/regex/u' as a UTF-8 string rather than as an ASCII string.
http://www.regular-expressions.info/php.html --

using function u can do change in Uppercase of String ....
Function Available here :
string name="manish niitian";
console.Writeline("Your String in Uppercase is : "+name.UPPERCASE());

Related

When do I need u-modifier in PHP regex?

I know, that PHP PCRE functions treat strings as byte sequences, so many sites suggest to use /u modifier for handling input and regex as UTF-8.
But, do I really need this always? My tests show, that this flag makes no difference, when I don't use escape sequences or dot or something like this.
For example
preg_match('/^[\da-f]{40}$/', $string); to check if string has format of a SHA1 hash
preg_replace('/[^a-zA-Z0-9]/', $spacer, $string); to replace every char that is non-ASCII letter or number
preg_replace('/^\+\((.*)\)$/', '\1', $string); for getting inner content of +(XYZ)
These regex contain only single byte ASCII symbols, so it should work on every input, regardless of encoding, shouldn't it? Note that third regex uses dot operator, but as I cut off some ASCII chars at beginning and end of string, this should work on UTF-8 also, correct?
Cannot anyone tell me, if I'm overlooking something?
There is no problem with the first expression. The characters being quantified are explicitly single-byte, and cannot occur in a UTF-8 multibyte sequence.
The second expression may give you more spacers than you expect; for example:
echo preg_replace('/[^a-zA-Z0-9]/', "0", "💩");
// => 0000
The third expression also does not pose a problem, as the repeated character is limited by parentheses (which is ASCII-safe).
This is more dangerous:
echo preg_replace('/^(.)/', "0", "💩");
// => 0???
Typically, without knowing more about how UTF-8 works, it may be tricky to predict which regexps are safe, and which are not, so using /u for all text that might contain a character above U+007F is the best practice.
Unicode modifier u allows proper detection of accented characters, which are always multibyte.
preg_match('/([\w ]{2,})/', 'baz báz báž', $match);
// $match[0] = "baz b" ... wrong, accented/multibyte chars silently ignored
preg_match('/([\w ]{2,})/u', 'baz báz báž', $match);
// $match[0] = "baz báz báž" ... correct
Use it also for safe detection of whitespaces:
preg_replace(''/\s+/u', ' ', $txt); // works reliably e.g. with EOLs (line endings)
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
You will need this when you have to compare Unicode characters, such as Korean or Japanese.
In other words, unless you are not comparing strings that is not Unicode (such as English), You don't need to use this flag.

Regex - Match only unicode alphabet not numbers

I'm using PHP, and trying to write a regular expression that matches any alphabet in any language but not numbers.
I've tried /\p{L}+/ But it matches unicode alphabets and numbers too. I'm checking against Arabic and English languages. English numbers doesn't pass which is normal, but Arabic numbers pass which is not normal.
Is there another regular expression that matches only alphabets in any language ?
The regex engine need to know that the target string is an unicode string (to avoid interpretation errors). To do that you can use the u modifier, that has two functions:
it expands classical shorthand character classes like \w \d to unicode characters (and not only ascii characters)
it forces the string to be seen as an unicode string
So you can use: /\pL+/u
Note that in your particular case, the first behavior is not needed, but you can only switch on the second behavior with: /(*UTF8)\pL+/ ((*UTF8) must be placed at the very begining of the pattern)

Regex for word characters in any language

Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_] to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!
Note that this is not for security filtering but rather for tokenizing a text.
Try [\pL_] - see the reference at
http://php.net/manual/en/regexp.reference.unicode.php
Try \p{L}. It matches any kind of letter from any language. If you don't want to use char set [].

Generating all Unicode characters not in ASCII scheme in PHP?

This regular expression is supposed to match all non-ASCII characters, 0-128 code points:
/[^x00-x7F]/i
Imagine I want to test (just out of curiosity) this regular expression with all Unicode characters, 0-1114111 code points.
Generating this range maybe simple with range(0, 1114111). Then I should covert each decimal number to hexadecimal with dechex() function.
After that, how can i convert the hexadecimal number to the actual character? And how can exclude characters already in ASCII scheme?
It depends on how you are going to do the matching and whether you are going to put the PCRE regex engine into UTF-8 mode with the /u modifier.
If you do use the /u modifier then first of all you must use UTF-8 encoding for both the regular expression and the subject and the regex engine will automatically interpret legal UTF-8 byte sequences as just one character. In this mode the regular expression [^x00-x7F] will match all characters outside the Latin-1 supplement block, including those with code points greater than 255. You will also need to generate the UTF-8 representations of each character (given its code point) manually.
If you do not use the /u modifier then the regex engine will be dumb: it will consider each byte as a separate character, which means that you have to work at byte rather than character level. On the other hand, you will now be able to work with any encoding you prefer. However, you will have to ditch the [^x00-x7F] regex (because it's only going to be matching random bytes in the string) and work with a regular expression that embodies the rules of your chosen encoding (example for UTF-8). To generate the encoded forms of random characters you will again need to use custom code that depends on the specific encoding.
I think the hex2bin(string) function will convert a hex string into a binary string. To exclude ASCII character codepoints, just begin from the x80 hex codepoint (skipping x00 to x7F).
But it does sort of sound like you're trying to unit test the regex library, which seems unnecessary unless you are developing the regex library, or you need to be extremely paranoid.

PHP regex question : how to match none-ascii letters in latin1_swedish_ci charset?

I have this string : Verbesserungsvorschläge which I think is in German. Now I want to match it with a regex in php. To be more general, I want to match such characters like German which are not 100% in the ASCII set.
Thanks.
If you're working with an 8-bit character set, the regex [\x80-\xFF] matches any character that is not ASCII. In PHP that would be:
if (preg_match('/[\x80-\xFF]/', $subject)) {
# String has non-ASCII characters
} else {
# String is pure ASCII or empty
}
preg_match_all('~[^\x00-\x7F]~u', 'Verbesserungsvorschläge', $matches);
It's world of hurt, but you can try using the hex value, as in "/Verbesserungsvorschl\xc3ge/" for simple extended characters.
The hex values can be found in a table for determined on the fly with
echo dechex( ord( ä ) );
For full unicode, you can use /u as a modifier. See http://www.php.net/manual/en/regexp.reference.unicode.php and other pages. My understanding is that unicode will work better in PHP version 6.

Categories