Hebrew regex match not working in php - php

this is my current regex code to validate english & numbers:
const CANONICAL_FMT = '[0-9a-z]{1,64}';
public static function isCanonical($str)
{
return preg_match('/^(?:' . self::CANONICAL_FMT . ')$/', $str);
}
Pretty straight forward. Now i want to change that to validate only hebrew, underscore
and numbers. So i changed the code to:
public static function isCanonical($str)
{
return preg_match('/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i', $str);
}
But it doesn't work. I basically took the hebrew UTF range out of Wikipedia.
What is Wrong here?

I was able to get it to work much more easily, using the /u flag and the \p{Hebrew} Unicode character property:
return preg_match('/^(?:\p{Hebrew}+|\w+)$/iu', $str);
Working example: http://ideone.com/gSlmh

If you want preg_match() to work properly with UTF-8, you might have to enable the u modifier (quoting) :
This modifier turns on additional functionality of PCRE that is
incompatible with Perl. Pattern strings are treated as UTF-8.
In your case, instead of using the following regex :
/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i
I suppose you'd be using :
/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/iu
(Note the additionnal u at the end)

You need the /u modifier to add support for UTF-8.
Make sure you convert your hebrew input to UTF-8 if it's in some other codepage/character set.

Related

PHP regular expression any kind of letter from any language

I'm trying to create my own routing in php using regex,
my example returns true when the name is in latain, but when the name is in arabic returns false
preg_match('#^(en/users/(?<name>[\p{L}\p{Nd}\_\-\+]+))$#', 'en/users/علي+عثمان')
What am I doing wrong?
To match arabic script you have to use \p{Arabic} instead of \p{L}, and also set the pattern modifier u to enable UTF-8 support.
Like so:
preg_match('#^(en/users/([\p{L}\p{Ll}\p{Arabic}\p{Nd}\_\-\+]+))$#u', 'en/users/علي+عثمان')
Working example: https://ideone.com/Zwrnpg

Why is ctype_alnum unhelpful in matching culture-agnostic alphanumerics?

Let's suppose that I have a text in a variable called $text and I want to validate it, so that it can contain spaces, underscores, dots and any letters from any languages and any digits. Since I am a total noob with regular expressions, I thought I can work-around learning it, like this:
if (!ctype_alnum(str_replace(".", "", str_replace(" ", "", str_replace("_", "", $text))))) {
//invalid
}
This correctly considers the following inputs as valid:
foobarloremipsum
foobarloremipsu1m
foobarloremi psu1m
foobar._remi psu1m
So far, so good. But if I enter my name, Lajos Árpád, which contains non-English letters, then it is considered to be invalid.
Returns TRUE if every character in text is either a letter or a digit,
FALSE otherwise.
Source.
I suppose that a setting needs to be changed to allow non-English letters, but how can I use ctype_alnum to return true if and only if $text contains only letters or digits in a culture-agnostic fashion?
Alternatively, I am aware that some spooky regular expression can be used to resolve the issue, including things like \p{L} which is nice, but I am interested to know whether it is possible using ctype_alnum.
You need to use setlocale with category set to LC_CTYPE and the appropriate locale for the ctype_* family of functions to work on non-English characters.
Note that the locale that you're using with setlocale needs to actually be installed on the system, otherwise it won't work. The best way to remedy this situatioin is to use a portable solution, given in this answer to a similar question.

PHP: Match strange dash with preg_match()

I have big problems to match this character: –
It's something called a "en dash" U+2013 (according to http://www.fileformat.info/info/unicode/char/search.htm)
It's a match with - in my test environment (windows and php 5.2.11) but fails on the production servers (ubuntu and php 5.3.2). Even \x2013 fails there.
Any suggestions how to match this strange character? Or how to config php to make it work?
You can also try use the "u" flag on the expression which makes the expression compatible with utf-8: regex pattern modifiers
so your expression would be "/[somepatter]/u"
if (preg_match ('~\xe2\x80\x93~', $string))
{
echo "En Dash found";
}
I believe you've got an UTF-8 encoding, don't you?

PHP Regex Problem:

$string1 = preg_replace('/[^A-Za-z0-9äöü!&_=\+-]/', ' ', $string4);
This Regex shouldn't replace the chars äöü.
In Ruby it worked as expected.
But in PHP it replaces also the ä ö and ü.
Can someone give me a hint how to fix it?
Set the u pattern modifier (to tell php to treat the regex as a UTF-8 string).
'/[^A-Za-z0-9äöü!&_=\+-]/u'
i think this should work:
$string1 = preg_replace('/\[^A-Za-z0-9\pL!&_=\+-]/u', ' ', $string4 );
Unicode support is one of the features promised for PHP 6.
Currently in php5
use the multibyte string functions like mb_ereg
PHP will interpret '/regex/u' as a UTF-8 string, with preg_match,preg_replace

PHP Regular Expression. Check if String contains ONLY letters

In PHP, how do I check if a String contains only letters? I want to write an if statement that will return false if there is (white space, number, symbol) or anything else other than a-z and A-Z.
My string must contain ONLY letters.
I thought I could do it this way, but I'm doing it wrong:
if( ereg("[a-zA-Z]+", $myString))
return true;
else
return false;
How do I find out if myString contains only letters?
Yeah this works fine. Thanks
if(myString.matches("^[a-zA-Z]+$"))
Never heard of ereg, but I'd guess that it will match on substrings.
In that case, you want to include anchors on either end of your regexp so as to force a match on the whole string:
"^[a-zA-Z]+$"
Also, you could simplify your function to read
return ereg("^[a-zA-Z]+$", $myString);
because the if to return true or false from what's already a boolean is redundant.
Alternatively, you could match on any character that's not a letter, and return the complement of the result:
return !ereg("[^a-zA-Z]", $myString);
Note the ^ at the beginning of the character set, which inverts it. Also note that you no longer need the + after it, as a single "bad" character will cause a match.
Finally... this advice is for Java because you have a Java tag on your question. But the $ in $myString makes it look like you're dealing with, maybe Perl or PHP? Some clarification might help.
Your code looks like PHP. It would return true if the string has a letter in it. To make sure the string has only letters you need to use the start and end anchors:
In Java you can make use of the matches method of the String class:
boolean hasOnlyLetters(String str) {
return str.matches("^[a-zA-Z]+$");
}
In PHP the function ereg is deprecated now. You need to use the preg_match as replacement. The PHP equivalent of the above function is:
function hasOnlyLetters($str) {
return preg_match('/^[a-z]+$/i',$str);
}
I'm going to be different and use Character.isLetter definition of what is a letter.
if (myString.matches("\\p{javaLetter}*"))
Note that this matches more than just [A-Za-z]*.
A character is considered to be a letter if its general category type, provided by Character.getType(ch), is any of the following: UPPERCASE_LETTER, LOWERCASE_LETTER, TITLECASE_LETTER, MODIFIER_LETTER, OTHER_LETTER
Not all letters have case. Many characters are letters but are neither uppercase nor lowercase nor titlecase.
The \p{javaXXX} character classes is defined in Pattern API.
Alternatively, try checking if it contains anything other than letters: [^A-Za-z]
The easiest way to do a "is ALL characters of a given type" is to check if ANY character is NOT of the type.
So if \W denotes a non-character, then just check for one of those.

Categories