Regex - Greek Characters in URL - php

I have a custom router that uses regex.
The problem is that I cannot parse Greek characters.
Here are some lines from index.php:
$router->get('/theatre/plays', 'TheatreController', 'showPlays');
$router->get('/theatre/interviews', 'TheatreController', 'showInterviews');
$router->get('/theatre/[-\w\d\!\.]+', 'TheatreController', 'single_post');
Here are some lines from Router.php:
$found = 0;
$path = parse_url($_SERVER['REQUEST_URI'], PHP_URL_PATH); //get the url
////// Bla Bla Bla /////////
if ( $found = preg_match("#^$value$#", $path) )
{
//Do stuff
}
Now, when I try a url like http://kourtis.app/theatre/α (notice the last character is a Greek 'alpha') then it is somehow interpreted to http://kourtis.app/theatre/%CE%B1
I can see this when I var_dump($path) or when I copy-paste the url.
I guess it has something to do with encoding but everything (I can think of) is in utf-8 format.
Any ideas?
--------------------------------
UPDATE: After the suggestions in the comments, the following works for only with some Greek characters:
/theatre/[α-ωΑ-Ω-\w\d\!\.]+
and use urldecode to decode the percent-encoding of the $path variable.
Some characters that produce an error are: κ π ρ χ.
The question now is ... why??
(BTW, this works for many chars /theatre/.+)

You can use
$router->get('/theatre/[^/]+', 'TheatreController', 'single_post');
as [^/]+ will match one or more characters other than / since [^...] is a negated character class that matches any char but the one(s) defined in the class.
Note you do not have to use \d if you used \w (\w already matches digits).
Also, you did not match diacritics with your regex. If you need to match diacritics, add \p{M} to the regex: '/theatre/[-\w\p{M}!.]+'.
Note that to allow \w to match Unicode letters/digits, you need to pass /u modifier to the regex: $found = preg_match("#^$value$#u", $path). This will both treat input strings as Unicode strings, and make shorthand patterns like \w Unicode aware.
Another thing: you need not escape . inside a character class.
Pattern details:
#...# - regex delimiters
^ - start of string
$value - the $value variable contents (since double quoted strings in PHP allow interpolation)
$ - end of string
#u - the modifier enabling PCRE_UTF and PCRE_UCP options. See more info about them here

Related

Alphanumeric regex not working with non-roman characters

I want to only have alphanumeric characters [a-f0-9] in a string. To achieve this, I have:
$text = preg_replace("/[^[:alnum:]]/u", '', $text);
Works fine in this case:
$text = 'hello?world'; // becomes 'helloworld'
The problem is that it doesn't seem to work for other languages, for example:
$text = '日本国'; // becomes '日本国'
That should be empty!
Ideone demo. What am I doing wrong here?
To be more clear, by default [:alnum:] contains [a-zA-Z0-9] (letters and digits from the ASCII range 0-127).
But if you use the u modifier, this class is extended to all UNICODE letters and digits.
The u modifier:
changes the way the subject string (and the pattern) is read (code point by code point instead of byte by byte)
extends several* character classes to UNICODE characters (*as a counter example, the \h character class doesn't change.)
It's possible to separate these two behaviors with commands at the start of the pattern:
(*UTF) at the start of the pattern informs that the subject and the pattern have to be read as utf (utf-8 in php) encoded strings (and not byte by byte).
(*UCP) extends the character classes.
(see several tests here:)
So instead of the u modifier, you can write your pattern this way:
$str = preg_replace('~(*UTF)[^[:alnum:]]+~', '', $str);
You can also choose to not use the [:alnum:] class at all and to be more explicit:
$str = preg_replace('~[^a-z0-9]+~ui', '', $str);
Since there is no predefined character class in the pattern, the (*UCP) part of the u modifier doesn't change anything.
Obviously, as noted in comments, it's also possible to ignore the fact that your subject string may contain characters out of the ASCII range, and read this string byte by byte with:
$str = preg_replace('~[^[:alnum:]]+~', '', $str);
// or
$str = preg_replace('~[^a-z0-9]+~i', '', $str);
and it will work too, but IMO it's less rigorous.

php preg_match get word with cyrillic characters

I try to get some word from string, but this word maybe will have cyrillic characters, I try to get it, but all what I to do - not working.
Please help me;
My code
$str= "Продавец:В KrossАдын рассказать друзьям var addthis_config = {'data_track_clickback':true};";
$pattern = '/\s(\w*|.*?)\s/';
preg_match($pattern, $str, $matches);
echo $matches[0];
I need to get KrossАдын.
Thaks!
You can change the meaning of \w by using the u modifier. With the u modifier, the string is read as an UTF8 string, and the \w character class is no more [a-zA-Z0-9_] but [\p{L}\p{N}_]:
$pattern = '/\s(\w*|.*?)\s/u';
Note that the alternation in the pattern is a non-sense:
you use an alternation where the second member can match the same thing than the first. (i.e. all that is matched by \w* can be matched by .*? because there is a whitespace on the right. The two subpatterns will match the characters between two whitespaces)
Writting $pattern = '/\s(.*?)\s/u'; does exactly the same, or better:
$pattern = '/\s(\S*)\s/u';
that avoids to use a lazy quantifier.
If your goal is only to match ASCII and cyrillic letters, the most efficient (because for character classes the smaller is the faster) will be:
$pattern = '~(*UTF8)[a-z\p{Cyrillic}]+~i';
(*UTF8) will inform the regex engine that the original string must be read as an UTF8 string.
\p{Cyrillic} is a character class that only contains cyrillic letters.
The issue is that your string uses UTF-8 characters, which \w will not match. Check this answer on StackOverflow for a solution: UTF-8 in PHP regular expressions
Essentially, you'll want to add the u modifier at the end of your expression, and use \p{L} instead of \w.

How can I use PHP's preg_replace function to convert Unicode code points to actual characters/HTML entities?

I want to convert a set of Unicode code points in string format to actual characters and/or HTML entities (either result is fine).
For example, if I have the following string assignment:
$str = '\u304a\u306f\u3088\u3046';
I want to use the preg_replace function to convert those Unicode code points to actual characters and/or HTML entities.
As per other Stack Overflow posts I saw for similar issues, I first attempted the following:
$str = '\u304a\u306f\u3088\u3046';
$str2 = preg_replace('/\u[0-9a-f]+/', '&#x$1;', $str);
However, whenever I attempt to do this, I get the following PHP error:
Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u
I tried all sorts of things like adding the u flag to the regex or changing /\u[0-9a-f]+/ to /\x{[0-9a-f]+}/, but nothing seems to work.
Also, I've looked at all sorts of other relevant pages/posts I could find on the web related to converting Unicode code points to actual characters in PHP, but either I'm missing something crucial, or something is wrong because I can't fix the issue I'm having.
Can someone please offer me a concrete solution on how to convert a string of Unicode code points to actual characters and/or a string of HTML entities?
From the PHP manual:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
First of all, in your regular expression, you're only using one backslash (\). As explained in the PHP manual, you need to use \\\\ to match a literal backslash (with some exceptions).
Second, you are missing the capturing groups in your original expression. preg_replace() searches the given string for matches to the supplied pattern and returns the string where the contents matched by the capturing groups are replaced with the replacement string.
The updated regular expression with proper escaping and correct capturing groups would look like:
$str2 = preg_replace('/\\\\u([0-9a-f]+)/i', '&#x$1;', $str);
Output:
おはよう
Expression: \\\\u([0-9a-f]+)
\\\\ - matches a literal backslash
u - matches the literal u character
( - beginning of the capturing group
[0-9a-f] - character class -- matches a digit (0 - 9) or an alphabet (from a - f) one or more times
) - end of capturing group
i modifier - used for case-insensitive matching
Replacement: &#x$1
& - literal ampersand character (&)
# - literal pound character (#)
x - literal character x
$1 - contents of the first capturing group -- in this case, the strings of the form 304a etc.
RegExr Demo.
This page here—titled Escaping Unicode Characters to HTML Entities in PHP—seems to tackle it with this nice function:
function unicode_escape_sequences($str){
$working = json_encode($str);
$working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working);
return json_decode($working);
}
That seems to work with json_encode and json_decode to take pure UTF-8 and convert it into Unicode. Very nice technique. But for your example, this would work.
$str = '\u304a\u306f\u3088\u3046';
echo preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $str);
The output is:
おはよう
Which is:
おはよう
Which translates to:
Good morning

PHP: remove small words from string ignoring german characters in the words

I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string
and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.
I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:
$kw = utf8_encode(strtolower($kw));
Now you can use the regex-unicode functionalities. \p{L} is for letters and \p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
[^\p{L}\p{N}]
You want all word-characters:
[\p{L}\p{N}]
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
(?<=[^\p{L}\p{N}]|^)
Replace max 2 "word-characters" followed by a boundary or the end:
[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)
So your regex could look like this:
'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'
And decode to your local, if you like:
echo utf8_decode($kw);
Good luck! Robert
Your \b word boundaries trip over the ö, because it's not an alphanumeric character. Per default PCRE works on ASCII letters.
The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer:
$kw = preg_replace("/\b[^-]{1,2}\b/u", "-", $kw);
I would use preg_replace_callback or /e btw, and instead search for [A-Z] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.

Regex to strip out everything but words and numbers (and latin chars)

Im trying to clean a post string used in an ajax request (sanitize before db query) to allow only alphanumeric characters, spaces (1 per word, not multiple), can contain "-", and latin characters like "ç" and "é" without success, can anyone help or point me on the right direction?
This is the regex I'm using so far:
$string = preg_replace('/^[a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+$/', '', mb_strtolower(utf8_encode($_POST['q'])));
Thank you.
$regEx = '/^[^\w\p{L}-]+$/iu';
\w - matches alphanumerics
\p{L} - matches a single Unicode Code Point in the 'Letters' category (see the Unicode Categories section here).
- at the end of the character class matches a single hyphen.
^ in the character classes negates the character class, so that the regex will match the opposite of the character class (anything you do not specify).
+ outside of the character class says match 1 or more characters
^ and $ outside of the character class will cause the engine to only accept matches that start at the beginning of a line and goes until the end of the line.
After the pattern, the i modifier says ignore case and the u tells the pattern matching engine that we're going to be sending UTF8 data it's way, and g modifier originally present has been removed since it's not necessary in PHP (instead global matching is dependent on which matching function is called)
$string = mb_strtolower(utf8_encode($_POST['q'])));
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+/g', '', $string);
$string = preg_replace('/ +/g', ' ', $string);
Why not just use mysql_real_escape_string?
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû\-]/u', '', mb_strtolower(utf8_encode($_POST['q']), 'UTF-8'));
$string = preg_replace( '/ +/', ' ', $string );
should do the trick. Note that
the character class is negated by putting ^ inside the character class
you need the u flag when dealing with unicode strings either in the pattern or in the subject
it's better to specify the character set explicitly in mb_* functions because otherwise they will fall back on your system defaults, and that may not be UTF-8.
the hyphen character needed escaping (\- instead of - at the end of your character class)

Categories