Match some strange characters in a `\w` match - php

preg_match("/\w+/", $s, $matches);
I have the PHP code above. I use it to match words in a string. It works great, except in one case.
Example:
'This is a word' should match {'This','is','a','word'}
'Bös Tüb' should match {'Bös','Tüb'}
The first example works, but the second does not. Instead it returns {'B','s','T','b'}, it does not see the ö and ü as a word character.
Question
How to match the ö and ü and any other characters that are normally used in names (they can be strange, this is about German and Turkish names)? Should I add them all manually (/[a-zA-Z and all others as unicode]/)?
EDIT
As I ofcourse forgot to mention, there are a lot of \n, \r and ' ' characters in between the words. This is why I am using Regex.

You can use the u modifier to deal with Unicode characters. And then decode the matches with utf8_decode().
$s = 'Bös Tüb';
preg_match("/\w+/u", $s, $matches); // use the 'u' modifier
var_dump(utf8_decode($matches[0])); // outputs: Bös

If you need to separate by space you can use php explode func like:
$some_string = 'test some words';
$words_arr = explode(' ', $some_string);
var_dump($words_arr);
No matter what are the chars into the string, the script will work.
EDIT:
You can try:
preg_match("/\w+/u", $s, $matches);
for unicode.

Related

Alphanumeric regex not working with non-roman characters

I want to only have alphanumeric characters [a-f0-9] in a string. To achieve this, I have:
$text = preg_replace("/[^[:alnum:]]/u", '', $text);
Works fine in this case:
$text = 'hello?world'; // becomes 'helloworld'
The problem is that it doesn't seem to work for other languages, for example:
$text = '日本国'; // becomes '日本国'
That should be empty!
Ideone demo. What am I doing wrong here?
To be more clear, by default [:alnum:] contains [a-zA-Z0-9] (letters and digits from the ASCII range 0-127).
But if you use the u modifier, this class is extended to all UNICODE letters and digits.
The u modifier:
changes the way the subject string (and the pattern) is read (code point by code point instead of byte by byte)
extends several* character classes to UNICODE characters (*as a counter example, the \h character class doesn't change.)
It's possible to separate these two behaviors with commands at the start of the pattern:
(*UTF) at the start of the pattern informs that the subject and the pattern have to be read as utf (utf-8 in php) encoded strings (and not byte by byte).
(*UCP) extends the character classes.
(see several tests here:)
So instead of the u modifier, you can write your pattern this way:
$str = preg_replace('~(*UTF)[^[:alnum:]]+~', '', $str);
You can also choose to not use the [:alnum:] class at all and to be more explicit:
$str = preg_replace('~[^a-z0-9]+~ui', '', $str);
Since there is no predefined character class in the pattern, the (*UCP) part of the u modifier doesn't change anything.
Obviously, as noted in comments, it's also possible to ignore the fact that your subject string may contain characters out of the ASCII range, and read this string byte by byte with:
$str = preg_replace('~[^[:alnum:]]+~', '', $str);
// or
$str = preg_replace('~[^a-z0-9]+~i', '', $str);
and it will work too, but IMO it's less rigorous.

How do i match with regex special chars that are not alphanumeric whilst ignoring emojis?

i'm currently having an problem, i don't know how to make regex match special characters whilst ignoring emojis.
Example, i want to match the special chars that are not emojis in this string: ❤️𝓉𝑒𝓈𝓉𝒾𝓃𝑔❤️
currently as my regex i have
[^\x00-\x7F]+
Current output: ❤️𝓉𝑒𝓈𝓉𝒾𝓃𝑔❤️
Wanted output: 𝓉𝑒𝓈𝓉𝒾𝓃𝑔
How would i go around fixing this?
Maybe, this expression might work:
$re = '/[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]/u';
$str = '❤️𝓉𝑒𝓈𝓉𝒾𝓃𝑔❤️';
$subst = '';
echo preg_replace($re, $subst, $str);
Output
𝓉𝑒𝓈𝓉𝒾𝓃𝑔️
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
Reference:
javascript unicode emoji regular expressions
Use the following unicode regex:
[^\p{M}\p{S}]+
\p{M} matches characters intended to be combined with another character (here ️).
\p{S} matches symbols (❤ in this case).
Demo
I think that your posts' title does not match it's body.
There is virtually no overlap between emoji and AlphaNum characters.
There are a couple of keycap emoji but since their sequence beyond
the first digits don't overlap the alphanum, it's enough just to put
a negative look ahead in front of the alphanum class.
'~(?![0-9]\x{FE0F}\x{20E3}|\x{2139})[\pL\pN]+~'
https://regex101.com/r/1JcUqY/1

php preg_match get word with cyrillic characters

I try to get some word from string, but this word maybe will have cyrillic characters, I try to get it, but all what I to do - not working.
Please help me;
My code
$str= "Продавец:В KrossАдын рассказать друзьям var addthis_config = {'data_track_clickback':true};";
$pattern = '/\s(\w*|.*?)\s/';
preg_match($pattern, $str, $matches);
echo $matches[0];
I need to get KrossАдын.
Thaks!
You can change the meaning of \w by using the u modifier. With the u modifier, the string is read as an UTF8 string, and the \w character class is no more [a-zA-Z0-9_] but [\p{L}\p{N}_]:
$pattern = '/\s(\w*|.*?)\s/u';
Note that the alternation in the pattern is a non-sense:
you use an alternation where the second member can match the same thing than the first. (i.e. all that is matched by \w* can be matched by .*? because there is a whitespace on the right. The two subpatterns will match the characters between two whitespaces)
Writting $pattern = '/\s(.*?)\s/u'; does exactly the same, or better:
$pattern = '/\s(\S*)\s/u';
that avoids to use a lazy quantifier.
If your goal is only to match ASCII and cyrillic letters, the most efficient (because for character classes the smaller is the faster) will be:
$pattern = '~(*UTF8)[a-z\p{Cyrillic}]+~i';
(*UTF8) will inform the regex engine that the original string must be read as an UTF8 string.
\p{Cyrillic} is a character class that only contains cyrillic letters.
The issue is that your string uses UTF-8 characters, which \w will not match. Check this answer on StackOverflow for a solution: UTF-8 in PHP regular expressions
Essentially, you'll want to add the u modifier at the end of your expression, and use \p{L} instead of \w.

PHP - replace all non-alphanumeric chars for all languages supported

Hi i'm actually trying replacing all the NON-alphanumeric chars from a string like this:
mb_ereg_replace('/[^a-z0-9\s]+/i','-',$string);
first problem is it doesn't replaces chars like "." from the string.
Second i would like to add multybite support for all users languages to this method.
How can i do that?
Any help appriciated, thanks a lot.
Try the following:
preg_replace('/[^\p{L}0-9\s]+/u', '-', $string);
When the u flag is used on a regular expression, \p{L} (and \p{Letter}) matches any character in any of the Unicode letter categories.
It should replace . with -, you're probably mixing up your data in the first place.
As for the multi-byte support, add the u modifier and look into PCRE properties, namely \p{Letter}:
$replaced = preg_replace('~[^0-9\p{Letter}]+~iu', '-', $string);
The shortest way is:
$result = preg_replace('~\P{Xan}++~u', '-', $string);
\p{Xan} contains numbers and letters in all languages, thus \P{Xan} contains all that is not a letter or a number.
This expression does replace dots. For multibyte use u modifier (UTF-8).

PHP: remove small words from string ignoring german characters in the words

I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string
and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.
I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:
$kw = utf8_encode(strtolower($kw));
Now you can use the regex-unicode functionalities. \p{L} is for letters and \p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
[^\p{L}\p{N}]
You want all word-characters:
[\p{L}\p{N}]
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
(?<=[^\p{L}\p{N}]|^)
Replace max 2 "word-characters" followed by a boundary or the end:
[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)
So your regex could look like this:
'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'
And decode to your local, if you like:
echo utf8_decode($kw);
Good luck! Robert
Your \b word boundaries trip over the ö, because it's not an alphanumeric character. Per default PCRE works on ASCII letters.
The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer:
$kw = preg_replace("/\b[^-]{1,2}\b/u", "-", $kw);
I would use preg_replace_callback or /e btw, and instead search for [A-Z] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.

Categories