php preg_match get word with cyrillic characters - php

I try to get some word from string, but this word maybe will have cyrillic characters, I try to get it, but all what I to do - not working.
Please help me;
My code
$str= "Продавец:В KrossАдын рассказать друзьям var addthis_config = {'data_track_clickback':true};";
$pattern = '/\s(\w*|.*?)\s/';
preg_match($pattern, $str, $matches);
echo $matches[0];
I need to get KrossАдын.
Thaks!

You can change the meaning of \w by using the u modifier. With the u modifier, the string is read as an UTF8 string, and the \w character class is no more [a-zA-Z0-9_] but [\p{L}\p{N}_]:
$pattern = '/\s(\w*|.*?)\s/u';
Note that the alternation in the pattern is a non-sense:
you use an alternation where the second member can match the same thing than the first. (i.e. all that is matched by \w* can be matched by .*? because there is a whitespace on the right. The two subpatterns will match the characters between two whitespaces)
Writting $pattern = '/\s(.*?)\s/u'; does exactly the same, or better:
$pattern = '/\s(\S*)\s/u';
that avoids to use a lazy quantifier.
If your goal is only to match ASCII and cyrillic letters, the most efficient (because for character classes the smaller is the faster) will be:
$pattern = '~(*UTF8)[a-z\p{Cyrillic}]+~i';
(*UTF8) will inform the regex engine that the original string must be read as an UTF8 string.
\p{Cyrillic} is a character class that only contains cyrillic letters.

The issue is that your string uses UTF-8 characters, which \w will not match. Check this answer on StackOverflow for a solution: UTF-8 in PHP regular expressions
Essentially, you'll want to add the u modifier at the end of your expression, and use \p{L} instead of \w.

Related

How do i match with regex special chars that are not alphanumeric whilst ignoring emojis?

i'm currently having an problem, i don't know how to make regex match special characters whilst ignoring emojis.
Example, i want to match the special chars that are not emojis in this string: ❤️𝓉𝑒𝓈𝓉𝒾𝓃𝑔❤️
currently as my regex i have
[^\x00-\x7F]+
Current output: ❤️𝓉𝑒𝓈𝓉𝒾𝓃𝑔❤️
Wanted output: 𝓉𝑒𝓈𝓉𝒾𝓃𝑔
How would i go around fixing this?
Maybe, this expression might work:
$re = '/[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]/u';
$str = '❤️𝓉𝑒𝓈𝓉𝒾𝓃𝑔❤️';
$subst = '';
echo preg_replace($re, $subst, $str);
Output
𝓉𝑒𝓈𝓉𝒾𝓃𝑔️
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
Reference:
javascript unicode emoji regular expressions
Use the following unicode regex:
[^\p{M}\p{S}]+
\p{M} matches characters intended to be combined with another character (here ️).
\p{S} matches symbols (❤ in this case).
Demo
I think that your posts' title does not match it's body.
There is virtually no overlap between emoji and AlphaNum characters.
There are a couple of keycap emoji but since their sequence beyond
the first digits don't overlap the alphanum, it's enough just to put
a negative look ahead in front of the alphanum class.
'~(?![0-9]\x{FE0F}\x{20E3}|\x{2139})[\pL\pN]+~'
https://regex101.com/r/1JcUqY/1

Regex - Greek Characters in URL

I have a custom router that uses regex.
The problem is that I cannot parse Greek characters.
Here are some lines from index.php:
$router->get('/theatre/plays', 'TheatreController', 'showPlays');
$router->get('/theatre/interviews', 'TheatreController', 'showInterviews');
$router->get('/theatre/[-\w\d\!\.]+', 'TheatreController', 'single_post');
Here are some lines from Router.php:
$found = 0;
$path = parse_url($_SERVER['REQUEST_URI'], PHP_URL_PATH); //get the url
////// Bla Bla Bla /////////
if ( $found = preg_match("#^$value$#", $path) )
{
//Do stuff
}
Now, when I try a url like http://kourtis.app/theatre/α (notice the last character is a Greek 'alpha') then it is somehow interpreted to http://kourtis.app/theatre/%CE%B1
I can see this when I var_dump($path) or when I copy-paste the url.
I guess it has something to do with encoding but everything (I can think of) is in utf-8 format.
Any ideas?
--------------------------------
UPDATE: After the suggestions in the comments, the following works for only with some Greek characters:
/theatre/[α-ωΑ-Ω-\w\d\!\.]+
and use urldecode to decode the percent-encoding of the $path variable.
Some characters that produce an error are: κ π ρ χ.
The question now is ... why??
(BTW, this works for many chars /theatre/.+)
You can use
$router->get('/theatre/[^/]+', 'TheatreController', 'single_post');
as [^/]+ will match one or more characters other than / since [^...] is a negated character class that matches any char but the one(s) defined in the class.
Note you do not have to use \d if you used \w (\w already matches digits).
Also, you did not match diacritics with your regex. If you need to match diacritics, add \p{M} to the regex: '/theatre/[-\w\p{M}!.]+'.
Note that to allow \w to match Unicode letters/digits, you need to pass /u modifier to the regex: $found = preg_match("#^$value$#u", $path). This will both treat input strings as Unicode strings, and make shorthand patterns like \w Unicode aware.
Another thing: you need not escape . inside a character class.
Pattern details:
#...# - regex delimiters
^ - start of string
$value - the $value variable contents (since double quoted strings in PHP allow interpolation)
$ - end of string
#u - the modifier enabling PCRE_UTF and PCRE_UCP options. See more info about them here

How to remove special characters and keep letters of any language in PHP?

I know this should remove any characters from string and keep only numbers and ENGLISH letters.
$txtafter = preg_replace("/[^a-zA-Z 0-9]+/","",$txtbefore);
but I wish to remove any special characters and keep any letter of any language like Arabic or Japanese.
Probably this will work for you:
$repl = preg_replace('/[^\w\s]+/u','' ,$txtbefore);
This will remove all non-word and non-space characters from your text. /u flag is there for unicode support.
You can use the \p{L} pattern to match any letter and \p{N} to much any numeric character. Also you should use u modifier like this: /\p{L}+/u
Your final regex may look like: /[^\p{L}\p{N}]/u
Also be sure to check this question:
Regular expression \p{L} and \p{N}

PHP: remove small words from string ignoring german characters in the words

I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string
and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.
I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:
$kw = utf8_encode(strtolower($kw));
Now you can use the regex-unicode functionalities. \p{L} is for letters and \p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
[^\p{L}\p{N}]
You want all word-characters:
[\p{L}\p{N}]
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
(?<=[^\p{L}\p{N}]|^)
Replace max 2 "word-characters" followed by a boundary or the end:
[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)
So your regex could look like this:
'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'
And decode to your local, if you like:
echo utf8_decode($kw);
Good luck! Robert
Your \b word boundaries trip over the ö, because it's not an alphanumeric character. Per default PCRE works on ASCII letters.
The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer:
$kw = preg_replace("/\b[^-]{1,2}\b/u", "-", $kw);
I would use preg_replace_callback or /e btw, and instead search for [A-Z] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.

preg_replace in PHP - regular expression for NOT condition

I am trying to write a function in PHP using preg_replace where it will replace all those characters which are NOT found in list. Normally we replace where they are found but this one is different.
For example if I have the string:
$mystring = "ab2c4d";
I can write the following function which will replace all numbers with *:
preg_replace("/(\d+)/","*",$mystring);
But I want to replace those characters which are neither number nor alphabets from a to z. They could be anything like #$*();~!{}[]|\/.,<>?' e.t.c.
So anything other than numbers and alphabets should be replaced by something else. How do I do that?
Thanks
You can use a negated character class (using ^ at the beginning of the class):
/[^\da-z]+/i
Update: I mean, you have to use a negated character class and you can use the one I provided but there are others as well ;)
Try
preg_replace("/([^a-zA-Z0-9]+)/","*",$mystring);
You want to use a negated "character class". The syntax for them is [^...]. In your case just [^\w] I think.
\W matches a non-alpha, non-digit character. The underscore _ is included in the list of alphanumerics, so it also won't match here.
preg_replace("/\W/", "something else", $mystring);
should do if you can live with the underscore not being replaced. If you can't, use
preg_replace("/[\W_]/", "something else", $mystring);
The \d, \w and similar in regex all have negative versions, which are simply the upper-case version of the same letter.
So \w matches any word character (ie basically alpha-numerics), and therefore \W matches anything except a word character, so anything other than an alpha-numeric.
This sounds like what you're after.
For more info, I recommend regular-expressions.info.
Since PHP 5.1.0 can use \p{L} (Unicode letters) and \p{N} (Unicode digits) that is unicode equivalent like \d and \w for latin
preg_replace("/[^\p{L}\p{N}]/iu", $replacement_string, $original_string);
/iu modifiers at the end of pattern:
i (PCRE_CASELESS)
u (PCRE_UTF8)
see more at: https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Categories