Allow English characters, Chinese, Japanese - php

How I can replace only the symbols via PHP but not the characters what is numbers 0,9 or English or Chinese or Japanese characters only symbols. Is there any way to do this via PHP?
I use preg_replace to allow English characters and numbers but if Japanese/Chinese/Russians characters are found is auto-deleted.
I try this command too but it is still not working:
$Data = preg_replace('/[^\p{L}\p{N}]/u', '-', $Data);

May be this code will help you.
<?php
$string = "年m月d日ASDFdfdfd4545$##$#$#";
$newString = preg_replace('/[^\\p{L} 0-9]/mu', "_", $string);
echo $newString;
Output:
年m月d日ASDFdfdfd4545_______
\p{L} matches any kind of letter from any language
/u is the Unicode modifier, you need this if you want to handle
Unicode characters
Live demo: http://sandbox.onlinephpfunctions.com/code/a81db5a33e910799f995046104d38898c1203756

Related

Remove all special chars, but not non-Latin characters

I'm using this PHP function for SEO urls. It's working fine with Latin words, but my urls are on Cyrillic. This regex - /[^a-z0-9_\s-]/ is not working with Cyrillic chars, please help me to make it works with non-Latin chars.
function seoUrl($string) {
// Lower case everything
$string = strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^a-z0-9_\s-]/', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using \p{Cyrillic}. Besides you have to set u (unicode) flag to predict engine behavior. You may also need i flag for enabling case-insensitivity like A-Z:
~[^\p{Cyrillic}a-z0-9_\s-]~ui
You don't need to double escape \s.
PHP code:
preg_replace('~[^\p{Cyrillic}a-z0-9_\s-]+~ui', '', $string);
To learn more about Unicode Regular Expressions see this article.
\p{L} or \p{Letter} matches any kind of letter from any language.
To match only Cyrillic characters, use \p{Cyrillic}
Since Cyrillic characters are not standard ASCII characters, you have to use u flag/modifier, so regex will recognize Unicode characters as needed.
Be sure to use mb_strtolower instead of strtolower, as you work with unicode characters.
Because you convert all characters to lowercase, you don't have to use i regex flag/modifier.
The following PHP code should work for you:
function seoUrl($string) {
// Lower case everything
$string = mb_strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^\p{Cyrillic}a-z0-9\s_-]+/u', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
Furthermore, please note that \p{InCyrillic_Supplementary} matches all Cyrillic Supplementary characters and \p{InCyrillic} matches all non-Supplementary Cyrillic characters.

How to remove repeating white-space characters from UTF8 string in PHP properly with regex?

I'm trying to remove repeating white-space characters from UTF8 string in PHP using regex.
This regex
$txt = preg_replace( '/\s+/i' , ' ', $txt );
usually works fine, but some of the strings have Cyrillic letter "Р", which is screwed after the replacement.
After small research I realized that the letter is encoded as \x{D0A0}, and since \xA0 is non-breaking white space in ASCII the regex replaces it with \x20 and the character is no longer valid.
Any ideas how to do this properly in PHP with regex?
Try the u modifier:
$txt="UTF 字符串 with 空格符號";
var_dump(preg_replace("/\\s+/iu","",$txt));
Outputs:
string(28) "UTF字符串with空格符號"
it is described # http://www.php.net/manual/en/function.preg-replace.php#106981
If you want to catch characters, as well european, russian, chinese, japanese, korean of whatever, just:
use mb_internal_encoding('UTF-8');
use preg_replace('...u', '...', $string) with the u (unicode) modifier
For further information, the complete list of preg_* modifiers could be found at :
http://php.net/manual/en/reference.pcre.pattern.modifiers.php

PHP: remove small words from string ignoring german characters in the words

I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string
and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.
I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:
$kw = utf8_encode(strtolower($kw));
Now you can use the regex-unicode functionalities. \p{L} is for letters and \p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
[^\p{L}\p{N}]
You want all word-characters:
[\p{L}\p{N}]
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
(?<=[^\p{L}\p{N}]|^)
Replace max 2 "word-characters" followed by a boundary or the end:
[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)
So your regex could look like this:
'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'
And decode to your local, if you like:
echo utf8_decode($kw);
Good luck! Robert
Your \b word boundaries trip over the ö, because it's not an alphanumeric character. Per default PCRE works on ASCII letters.
The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer:
$kw = preg_replace("/\b[^-]{1,2}\b/u", "-", $kw);
I would use preg_replace_callback or /e btw, and instead search for [A-Z] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.

how can i use preg_match with alphanumeric and unicode acceptance?

I am going to build a multilingual website with PHP and need to have a preg_match which passes all Unicode characters and numbers.
i.e I need it to pass English letters, Spanish letters,Italian letters and as you may know I don't want to pass other characters like ' " _ - and ...
I want some thing like this :
$pattern='/^[unicode chars without \'-_;?]*$/u';
if(!preg_match($pattern, $url))
echo #error;
Unicode property for letter is \pL so in preg_match:
preg_match('/^\pL+$/u', $string);
for an url you could add numbers \pN and dot:
preg_match('/^[\pL\pN.]+/u', $string);

how to replace all non alphanumeric characters with space in php?

$html=strip_tags($html);
$html=ereg_replace("[^A-Za-zäÄÜüÖö]"," ",$html);
$words = preg_split("/[\s,]+/", $html);
doesnt this replace all non (A-Z, a-z, a o u with umlauts) characters with space?
I am losing words like zugänglich etc with umlauts
is there any thing wrong with the regex?
edit:
I replaced ereg_replace with preg_replace but somehow the special characters like :, ® are not getting replace by space...
If you succeed with your approach foremost depends on the encoding. When all umlauts got stripped, it's likely that your source text (or php script) was encoded as UTF-8.
In this case rather use:
$text = preg_replace('/[^\p{L}]/u', " ", $text);
This will match all letter characters, not just umlauts. And /u solves your likely charset problem.
Maybe, your umlauts are still html-entities (ä etc.) which contain non alphanumeric characters, that would be deleted...
BTW: Alphanumeric isn't just a-Z but numbers as well...
the regex should be /[^A-Za-zäÄÜüÖö]+/

Categories