$html=strip_tags($html);
$html=ereg_replace("[^A-Za-zäÄÜüÖö]"," ",$html);
$words = preg_split("/[\s,]+/", $html);
doesnt this replace all non (A-Z, a-z, a o u with umlauts) characters with space?
I am losing words like zugänglich etc with umlauts
is there any thing wrong with the regex?
edit:
I replaced ereg_replace with preg_replace but somehow the special characters like :, ® are not getting replace by space...
If you succeed with your approach foremost depends on the encoding. When all umlauts got stripped, it's likely that your source text (or php script) was encoded as UTF-8.
In this case rather use:
$text = preg_replace('/[^\p{L}]/u', " ", $text);
This will match all letter characters, not just umlauts. And /u solves your likely charset problem.
Maybe, your umlauts are still html-entities (ä etc.) which contain non alphanumeric characters, that would be deleted...
BTW: Alphanumeric isn't just a-Z but numbers as well...
the regex should be /[^A-Za-zäÄÜüÖö]+/
Related
I want to use preg_replace to remove all unicode characters including Persian characters from a string and keep English and all special characters. The way I know to do it is :
preg_replace('/[^<>()/\* a-zA-Z0-9_.-]/u', '', $string);
But, I don't really want to include all special characters inside []. Is there any shorter way?!
To remove everything but characters falling in the basic ASCII range, you may use a pattern similar to this to match the range by HEX codes.
// Given a string with characters in and outside ASCII:
$s = "abcde啅cde衸xtzሴbb()*&bԴ";
// Match HEX 00-7F and remove characters outside that
// by inverting with ^
echo preg_replace('/[^\x00-\x7f]/', '', $s);
// Prints:
// abcdecdextzbb()*&b
Using HEX 00-7F will also include the start of the ASCII range, therefore covering things like NUL, terminal bell, backspace, etc. You may consider starting with ASCII 32 (hex 20) at SPACE if you don't want your output to include those special non-printable control characters.
echo preg_replace('/[^\x20-\x7f]/', '', $s);
I'm trying to remove repeating white-space characters from UTF8 string in PHP using regex.
This regex
$txt = preg_replace( '/\s+/i' , ' ', $txt );
usually works fine, but some of the strings have Cyrillic letter "Р", which is screwed after the replacement.
After small research I realized that the letter is encoded as \x{D0A0}, and since \xA0 is non-breaking white space in ASCII the regex replaces it with \x20 and the character is no longer valid.
Any ideas how to do this properly in PHP with regex?
Try the u modifier:
$txt="UTF 字符串 with 空格符號";
var_dump(preg_replace("/\\s+/iu","",$txt));
Outputs:
string(28) "UTF字符串with空格符號"
it is described # http://www.php.net/manual/en/function.preg-replace.php#106981
If you want to catch characters, as well european, russian, chinese, japanese, korean of whatever, just:
use mb_internal_encoding('UTF-8');
use preg_replace('...u', '...', $string) with the u (unicode) modifier
For further information, the complete list of preg_* modifiers could be found at :
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string
and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.
I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:
$kw = utf8_encode(strtolower($kw));
Now you can use the regex-unicode functionalities. \p{L} is for letters and \p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
[^\p{L}\p{N}]
You want all word-characters:
[\p{L}\p{N}]
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
(?<=[^\p{L}\p{N}]|^)
Replace max 2 "word-characters" followed by a boundary or the end:
[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)
So your regex could look like this:
'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'
And decode to your local, if you like:
echo utf8_decode($kw);
Good luck! Robert
Your \b word boundaries trip over the ö, because it's not an alphanumeric character. Per default PCRE works on ASCII letters.
The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer:
$kw = preg_replace("/\b[^-]{1,2}\b/u", "-", $kw);
I would use preg_replace_callback or /e btw, and instead search for [A-Z] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.
I have a set of characters like
., !, ?, ;, (space)
and a string, which may or may not be UTF 8 (any language).
Is there a easy way to find out if the string has one of the character set above?
For example:
这是一个在中国的字符串。
which translates to
This is a string in chinese.
The dot character looks different in the first string. Is that a totally different character, or the dot correspondent in utf 8?
Or maybe there's a list somewhere with Unicode punctuation character codes?
In Unicode there are character propertiesPHP Docs, for example Symbols, Letters and the like. You can search for any string of a specific class with preg_matchDocs and the u modifier.
echo preg_match('/pP$/u', $str);
However, your string needs to be UTF-8 to do that.
You can test this on your own, I created a little script that tests for all properties via preg_match:
Looking for properties of last character in "Test.":
Found Punctuation (P).
Found Other punctuation (Po).
Looking for properties of last character in "这是一个在中国的字符串。":
Found Punctuation (P).
Found Other punctuation (Po).
Related: PHP - Fast way to strip all characters not displayable in browser from utf8 string.
Yes, 。 (U+3002, IDEOGRAPHIC FULL STOP) is a totally different character than . (U+002E, FULL STOP). If you want to find out whether a string contains one of the listed characters, you can use regular expressions:
preg_match('/[.!?;。]/u', $str, $match)
This will return either 0 or 1 and $match will contain the matched character. With this it’s important that your string in $str is properly encoded in UTF-8.
If you want to match any Unicode punctuation character, you can use the pattern \p{P} to describe the Unicode character property instead:
/\p{P}/u
you are not trying to transliterate, you are trying to translate!
UTF-8 is not a language, is a unicode character set that supports (virtually) all languages known in the world
what you are trying to do is something like this:
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "这是一个在中国的字符串。");
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "à è ò ù");
that not works with your chinese example
Is there a simple regex that will catch all non-english characters? It would need to allow common punctation and symbols, but no special characters such as Russian, Japanese, etc.
Looking for something to work in PHP.
Since in your comment your referring to addresses, they might contain digits too. So:
preg_replace('/[^[:alpha:][:punct:][:digit:]]/u', utf8_encode($input), '');
Should replace your unwanted characters. The [:alpha:] class will only work, if your locale is set up correctly, though. If, for example, it's set to de_DE, not only "a" through "z" are regarded characters, but also "exotics" like "ä", "ö", "è", and the like.
Also, since you don't want "Russian, Japanese, etc.", note the u modifier. The input has to be UTF-8 encoded in order to not break it and give you wrong results.
Such as this one [^A-Za-z0-9\,\.\-]?
This q/a seemed to handle it: PHP Validate string characters are UK or US Keyboard characters
use hex codes, e.g. this cleans out all non-ascii characters as well as line endings, and replaces them with spaces. space (\x20) is deliberately left out of the range so that consecutive runs of spaces and/or special chars are replaced with a single space.
$clean = trim(preg_replace('/[^\x21-\x7E]+/', ' ', $input));
if (strlen($str) == strlen(utf8_decode($str))) {
}