I'm using this function to clean strings for elastic search:
function cleanString($string){
$string = mb_convert_encoding($string, "UTF-8");
$string = str_ireplace(array('<', '>'), array(' <', '> '), $string);
$string = strip_tags($string);
$string = filter_var($string, FILTER_SANITIZE_STRING);
$string = str_ireplace(array("\t", "\n", "\r", " "," ",":"), ' ', $string);
$string = str_ireplace(array("","«","»","£"), '', $string);
return trim($string, ",;.:-_*+~#'\"´`!§$%&/()=?«»")
}
It does all sorts of stuff, but the problem I am facing has to do with the trim function at the very end. It is supposed to trim away whitespaces and special characters, and worked fine until recently, when I added two more special character to trim away from string: « and ». This caused problems with another special character:
When I pass the word België into the function, the ë gets corrupted and elastic throws an error.
Why does trim corrupt a completely different character?
How can I fix
that, so that I parse out « and » and preserve ë?
trim is not encoding aware and just looks at individual bytes. If you tell it to trim '«»', and that's encoded in UTF-8, it will look for the bytes C2 AB C2 BB (where C2 is redundant, so AB BB C2 are the actual search terms). "ë" in UTF-8 is C3 AB, so half of it gets removed and the character is thereby broken.
You'll need to use an encoding aware functions to safely remove multibyte characters, e.g.:
preg_replace('/^[«»]+|[«»]+$/u', '', $str)
So I'm trying to generate slugs to store in my DB. My locales include English, some European languages and Japanese.
I allow \d, \w, European characters are transliterated, Japanese characters are untouched. Period, plus and dash (-) are kept. Leading/trailing whitespace is removed, while the whitespace in between is replaced by a dash.
Here is some code: (please feel free to improve it, given my conditions above as my regex-fu is currently white belt tier)
function ToSlug($string, $separator='-') {
$url = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$url = preg_replace('/[^\d\w一-龠ぁ-ゔァ-ヴー々〆〤.+ -]/', '', $url);
$url = strtolower($url);
$url = preg_replace('/[ ' . $separator . ']+/', $separator, $url);
return $url;
}
I'm testing this function, however my JP characters are not getting through, they are simply replaced by ''. Whilst I do suspect it's the //IGNORE that's taking them out, I need that their or else my German, France transliterations will not work. Any ideas on how I can fix this?
EDIT: I'm not sure if Japanese Kanji covers all of Simplified Chinese but I'm gonna need that and Korean as well. If anyone who knows the regex off the bat please let me know it will save me some time searching. Thanks.
Note: I am not familiar with the Japanese writing system.
Looking at the function the iconv call appears to remove all the Japanese characters. Instead of using iconv to transliterate, it may be easier to just create a function that does it:
function _toSlugTransliterate($string) {
// Lowercase equivalents found at:
// https://github.com/kohana/core/blob/3.3/master/utf8/transliterate_to_ascii.php
$lower = [
'à'=>'a','ô'=>'o','ď'=>'d','ḟ'=>'f','ë'=>'e','š'=>'s','ơ'=>'o',
'ß'=>'ss','ă'=>'a','ř'=>'r','ț'=>'t','ň'=>'n','ā'=>'a','ķ'=>'k',
'ŝ'=>'s','ỳ'=>'y','ņ'=>'n','ĺ'=>'l','ħ'=>'h','ṗ'=>'p','ó'=>'o',
'ú'=>'u','ě'=>'e','é'=>'e','ç'=>'c','ẁ'=>'w','ċ'=>'c','õ'=>'o',
'ṡ'=>'s','ø'=>'o','ģ'=>'g','ŧ'=>'t','ș'=>'s','ė'=>'e','ĉ'=>'c',
'ś'=>'s','î'=>'i','ű'=>'u','ć'=>'c','ę'=>'e','ŵ'=>'w','ṫ'=>'t',
'ū'=>'u','č'=>'c','ö'=>'o','è'=>'e','ŷ'=>'y','ą'=>'a','ł'=>'l',
'ų'=>'u','ů'=>'u','ş'=>'s','ğ'=>'g','ļ'=>'l','ƒ'=>'f','ž'=>'z',
'ẃ'=>'w','ḃ'=>'b','å'=>'a','ì'=>'i','ï'=>'i','ḋ'=>'d','ť'=>'t',
'ŗ'=>'r','ä'=>'a','í'=>'i','ŕ'=>'r','ê'=>'e','ü'=>'u','ò'=>'o',
'ē'=>'e','ñ'=>'n','ń'=>'n','ĥ'=>'h','ĝ'=>'g','đ'=>'d','ĵ'=>'j',
'ÿ'=>'y','ũ'=>'u','ŭ'=>'u','ư'=>'u','ţ'=>'t','ý'=>'y','ő'=>'o',
'â'=>'a','ľ'=>'l','ẅ'=>'w','ż'=>'z','ī'=>'i','ã'=>'a','ġ'=>'g',
'ṁ'=>'m','ō'=>'o','ĩ'=>'i','ù'=>'u','į'=>'i','ź'=>'z','á'=>'a',
'û'=>'u','þ'=>'th','ð'=>'dh','æ'=>'ae','µ'=>'u','ĕ'=>'e','ı'=>'i',
];
return str_replace(array_keys($lower), array_values($lower), $string);
}
So, with some modifications, it could look something like this:
function toSlug($string, $separator = '-') {
// Work around this...
#$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$string = _toSlugTransliterate($string);
// Remove unwanted chars + trim excess whitespace
// I got the character ranges from the following URL:
// https://stackoverflow.com/questions/6787716/regular-expression-for-japanese-characters#10508813
$regex = '/[^一-龠ぁ-ゔァ-ヴーa-zA-Z0-9a-zA-Z0-9々〆〤.+ -]|^\s+|\s+$/u';
$string = preg_replace($regex, '', $string);
// Using the mb_* version seems safer for some reason
$string = mb_strtolower($string);
// Same as before
$string = preg_replace("/[ {$separator}]+/", $separator, $string);
return $string;
}
$x = ' æøå!this.ís-a test-ゔヴ ーァ ';
echo toSlug($x);
In regex you can use unicode "scripts" to match letters of various languages. There is no "Japanese" one, but there are Hiragana, Katakana and Han. As I have no idea how Japanese is written, and how one could use these, I am not even going to try.
Using these scripts, however, would be done something like this:
'/[\p{Hiragana}\p{Katakana}\p{Han}]+/'
This question already has answers here:
Matching UTF Characters with preg_match in PHP: (*UTF8) Works on Windows but not Linux
(3 answers)
Closed 9 years ago.
I am trying to create a regular expression for any given string.
Goal: remove ALL characters which are not "latin" or "lowercase greek" or "numbers" .
What I have done so far: [^a-z0-9]
This works perfect for latin characters.
When I try this: [^a-z0-9α-ω] no luck. Works BUT leaves out any other symbol like !!#$%#%#$#,`
My knowledge is limited when it comes to regexp. Any help would be much appreciated!
EDIT:
Posted below is the function that matches characters specified and creates a slug out of it, with a dash as a separation character:
$q_separator = preg_quote('-');
$trans = array(
'&.+?;' => '',
'[^a-z0-9 -]' => '',
'\s+' => $separator,
'('.$q_separator.')+' => $separator
);
$str = strip_tags($str);
foreach ($trans as $key => $val){
$str = preg_replace("#".$key."#i", $val, $str);
}
if ($lowercase === TRUE){
$str = strtolower($str);
}
return trim($str, '-');
So if the string is: OnCE upon a tIME !#% #$$ in MEXIco
Using the function the output will be: once-upon-a-time-in-mexico
This works fine but I want the preg_match also to exclude greek characters.
Ok, can this replace your function?
$subject = 'OnCEΨΩ é-+#àupon</span> aαθ tIME !#%#$ in MEXIco in the year 1874 <or 1875';
function format($str, $excludeRE = '/[^a-z0-9]+/u', $separator = '-') {
$str = strip_tags($str);
$str = strtolower($str);
$str = preg_replace($excludeRE, $separator, $str);
$str = trim($str, $separator);
return $str;
}
echo format($subject);
Note that you will loose all characters after a < (cause of strip_tags) until you meet a >
// Old answer when I tought you wanted to preserve greek characters
It's possible to build a character range such as α-ω or any strange characters you want! The reason your pattern doesn't work is that you don't inform the regex engine you are dealing with a unicode string. To do that, you must add the u modifier at the end of the pattern. Like that:
/[^a-z0-9α-ω]+/u
You can use chars hexadecimal code too:
/[^a-z0-9\x{3B1}-\x{3C9}]+/u
Note that if you are sure not to have or want to preserve, uppercase Greek chars in your string, you can use the character class \p{Greek} like this :
/[^a-z0-9\p{Greek}]+/u
(It's a little longer but more explicit)
There's already an answered question about this:
Remove Non English Characters PHP
You can't specify a range such as α-ω but you need to use their code e.g. \00-\255
I've been reading up on a few solutions but have not managed to get anything to work as yet.
I have a JSON string that I read in from an API call and it contains Unicode characters - \u00c2\u00a3 for example is the £ symbol.
I'd like to use PHP to convert these into either £ or £.
I'm looking into the problem and found the following code (using my pound symbol to test) but it didn't seem to work:
$title = preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", '\u00c2\u00a3');
The output is £.
Am I correct in thinking that this is UTF-16 encoded? How would I convert these to output as HTML?
UPDATE
It seems that the JSON string from the API has 2 or 3 unescaped Unicode strings, e.g.:
That\u00e2\u0080\u0099s (right single quotation)
\u00c2\u00a (pound symbol)
It is not UTF-16 encoding. It rather seems like bogus encoding, because the \uXXXX encoding is independant of whatever UTF or UCS encodings for Unicode. \u00c2\u00a3 really maps to the £ string.
What you should have is \u00a3 which is the unicode code point for £.
{0xC2, 0xA3} is the UTF-8 encoded 2-byte character for this code point.
If, as I think, the software that encoded the original UTF-8 string to JSON was oblivious to the fact it was UTF-8 and blindly encoded each byte to an escaped unicode code point, then you need to convert each pair of unicode code points to an UTF-8 encoded character, and then decode it to the native PHP encoding to make it printable.
function fixBadUnicode($str) {
return utf8_decode(preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str));
}
Example here: http://phpfiddle.org/main/code/6sq-rkn
Edit:
If you want to fix the string in order to obtain a valid JSON string, you need to use the following function:
function fixBadUnicodeForJson($str) {
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1")).chr(hexdec("$2"))', $str);
$str = preg_replace("/\\\\u00([0-9a-f]{2})/e", 'chr(hexdec("$1"))', $str);
return $str;
}
Edit 2: fixed the previous function to transform any wrongly unicode escaped utf-8 byte sequence into the equivalent utf-8 character.
Be careful that some of these characters, which probably come from an editor such as Word are not translatable to ISO-8859-1, therefore will appear as '?' after ut8_decode.
The output is correct.
\u00c2 == Â
\u00a3 == £
So nothing is wrong here. And converting to HTML entities is easy:
htmlentities($title);
Here is an updated version of the function using preg_replace_callback instead of preg_replace.
function fixBadUnicodeForJson($str) {
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")).chr(hexdec("$4")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")).chr(hexdec("$3")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")).chr(hexdec("$2")); },
$str
);
$str = preg_replace_callback(
'/\\\\u00([0-9a-f]{2})/',
function($matches) { return chr(hexdec("$1")); },
$str
);
return $str;
}
OK I have read many threads and have found some options that work but now I am just more curious than anything...
When trying to remove characters like: Â é as google does not like them in the XML product feed.
Why does this work:
But neither of these 2 do?
$string = preg_replace("/[^[:print:]]+/", ' ', $string);
$string = preg_replace("/[^[:print:]]/", ' ', $string);
To put it all in context here is the full function:
// Remove all unprintable characters
$string = ereg_replace("[^[:print:]]", ' ', $string);
// Convert back into HTML entities after printable characters removed
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
// Decode back
$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
// Return the UTF-8 encoded string
$string = strip_tags(stripslashes($string));
// Return the UTF-8 encoded string
return utf8_encode($string);
}
The reason that code doesn't work is because it removes characters that are not in the posix :print: character group which is comprised of printable characters. á É, etc are all printable.
You can find more about posix sets here.
Also, removing accentuated characters might not always be the best option... Check out this question for alternatives.