substr_replace() when used with special charactors (äöå) replaces with a? [duplicate]

substr_replace() when used with special charactors (äöå) replaces with a? [duplicate] - php

I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings..
$accents_search = array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ');
$accents_replace = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e',
'e','e','E','E','E','E','i','i','i','i','I','I','I','I','oe','o','o','o','o','o','o',
'O','O','O','O','O','u','u','u','U','U','U','c','C','N','n');
$str = str_replace($accents_search, $accents_replace, $str);
Results I get:
Ørjan Nilsen -> �orjan Nilsen
Expected Result:
Ørjan Nilsen -> Orjan Nilsen
Edit: I've got my internal character handler set to UTF-8 (according to mb_internal_encoding()), also the value of $str is UTF-8, so from what I can tell, all the strings involved are UTF-8. Does str_replace() detect char sets and use them properly?

According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.

Looks like the string was not replaced because your input encoding and the file encoding mismatch.

It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.
NFD converts something like the "ü" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).
header('Content-Type: text/plain; charset=utf-8');
$test = implode('', array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'));
$test = Normalizer::normalize($test, Normalizer::FORM_D);
// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';
echo preg_replace($pattern, '', $test);
Output:
aaaaªaaAAAAAeeeeEEEEiiiiIIIIœooooºøØOOOOuuuUUUcCNn
The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)
(I'm adding this two months late because I think it's a nice technique that's not known widely enough.)

Try this function definition:
if (!function_exists('mb_str_replace')) {
function mb_str_replace($search, $replace, $subject) {
if (is_array($subject)) {
foreach ($subject as $key => $val) {
$subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
}
return $subject;
}
$pattern = '/(?:'.implode('|', array_map(create_function('$match', 'return preg_quote($match[0], "/");'), (array)$search)).')/u';
if (is_array($search)) {
if (is_array($replace)) {
$len = min(count($search), count($replace));
$table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
$f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
$subject = preg_replace_callback($pattern, $f, $subject);
return $subject;
}
}
$subject = preg_replace($pattern, (string)$replace, $subject);
return $subject;
}
}

Related

PHP Convert Unicode to text

I am receiving from a form the following urlencoded string %F0%9D%90%B4%F0%9D%91%99%F0%9D%91%92%F0%9D%91%97%F0%9D%91%8E%F0%9D%91%9B%F0%9D%91%91%F0%9D%91%9F%F0%9D%91%8E
If I decode it I get the following formatted text: 𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎
Is there any way with PHP to get the plain "Alejandra" text from the encoded or decoded string?
I have tried without success several ways to do it with
mb_convert_encoding($string, "UTF-16",mb_detect_encoding($string))
iconv('utf-16', 'utf-8', rawurldecode($string)
and any other solution I could in stackoverflow.
Edit:
I tried the proposed solution $strAscii = iconv('UTF-8','ASCII//TRANSLIT',$str); but it deletes the special characters such as áéíóúñç which we need to stay.
Expected result
input: 𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎
output: Alejandra
input: Álejandra
output: Álejandra
Thank you in advance.

urldecode or rawurldecode is sufficient.
$string = "%F0%9D%90%B4%F0%9D%91%99%F0%9D%91%92%F0%9D%91%97%F0%9D%91%8E%F0%9D%91%9B%F0%9D%91%91%F0%9D%91%9F%F0%9D%91%8E";
$str = urldecode($string);
var_dump($str);
//string(36) "𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎"
Demo: https://3v4l.org/OMQ35
A special debugger gives me: string(36) UTF-8mb4. This means that there are also UTF-8 characters in the string that require 4 bytes. The character A is the Unicode character “𝐴” (U+1D434).
Note:
If the special UTF-8 characters cause problems, you can try to display the strings as ASCII characters with iconv.
$strAscii = iconv('UTF-8','ASCII//TRANSLIT',$str);
//string(9) "Alejandra"

What you are getting is called a "psuedo-alphabet", you can see a list of them here: https://qaz.wtf/u/convert.cgi. The one that you appear to be getting can be seen here: https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols
Basically what you need to do is take the string, split it and use a lookup table to convert it back to regular characters. This implementation is terribly efficient but that's because I grabbed the alphabets from the above Wikipedia page and was too lazy to reorganise it.
function math_symbols_to_plain_text($input, $alphabet)
{
$alphabets = [
['a','𝐚','𝑎','𝒂','𝖺','𝗮','𝘢','𝙖','𝒶','𝓪','𝔞','𝖆','𝚊','𝕒'],
['b','𝐛','𝑏','𝒃','𝖻','𝗯','𝘣','𝙗','𝒷','𝓫','𝔟','𝖇','𝚋','𝕓'],
['c','𝐜','𝑐','𝒄','𝖼','𝗰','𝘤','𝙘','𝒸','𝓬','𝔠','𝖈','𝚌','𝕔'],
['d','𝐝','𝑑','𝒅','𝖽','𝗱','𝘥','𝙙','𝒹','𝓭','𝔡','𝖉','𝚍','𝕕'],
['e','𝐞','𝑒','𝒆','𝖾','𝗲','𝘦','𝙚','ℯ','𝓮','𝔢','𝖊','𝚎','𝕖'],
['f','𝐟','𝑓','𝒇','𝖿','𝗳','𝘧','𝙛','𝒻','𝓯','𝔣','𝖋','𝚏','𝕗'],
['g','𝐠','𝑔','𝒈','𝗀','𝗴','𝘨','𝙜','ℊ','𝓰','𝔤','𝖌','𝚐','𝕘'],
['h','𝐡','ℎ','𝒉','𝗁','𝗵','𝘩','𝙝','𝒽','𝓱','𝔥','𝖍','𝚑','𝕙'],
['i','𝐢','𝑖','𝒊','𝗂','𝗶','𝘪','𝙞','𝒾','𝓲','𝔦','𝖎','𝚒','𝕚'],
['j','𝐣','𝑗','𝒋','𝗃','𝗷','𝘫','𝙟','𝒿','𝓳','𝔧','𝖏','𝚓','𝕛'],
['k','𝐤','𝑘','𝒌','𝗄','𝗸','𝘬','𝙠','𝓀','𝓴','𝔨','𝖐','𝚔','𝕜'],
['l','𝐥','𝑙','𝒍','𝗅','𝗹','𝘭','𝙡','𝓁','𝓵','𝔩','𝖑','𝚕','𝕝'],
['m','𝐦','𝑚','𝒎','𝗆','𝗺','𝘮','𝙢','𝓂','𝓶','𝔪','𝖒','𝚖','𝕞'],
['n','𝐧','𝑛','𝒏','𝗇','𝗻','𝘯','𝙣','𝓃','𝓷','𝔫','𝖓','𝚗','𝕟'],
['o','𝐨','𝑜','𝒐','𝗈','𝗼','𝘰','𝙤','ℴ','𝓸','𝔬','𝖔','𝚘','𝕠'],
['p','𝐩','𝑝','𝒑','𝗉','𝗽','𝘱','𝙥','𝓅','𝓹','𝔭','𝖕','𝚙','𝕡'],
['q','𝐪','𝑞','𝒒','𝗊','𝗾','𝘲','𝙦','𝓆','𝓺','𝔮','𝖖','𝚚','𝕢'],
['r','𝐫','𝑟','𝒓','𝗋','𝗿','𝘳','𝙧','𝓇','𝓻','𝔯','𝖗','𝚛','𝕣'],
['s','𝐬','𝑠','𝒔','𝗌','𝘀','𝘴','𝙨','𝓈','𝓼','𝔰','𝖘','𝚜','𝕤'],
['t','𝐭','𝑡','𝒕','𝗍','𝘁','𝘵','𝙩','𝓉','𝓽','𝔱','𝖙','𝚝','𝕥'],
['u','𝐮','𝑢','𝒖','𝗎','𝘂','𝘶','𝙪','𝓊','𝓾','𝔲','𝖚','𝚞','𝕦'],
['v','𝐯','𝑣','𝒗','𝗏','𝘃','𝘷','𝙫','𝓋','𝓿','𝔳','𝖛','𝚟','𝕧'],
['w','𝐰','𝑤','𝒘','𝗐','𝘄','𝘸','𝙬','𝓌','𝔀','𝔴','𝖜','𝚠','𝕨'],
['x','𝐱','𝑥','𝒙','𝗑','𝘅','𝘹','𝙭','𝓍','𝔁','𝔵','𝖝','𝚡','𝕩'],
['y','𝐲','𝑦','𝒚','𝗒','𝘆','𝘺','𝙮','𝓎','𝔂','𝔶','𝖞','𝚢','𝕪'],
['z','𝐳','𝑧','𝒛','𝗓','𝘇','𝘻','𝙯','𝓏','𝔃','𝔷','𝖟','𝚣','𝕫'],
['A','𝐀','𝐴','𝑨','𝖠','𝗔','𝘈','𝘼','𝒜','𝓐','𝔄','𝕬','𝙰','𝔸'],
['B','𝐁','𝐵','𝑩','𝖡','𝗕','𝘉','𝘽','ℬ','𝓑','𝔅','𝕭','𝙱','𝔹'],
['C','𝐂','𝐶','𝑪','𝖢','𝗖','𝘊','𝘾','𝒞','𝓒','ℭ','𝕮','𝙲','ℂ'],
['D','𝐃','𝐷','𝑫','𝖣','𝗗','𝘋','𝘿','𝒟','𝓓','𝔇','𝕯','𝙳','𝔻'],
['E','𝐄','𝐸','𝑬','𝖤','𝗘','𝘌','𝙀','ℰ','𝓔','𝔈','𝕰','𝙴','𝔼'],
['F','𝐅','𝐹','𝑭','𝖥','𝗙','𝘍','𝙁','ℱ','𝓕','𝔉','𝕱','𝙵','𝔽'],
['G','𝐆','𝐺','𝑮','𝖦','𝗚','𝘎','𝙂','𝒢','𝓖','𝔊','𝕲','𝙶','𝔾'],
['H','𝐇','𝐻','𝑯','𝖧','𝗛','𝘏','𝙃','ℋ','𝓗','ℌ','𝕳','𝙷','ℍ'],
['I','𝐈','𝐼','𝑰','𝖨','𝗜','𝘐','𝙄','ℐ','𝓘','ℑ','𝕴','𝙸','𝕀'],
['J','𝐉','𝐽','𝑱','𝖩','𝗝','𝘑','𝙅','𝒥','𝓙','𝔍','𝕵','𝙹','𝕁'],
['K','𝐊','𝐾','𝑲','𝖪','𝗞','𝘒','𝙆','𝒦','𝓚','𝔎','𝕶','𝙺','𝕂'],
['L','𝐋','𝐿','𝑳','𝖫','𝗟','𝘓','𝙇','ℒ','𝓛','𝔏','𝕷','𝙻','𝕃'],
['M','𝐌','𝑀','𝑴','𝖬','𝗠','𝘔','𝙈','ℳ','𝓜','𝔐','𝕸','𝙼','𝕄'],
['N','𝐍','𝑁','𝑵','𝖭','𝗡','𝘕','𝙉','𝒩','𝓝','𝔑','𝕹','𝙽','ℕ'],
['O','𝐎','𝑂','𝑶','𝖮','𝗢','𝘖','𝙊','𝒪','𝓞','𝔒','𝕺','𝙾','𝕆'],
['P','𝐏','𝑃','𝑷','𝖯','𝗣','𝘗','𝙋','𝒫','𝓟','𝔓','𝕻','𝙿','ℙ'],
['Q','𝐐','𝑄','𝑸','𝖰','𝗤','𝘘','𝙌','𝒬','𝓠','𝔔','𝕼','𝚀','ℚ'],
['R','𝐑','𝑅','𝑹','𝖱','𝗥','𝘙','𝙍','ℛ','𝓡','ℜ','𝕽','𝚁','ℝ'],
['S','𝐒','𝑆','𝑺','𝖲','𝗦','𝘚','𝙎','𝒮','𝓢','𝔖','𝕾','𝚂','𝕊'],
['T','𝐓','𝑇','𝑻','𝖳','𝗧','𝘛','𝙏','𝒯','𝓣','𝔗','𝕿','𝚃','𝕋'],
['U','𝐔','𝑈','𝑼','𝖴','𝗨','𝘜','𝙐','𝒰','𝓤','𝔘','𝖀','𝚄','𝕌'],
['V','𝐕','𝑉','𝑽','𝖵','𝗩','𝘝','𝙑','𝒱','𝓥','𝔙','𝖁','𝚅','𝕍'],
['W','𝐖','𝑊','𝑾','𝖶','𝗪','𝘞','𝙒','𝒲','𝓦','𝔚','𝖂','𝚆','𝕎'],
['X','𝐗','𝑋','𝑿','𝖷','𝗫','𝘟','𝙓','𝒳','𝓧','𝔛','𝖃','𝚇','𝕏'],
['Y','𝐘','𝑌','𝒀','𝖸','𝗬','𝘠','𝙔','𝒴','𝓨','𝔜','𝖄','𝚈','𝕐'],
['Z','𝐙','𝑍','𝒁','𝖹','𝗭','𝘡','𝙕','𝒵','𝓩','ℨ','𝖅','𝚉','ℤ']
];
$replace = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'];
$lookup = [
'serif-normal',
'serif-bold',
'serif-italic',
'serif-bolditalic',
'sans-normal',
'sans-bold',
'sans-italic',
'sans-bolditalic',
'script-normal',
'script-bold',
'franktur-normal',
'fraktur-bold',
'monospace',
'doublestruck'
];
$map_index = array_search($alphabet, $lookup);
$split = mb_str_split($input);
$output = '';
foreach ($split as $char) {
foreach ($alphabets as $i => $letter) {
if ($letter[$map_index] === $char)
$output .= $replace[$i];
}
}
return $output;
}
$input = '𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎';
$output = math_symbols_to_plain_text($input, 'serif-italic');
echo $input . PHP_EOL . $output . PHP_EOL;
Yields:
𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎
Alejandra

If I am not wrong, you are trying to decode URL then why you are not trying to use urldecode()
follow this .PHP DOC

PHP str_replace removing unintentionally removing Chinese characters

i have a PHP scripts that removes special characters, but unfortunately, some Chinese characters are also removed.
<?php
function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split('#/\\:*?\"<>|[]\'_+(),{}’! &'), "", $inputString);
return $inputString;
}
$test = '赵景然 赵景然';
print(removeSpecialCharactersFromString($test));
?>
oddly, the output is 赵然 赵然. The character 景 is removed
in addition, 陈 一 is also removed. What might be the possible cause?

The string your using to act as a list of the things you want to replace doesn't work well with the mixed encoding. What I've done is to convert this string to UTF16 and then split it.
function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split(
mb_convert_encoding('#/\\:*?\"<>|[]\'_+(),{}’! &', 'UTF16')), "", $inputString);
return $inputString;
}
$test = '#赵景然 赵景然';
print(removeSpecialCharactersFromString($test));
Which gives...
赵景然赵景然
BTW -str_replace is MB safe - sort of recognised the poster... http://php.net/manual/en/ref.mbstring.php#109937

PHP: UTF-8 character gets messy in function which takes the first letter from each word of a sentence

I have this function which when executed it returns the first letters of each word of a string.
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word)
$retturns .= ($word[0]);
return $retturns;
}
Everything works fine. The only problem is that when the words begin with special characters it starts to get messy.
For example "test økonomi" become "t�" instead of "tø"
How can i correct this?

That happens because $word[0] takes the first byte of a string, whereas you are using a multi-bye encoding. So a character may consist of multiple bytes. In case of a ø character it consists of 2 bytes: 0xC3 0xB8
That is how you would extract the first character instead:
mb_substr($word, 0, 1, 'utf8')
Working demo: http://ideone.com/XVnC87

You should use mb_substr with mb_internal_encoding as in example:
<?php
header('Content-Type: text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
echo initials('ąęść óęłęł');
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word) {
$retturns .= mb_substr($word,0,1);
}
return $retturns;
}

Complementing various answers above, you could convert utf-8 (to be precise, assumed as utf-8) encoded character to its ISO 8859 counterpart.
No multibyte support required, as it's not enabled by default in many PHP configurations.
Use utf8_encode() in order to do so
<?php
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', utf8_decode($stringsoftext)) as $word)
$retturns .= ($word[0]);
return $retturns;
}
echo initials("test økonomi");
//return tø
?>
Edit: This approach could break if the characters being converted is not defined on ISO 8859 charset (e.g non latin symbols). Just to reiterate if PHP multi byte support is turned on, mb_substr() solutions is certainly the most appropriate as it is able to properly process the string in utf8 encoding.

str_replace() on multibyte strings dangerous?

This question already has answers here:
Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?
(5 answers)
Closed 10 hours ago.
Given certain multibyte character sets, am I correct in assuming that the following doesn't do what it was intended to do?
$string = str_replace('"', '\\"', $string);
In particular, if the input was in a character set that might have a valid character like 0xbf5c, so an attacker can inject 0xbf22 to get 0xbf5c22, leaving a valid character followed by an unquoted double quote (").
Is there an easy way to mitigate this problem, or am I misunderstanding the issue in the first place?
(In my case, the string is going into the value attribute of an HTML input tag: echo 'input type="text" value="' . $string . '">';)
EDIT: For that matter, what about a function like preg_quote()? There's no charset argument for it, so it seems totally useless in this scenario. When you DON'T have the option of limiting charset to UTF-8 (yes, that'd be nice), it seems like you are really handicapped. What replace and quoting functions are available in that case?

No, you’re right: Using a singlebyte string function on a multibyte string can cause an unexpected result. Use the multibyte string functions instead, for example mb_ereg_replace or mb_split:
$string = mb_ereg_replace('"', '\\"', $string);
$string = implode('\\"', mb_split('"', $string));
Edit    Here’s a mb_replace implementation using the split-join variant:
function mb_replace($search, $replace, $subject, &$count=0) {
if (!is_array($search) && is_array($replace)) {
return false;
}
if (is_array($subject)) {
// call mb_replace for each single string in $subject
foreach ($subject as &$string) {
$string = &mb_replace($search, $replace, $string, $c);
$count += $c;
}
} elseif (is_array($search)) {
if (!is_array($replace)) {
foreach ($search as &$string) {
$subject = mb_replace($string, $replace, $subject, $c);
$count += $c;
}
} else {
$n = max(count($search), count($replace));
while ($n--) {
$subject = mb_replace(current($search), current($replace), $subject, $c);
$count += $c;
next($search);
next($replace);
}
}
} else {
$parts = mb_split(preg_quote($search), $subject);
$count = count($parts)-1;
$subject = implode($replace, $parts);
}
return $subject;
}
As regards the combination of parameters, this function should behave like the singlebyte str_replace.

The code is perfectly safe with sane multibyte-encodings like UTF-8 and EUC-TW, but dangerous with broken ones like Shift_JIS, GB*, etc. Rather than going through all the headache and overhead to be safe with these legacy encodings, I would recommend just supporting only UTF-8.

You could use either mb_ereg_replace by first specifying the charset with mb_regex_encoding(). Alternatively if you use UTF-8, you can use preg_replace with the u modifier.

How can I make the first character of a string lowercase in PHP?

I cannot use strtolower as it affects all characters. Should I use some sort of regular expression?
I'm getting a string which is a product code. I want to use this product code as a search key in a different place with the first letter made lowercase.

Try
lcfirst — Make a string's first character lowercase
and for PHP < 5.3 add this into the global scope:
if (!function_exists('lcfirst')) {
function lcfirst($str)
{
$str = is_string($str) ? $str : '';
if(mb_strlen($str) > 0) {
$str[0] = mb_strtolower($str[0]);
}
return $str;
}
}
The advantage of the above over just strolowering where needed is that your PHP code will simply switch to the native function once you upgrade to PHP 5.3.
The function checks whether there actually is a first character in the string and that it is an alphabetic character in the current locale. It is also multibyte aware.

Just do:
$str = "STACK overflow";
$str[0] = strtolower($str[0]); // prints sTACK overflow
And if you are using 5.3 or later, you can do:
$str = lcfirst($str);

Use lcfirst():
<?php
$foo = 'HelloWorld';
$foo = lcfirst($foo); // helloWorld
$bar = 'HELLO WORLD!';
$bar = lcfirst($bar); // hELLO WORLD!
$bar = lcfirst(strtoupper($bar)); // hELLO WORLD!
?>

For a multibyte first letter of a string, none of the previous examples will work.
In that case, you should use:
function mb_lcfirst($string)
{
return mb_strtolower(mb_substr($string, 0, 1)) . mb_substr($string, 1);
}

The ucfirst() function converts the first character of a string to uppercase.
Related functions:
lcfirst() - converts the first character of a string to lowercase
ucwords() - converts the first character of each word in a string to uppercase
strtoupper() - converts a string to uppercase
strtolower() - converts a string to lowercase
PHP version: 4 and later

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

substr_replace() when used with special charactors (äöå) replaces with a? [duplicate] - php

According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.

Looks like the string was not replaced because your input encoding and the file encoding mismatch.

Related

PHP Convert Unicode to text

PHP str_replace removing unintentionally removing Chinese characters

PHP: UTF-8 character gets messy in function which takes the first letter from each word of a sentence

str_replace() on multibyte strings dangerous?

How can I make the first character of a string lowercase in PHP?

Categories

Resources