I am receiving from a form the following urlencoded string %F0%9D%90%B4%F0%9D%91%99%F0%9D%91%92%F0%9D%91%97%F0%9D%91%8E%F0%9D%91%9B%F0%9D%91%91%F0%9D%91%9F%F0%9D%91%8E
If I decode it I get the following formatted text: ๐ด๐๐๐๐๐๐๐๐
Is there any way with PHP to get the plain "Alejandra" text from the encoded or decoded string?
I have tried without success several ways to do it with
mb_convert_encoding($string, "UTF-16",mb_detect_encoding($string))
iconv('utf-16', 'utf-8', rawurldecode($string)
and any other solution I could in stackoverflow.
Edit:
I tried the proposed solution $strAscii = iconv('UTF-8','ASCII//TRANSLIT',$str); but it deletes the special characters such as รกรฉรญรณรบรฑรง which we need to stay.
Expected result
input: ๐ด๐๐๐๐๐๐๐๐
output: Alejandra
input: รlejandra
output: รlejandra
Thank you in advance.
urldecode or rawurldecode is sufficient.
$string = "%F0%9D%90%B4%F0%9D%91%99%F0%9D%91%92%F0%9D%91%97%F0%9D%91%8E%F0%9D%91%9B%F0%9D%91%91%F0%9D%91%9F%F0%9D%91%8E";
$str = urldecode($string);
var_dump($str);
//string(36) "๐ด๐๐๐๐๐๐๐๐"
Demo: https://3v4l.org/OMQ35
A special debugger gives me: string(36) UTF-8mb4. This means that there are also UTF-8 characters in the string that require 4 bytes. The character A is the Unicode character โ๐ดโ (U+1D434).
Note:
If the special UTF-8 characters cause problems, you can try to display the strings as ASCII characters with iconv.
$strAscii = iconv('UTF-8','ASCII//TRANSLIT',$str);
//string(9) "Alejandra"
What you are getting is called a "psuedo-alphabet", you can see a list of them here: https://qaz.wtf/u/convert.cgi. The one that you appear to be getting can be seen here: https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols
Basically what you need to do is take the string, split it and use a lookup table to convert it back to regular characters. This implementation is terribly efficient but that's because I grabbed the alphabets from the above Wikipedia page and was too lazy to reorganise it.
function math_symbols_to_plain_text($input, $alphabet)
{
$alphabets = [
['a','๐','๐','๐','๐บ','๐ฎ','๐ข','๐','๐ถ','๐ช','๐','๐','๐','๐'],
['b','๐','๐','๐','๐ป','๐ฏ','๐ฃ','๐','๐ท','๐ซ','๐','๐','๐','๐'],
['c','๐','๐','๐','๐ผ','๐ฐ','๐ค','๐','๐ธ','๐ฌ','๐ ','๐','๐','๐'],
['d','๐','๐','๐
','๐ฝ','๐ฑ','๐ฅ','๐','๐น','๐ญ','๐ก','๐','๐','๐'],
['e','๐','๐','๐','๐พ','๐ฒ','๐ฆ','๐','โฏ','๐ฎ','๐ข','๐','๐','๐'],
['f','๐','๐','๐','๐ฟ','๐ณ','๐ง','๐','๐ป','๐ฏ','๐ฃ','๐','๐','๐'],
['g','๐ ','๐','๐','๐','๐ด','๐จ','๐','โ','๐ฐ','๐ค','๐','๐','๐'],
['h','๐ก','โ','๐','๐','๐ต','๐ฉ','๐','๐ฝ','๐ฑ','๐ฅ','๐','๐','๐'],
['i','๐ข','๐','๐','๐','๐ถ','๐ช','๐','๐พ','๐ฒ','๐ฆ','๐','๐','๐'],
['j','๐ฃ','๐','๐','๐','๐ท','๐ซ','๐','๐ฟ','๐ณ','๐ง','๐','๐','๐'],
['k','๐ค','๐','๐','๐','๐ธ','๐ฌ','๐ ','๐','๐ด','๐จ','๐','๐','๐'],
['l','๐ฅ','๐','๐','๐
','๐น','๐ญ','๐ก','๐','๐ต','๐ฉ','๐','๐','๐'],
['m','๐ฆ','๐','๐','๐','๐บ','๐ฎ','๐ข','๐','๐ถ','๐ช','๐','๐','๐'],
['n','๐ง','๐','๐','๐','๐ป','๐ฏ','๐ฃ','๐','๐ท','๐ซ','๐','๐','๐'],
['o','๐จ','๐','๐','๐','๐ผ','๐ฐ','๐ค','โด','๐ธ','๐ฌ','๐','๐','๐ '],
['p','๐ฉ','๐','๐','๐','๐ฝ','๐ฑ','๐ฅ','๐
','๐น','๐ญ','๐','๐','๐ก'],
['q','๐ช','๐','๐','๐','๐พ','๐ฒ','๐ฆ','๐','๐บ','๐ฎ','๐','๐','๐ข'],
['r','๐ซ','๐','๐','๐','๐ฟ','๐ณ','๐ง','๐','๐ป','๐ฏ','๐','๐','๐ฃ'],
['s','๐ฌ','๐ ','๐','๐','๐','๐ด','๐จ','๐','๐ผ','๐ฐ','๐','๐','๐ค'],
['t','๐ญ','๐ก','๐','๐','๐','๐ต','๐ฉ','๐','๐ฝ','๐ฑ','๐','๐','๐ฅ'],
['u','๐ฎ','๐ข','๐','๐','๐','๐ถ','๐ช','๐','๐พ','๐ฒ','๐','๐','๐ฆ'],
['v','๐ฏ','๐ฃ','๐','๐','๐','๐ท','๐ซ','๐','๐ฟ','๐ณ','๐','๐','๐ง'],
['w','๐ฐ','๐ค','๐','๐','๐','๐ธ','๐ฌ','๐','๐','๐ด','๐','๐ ','๐จ'],
['x','๐ฑ','๐ฅ','๐','๐','๐
','๐น','๐ญ','๐','๐','๐ต','๐','๐ก','๐ฉ'],
['y','๐ฒ','๐ฆ','๐','๐','๐','๐บ','๐ฎ','๐','๐','๐ถ','๐','๐ข','๐ช'],
['z','๐ณ','๐ง','๐','๐','๐','๐ป','๐ฏ','๐','๐','๐ท','๐','๐ฃ','๐ซ'],
['A','๐','๐ด','๐จ','๐ ','๐','๐','๐ผ','๐','๐','๐','๐ฌ','๐ฐ','๐ธ'],
['B','๐','๐ต','๐ฉ','๐ก','๐','๐','๐ฝ','โฌ','๐','๐
','๐ญ','๐ฑ','๐น'],
['C','๐','๐ถ','๐ช','๐ข','๐','๐','๐พ','๐','๐','โญ','๐ฎ','๐ฒ','โ'],
['D','๐','๐ท','๐ซ','๐ฃ','๐','๐','๐ฟ','๐','๐','๐','๐ฏ','๐ณ','๐ป'],
['E','๐','๐ธ','๐ฌ','๐ค','๐','๐','๐','โฐ','๐','๐','๐ฐ','๐ด','๐ผ'],
['F','๐
','๐น','๐ญ','๐ฅ','๐','๐','๐','โฑ','๐','๐','๐ฑ','๐ต','๐ฝ'],
['G','๐','๐บ','๐ฎ','๐ฆ','๐','๐','๐','๐ข','๐','๐','๐ฒ','๐ถ','๐พ'],
['H','๐','๐ป','๐ฏ','๐ง','๐','๐','๐','โ','๐','โ','๐ณ','๐ท','โ'],
['I','๐','๐ผ','๐ฐ','๐จ','๐','๐','๐','โ','๐','โ','๐ด','๐ธ','๐'],
['J','๐','๐ฝ','๐ฑ','๐ฉ','๐','๐','๐
','๐ฅ','๐','๐','๐ต','๐น','๐'],
['K','๐','๐พ','๐ฒ','๐ช','๐','๐','๐','๐ฆ','๐','๐','๐ถ','๐บ','๐'],
['L','๐','๐ฟ','๐ณ','๐ซ','๐','๐','๐','โ','๐','๐','๐ท','๐ป','๐'],
['M','๐','๐','๐ด','๐ฌ','๐ ','๐','๐','โณ','๐','๐','๐ธ','๐ผ','๐'],
['N','๐','๐','๐ต','๐ญ','๐ก','๐','๐','๐ฉ','๐','๐','๐น','๐ฝ','โ'],
['O','๐','๐','๐ถ','๐ฎ','๐ข','๐','๐','๐ช','๐','๐','๐บ','๐พ','๐'],
['P','๐','๐','๐ท','๐ฏ','๐ฃ','๐','๐','๐ซ','๐','๐','๐ป','๐ฟ','โ'],
['Q','๐','๐','๐ธ','๐ฐ','๐ค','๐','๐','๐ฌ','๐ ','๐','๐ผ','๐','โ'],
['R','๐','๐
','๐น','๐ฑ','๐ฅ','๐','๐','โ','๐ก','โ','๐ฝ','๐','โ'],
['S','๐','๐','๐บ','๐ฒ','๐ฆ','๐','๐','๐ฎ','๐ข','๐','๐พ','๐','๐'],
['T','๐','๐','๐ป','๐ณ','๐ง','๐','๐','๐ฏ','๐ฃ','๐','๐ฟ','๐','๐'],
['U','๐','๐','๐ผ','๐ด','๐จ','๐','๐','๐ฐ','๐ค','๐','๐','๐','๐'],
['V','๐','๐','๐ฝ','๐ต','๐ฉ','๐','๐','๐ฑ','๐ฅ','๐','๐','๐
','๐'],
['W','๐','๐','๐พ','๐ถ','๐ช','๐','๐','๐ฒ','๐ฆ','๐','๐','๐','๐'],
['X','๐','๐','๐ฟ','๐ท','๐ซ','๐','๐','๐ณ','๐ง','๐','๐','๐','๐'],
['Y','๐','๐','๐','๐ธ','๐ฌ','๐ ','๐','๐ด','๐จ','๐','๐','๐','๐'],
['Z','๐','๐','๐','๐น','๐ญ','๐ก','๐','๐ต','๐ฉ','โจ','๐
','๐','โค']
];
$replace = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'];
$lookup = [
'serif-normal',
'serif-bold',
'serif-italic',
'serif-bolditalic',
'sans-normal',
'sans-bold',
'sans-italic',
'sans-bolditalic',
'script-normal',
'script-bold',
'franktur-normal',
'fraktur-bold',
'monospace',
'doublestruck'
];
$map_index = array_search($alphabet, $lookup);
$split = mb_str_split($input);
$output = '';
foreach ($split as $char) {
foreach ($alphabets as $i => $letter) {
if ($letter[$map_index] === $char)
$output .= $replace[$i];
}
}
return $output;
}
$input = '๐ด๐๐๐๐๐๐๐๐';
$output = math_symbols_to_plain_text($input, 'serif-italic');
echo $input . PHP_EOL . $output . PHP_EOL;
Yields:
๐ด๐๐๐๐๐๐๐๐
Alejandra
If I am not wrong, you are trying to decode URL then why you are not trying to use urldecode()
follow this .PHP DOC
I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings..
$accents_search = array('รก','ร ','รข','รฃ','ยช','รค','รฅ','ร','ร','ร','ร','ร','รฉ','รจ',
'รช','รซ','ร','ร','ร','ร','รญ','รฌ','รฎ','รฏ','ร','ร','ร','ร','ล','รฒ','รณ','รด','รต','ยบ','รธ',
'ร','ร','ร','ร','ร','รบ','รน','รป','ร','ร','ร','รง','ร','ร','รฑ');
$accents_replace = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e',
'e','e','E','E','E','E','i','i','i','i','I','I','I','I','oe','o','o','o','o','o','o',
'O','O','O','O','O','u','u','u','U','U','U','c','C','N','n');
$str = str_replace($accents_search, $accents_replace, $str);
Results I get:
รrjan Nilsen -> ๏ฟฝorjan Nilsen
Expected Result:
รrjan Nilsen -> Orjan Nilsen
Edit: I've got my internal character handler set to UTF-8 (according to mb_internal_encoding()), also the value of $str is UTF-8, so from what I can tell, all the strings involved are UTF-8. Does str_replace() detect char sets and use them properly?
According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.
Looks like the string was not replaced because your input encoding and the file encoding mismatch.
It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.
NFD converts something like the "รผ" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).
header('Content-Type: text/plain; charset=utf-8');
$test = implode('', array('รก','ร ','รข','รฃ','ยช','รค','รฅ','ร','ร','ร','ร','ร','รฉ','รจ',
'รช','รซ','ร','ร','ร','ร','รญ','รฌ','รฎ','รฏ','ร','ร','ร','ร','ล','รฒ','รณ','รด','รต','ยบ','รธ',
'ร','ร','ร','ร','ร','รบ','รน','รป','ร','ร','ร','รง','ร','ร','รฑ'));
$test = Normalizer::normalize($test, Normalizer::FORM_D);
// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';
echo preg_replace($pattern, '', $test);
Output:
aaaaยชaaAAAAAeeeeEEEEiiiiIIIIลooooยบรธรOOOOuuuUUUcCNn
The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)
(I'm adding this two months late because I think it's a nice technique that's not known widely enough.)
Try this function definition:
if (!function_exists('mb_str_replace')) {
function mb_str_replace($search, $replace, $subject) {
if (is_array($subject)) {
foreach ($subject as $key => $val) {
$subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
}
return $subject;
}
$pattern = '/(?:'.implode('|', array_map(create_function('$match', 'return preg_quote($match[0], "/");'), (array)$search)).')/u';
if (is_array($search)) {
if (is_array($replace)) {
$len = min(count($search), count($replace));
$table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
$f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
$subject = preg_replace_callback($pattern, $f, $subject);
return $subject;
}
}
$subject = preg_replace($pattern, (string)$replace, $subject);
return $subject;
}
}