PHP Convert Unicode to text

PHP Convert Unicode to text - php

I am receiving from a form the following urlencoded string %F0%9D%90%B4%F0%9D%91%99%F0%9D%91%92%F0%9D%91%97%F0%9D%91%8E%F0%9D%91%9B%F0%9D%91%91%F0%9D%91%9F%F0%9D%91%8E
If I decode it I get the following formatted text: 𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎
Is there any way with PHP to get the plain "Alejandra" text from the encoded or decoded string?
I have tried without success several ways to do it with
mb_convert_encoding($string, "UTF-16",mb_detect_encoding($string))
iconv('utf-16', 'utf-8', rawurldecode($string)
and any other solution I could in stackoverflow.
Edit:
I tried the proposed solution $strAscii = iconv('UTF-8','ASCII//TRANSLIT',$str); but it deletes the special characters such as áéíóúñç which we need to stay.
Expected result
input: 𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎
output: Alejandra
input: Álejandra
output: Álejandra
Thank you in advance.

urldecode or rawurldecode is sufficient.
$string = "%F0%9D%90%B4%F0%9D%91%99%F0%9D%91%92%F0%9D%91%97%F0%9D%91%8E%F0%9D%91%9B%F0%9D%91%91%F0%9D%91%9F%F0%9D%91%8E";
$str = urldecode($string);
var_dump($str);
//string(36) "𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎"
Demo: https://3v4l.org/OMQ35
A special debugger gives me: string(36) UTF-8mb4. This means that there are also UTF-8 characters in the string that require 4 bytes. The character A is the Unicode character “𝐴” (U+1D434).
Note:
If the special UTF-8 characters cause problems, you can try to display the strings as ASCII characters with iconv.
$strAscii = iconv('UTF-8','ASCII//TRANSLIT',$str);
//string(9) "Alejandra"

What you are getting is called a "psuedo-alphabet", you can see a list of them here: https://qaz.wtf/u/convert.cgi. The one that you appear to be getting can be seen here: https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols
Basically what you need to do is take the string, split it and use a lookup table to convert it back to regular characters. This implementation is terribly efficient but that's because I grabbed the alphabets from the above Wikipedia page and was too lazy to reorganise it.
function math_symbols_to_plain_text($input, $alphabet)
{
$alphabets = [
['a','𝐚','𝑎','𝒂','𝖺','𝗮','𝘢','𝙖','𝒶','𝓪','𝔞','𝖆','𝚊','𝕒'],
['b','𝐛','𝑏','𝒃','𝖻','𝗯','𝘣','𝙗','𝒷','𝓫','𝔟','𝖇','𝚋','𝕓'],
['c','𝐜','𝑐','𝒄','𝖼','𝗰','𝘤','𝙘','𝒸','𝓬','𝔠','𝖈','𝚌','𝕔'],
['d','𝐝','𝑑','𝒅','𝖽','𝗱','𝘥','𝙙','𝒹','𝓭','𝔡','𝖉','𝚍','𝕕'],
['e','𝐞','𝑒','𝒆','𝖾','𝗲','𝘦','𝙚','ℯ','𝓮','𝔢','𝖊','𝚎','𝕖'],
['f','𝐟','𝑓','𝒇','𝖿','𝗳','𝘧','𝙛','𝒻','𝓯','𝔣','𝖋','𝚏','𝕗'],
['g','𝐠','𝑔','𝒈','𝗀','𝗴','𝘨','𝙜','ℊ','𝓰','𝔤','𝖌','𝚐','𝕘'],
['h','𝐡','ℎ','𝒉','𝗁','𝗵','𝘩','𝙝','𝒽','𝓱','𝔥','𝖍','𝚑','𝕙'],
['i','𝐢','𝑖','𝒊','𝗂','𝗶','𝘪','𝙞','𝒾','𝓲','𝔦','𝖎','𝚒','𝕚'],
['j','𝐣','𝑗','𝒋','𝗃','𝗷','𝘫','𝙟','𝒿','𝓳','𝔧','𝖏','𝚓','𝕛'],
['k','𝐤','𝑘','𝒌','𝗄','𝗸','𝘬','𝙠','𝓀','𝓴','𝔨','𝖐','𝚔','𝕜'],
['l','𝐥','𝑙','𝒍','𝗅','𝗹','𝘭','𝙡','𝓁','𝓵','𝔩','𝖑','𝚕','𝕝'],
['m','𝐦','𝑚','𝒎','𝗆','𝗺','𝘮','𝙢','𝓂','𝓶','𝔪','𝖒','𝚖','𝕞'],
['n','𝐧','𝑛','𝒏','𝗇','𝗻','𝘯','𝙣','𝓃','𝓷','𝔫','𝖓','𝚗','𝕟'],
['o','𝐨','𝑜','𝒐','𝗈','𝗼','𝘰','𝙤','ℴ','𝓸','𝔬','𝖔','𝚘','𝕠'],
['p','𝐩','𝑝','𝒑','𝗉','𝗽','𝘱','𝙥','𝓅','𝓹','𝔭','𝖕','𝚙','𝕡'],
['q','𝐪','𝑞','𝒒','𝗊','𝗾','𝘲','𝙦','𝓆','𝓺','𝔮','𝖖','𝚚','𝕢'],
['r','𝐫','𝑟','𝒓','𝗋','𝗿','𝘳','𝙧','𝓇','𝓻','𝔯','𝖗','𝚛','𝕣'],
['s','𝐬','𝑠','𝒔','𝗌','𝘀','𝘴','𝙨','𝓈','𝓼','𝔰','𝖘','𝚜','𝕤'],
['t','𝐭','𝑡','𝒕','𝗍','𝘁','𝘵','𝙩','𝓉','𝓽','𝔱','𝖙','𝚝','𝕥'],
['u','𝐮','𝑢','𝒖','𝗎','𝘂','𝘶','𝙪','𝓊','𝓾','𝔲','𝖚','𝚞','𝕦'],
['v','𝐯','𝑣','𝒗','𝗏','𝘃','𝘷','𝙫','𝓋','𝓿','𝔳','𝖛','𝚟','𝕧'],
['w','𝐰','𝑤','𝒘','𝗐','𝘄','𝘸','𝙬','𝓌','𝔀','𝔴','𝖜','𝚠','𝕨'],
['x','𝐱','𝑥','𝒙','𝗑','𝘅','𝘹','𝙭','𝓍','𝔁','𝔵','𝖝','𝚡','𝕩'],
['y','𝐲','𝑦','𝒚','𝗒','𝘆','𝘺','𝙮','𝓎','𝔂','𝔶','𝖞','𝚢','𝕪'],
['z','𝐳','𝑧','𝒛','𝗓','𝘇','𝘻','𝙯','𝓏','𝔃','𝔷','𝖟','𝚣','𝕫'],
['A','𝐀','𝐴','𝑨','𝖠','𝗔','𝘈','𝘼','𝒜','𝓐','𝔄','𝕬','𝙰','𝔸'],
['B','𝐁','𝐵','𝑩','𝖡','𝗕','𝘉','𝘽','ℬ','𝓑','𝔅','𝕭','𝙱','𝔹'],
['C','𝐂','𝐶','𝑪','𝖢','𝗖','𝘊','𝘾','𝒞','𝓒','ℭ','𝕮','𝙲','ℂ'],
['D','𝐃','𝐷','𝑫','𝖣','𝗗','𝘋','𝘿','𝒟','𝓓','𝔇','𝕯','𝙳','𝔻'],
['E','𝐄','𝐸','𝑬','𝖤','𝗘','𝘌','𝙀','ℰ','𝓔','𝔈','𝕰','𝙴','𝔼'],
['F','𝐅','𝐹','𝑭','𝖥','𝗙','𝘍','𝙁','ℱ','𝓕','𝔉','𝕱','𝙵','𝔽'],
['G','𝐆','𝐺','𝑮','𝖦','𝗚','𝘎','𝙂','𝒢','𝓖','𝔊','𝕲','𝙶','𝔾'],
['H','𝐇','𝐻','𝑯','𝖧','𝗛','𝘏','𝙃','ℋ','𝓗','ℌ','𝕳','𝙷','ℍ'],
['I','𝐈','𝐼','𝑰','𝖨','𝗜','𝘐','𝙄','ℐ','𝓘','ℑ','𝕴','𝙸','𝕀'],
['J','𝐉','𝐽','𝑱','𝖩','𝗝','𝘑','𝙅','𝒥','𝓙','𝔍','𝕵','𝙹','𝕁'],
['K','𝐊','𝐾','𝑲','𝖪','𝗞','𝘒','𝙆','𝒦','𝓚','𝔎','𝕶','𝙺','𝕂'],
['L','𝐋','𝐿','𝑳','𝖫','𝗟','𝘓','𝙇','ℒ','𝓛','𝔏','𝕷','𝙻','𝕃'],
['M','𝐌','𝑀','𝑴','𝖬','𝗠','𝘔','𝙈','ℳ','𝓜','𝔐','𝕸','𝙼','𝕄'],
['N','𝐍','𝑁','𝑵','𝖭','𝗡','𝘕','𝙉','𝒩','𝓝','𝔑','𝕹','𝙽','ℕ'],
['O','𝐎','𝑂','𝑶','𝖮','𝗢','𝘖','𝙊','𝒪','𝓞','𝔒','𝕺','𝙾','𝕆'],
['P','𝐏','𝑃','𝑷','𝖯','𝗣','𝘗','𝙋','𝒫','𝓟','𝔓','𝕻','𝙿','ℙ'],
['Q','𝐐','𝑄','𝑸','𝖰','𝗤','𝘘','𝙌','𝒬','𝓠','𝔔','𝕼','𝚀','ℚ'],
['R','𝐑','𝑅','𝑹','𝖱','𝗥','𝘙','𝙍','ℛ','𝓡','ℜ','𝕽','𝚁','ℝ'],
['S','𝐒','𝑆','𝑺','𝖲','𝗦','𝘚','𝙎','𝒮','𝓢','𝔖','𝕾','𝚂','𝕊'],
['T','𝐓','𝑇','𝑻','𝖳','𝗧','𝘛','𝙏','𝒯','𝓣','𝔗','𝕿','𝚃','𝕋'],
['U','𝐔','𝑈','𝑼','𝖴','𝗨','𝘜','𝙐','𝒰','𝓤','𝔘','𝖀','𝚄','𝕌'],
['V','𝐕','𝑉','𝑽','𝖵','𝗩','𝘝','𝙑','𝒱','𝓥','𝔙','𝖁','𝚅','𝕍'],
['W','𝐖','𝑊','𝑾','𝖶','𝗪','𝘞','𝙒','𝒲','𝓦','𝔚','𝖂','𝚆','𝕎'],
['X','𝐗','𝑋','𝑿','𝖷','𝗫','𝘟','𝙓','𝒳','𝓧','𝔛','𝖃','𝚇','𝕏'],
['Y','𝐘','𝑌','𝒀','𝖸','𝗬','𝘠','𝙔','𝒴','𝓨','𝔜','𝖄','𝚈','𝕐'],
['Z','𝐙','𝑍','𝒁','𝖹','𝗭','𝘡','𝙕','𝒵','𝓩','ℨ','𝖅','𝚉','ℤ']
];
$replace = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'];
$lookup = [
'serif-normal',
'serif-bold',
'serif-italic',
'serif-bolditalic',
'sans-normal',
'sans-bold',
'sans-italic',
'sans-bolditalic',
'script-normal',
'script-bold',
'franktur-normal',
'fraktur-bold',
'monospace',
'doublestruck'
];
$map_index = array_search($alphabet, $lookup);
$split = mb_str_split($input);
$output = '';
foreach ($split as $char) {
foreach ($alphabets as $i => $letter) {
if ($letter[$map_index] === $char)
$output .= $replace[$i];
}
}
return $output;
}
$input = '𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎';
$output = math_symbols_to_plain_text($input, 'serif-italic');
echo $input . PHP_EOL . $output . PHP_EOL;
Yields:
𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎
Alejandra

If I am not wrong, you are trying to decode URL then why you are not trying to use urldecode()
follow this .PHP DOC

Related

substr_replace() when used with special charactors (äöå) replaces with a? [duplicate]

I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings..
$accents_search = array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ');
$accents_replace = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e',
'e','e','E','E','E','E','i','i','i','i','I','I','I','I','oe','o','o','o','o','o','o',
'O','O','O','O','O','u','u','u','U','U','U','c','C','N','n');
$str = str_replace($accents_search, $accents_replace, $str);
Results I get:
Ørjan Nilsen -> �orjan Nilsen
Expected Result:
Ørjan Nilsen -> Orjan Nilsen
Edit: I've got my internal character handler set to UTF-8 (according to mb_internal_encoding()), also the value of $str is UTF-8, so from what I can tell, all the strings involved are UTF-8. Does str_replace() detect char sets and use them properly?

According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.

Looks like the string was not replaced because your input encoding and the file encoding mismatch.

It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.
NFD converts something like the "ü" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).
header('Content-Type: text/plain; charset=utf-8');
$test = implode('', array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'));
$test = Normalizer::normalize($test, Normalizer::FORM_D);
// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';
echo preg_replace($pattern, '', $test);
Output:
aaaaªaaAAAAAeeeeEEEEiiiiIIIIœooooºøØOOOOuuuUUUcCNn
The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)
(I'm adding this two months late because I think it's a nice technique that's not known widely enough.)

Try this function definition:
if (!function_exists('mb_str_replace')) {
function mb_str_replace($search, $replace, $subject) {
if (is_array($subject)) {
foreach ($subject as $key => $val) {
$subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
}
return $subject;
}
$pattern = '/(?:'.implode('|', array_map(create_function('$match', 'return preg_quote($match[0], "/");'), (array)$search)).')/u';
if (is_array($search)) {
if (is_array($replace)) {
$len = min(count($search), count($replace));
$table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
$f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
$subject = preg_replace_callback($pattern, $f, $subject);
return $subject;
}
}
$subject = preg_replace($pattern, (string)$replace, $subject);
return $subject;
}
}

What is this encoding and how can I encode a string to it in PHP?

I have an input like this:
$input = 'GFL/R&D/50/67289';
I am trying to get to this:
GFL$2fR$26D$2f50$2f67289
So far, the closest I have come is this:
echo filter_var($input, FILTER_SANITIZE_ENCODED, FILTER_FLAG_ENCODE_LOW)
which produces:
GFL%2FR%26D%2F50%2F67289
How can I get from the given input to the desired output and what sort of encoding is the result in?
By the way, please note the case sensitivity going on there. $2f is required rather than $2F.

This will do the trick: url-encode, then lower-case the encoded sequences and swap % for $ with a preg callback (PHP's PCRE doesn't support case-shifting modifiers):
$input = 'GFL/R&D/50/67289';
echo preg_replace_callback('/(%)([0-9A-F]{2})/', function ($m) {
return '$' . strtolower($m[2]);
}, urlencode($input));
output:
GFL$2fR$26D$2f50$2f67289

php url query nested array with no index

I'm working with a third party API that receives several parameters which must be encoded like this:
text[]=Hello%20World&text[]=How%20are%20you?&html[]=<p>Just%20fine,%20thank%20you</p>
As you can see this API can accept multiple parameters for text, and also for HTML (not in the sample call).
I have used http_build_query to correctly build a query string for other APIs
$params['text'][] = 'Hello World';
$params['text'][] = 'How are you?';
$params['html'][] = '<p>Just fine, thank you</p>';
$http_query = http_build_query($params);
The problem with this approach is that it will build a query string with the numeric index:
text[0]=Hello%20World&text[1]=How%20are%20you?&html[0]=<p>Just%20fine,%20thank%20you</p>
unfortunately the API I'm working with doesn't like the numeric index and fails.
Is there any php function/class-method that can help me build a query like this quickly?
Thank you

I don't know a standard way to do it (I think there is no such way), but here's an ugly solution:
Since [] is encoded by http_build_query, you may generate string with indices and then replace them.
preg_replace('/(%5B)\d+(%5D=)/i', '$1$2', http_build_query($params));

I very much agree with the answer by RiaD, but you might run into some problems with this code (sorry I can't just make this a comment due to lack of rep).
First off, as far as I know http_build_query returns an urlencode()'d string, which means you won't have [ and ] but instead you'll have %5B and %5D.
Second, PHP's PCRE engine recognizes the '[' character as the beginning of a character class and not just as a simple '[' (PCRE Meta Characters). This may end up replacing ALL digits from your request with '[]'.
You'll more likely want something like this:
preg_replace('/\%5B\d+\%5D/', '%5B%5D', http_build_query($params));
In this case, you'll need to escape the % characters because those also have a special meaning. Provided you have a string with the actual brackets instead of the escapes, try this:
preg_replace('/\[\d+\]/', '[]', $http_query);

There doesn't seem to be a way to do this with http_build_query. Sorry. On the docs page though, someone has this:
function cr_post($a,$b=0,$c=0){
if (!is_array($a)) return false;
foreach ((array)$a as $k=>$v){
if ($c) $k=$b."[]"; elseif (is_int($k)) $k=$b.$k;
if (is_array($v)||is_object($v)) {
$r[]=cr_post($v,$k,1);continue;
}
$r[]=urlencode($k)."=" .urlencode($v);
}
return implode("&",$r);
}
$params['text'][] = 'Hello World';
$params['text'][] = 'How are you?';
$params['html'][] = '<p>Just fine, thank you</p>';
$str = cr_post($params);
echo $str;
I haven't tested it. If it doesn't work then you're going to have to roll your own. Maybe you can publish a github gist so other people can use it!

Try this:
$params['text'][] = 'Hello World';
$params['text'][] = 'How are you?';
$params['html'][] = '<p>Just fine, thank you</p>';
foreach ($params as $key => $value) {
foreach ($value as $key2 => $value2) {
$http_query.= $key . "[]=" . $value2 . "&";
}
}
$http_query = substr($http_query, 0, strlen($http_query)-1); // remove the last '&'
$http_query = str_replace(" ", "%20", $http_query); // manually encode spaces
echo $http_query;

Replace characters with word in PHP?

Want to replace specific letters in a string to a full word.
I'm using:
function spec2hex($instr) {
for ($i=0; $i<strlen($instr); $i++) {
$char = substr($instr, $i,1);
if ($char == "a"){
$char = "hello";
}
$convString .= "&#".ord($char).";";
}
return $convString;
}
$myString = "adam";
$convertedString = spec2hex($myString);
echo $convertedString;
but that's returning:
hdhm
How do I do this? By the way, this is to replace punctuation with hex characters.
Thanks all.

Use http://php.net/substr_replace
substr_replace($instr, $word, $i,1);

ord() expects only a SINGLE character. You're passing in hello, so ord is doing its thing only on the h:
php > echo ord('hello');
104
php > echo ord('h');
104
So in effect your output is actually
hdhm

it you want to use your same code just change $convString .= "&#".ord($char).";";
to $convString .= $char;

If you just want to replace the occurrence of a with hello within the string you pass to the function, why not use PHP's str_replace()?
function spec2hex($instr) {
return str_replace("a","hello",$instr);
}

I must assume that you don't want to have hex characters instead of punctuation but html entities. Be aware that str_replace(), when called with arrays, will run over the string for multiple times, thus replacing the ";" in "{" also!
Your posted code is not useful for replacing punctuation.
use strtr() with arrays, it doesn't have the drawback of str_replace().
$aReplacements = array(',' => ',', '.' => '.'); //todo: complete the array
$sText = strtr($sText, $aReplacements);

php file_put_contents asian character filename encoding

I'm trying to get this scrape images off of wikipedia. What good is free licensed media if you can't get it? Original script is here.
If you put this
http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png
in firefox, it will immediately be transformed into
http://upload.wikimedia.org/wikipedia/commons/2/26/的-bw.png
so that when you save the image, it's saved as 的-bw.png
Simple enough eh? Now how to get php to do that? Just guessing, I tried utf8_decode($fileName) .. but getting the wrong Chinese characters.
$src= "http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png";
$pngData = file_get_contents($src);
$fileName = basename($src);
file_put_contents($fileName, $pngData);
Any help appreciated, as I really have no idea where to go from here.

Have you tried url_decode(); ?
<?php
$url = 'http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png';
$parts = explode('/', $url);
$title = $parts[count($parts)-1]; //get last section
$title = urldecode($title);
?>

Squirrelmail contains a nice function in the sources to convert unicode to entities:
<?php
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Convert Unicode to text - php

If I am not wrong, you are trying to decode URL then why you are not trying to use urldecode() follow this .PHP DOC

Related

substr_replace() when used with special charactors (äöå) replaces with a? [duplicate]

What is this encoding and how can I encode a string to it in PHP?

php url query nested array with no index

Replace characters with word in PHP?

php file_put_contents asian character filename encoding

Categories

Resources