php file_put_contents asian character filename encoding

php file_put_contents asian character filename encoding - php

I'm trying to get this scrape images off of wikipedia. What good is free licensed media if you can't get it? Original script is here.
If you put this
http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png
in firefox, it will immediately be transformed into
http://upload.wikimedia.org/wikipedia/commons/2/26/的-bw.png
so that when you save the image, it's saved as 的-bw.png
Simple enough eh? Now how to get php to do that? Just guessing, I tried utf8_decode($fileName) .. but getting the wrong Chinese characters.
$src= "http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png";
$pngData = file_get_contents($src);
$fileName = basename($src);
file_put_contents($fileName, $pngData);
Any help appreciated, as I really have no idea where to go from here.

Have you tried url_decode(); ?
<?php
$url = 'http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png';
$parts = explode('/', $url);
$title = $parts[count($parts)-1]; //get last section
$title = urldecode($title);
?>

Squirrelmail contains a nice function in the sources to convert unicode to entities:
<?php
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}
?>

Related

PHP Convert Unicode to text

I am receiving from a form the following urlencoded string %F0%9D%90%B4%F0%9D%91%99%F0%9D%91%92%F0%9D%91%97%F0%9D%91%8E%F0%9D%91%9B%F0%9D%91%91%F0%9D%91%9F%F0%9D%91%8E
If I decode it I get the following formatted text: 𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎
Is there any way with PHP to get the plain "Alejandra" text from the encoded or decoded string?
I have tried without success several ways to do it with
mb_convert_encoding($string, "UTF-16",mb_detect_encoding($string))
iconv('utf-16', 'utf-8', rawurldecode($string)
and any other solution I could in stackoverflow.
Edit:
I tried the proposed solution $strAscii = iconv('UTF-8','ASCII//TRANSLIT',$str); but it deletes the special characters such as áéíóúñç which we need to stay.
Expected result
input: 𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎
output: Alejandra
input: Álejandra
output: Álejandra
Thank you in advance.

urldecode or rawurldecode is sufficient.
$string = "%F0%9D%90%B4%F0%9D%91%99%F0%9D%91%92%F0%9D%91%97%F0%9D%91%8E%F0%9D%91%9B%F0%9D%91%91%F0%9D%91%9F%F0%9D%91%8E";
$str = urldecode($string);
var_dump($str);
//string(36) "𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎"
Demo: https://3v4l.org/OMQ35
A special debugger gives me: string(36) UTF-8mb4. This means that there are also UTF-8 characters in the string that require 4 bytes. The character A is the Unicode character “𝐴” (U+1D434).
Note:
If the special UTF-8 characters cause problems, you can try to display the strings as ASCII characters with iconv.
$strAscii = iconv('UTF-8','ASCII//TRANSLIT',$str);
//string(9) "Alejandra"

What you are getting is called a "psuedo-alphabet", you can see a list of them here: https://qaz.wtf/u/convert.cgi. The one that you appear to be getting can be seen here: https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols
Basically what you need to do is take the string, split it and use a lookup table to convert it back to regular characters. This implementation is terribly efficient but that's because I grabbed the alphabets from the above Wikipedia page and was too lazy to reorganise it.
function math_symbols_to_plain_text($input, $alphabet)
{
$alphabets = [
['a','𝐚','𝑎','𝒂','𝖺','𝗮','𝘢','𝙖','𝒶','𝓪','𝔞','𝖆','𝚊','𝕒'],
['b','𝐛','𝑏','𝒃','𝖻','𝗯','𝘣','𝙗','𝒷','𝓫','𝔟','𝖇','𝚋','𝕓'],
['c','𝐜','𝑐','𝒄','𝖼','𝗰','𝘤','𝙘','𝒸','𝓬','𝔠','𝖈','𝚌','𝕔'],
['d','𝐝','𝑑','𝒅','𝖽','𝗱','𝘥','𝙙','𝒹','𝓭','𝔡','𝖉','𝚍','𝕕'],
['e','𝐞','𝑒','𝒆','𝖾','𝗲','𝘦','𝙚','ℯ','𝓮','𝔢','𝖊','𝚎','𝕖'],
['f','𝐟','𝑓','𝒇','𝖿','𝗳','𝘧','𝙛','𝒻','𝓯','𝔣','𝖋','𝚏','𝕗'],
['g','𝐠','𝑔','𝒈','𝗀','𝗴','𝘨','𝙜','ℊ','𝓰','𝔤','𝖌','𝚐','𝕘'],
['h','𝐡','ℎ','𝒉','𝗁','𝗵','𝘩','𝙝','𝒽','𝓱','𝔥','𝖍','𝚑','𝕙'],
['i','𝐢','𝑖','𝒊','𝗂','𝗶','𝘪','𝙞','𝒾','𝓲','𝔦','𝖎','𝚒','𝕚'],
['j','𝐣','𝑗','𝒋','𝗃','𝗷','𝘫','𝙟','𝒿','𝓳','𝔧','𝖏','𝚓','𝕛'],
['k','𝐤','𝑘','𝒌','𝗄','𝗸','𝘬','𝙠','𝓀','𝓴','𝔨','𝖐','𝚔','𝕜'],
['l','𝐥','𝑙','𝒍','𝗅','𝗹','𝘭','𝙡','𝓁','𝓵','𝔩','𝖑','𝚕','𝕝'],
['m','𝐦','𝑚','𝒎','𝗆','𝗺','𝘮','𝙢','𝓂','𝓶','𝔪','𝖒','𝚖','𝕞'],
['n','𝐧','𝑛','𝒏','𝗇','𝗻','𝘯','𝙣','𝓃','𝓷','𝔫','𝖓','𝚗','𝕟'],
['o','𝐨','𝑜','𝒐','𝗈','𝗼','𝘰','𝙤','ℴ','𝓸','𝔬','𝖔','𝚘','𝕠'],
['p','𝐩','𝑝','𝒑','𝗉','𝗽','𝘱','𝙥','𝓅','𝓹','𝔭','𝖕','𝚙','𝕡'],
['q','𝐪','𝑞','𝒒','𝗊','𝗾','𝘲','𝙦','𝓆','𝓺','𝔮','𝖖','𝚚','𝕢'],
['r','𝐫','𝑟','𝒓','𝗋','𝗿','𝘳','𝙧','𝓇','𝓻','𝔯','𝖗','𝚛','𝕣'],
['s','𝐬','𝑠','𝒔','𝗌','𝘀','𝘴','𝙨','𝓈','𝓼','𝔰','𝖘','𝚜','𝕤'],
['t','𝐭','𝑡','𝒕','𝗍','𝘁','𝘵','𝙩','𝓉','𝓽','𝔱','𝖙','𝚝','𝕥'],
['u','𝐮','𝑢','𝒖','𝗎','𝘂','𝘶','𝙪','𝓊','𝓾','𝔲','𝖚','𝚞','𝕦'],
['v','𝐯','𝑣','𝒗','𝗏','𝘃','𝘷','𝙫','𝓋','𝓿','𝔳','𝖛','𝚟','𝕧'],
['w','𝐰','𝑤','𝒘','𝗐','𝘄','𝘸','𝙬','𝓌','𝔀','𝔴','𝖜','𝚠','𝕨'],
['x','𝐱','𝑥','𝒙','𝗑','𝘅','𝘹','𝙭','𝓍','𝔁','𝔵','𝖝','𝚡','𝕩'],
['y','𝐲','𝑦','𝒚','𝗒','𝘆','𝘺','𝙮','𝓎','𝔂','𝔶','𝖞','𝚢','𝕪'],
['z','𝐳','𝑧','𝒛','𝗓','𝘇','𝘻','𝙯','𝓏','𝔃','𝔷','𝖟','𝚣','𝕫'],
['A','𝐀','𝐴','𝑨','𝖠','𝗔','𝘈','𝘼','𝒜','𝓐','𝔄','𝕬','𝙰','𝔸'],
['B','𝐁','𝐵','𝑩','𝖡','𝗕','𝘉','𝘽','ℬ','𝓑','𝔅','𝕭','𝙱','𝔹'],
['C','𝐂','𝐶','𝑪','𝖢','𝗖','𝘊','𝘾','𝒞','𝓒','ℭ','𝕮','𝙲','ℂ'],
['D','𝐃','𝐷','𝑫','𝖣','𝗗','𝘋','𝘿','𝒟','𝓓','𝔇','𝕯','𝙳','𝔻'],
['E','𝐄','𝐸','𝑬','𝖤','𝗘','𝘌','𝙀','ℰ','𝓔','𝔈','𝕰','𝙴','𝔼'],
['F','𝐅','𝐹','𝑭','𝖥','𝗙','𝘍','𝙁','ℱ','𝓕','𝔉','𝕱','𝙵','𝔽'],
['G','𝐆','𝐺','𝑮','𝖦','𝗚','𝘎','𝙂','𝒢','𝓖','𝔊','𝕲','𝙶','𝔾'],
['H','𝐇','𝐻','𝑯','𝖧','𝗛','𝘏','𝙃','ℋ','𝓗','ℌ','𝕳','𝙷','ℍ'],
['I','𝐈','𝐼','𝑰','𝖨','𝗜','𝘐','𝙄','ℐ','𝓘','ℑ','𝕴','𝙸','𝕀'],
['J','𝐉','𝐽','𝑱','𝖩','𝗝','𝘑','𝙅','𝒥','𝓙','𝔍','𝕵','𝙹','𝕁'],
['K','𝐊','𝐾','𝑲','𝖪','𝗞','𝘒','𝙆','𝒦','𝓚','𝔎','𝕶','𝙺','𝕂'],
['L','𝐋','𝐿','𝑳','𝖫','𝗟','𝘓','𝙇','ℒ','𝓛','𝔏','𝕷','𝙻','𝕃'],
['M','𝐌','𝑀','𝑴','𝖬','𝗠','𝘔','𝙈','ℳ','𝓜','𝔐','𝕸','𝙼','𝕄'],
['N','𝐍','𝑁','𝑵','𝖭','𝗡','𝘕','𝙉','𝒩','𝓝','𝔑','𝕹','𝙽','ℕ'],
['O','𝐎','𝑂','𝑶','𝖮','𝗢','𝘖','𝙊','𝒪','𝓞','𝔒','𝕺','𝙾','𝕆'],
['P','𝐏','𝑃','𝑷','𝖯','𝗣','𝘗','𝙋','𝒫','𝓟','𝔓','𝕻','𝙿','ℙ'],
['Q','𝐐','𝑄','𝑸','𝖰','𝗤','𝘘','𝙌','𝒬','𝓠','𝔔','𝕼','𝚀','ℚ'],
['R','𝐑','𝑅','𝑹','𝖱','𝗥','𝘙','𝙍','ℛ','𝓡','ℜ','𝕽','𝚁','ℝ'],
['S','𝐒','𝑆','𝑺','𝖲','𝗦','𝘚','𝙎','𝒮','𝓢','𝔖','𝕾','𝚂','𝕊'],
['T','𝐓','𝑇','𝑻','𝖳','𝗧','𝘛','𝙏','𝒯','𝓣','𝔗','𝕿','𝚃','𝕋'],
['U','𝐔','𝑈','𝑼','𝖴','𝗨','𝘜','𝙐','𝒰','𝓤','𝔘','𝖀','𝚄','𝕌'],
['V','𝐕','𝑉','𝑽','𝖵','𝗩','𝘝','𝙑','𝒱','𝓥','𝔙','𝖁','𝚅','𝕍'],
['W','𝐖','𝑊','𝑾','𝖶','𝗪','𝘞','𝙒','𝒲','𝓦','𝔚','𝖂','𝚆','𝕎'],
['X','𝐗','𝑋','𝑿','𝖷','𝗫','𝘟','𝙓','𝒳','𝓧','𝔛','𝖃','𝚇','𝕏'],
['Y','𝐘','𝑌','𝒀','𝖸','𝗬','𝘠','𝙔','𝒴','𝓨','𝔜','𝖄','𝚈','𝕐'],
['Z','𝐙','𝑍','𝒁','𝖹','𝗭','𝘡','𝙕','𝒵','𝓩','ℨ','𝖅','𝚉','ℤ']
];
$replace = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'];
$lookup = [
'serif-normal',
'serif-bold',
'serif-italic',
'serif-bolditalic',
'sans-normal',
'sans-bold',
'sans-italic',
'sans-bolditalic',
'script-normal',
'script-bold',
'franktur-normal',
'fraktur-bold',
'monospace',
'doublestruck'
];
$map_index = array_search($alphabet, $lookup);
$split = mb_str_split($input);
$output = '';
foreach ($split as $char) {
foreach ($alphabets as $i => $letter) {
if ($letter[$map_index] === $char)
$output .= $replace[$i];
}
}
return $output;
}
$input = '𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎';
$output = math_symbols_to_plain_text($input, 'serif-italic');
echo $input . PHP_EOL . $output . PHP_EOL;
Yields:
𝐴𝑙𝑒𝑗𝑎𝑛𝑑𝑟𝑎
Alejandra

If I am not wrong, you are trying to decode URL then why you are not trying to use urldecode()
follow this .PHP DOC

PHP str_replace removing unintentionally removing Chinese characters

i have a PHP scripts that removes special characters, but unfortunately, some Chinese characters are also removed.
<?php
function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split('#/\\:*?\"<>|[]\'_+(),{}’! &'), "", $inputString);
return $inputString;
}
$test = '赵景然 赵景然';
print(removeSpecialCharactersFromString($test));
?>
oddly, the output is 赵然 赵然. The character 景 is removed
in addition, 陈 一 is also removed. What might be the possible cause?

The string your using to act as a list of the things you want to replace doesn't work well with the mixed encoding. What I've done is to convert this string to UTF16 and then split it.
function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split(
mb_convert_encoding('#/\\:*?\"<>|[]\'_+(),{}’! &', 'UTF16')), "", $inputString);
return $inputString;
}
$test = '#赵景然 赵景然';
print(removeSpecialCharactersFromString($test));
Which gives...
赵景然赵景然
BTW -str_replace is MB safe - sort of recognised the poster... http://php.net/manual/en/ref.mbstring.php#109937

substr_replace() when used with special charactors (äöå) replaces with a? [duplicate]

I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings..
$accents_search = array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ');
$accents_replace = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e',
'e','e','E','E','E','E','i','i','i','i','I','I','I','I','oe','o','o','o','o','o','o',
'O','O','O','O','O','u','u','u','U','U','U','c','C','N','n');
$str = str_replace($accents_search, $accents_replace, $str);
Results I get:
Ørjan Nilsen -> �orjan Nilsen
Expected Result:
Ørjan Nilsen -> Orjan Nilsen
Edit: I've got my internal character handler set to UTF-8 (according to mb_internal_encoding()), also the value of $str is UTF-8, so from what I can tell, all the strings involved are UTF-8. Does str_replace() detect char sets and use them properly?

According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.

Looks like the string was not replaced because your input encoding and the file encoding mismatch.

It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.
NFD converts something like the "ü" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).
header('Content-Type: text/plain; charset=utf-8');
$test = implode('', array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'));
$test = Normalizer::normalize($test, Normalizer::FORM_D);
// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';
echo preg_replace($pattern, '', $test);
Output:
aaaaªaaAAAAAeeeeEEEEiiiiIIIIœooooºøØOOOOuuuUUUcCNn
The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)
(I'm adding this two months late because I think it's a nice technique that's not known widely enough.)

Try this function definition:
if (!function_exists('mb_str_replace')) {
function mb_str_replace($search, $replace, $subject) {
if (is_array($subject)) {
foreach ($subject as $key => $val) {
$subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
}
return $subject;
}
$pattern = '/(?:'.implode('|', array_map(create_function('$match', 'return preg_quote($match[0], "/");'), (array)$search)).')/u';
if (is_array($search)) {
if (is_array($replace)) {
$len = min(count($search), count($replace));
$table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
$f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
$subject = preg_replace_callback($pattern, $f, $subject);
return $subject;
}
}
$subject = preg_replace($pattern, (string)$replace, $subject);
return $subject;
}
}

convert arabic string to utf8 encoded url

lets assume that i have a string as the following:
إصلاح إصلاح
and i want to convert it to seo friendly url removing slashes and special characters with the following function calls
$title = trim(strtolower($str));
$title = preg_replace('#[^a-z0-9\s-]#',null, $title);
$title = preg_replace('#[\s-]+#','-', $title);
in English its working fine and its giving correct results but in arabic its giving the following result :
15731589160415751581-15731589160415751581
Thanks in advance

I'd suggest urlencode() with unique post id, like
/blog/12345-<?= urlencode('إصلاح إصلاح') ?>

This is an unsolved problem yet. What you basically had to do is to transliterate any given character (irrelevant if arabic or chinese or japanese or whatever) to latin transcription and then perform the URI generation methods on it.
There is some basic(!) support in iconv for this, have a look at http://ch.php.net/manual/de/function.iconv.php, you have to use iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $text) but as I said, support is limited.
If I were you I would just remove spaces and such and then call urlencode() on it:
$url = urlencode(mb_ereg_replace('\s+', '-', $url));
I'm using mb_ereg_replace() because it is unicode aware and such replaces unicode whitespaces as well.

The unicode property for arabic letter is : \p{arabic}, change the second preg_replace by:
$title = preg_replace('#[^\p{arabic}\s-]#',null, $title);

Try This function. I always use it and it works perfectly!
function SafeUrl3($str) {
$friendlyURL = htmlentities($str, ENT_COMPAT, "UTF-8", false) ;
$friendlyURL = preg_replace ( "/[^أ-يa-zA-Z0-9_.-]/u", "-", $friendlyURL ) ;
$friendlyURL = html_entity_decode($friendlyURL,ENT_COMPAT, "UTF-8") ;
$friendlyURL = trim($friendlyURL, '-') ;
return $friendlyURL ;
}

How to convert a nice page title into a valid URL string?

imagine a page Title string in any given language (english, arabic, japanese etc) containing several words in UTF-8. Example:
$stringRAW = "Blues & μπλουζ Bliss's ブルース Schön";
Now this actually needs to be converted into something thats a valid portion of a URL of that page:
$stringURL = "blues-μπλουζ-bliss-ブルース-schön"
just check out this link
This works on my server too!
Q1. What characters are allowed as valid URL these days? I remember having seen whol arabic strings sitting on the browser and i tested it on my apache 2 and all worked fine.
I guesse it must become: $stringURL = "blues-blows-bliss-black"
Q2. What existing php functions do you know that encode/convert these UTF-8 strings correctly for URL ripping them off of any invalid chars?
I guesse that at least:
1. spaces should be converted into dashes -
2. delete invalid characters? which are they? # and '&'?
3. converts all letters to lower case (or are capitcal letters valid in urls?)
Thanks: your suggestions are much appreciated!

this is solution which I use:
$text = 'Nevalidní Český text';
$text = preg_replace('/[^\\pL0-9]+/u', '-', $text);
$text = trim($text, "-");
$text = iconv("utf-8", "us-ascii//TRANSLIT", $text);
$text = preg_replace('/[^-a-z0-9]+/i', '', $text);
Capitals in URL's are not a problem, but if you want the text to be lowercase then simply add $text = strtolower($text); at the end :-).

I would use:
$stringURL = str_replace(' ', '-', $stringURL); // Converts spaces to dashes
$stringURL = urlencode($stringURL);

$stringURL = preg_replace('~[^a-z ]~', '', str_replace(' ', '-', $stringRAW));
Check this method: http://www.whatstyle.net/articles/52/generate_unique_slugs_in_cakephp

pick the title of your webpage
$title = "mytitle#$3%#$5345";
simply urlencode it
$url = urlencode($title);
you dont need to worry about small details but remember to identify your url request its best to use a unique id prefix in url such as /389894/sdojfsodjf , during routing process you can use id 389894 to get the topic sdojfsodjf .

Here is a short & handy one that does the trick for me
$title = trim(strtolower($title)); // lower string, removes white spaces and linebreaks at the start/end
$title = preg_replace('#[^a-z0-9\s-]#',null, $title); // remove all unwanted chars
$title = preg_replace('#[\s-]+#','-', $title); // replace white spaces and - with - (otherwise you end up with ---)
and of course you need to handle umlauts, currency signs and so forth depending on the possible input

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php file_put_contents asian character filename encoding - php

Have you tried url_decode(); ? <?php $url = 'http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png'; $parts = explode('/', $url); $title = $parts[count($parts)-1]; //get last section $title = urldecode($title); ?>

Related

PHP Convert Unicode to text

PHP str_replace removing unintentionally removing Chinese characters

substr_replace() when used with special charactors (äöå) replaces with a? [duplicate]

convert arabic string to utf8 encoded url

How to convert a nice page title into a valid URL string?

Categories

Resources