I am trying to create some directories with unicode names in Windows. The names displays correctly in the Browser but when the Directory is created then it is converted into garbage text.
I have tried ecoding conversions removing special characters.
$myfile = fopen("unicode.csv", "r") or die("Unable to open file!");
$lines = file("unicode.csv", FILE_IGNORE_NEW_LINES);
echo '<table border="1">';
foreach($lines as $k=>$v){
$parts = preg_split('/[\t]/', $v);
echo '<tr>';
foreach($parts as $key=>$val){
if($key==0){
$dir = str_replace("/", "", $val);
$dir = str_replace("\\", "", $dir);
$encode = mb_detect_encoding($dir, mb_detect_order(), false);
$dir = mb_convert_encoding($dir , 'UTF-8' , 'UTF-8');
echo '<td>'.$dir.'</td><td>'.$encode.'</td>';
$result = mkdir ($dir, "0777");
}
echo '<td>'.$val.'</td>';
}
echo '</tr>';
}
Expected result is directory name should be readable in UTF-8.
It turns out to be in garbage text.
Thanks to #eryksun :
Based on your results, it looks like PHP mkdir does not transcode from UTF-8 to native Windows UTF-16LE in order to call [W]ide-character CreateDirectoryW. It probably just calls C mkdir. This naively passes bytes to CreateDirectoryA, which decodes the UTF-8 name using the system [A]NSI encoding (e.g. codepage 1252). Starting with Windows 10, we can set [A]NSI to UTF-8 in the system locale configuration. This change requires a reboot.
Related
I have some json data that I need to email to my users as a .csv file.
The following code works as expected for Hällowörld, however, once I put a space in there, Hällo wörld, the .csv file reads ÔªøH√§llo w√∂rld when opened in Excel (for Mac).
$temp = fopen('php://temp/maxmemory:10485760', 'w');
$rows = json_decode('[["Hällowörld"]]'); // -> Hällowörld
//$rows = json_decode('[["Hällo wörld"]]'); // -> ÔªøH√§llo w√∂rld
foreach($rows as $row) {
$row = array_map(function($cell) {
return chr(239).chr(187).chr(191).$cell;
}, $row);
fputcsv($temp, $row, ';');
}
rewind($temp);
$csv = stream_get_contents($temp);
fclose($temp);
$csv = base64_encode($csv);
// -> post $csv to my email provider's API
A few notes:
My code is in UTF-8
If I open the file with Apple's numbers or textedit, the content is displayed as expected.
If I don't do the mapping with chr(239).chr(187).chr(191).$cell, I get Hällowörld.
If instead, I use mb_convert_encoding($cell, 'UTF-16LE', 'UTF-8') or mb_convert_encoding($cell, 'Windows-1252', 'UTF-8'), as is often suggested, I get H‰llowˆrld.
The final base64_encode() is necessary, because my email provider needs the attachment to be base_64-encoded.
I found the solution! :) Replace the above foreach loop with the following:
array_unshift($rows, [chr(239).chr(187).chr(191)]);
foreach($rows as $row) {
fputcsv($f, $row, ',');
}
I am using php on osx terminal to open the file generated with windows.
I confirmed file is utf-16le encoded
$file --mime myfile.ini
myfile.ini: text/plain; charset=utf-16le
Now I convert it to UTF-8 with this script.
while ($line = fgets($handle)) {
$line = rtrim($line);
$line = mb_convert_encoding($line,"UTF-8","UTF-16LE");
var_dump($line);
}
somehow it shows the corruption like this
string(63) "䘀爀漀洀䐀愀琀攀㴀㈀ ⸀ ⸀ ഀ"
How can I get the correct encoding???
When I don't use mb_convert_encoding
while ($line = fgets($handle)) {
$line = rtrim($line);
$line = mb_convert_encoding($line,"UTF-8","UTF-16LE");
var_dump($line);
if (preg_match('/Optimization/',$line)){print "hit";}
}
var_dump shows the strange result why 28????
string(28) "Optimization=0"
and preg_match also dosen't hit.
You could try doing this:
while ($line = fgets($handle)) {
$line = rtrim($line);
$line = iconv(mb_detect_encoding($line, mb_detect_order(), true), "UTF-8", $line);;
var_dump($line);
}
fgets() won't possibly detect line endings reliably if the stream isn't encoded in an ASCII-compatible encoding. Similarly, when rtrim() seeks for e.g. \n ('LINE FEED (LF)' (U+000A)) it expects a literal 0x0A but in UTF-16LE the encoding is 0x0A00. Bad things can happen.
I suggest you read the file in chunks that are a multiple of 4 bytes, so you won't split individual characters, and forget about line endings until you've successfully re-encoded the file:
$output = '';
while ($line = fgets($handle, 4 * 4096)) {
$output .= mb_convert_encoding($line, "UTF-8", "UTF-16LE");
}
var_dump(bin2hex($output));
Ideally, save output to a file so you can use a text editor or hexadecimal editor to inspect the result.
Finally I use UTF-16BE not UTF-16LE , it shows the correct strings.
My problem was solved.
$line = mb_convert_encoding($line,"UTF-8","UTF-16BE");
However I don't know why it works,
Even file commend says This file is utf-16le
$file --mime myfile.ini
myfile.ini: text/plain; charset=utf-16le
I to everyone, when i execute thi code for write on a file:
$fileTXT = 'prodotti.txt';
$newfileTXT = 'prodotti_2'.date("d-m-Y_h_m_s").'.txt';
if (!copy($fileTXT, $newfileTXT)) {
echo "Impossibile continuare, impossibile creare file TXT.";
exit;
}
$towriteinfile = "";
$fp = fopen($path . $filename, "r") or die("Couldn't open $filename");
$fpTXT = fopen($newfileTXT, 'w') or die("Couldn't open $newfileTXT");
while (!feof($fp)) {
$line = fgets($fp, 1024);
$arr = explode("\t", $line);
$arr[7] = '<img src="http://link/imgHigh/' . $arr[7] . '.jpg" />;';
echo "Prodotto: ".$arr[4]."<br>";
foreach ($arr as $fields) {
fwrite($fpTXT, $fields.";");
}
fwrite($fpTXT, "\n");
}
fclose($fpTXT);
fclose($fp);
I have thi result on txt file:
175;13563;desc;01;category;..............c etc etc.....
mercato.㰻浩牳㵣栢瑴㩰⼯睷獯畣慬楴挮浯椯⽴慣⽴浩䡧杩⽨ ⸀ ⸀砀砀 漀欀ഀ樮杰•㸯㬻
the html code for image is written as chinese caharcter, why?
Do you want to add content to the end of $newFileTXT from $filename ?
IF so, you should change:
$fpTXT = fopen($newfileTXT, 'w') or die("Couldn't open $newfileTXT");
to
$fpTXT = fopen($newfileTXT, 'a') or die("Couldn't open $newfileTXT");
The file is probably interpreted as unicode (probably UTF-8). In unicode, characters can consist of multiple bytes. When you read the file, you just read 1024 bytes, which can result in half a unicode character at the end of the part that you read, and the other half at the start of the next part. When you start adding new characters inbetween, you get other unicode sequences instead, causing the text to be a complete mess.
I have resolved the problem, i have passed any line to this function:
function cleanString($string){
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
return $string;
}
My old string contained binary chars, i have cleaned the string and now all is ok
I am using a script to send a "$filename" variable from flash to PHP in order to create an xml file. The problem is that when I am typing Greek Characters as Filename the filename on the server gets values such as these for example: (δσωδσαωςεωςεβ.qxml)
I do not have any problem a) When writing english characters, b) When writing greek characters data in the xml file.
I am using file_put_contents function.
If instead of getting the Post variable as filename, I set my own filename such as "Ελληνικά.qxml" it works without a problem.
Thanks a lot in advance.
$string = $_POST['xmldata'];
$filename = $_POST['filename'];
$path = "test/";
//$dir_handle = #opendir($path) or mkdir("{$path}", 0777, true);
file_put_contents($path."/".$filename."", $string);
This problem was solved, but another arose. When I try to open the file from flash it does not recognise it now because it is in Greek.
The problem is that flash is sending the data in different encoding. From the comments in the PHP manual for mb_convert_encoding I can see that you should use the following to get it to work (tested on danisch charactors and not greek)
<?php
$string = isset($_POST['xmldata'])?$_POST['xmldata']:"";
$filename = isset($_POST['filename'])?$_POST['filename']:"";
//tested on danish chars
/*
$string = mb_convert_encoding($string, "ISO-8859-1", "UTF-8");
$filename = mb_convert_encoding($filename, "ISO-8859-1", "UTF-8");
*/
//tested on greek chars
$string = mb_convert_encoding($string, "ISO-8859-7", "UTF-8");
$filename = mb_convert_encoding($filename, "ISO-8859-7", "UTF-8");
$path = "test/";
//$dir_handle = #opendir($path) or mkdir("{$path}", 0777, true);
file_put_contents($path."/".$filename."", $string);
?>
Using PHP i'm writing content to a .htaccess file using fwrite, this all works correctly but when i view the .htaccess in Vim afterwards it displays ^M at the end of each line that has been added. This doesn't seem to cause any issues but i'm unsure quite whats happening to cause this and whether it can be prevented?
this is the PHP:
$replaceWith = "#SO redirect_301\n".trim($_POST['redirect_301'])."\n#EO redirect_301";
$filename = SITE_ROOT.'/public_html/.htaccess';
$handle = fopen($filename,'r');
$contents = fread($handle, filesize($filename));
fclose($handle);
if (preg_match('/#SO redirect_301(.*?)#EO redirect_301/si', $contents, $regs)){
$result = $regs[0];
}
$newcontents = str_replace($result,$replaceWith,$contents);
$filename = SITE_ROOT.'/public_html/.htaccess';
$handle = fopen($filename,'w');
if (fwrite($handle, $newcontents) === FALSE) {
}
fclose($handle);
When i check in Vim afterwards i will see something like this:
#SO redirect_301
Redirect 301 /from1 http://www.domain.com/to1^M
Redirect 301 /from2 http://www.domain.com/to2^M
Redirect 301 /from3 http://www.domain.com/to3
#EO redirect_301
The server is running CentOS and i'm working locally on a Mac
Your newlines are incoming as \r\n, not as \n.
Before writing to the file, you should replace the invalid input:
$input = trim($_POST['redirect_301']);
$input = preg_replace('/\r\n/', "\n", $input); // DOS style newlines
$input = preg_replace('/\r/', "\n", $input); // Mac newlines for nostalgia