utf-16le to UTF-8 - php

I am using php on osx terminal to open the file generated with windows.
I confirmed file is utf-16le encoded
$file --mime myfile.ini
myfile.ini: text/plain; charset=utf-16le
Now I convert it to UTF-8 with this script.
while ($line = fgets($handle)) {
$line = rtrim($line);
$line = mb_convert_encoding($line,"UTF-8","UTF-16LE");
var_dump($line);
}
somehow it shows the corruption like this
string(63) "䘀爀漀洀䐀愀琀攀㴀㈀ ㄀㄀⸀ ㄀⸀ ㄀ഀ਀"
How can I get the correct encoding???
When I don't use mb_convert_encoding
while ($line = fgets($handle)) {
$line = rtrim($line);
$line = mb_convert_encoding($line,"UTF-8","UTF-16LE");
var_dump($line);
if (preg_match('/Optimization/',$line)){print "hit";}
}
var_dump shows the strange result why 28????
string(28) "Optimization=0"
and preg_match also dosen't hit.

You could try doing this:
while ($line = fgets($handle)) {
$line = rtrim($line);
$line = iconv(mb_detect_encoding($line, mb_detect_order(), true), "UTF-8", $line);;
var_dump($line);
}

fgets() won't possibly detect line endings reliably if the stream isn't encoded in an ASCII-compatible encoding. Similarly, when rtrim() seeks for e.g. \n ('LINE FEED (LF)' (U+000A)) it expects a literal 0x0A but in UTF-16LE the encoding is 0x0A00. Bad things can happen.
I suggest you read the file in chunks that are a multiple of 4 bytes, so you won't split individual characters, and forget about line endings until you've successfully re-encoded the file:
$output = '';
while ($line = fgets($handle, 4 * 4096)) {
$output .= mb_convert_encoding($line, "UTF-8", "UTF-16LE");
}
var_dump(bin2hex($output));
Ideally, save output to a file so you can use a text editor or hexadecimal editor to inspect the result.

Finally I use UTF-16BE not UTF-16LE , it shows the correct strings.
My problem was solved.
$line = mb_convert_encoding($line,"UTF-8","UTF-16BE");
However I don't know why it works,
Even file commend says This file is utf-16le
$file --mime myfile.ini
myfile.ini: text/plain; charset=utf-16le

Related

Unable to create directory in Windows using PHP and UTF-8

I am trying to create some directories with unicode names in Windows. The names displays correctly in the Browser but when the Directory is created then it is converted into garbage text.
I have tried ecoding conversions removing special characters.
$myfile = fopen("unicode.csv", "r") or die("Unable to open file!");
$lines = file("unicode.csv", FILE_IGNORE_NEW_LINES);
echo '<table border="1">';
foreach($lines as $k=>$v){
$parts = preg_split('/[\t]/', $v);
echo '<tr>';
foreach($parts as $key=>$val){
if($key==0){
$dir = str_replace("/", "", $val);
$dir = str_replace("\\", "", $dir);
$encode = mb_detect_encoding($dir, mb_detect_order(), false);
$dir = mb_convert_encoding($dir , 'UTF-8' , 'UTF-8');
echo '<td>'.$dir.'</td><td>'.$encode.'</td>';
$result = mkdir ($dir, "0777");
}
echo '<td>'.$val.'</td>';
}
echo '</tr>';
}
Expected result is directory name should be readable in UTF-8.
It turns out to be in garbage text.
Thanks to #eryksun :
Based on your results, it looks like PHP mkdir does not transcode from UTF-8 to native Windows UTF-16LE in order to call [W]ide-character CreateDirectoryW. It probably just calls C mkdir. This naively passes bytes to CreateDirectoryA, which decodes the UTF-8 name using the system [A]NSI encoding (e.g. codepage 1252). Starting with Windows 10, we can set [A]NSI to UTF-8 in the system locale configuration. This change requires a reboot.

PHP - echo and fgets weird characters

I'm trying to display the content of a text file on my website using PHP's fgets, but when I echo the lines in combination with something else (<br>, \n, ...) I get pretty weird characters.
Here's my code :
<?php
header('Content-Type: text/plain;charset=utf-8');
$handle = #fopen("test.txt", "r");
if ($handle) {
while (($buffer = fgets($handle, 4096)) !== false) {
echo $buffer."<br>";
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
}
?>
Here is the content of test.txt :
1
2
3
4
5
... (6 - 18)
19
20
And here's what I get :
Result with <br>
If I use \n instead of <br>, I don't even get Chinese characters :
Result with \n
I think the issue comes from fgets(), because when I print only one line (without the loop) I get the same issue, but if replace $buffer by "1" (echo "1"."<br>";) I get the expected result.
EDIT
As suggested I modified the code to add header('Content-Type: text/plain;charset=utf-8'); at the beginning of the php file, and modified the output as well.
I found that the issue must be somewhere in the text file : I created a new one and the issue was gone.
I don't know the original encryption of the file because a friend gave it to me.
I'll update this answer if I find out exactly what was going on.
EDIT
I made a copy via TextEdit and when saving it the default encoding format was UTF-16, I guess that was the problem.
Working DEMO: http://phpfiddle.org/main/code/xrsk-a0uv
Text File:: http://m.uploadedit.com/ba3s/1500405331493.txt
Problem: at the Time of create text file it's select the encoding format is UTF-16. !! UTF-8 by default for nodepad,nodepad++,sublime etc.. !!
<?php
header('Content-Type: text/plain;charset=utf-8');
$handle = #fopen("http://m.uploadedit.com/ba3s/1500405331493.txt", "r");
if ($handle) {
while (($buffer = fgets($handle, 4096)) !== false) {
echo $buffer."</br>";
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
}
?>
NOTE: Add header for charset-utf-8
header('Content-Type: text/plain;charset=utf-8');
OUTPUT Using With "\n"
OUTPUT Using With "</br>"

Php write strange character on txt file

I to everyone, when i execute thi code for write on a file:
$fileTXT = 'prodotti.txt';
$newfileTXT = 'prodotti_2'.date("d-m-Y_h_m_s").'.txt';
if (!copy($fileTXT, $newfileTXT)) {
echo "Impossibile continuare, impossibile creare file TXT.";
exit;
}
$towriteinfile = "";
$fp = fopen($path . $filename, "r") or die("Couldn't open $filename");
$fpTXT = fopen($newfileTXT, 'w') or die("Couldn't open $newfileTXT");
while (!feof($fp)) {
$line = fgets($fp, 1024);
$arr = explode("\t", $line);
$arr[7] = '<img src="http://link/imgHigh/' . $arr[7] . '.jpg" />;';
echo "Prodotto: ".$arr[4]."<br>";
foreach ($arr as $fields) {
fwrite($fpTXT, $fields.";");
}
fwrite($fpTXT, "\n");
}
fclose($fpTXT);
fclose($fp);
I have thi result on txt file:
175;13563;desc;01;category;..............c etc etc.....
mercato.㰻浩⁧牳㵣栢瑴㩰⼯睷⹷獯畣慬楴挮浯椯⽴慣⽴浩䡧杩⽨ ㄀⸀㄀  ⸀砀砀 漀欀ഀ਀樮杰•㸯㬻
the html code for image is written as chinese caharcter, why?
Do you want to add content to the end of $newFileTXT from $filename ?
IF so, you should change:
$fpTXT = fopen($newfileTXT, 'w') or die("Couldn't open $newfileTXT");
to
$fpTXT = fopen($newfileTXT, 'a') or die("Couldn't open $newfileTXT");
The file is probably interpreted as unicode (probably UTF-8). In unicode, characters can consist of multiple bytes. When you read the file, you just read 1024 bytes, which can result in half a unicode character at the end of the part that you read, and the other half at the start of the next part. When you start adding new characters inbetween, you get other unicode sequences instead, causing the text to be a complete mess.
I have resolved the problem, i have passed any line to this function:
function cleanString($string){
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
return $string;
}
My old string contained binary chars, i have cleaned the string and now all is ok

fputcsv and newline codes

I'm using fputcsv in PHP to output a comma-delimited file of a database query. When opening the file in gedit in Ubuntu, it looks correct - each record has a line break (no visible line break characters, but you can tell each record is separated,and opening it in OpenOffice spreadsheet allows me to view the file correctly.)
However, we're sending these files on to a client on Windows, and on their systems, the file comes in as one big, long line. Opening it in Excel, it doesn't recognize multiple lines at all.
I've read several questions on here that are pretty similar, including this one, which includes a link to the really informative Great Newline Schism explanation.
Unfortunately, we can't just tell our clients to open the files in a "smarter" editor. They need to be able to open them in Excel. Is there any programmatic way to ensure that the correct newline characters are added so the file can be opened in a spreadsheet program on any OS?
I'm already using a custom function to force quotes around all values, since fputcsv is selective about it. I've tried doing something like this:
function my_fputcsv($handle, $fieldsarray, $delimiter = "~", $enclosure ='"'){
$glue = $enclosure . $delimiter . $enclosure;
return fwrite($handle, $enclosure . implode($glue,$fieldsarray) . $enclosure."\r\n");
}
But when the file is opened in a Windows text editor, it still shows up as a single long line.
// Writes an array to an open CSV file with a custom end of line.
//
// $fp: a seekable file pointer. Most file pointers are seekable,
// but some are not. example: fopen('php://output', 'w') is not seekable.
// $eol: probably one of "\r\n", "\n", or for super old macs: "\r"
function fputcsv_eol($fp, $array, $eol) {
fputcsv($fp, $array);
if("\n" != $eol && 0 === fseek($fp, -1, SEEK_CUR)) {
fwrite($fp, $eol);
}
}
This is an improved version of #John Douthat's great answer, preserving the possibility of using custom delimiters and enclosures and returning fputcsv's original output:
function fputcsv_eol($handle, $array, $delimiter = ',', $enclosure = '"', $eol = "\n") {
$return = fputcsv($handle, $array, $delimiter, $enclosure);
if($return !== FALSE && "\n" != $eol && 0 === fseek($handle, -1, SEEK_CUR)) {
fwrite($handle, $eol);
}
return $return;
}
Using the php function fputcsv writes only \n and cannot be customized. This makes the function worthless for microsoft environment although some packages will detect the linux newline also.
Still the benefits of fputcsv kept me digging into a solution to replace the newline character just before sending to the file. This can be done by streaming the fputcsv to the build in php temp stream first. Then adapt the newline character(s) to whatever you want and then save to file. Like this:
function getcsvline($list, $seperator, $enclosure, $newline = "" ){
$fp = fopen('php://temp', 'r+');
fputcsv($fp, $list, $seperator, $enclosure );
rewind($fp);
$line = fgets($fp);
if( $newline and $newline != "\n" ) {
if( $line[strlen($line)-2] != "\r" and $line[strlen($line)-1] == "\n") {
$line = substr_replace($line,"",-1) . $newline;
} else {
// return the line as is (literal string)
//die( 'original csv line is already \r\n style' );
}
}
return $line;
}
/* to call the function with the array $row and save to file with filehandle $fp */
$line = getcsvline( $row, ",", "\"", "\r\n" );
fwrite( $fp, $line);
As webbiedave pointed out (thx!) probably the cleanest way is to use a stream filter.
It is a bit more complex than other solutions, but even works on streams that are not editable after writing to them (like a download using $handle = fopen('php://output', 'w'); )
Here is my approach:
class StreamFilterNewlines extends php_user_filter {
function filter($in, $out, &$consumed, $closing) {
while ( $bucket = stream_bucket_make_writeable($in) ) {
$bucket->data = preg_replace('/([^\r])\n/', "$1\r\n", $bucket->data);
$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}
return PSFS_PASS_ON;
}
}
stream_filter_register("newlines", "StreamFilterNewlines");
stream_filter_append($handle, "newlines");
fputcsv($handle, $list, $seperator, $enclosure);
...
alternatively, you can output in native unix format (\n only) then run unix2dos on the resulting file to convert to \r\n in the appropriate places. Just be careful that your data contains no \n's . Also, I see you are using a default separator of ~ . try a default separator of \t .
I've been dealing with a similiar situation. Here's a solution I've found that outputs CSV files with windows friendly line-endings.
http://www.php.net/manual/en/function.fputcsv.php#90883
I wasn't able to use the since I'm trying to stream a file to the client and can't use the fseeks.
windows needs \r\n as the linebreak/carriage return combo in order to show separate lines.
I did eventually get an answer over at experts-exchange; here's what worked:
function my_fputcsv($handle, $fieldsarray, $delimiter = "~", $enclosure ='"'){
$glue = $enclosure . $delimiter . $enclosure;
return fwrite($handle, $enclosure . implode($glue,$fieldsarray) . $enclosure.PHP_EOL);
}
to be used in place of standard fputcsv.

Some characters in CSV file are not read during PHP fgetcsv()

I am reading a CSV file with php. Many of the rows have a "check mark" which is really the square root symbol: √ and the php code is just skipping over this character every time it is encountered.
Here is my code (printing to the browser window in "CSV style" format so I can check that the lines break at the right place:
$file = fopen($uploadfile, 'r');
while (($line = fgetcsv($file)) !== FALSE) {
foreach ($line as $key => $value) {
if ($value) {
echo $value.",";
}
}
echo "<br />";
}
fclose($file);
As an interim solution, I am just finding and replacing the checkmarks with 1's manually, in Excel. Obviously I'd like a more efficient solution :) Thanks for the help!
fgetcsv() only works on standard ASCII characters; so it's probably "correct" in skipping your square root symbols. However, rather than replacing the checkmarks manually, you could read the file into a string, do a str_replace() on those characters, and then parse it using fgetcsv(). You can turn a string into a file pointer (for fgetcsv) thusly:
$fp = fopen('php://memory', 'rw');
fwrite($fp, (string)$string);
rewind($fp);
while (($line = fgetcsv($fp)) !== FALSE)
...
I had a similar problem with accented first characters of strings. I eventually gave up on fgetscv and did the following, using fgets() and explode() instead (I'm guessing your csv is comma separated):
$file = fopen($uploadfile, 'r');
while (($the_line = fgets($file)) !== FALSE) // <-- fgets
{
$line = explode(',', $the_line); // <-- explode
foreach ($line as $key => $value)
{
if ($value)
{
echo $value.",";
}
}
echo "<br />";
}
fclose($file);
You should setlocale ar written in documentation
Note:
Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function.
before fgetcsv add setlocale(LC_ALL, 'en_US.UTF-8'). In my case it was 'lt_LT.UTF-8'.
This behaviour is reported as a php bug

Categories