Encoding issue with PHP while writing in a .csv file

Encoding issue with PHP while writing in a .csv file - php

I'm working with a php array which contains some values parsed from a previous scraping process (using Simple HTML DOM Parser). I can normally print / echo the values of this array, which contains special chars é,à,è, etc. BUT, the problem is the following :
When I'm using fwrite to save values in a .csv file, some characters are not successfully saved. For example, Székesfehérvár is well displayed on my php view in HTML, but saved as Székesfehérvár in the .csv file which I generate with the php script above.
I've already set-up several things in the php script :
The page I'm scraping seems to be utf-8 encoded
My PHP script is also declared as utf-8 in the header
I've tried a lot of iconv and mb_encode methods in different places in the code
NOTE that when I'm make a JS console.log of my php array, using json_encode, the characters are also broken, maybe linked to the original encoding of the page I'm scraping?
Here's a part of the script, it is the part who is writing values in a .csv file
<?php
$data = array(
array("item1", "item2"),
array("item1", "item2"),
array("item1", "item2"),
array("item1", "item2")
// ...
);
//filename
$filename = 'myFileName.csv';
foreach($data as $line) {
$string_txt = ""; //declares the content of the .csv as a string
foreach($line as $item) {
//writes a new line of the .csv
$line_txt = "";
//each line of the .csv equals to the values of the php subarray, tab separated
$line_txt .= $item . "\t";
}
//PHP endline constant, indicates the next line of the .csv
$line_txt .= PHP_EOL;
//add the line to the string which is the global content of the .csv
$line_txt .= $string_txt;
}
//writing the string in a .csv file
$file = fopen($filename, 'w+');
fwrite($file, $string_txt);
fclose($file);
I am currently stuck because I can't save values with accentuated characters correctly.

Put this line in your code
header('Content-Type: text/html; charset=UTF-8');
Hope this helps you!

Try it
$file = fopen('myFileName.csv','w');
$data= array_map("utf8_decode", $data);
fputcsv($file,$data);

Excel has problems displaying utf8 encoded csv files. I saw this before. But you can try utf8 BOM. I tried it and works for me. This is simply adding these bytes at the start of your utf8 string:
$line_txt .= chr(239) . chr(187) . chr(191) . $item . "\t";
For more info:
Encoding a string as UTF-8 with BOM in PHP
Alternatively, you can use the file import feature in Excel and make sure the file origin says 65001 : Unicode(UTF8). It should display your text properly and you will need to save it as an Excel file to preserve the format.

The solution (provided by #misorude) :
When scraping HTML contents from webpages, there is a difference between what's displayed in your debug and what's really scraped in the script. I had to use html_entity_decode to let PHP interpret the true value of the HTML code I've scraped, and not the browser's interpretation.
To validate a good retriving of values before store them somewhere, you could try a console.log in JS to see if values are correctly drived :
PHP
//decoding numeric HTML entities who represents "Sóstói Stadion"
$b = html_entity_decode("Sóstói Stadion");
Javascript (to test):
<script>
var b = <?php echo json_encode($b) ;?>;
//print "Sóstói Stadion" correctly
console.log(b);
</script>

Related

PHP : converting an UCS-2 LE BOM string to UTF-8 stops working once i write the string to a file

I am currently having a hard time trying to do the simplest thing :
I have a UCS-2 LE BOM encoded file that I am converting to UTF-8.
Here is what Notepad++ says about the encoding :
My converting routine is simple :
I am opening the input file and creating an output file.
I am parsing the input file and converting everyline on-the-go to the UTF-8 format
Once the converting is done, I remove the input file
Once the input file is removed, I rename my output file to the name of the input file
Here is the code that does it :
public function convertCsvToUtf8(string $absolutePathToFile) : string {
$dotPosition = strrpos($absolutePathToFile, ".");
$absolutePathToNewFile = substr($absolutePathToFile, 0, $dotPosition)."-utf8.csv";
$res_input_file = fopen($absolutePathToFile, "r");
$res_output_file = fopen($absolutePathToNewFile, "w+");
while($input_string = fgets($res_input_file)){
$inputEncoding = mb_detect_encoding($input_string, mb_list_encodings(), true);
$output_string = iconv($inputEncoding, 'UTF-8', $input_string);
fputs($res_output_file, ($output_string));
}
fclose($res_input_file);
fclose($res_output_file);
unlink($absolutePathToFile);
rename($absolutePathToNewFile, $absolutePathToFile);
return $absolutePathToFile;
}
Here you can see an example of an execution :
So... everything seems to be okay at a first glance (expect the fact that the "°" is replaced by a weird character); but when I open the output file with Notepad++, here is a sample what I see :
I have no idea what is going on here.
Any help would be awesome !
Feel free to ask for more details !
Thanks in advance,

PHP convert encoding with Shift_JIS

I have a text file. It contains "砡" character and its encoding is Shift-JIS.
I using function file_get_contents() in PHP (Laravel) to read this file, then response in json for client.
$file = file_get_contents("/path/to/file/text");
$file = iconv("SJIS", "UTF-8//IGNORE", $file);
return response()->json(['content' => $file]);
However, this charater "砡" doesn't correctly display, it show to "x".
How do I fix it ?

Try "SJIS-win" instead of "SJIS".

Reading from text file containing '<<EOF' via php

I try to read in a bash script from a text file and print it to the screen via php.
I tried
$code = #file_get_contents( $myFileName );
as well as
$code = "";
$myFile = fopen($myFileName, "r");
while ($line = fgets($myFile)) {
$code .= $line;
}
However, the string I get from reading in the file doesn't contain all of the file's contents. The problem is that the text file contains the string
<<EOF
After that the String abruptly stops.
How come? It seems weird to me that php isn't able to deal with those few characters and misinterpret them as the actual EOF.
Is there a way I can read in the whole file?
Thanks in advance!

When I try it, I don't experience that problem therefore, presumably, you are outputting the text to an HTML document and testing your code by looking at the rendering of that document in a browser (as opposed to looking at the raw output of the script, as would appear in View > Source).
In HTML < indicates the start of a tag. You need to escape your HTML with htmlspecialchars() for < to be treated as data instead of markup.

PHP fputcsv with UTF-8 Problem

I'm trying to allow my clients view some of the MySQL data in Excel. I have used PHP's fputcsv() function, like:
public function generate() {
setlocale(LC_ALL, 'ko_KR.UTF8');
$this->filename = date("YmdHis");
$create = $this->directory."Report".$this->filename.".csv";
$f = fopen("$create","w") or die("can't open file");
fwrite($f, "\xEF\xBB\xBF");
$i = 1;
$length = count($this->inputarray[0]);
fwrite($f, $this->headers."\n");
// print column titles
foreach($this->inputarray[0] as $key=>$value) {
$delimiter = ($i == $length) ? "\n\n" : ",";
fwrite($f, $key.$delimiter);
$i++;
}
// print actual rows
foreach($this->inputarray as $row) {
fputcsv($f, $row);
}
fclose($f);
}
My clients are Korean, and a good chunk of the MySQL database contains values in utf8_unicode_ci. By using the above function, I successfully generated a CSV file with correctly encoded data that opens fine in Excel on my machine (Win7 in English), but when I opened the file in Excel on the client computer (Win7 in Korean), the characters were broken again. I tried taking the header (\xEF\xBB\xBF) out, and commenting out the setlocale, to no avail.
Can you help me figure this out?

If, as you say, your CSV file has "correctly encoded data" - i.e. that it contains a valid UTF-8 byte stream, and assuming that the byte stream of the file on your client's site is the same (e.g. has not been corrupted in transit by a file transfer problem) then it sounds like the issue Excel on the client's machine not correctly interpreting the UTF-8. This might be because it's not supported or that some option needs to be selected when importing to indicate the encoding. As such, you might try producing your file in a different encoding (using mb_convert_encoding or iconv).
If you get your client to export a CSV containing Korean characters then you'll be able to take a look at that file and determine the encoding that is being produced. You should then try using that encoding.

Try encoding the data as UTF-16LE, and ensure that the file has the appropriate BOM.
Alternatively, send your clients an Excel file rather than a CSV, then the encoding shouldn't be a problem

Try wrapping the text in each fwrite call with utf8_encode.
Then use what is suggested here: http://www.php.net/manual/en/function.fwrite.php#69566

PHP File Opening Encoding Problem?

When I try to open a .log file created by a game in PHP I get a bunch of this.
ÿþ*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*�*� �
K�2� �E�n�g�i�n�e� �s�t�a�r�t� �u�p�.�.�.� �
[�2�0�0�9�/�2�2�/�0�9�]� �
[�1�6�:�0�7�:�3�3�]� �
[�0�.�1�.�4�6�.�0�]� �
[�0�]� �
I have no idea as to why? My code is
$file = trim($_GET['id']);
$handle = #fopen($file, "a+");
if ($handle) {
print "<table>";
while (!feof($handle)) {
$buffer = stream_get_line($handle, 10000, "\n");
echo "<tr><td width=10>" . __LINE__ . "</td><td>" . $buffer . "</td></tr>";
}
print "</table>";
fclose($handle);
I'm using stream_get_line because it is apparently better for large files?

PHP doesn't really know much about encodings. In particular, it knows nothing about the encoding of your file.
The data looks like UTF-16LE. so you'll need to convert that into something you can handle - or, since you're just printing, you can convert the entire script to output its HTML as UTF-16LE as well.
I would probably prefer converting to UTF-8 and using that as the page encoding, so you're sure no characters are lost. Take a look at iconv, assuming it's available (a PHP extension is required on Windows, I believe).
Note that regardless of what you do, you should strip the first two characters of the first line, assuming the encoding is always the same. In the data you're showing, these characters are the byte order mark, which tells us the file's encoding (UTF-16LE, like I mentioned earlier).
However, seeing as how it appears to be plain text, and all you're doing is printing the data, consider just opening it in a plain old text editor (that supports Unicode). Not knowing your operating system, I'm hesitant to suggest a specific one, but if you're on Windows and the file is relatively small, Notepad can do it.
As a side note, __LINE__ will not give you the line number of the file you're reading, it will print the line number of the currently executing script line.

You might be running into a UTF-8 Byte Order Mark: http://en.wikipedia.org/wiki/Byte-order_mark
Try reading it like so:
<?php
// Reads past the UTF-8 bom if it is there.
function fopen_utf8 ($filename, $mode) {
$file = #fopen($filename, $mode);
$bom = fread($file, 3);
if ($bom != b"\xEF\xBB\xBF")
rewind($file, 0);
else
echo "bom found!\n";
return $file;
}
?>
From: http://us3.php.net/manual/en/function.fopen.php#78308

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Encoding issue with PHP while writing in a .csv file - php

Put this line in your code header('Content-Type: text/html; charset=UTF-8'); Hope this helps you!

Try it $file = fopen('myFileName.csv','w'); $data= array_map("utf8_decode", $data); fputcsv($file,$data);

Related

PHP : converting an UCS-2 LE BOM string to UTF-8 stops working once i write the string to a file

PHP convert encoding with Shift_JIS

Reading from text file containing '<<EOF' via php

PHP fputcsv with UTF-8 Problem

PHP File Opening Encoding Problem?

Categories

Resources