UTF-8 problems while reading CSV file with fgetcsv

UTF-8 problems while reading CSV file with fgetcsv - php

I try to read a CSV and echo the content. But the content displays the characters wrong.
Mäx Müstermänn -> MÃ¤x MÃ¼stermÃ¤nn
Encoding of the CSV file is UTF-8 without BOM (checked with Notepad++).
This is the content of the CSV file:
"Mäx";"Müstermänn"
My PHP script
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<?php
$handle = fopen ("specialchars.csv","r");
echo '<table border="1"><tr><td>First name</td><td>Last name</td></tr><tr>';
while ($data = fgetcsv ($handle, 1000, ";")) {
$num = count ($data);
for ($c=0; $c < $num; $c++) {
// output data
echo "<td>$data[$c]</td>";
}
echo "</tr><tr>";
}
?>
</body>
</html>
I tried to use setlocale(LC_ALL, 'de_DE.utf8'); as suggested here without success. The content is still wrong displayed.
What I'm missing?
Edit:
An echo mb_detect_encoding($data[$c],'UTF-8'); gives me UTF-8 UTF-8.
echo file_get_contents("specialchars.csv"); gives me "MÃ¤x";"MÃ¼stermÃ¤nn".
And
print_r(str_getcsv(reset(explode("\n", file_get_contents("specialchars.csv"))), ';'))
gives me
Array ( [0] => MÃ¤x [1] => MÃ¼stermÃ¤nn )
What does it mean?

Try this:
<?php
$handle = fopen ("specialchars.csv","r");
echo '<table border="1"><tr><td>First name</td><td>Last name</td></tr><tr>';
while ($data = fgetcsv ($handle, 1000, ";")) {
$data = array_map("utf8_encode", $data); //added
$num = count ($data);
for ($c=0; $c < $num; $c++) {
// output data
echo "<td>$data[$c]</td>";
}
echo "</tr><tr>";
}
?>

Encountered similar problem: parsing CSV file with special characters like é, è, ö etc ...
The following worked fine for me:
To represent the characters correctly on the html page, the header was needed :
header('Content-Type: text/html; charset=UTF-8');
In order to parse every character correctly, I used:
utf8_encode(fgets($file));
Dont forget to use in all following string operations the 'Multibyte String Functions', like:
mb_strtolower($value, 'UTF-8');

In my case the source file has windows-1250 encoding and iconv prints tons of notices about illegal characters in input string...
So this solution helped me a lot:
/**
* getting CSV array with UTF-8 encoding
*
* #param resource &$handle
* #param integer $length
* #param string $separator
*
* #return array|false
*/
private function fgetcsvUTF8(&$handle, $length, $separator = ';')
{
if (($buffer = fgets($handle, $length)) !== false)
{
$buffer = $this->autoUTF($buffer);
return str_getcsv($buffer, $separator);
}
return false;
}
/**
* automatic convertion windows-1250 and iso-8859-2 info utf-8 string
*
* #param string $s
*
* #return string
*/
private function autoUTF($s)
{
// detect UTF-8
if (preg_match('#[\x80-\x{1FF}\x{2000}-\x{3FFF}]#u', $s))
return $s;
// detect WINDOWS-1250
if (preg_match('#[\x7F-\x9F\xBC]#', $s))
return iconv('WINDOWS-1250', 'UTF-8', $s);
// assume ISO-8859-2
return iconv('ISO-8859-2', 'UTF-8', $s);
}
Response to #manvel's answer - use str_getcsv instead of explode - because of cases like this:
some;nice;value;"and;here;comes;combinated;value";and;some;others
explode will explode string into parts:
some
nice
value
"and
here
comes
combinated
value"
and
some
others
but str_getcsv will explode string into parts:
some
nice
value
and;here;comes;combinated;value
and
some
others

Try putting this into the top of your file (before any other output):
<?php
header('Content-Type: text/html; charset=UTF-8');
?>

The problem is that the function returns UTF-8 (it can check using mb_detect_encoding), but do not convert, and these characters takes as UTF-8. Тherefore, it's necessary to do the reverse-convert to initial encoding (Windows-1251 or CP1251) using iconv. But since by the fgetcsv returns an array, I suggest to write a custom function:
[Sorry for my english]
function customfgetcsv(&$handle, $length, $separator = ';'){
if (($buffer = fgets($handle, $length)) !== false) {
return explode($separator, iconv("CP1251", "UTF-8", $buffer));
}
return false;
}

Now I got it working (after removing the header command). I think the problem was that the encoding of the php file was in ISO-8859-1. I set it to UTF-8 without BOM. I thought I already have done that, but perhaps I made an additional undo.
Furthermore, I used SET NAMES 'utf8' for the database. Now it is also correct in the database.

Related

Irish accent get changed to wiered charecter while processing the csv file in php (yii1.1 framework)

I know there are a lot of similar question has already been asked in this community but unfortunately nothing gonna work for me.
I have a CSV sheet which I need to import in our system. Sheet is getting imported without any issue in Linux (creating the sheet with Libre Office) even with Irish character.
But main problem starts in Windows and iOS environment with excel (MS-excel) where character encoding get changed. And few of the Irish characters like
Ž, Ŕ and many others are getting changed to different symbols.
P.S : CSV is working fine if we are creating that through Numbers in iOS.
Below is the php method by which I'm reading the CSV sheet.
$path = CUploadedFile::getInstance($model, 'absence_data_file'); // Get the instance of selected file
$target = ['First Name', 'Last Name', 'Class', 'Year', 'From']; // Valid Header
public static function readCSV($path, $target) {
$updated_header = array();
$data = array();
if ($path->type == 'text/csv' || $path->type == 'application/vnd.ms-excel' || $path->type == 'text/plain' || $path->type == 'text/tsv') {
$fp = fopen($path->tempName, 'r');
$encoding_type = mb_detect_encoding(file_get_contents($path->tempName));
if ($fp !== FALSE) {
$header = fgetcsv($fp);
foreach ($header as $h) {
$updated_header[] = $h;
}
$updated_header = array_map( 'trim', array_values($updated_header));
if (array_diff($target, $updated_header)) {
$errormessage = 'Invalid header format.';
return $errormessage;
} else {
while ($ar = fgetcsv($fp)) {
$data[] = array_combine($updated_header, $ar);
}
$data['file_encoding'] = $encoding_type;
return $data;
}
}
} else {
$errormessage = "Invalid File type, You can import CSV files only";
return $errormessage;
}
}
Sheet which I'm importing (Check the pic):
Printing the data (First Record)

I'm not sure about Irish Codepage, but if it is Western European as you mentioned in your comment, I'm guessing your codepage would be ISO-8859-1 or ISO-8859-14, and your line of code should be:
$encoding_type = mb_detect_encoding(file_get_contents($path->tempName), 'ISO-8859-1', true);
or just simply following since you are sure its encoding is 'ISO-8859-1'
$encoding_type = 'ISO-8859-1'
Second and 3rd parameters in mb_detect_encoding tells the function to strictly try and encode using ISO-8859-1 if you want to try other codepages at the same time, you can provide a list of comma separated code pages to second parameter, e.g. UTF-8, ISO-8859-1
Note that you will need to call mb_convert_encoding to actually get the file in your desired encoding, so following code will strictly try to decode from ISO-8859-1 to UTF-8
$UTF8_text = mb_convert_encoding($content, 'UTF-8', 'ISO-8859-1');
if you insist in using fgetcsv, have a look at (mb_internal_encoding)[https://www.php.net/manual/en/function.mb-internal-encoding.php], it will set default encoding.

Multibyte pointer when reading part of a file in PHP

I use PHP.
The function below loads part of a big multibyte enter separated CSV file and return a pointer (the end position) and the content in an array. With the pointer I can later do another run. It works:
function part($path, $offset, $rows) {
$buffer = array();
$buffer['content'] = '';
$buffer['pointer'] = array();
$handle = fopen($path, "r");
fseek($handle, $offset);
if( $handle ) {
for( $i = 0; $i < $rows; $i++ ) {
$buffer['content'] .= fgets($handle);
$buffer['pointer'] = mb_strlen($buffer['content']);
}
}
fclose($handle);
return($buffer);
}
// Buffer first part
$buffer = part($path_to_file, 0, 100);
// Buffer second part
$buffer = part($path_to_file, $buffer['pointer'], 100);
print_r($buffer);
If I change the $buffer['pointer'] line to:
$buffer['pointer'] = mb_strlen($buffer['content'], "UTF-8");
...it does not work anymore... I understand that it uses the different encoding when I use UTF-8 instead of the default, but why doesn't it work with UTF-8?
Shouldn't UTF-8 be compatible with foreign characters?
Because the function above works when I use it without "UTF-8" I guess I could just use it without UTF-8.
I'm still worried that in some cases it can give the wrong pointer?
Is there a safer way to get the correct pointer?
Encoding test
When I do this I get UTF-8:
echo mb_detect_encoding($buffer['content']);

This has little to do with UTF-8. Filesystem functions (like fseek(), fread(), etc.) operate on individual bytes. They don't care about the encoding at all. (You could be writing / reading binary data).
If you want to store a pointer to fseek() to at a later time, use ftell() to find out the current position:
$buffer['pointer'] = ftell($handle);

Processing csv file as UTF-8

Trying to figure out how to process a csv file with UTF encoding. Tried multiple ways like adding this utf8_encode() and with this in the header:
header('Content-Type: text/html; charset=UTF-8');
But nothing seems to work.
The code is:
<?php
include 'head.php';
$csv = array_map("str_getcsv", file("translations/dk.csv"));
foreach ($csv as $line){
$translate["dk"][ $line[0] ] = $line[1];
}if ($line[1] != NULL){
$line[0] = $line[1];
}
echo $line[0];
fclose($csv);
?>
How to I echo the output with UTF-8 encoding?

When you would display it in a browser you should use valid html and set the meta charset to utf8 too:
<?php
include 'head.php';
?>
<!DOCTYPE html>
<html lang="dk">
<head>
<meta charset="utf-8"/>
</head>
<body>
<?php
$csv = array_map("str_getcsv", file("translations/dk.csv"));
foreach ($csv as $line){
$translate["dk"][ $line[0] ] = $line[1];
}if ($line[1] != NULL){
$line[0] = $line[1];
}
echo $line[0];
fclose($csv);
?>
</body>
</html>
Or using text/plain instead of text/html can help:
header('Content-Type: text/plain; charset=UTF-8');
Hope that helps.

Based on what you described it looks like the file isn't in UTF-8 format, its probably in ISO-8859-1 but you are trying to display as if it was in UTF-8, hence why you see strange blocky symbols.
You have two options, you can convert the file entries to UTF-8 with:
foreach ($csv as $line)
$translate["dk"][$line[0]] = utf8_encode($line[1]);
Or declare the file real encoding to the browser so it will display correctly:
header('Content-Type: text/html; charset=ISO-8859-1');
Since W3C recommends UTF-8 as default encoding for web, the first option should be prefered.
Alternatively, you can convert the entire file to UTF-8 using your favorite text editor and save it that way, so you don't have to convert it to UTF-8 every time.

php fgetcsv - charset encoding problems

Using PHP 5.3 fgetcsv function, I am experiencing some problems due to encoding matters. Note that that file has spanish "special" latin characters like graphic accents á, é, í ï, etc...
I get the CSV file exporting some structured data I have in an MS 2008 for Mac Excel file.
If I open it with Mac OS X TextEdit application, everything seems to go perfect.
But when I get down to my PHP program and try to read the CSV using that fgetcsv PHP function, I am not getting it to read properly the charset.
/**
* #Route("/cvsLoad", name="_csv_load")
* #Template()
*/
public function cvsLoadAction(){
//setlocale(LC_ALL, 'es_ES.UTF-8');
$reader = new Reader($this->get('kernel')->getRootDir().'/../web/uploads/documents/question_images/2/41/masiva.csv');
$i = 1;
$r = array("hhh" => $reader -> getAll());
return new Response(json_encode($r, 200));
}
As you can see, I have tried also to use a setlocale to es_ES.UTF-8. But nothing get it working.
The read part comes here:
public function getRow()
{
if (($row = fgetcsv($this->_handle, 10000, $this->_delimiter)) !== false) {
$this->_line++;
return $this->_headers ? array_combine($this->_headers, $row) : $row;
} else {
return false;
}
}
See what I get in the $row variable after each row reading:
Those ? characters are supposed to be vowels with graphic accents on them.
Any clue over there? Would it work if I used MS Excel for Windows? How can I know in run time the exact encoding of the file and set it before reading it?
(For those spanish speakers, don't get frightened with such awful medical stuff in those texts ;)).

Try this:
function convert( $str ) {
return iconv( "Windows-1252", "UTF-8", $str );
}
public function getRow()
{
if (($row = fgetcsv($this->_handle, 10000, $this->_delimiter)) !== false) {
$row = array_map( "convert", $row );
$this->_line++;
return $this->_headers ? array_combine($this->_headers, $row) : $row;
} else {
return false;
}
}

This is likely to do with the way excel encodes the file when saving.
Try uploading the .xls file to google docs and downloading as a .csv

Missing first character of fields in csv

I'm working on a csv import script in php. It works fine, except for foreign characters in the beginning of a field.
The code looks like this
if (($handle = fopen($filename, "r")) !== FALSE)
{
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE)
$teljing[] = $data;
fclose($handle);
}
Here is a data example showing my issue
føroyskir stavir, "Kr. 201,50"
óvirkin ting, "Kr. 100,00"
This will result in the following
array
(
[0] => array
(
[0] => 'føroyskir stavir',
[1] => 'Kr. 201,50'
)
[1] => array
(
[0] => 'virkin ting', <--- Should be 'óvirkin ting'
[1] => 'Kr. 100,00'
)
)
I have seen this behaivior documented in some comments in php.net, and I have tried ini_set('auto_detect_line_endings',TRUE); to detect line endings. No success.
Anyone familiar with this issue?
Edit:
Thanks you AJ, this issue is now solved.
setlocale(LC_ALL, 'en_US.UTF-8');
Was the solution.

From the PHP manual for fgetcsv():
"Note: Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function."

Copied from the PHP.net/fgetcsv comments:
kent at marketruler dot com
04-Feb-2010 11:18 Note that fgetcsv,
at least in PHP 5.3 or previous, will
NOT work with UTF-16 encoded files.
Your options are to convert the entire
file to ISO-8859-1 (or latin1), or
convert line by line and convert each
line into ISO-8859-1 encoding, then
use str_getcsv (or compatible
backwards-compatible implementation).
If you need to read non-latin
alphabets, probably best to convert to
UTF-8.
See str_getcsv for a
backwards-compatible version of it
with PHP < 5.3, and see utf8_decode
for a function written by Rasmus
Andersson which provides utf16_decode.
The modification I added was that the
BOP appears at the top of the file,
then not on subsequent lines. So you
need to store the endian-ness, and
then re-send it upon each subsequent
line decoding. This modified version
returns the endianness, if it's not
available:
<?php
/**
* Decode UTF-16 encoded strings.
*
* Can handle both BOM'ed data and un-BOM'ed data.
* Assumes Big-Endian byte order if no BOM is available.
* From: http://php.net/manual/en/function.utf8-decode.php
*
* #param string $str UTF-16 encoded data to decode.
* #return string UTF-8 / ISO encoded data.
* #access public
* #version 0.1 / 2005-01-19
* #author Rasmus Andersson {#link http://rasmusandersson.se/}
* #package Groupies
*/
function utf16_decode($str, &$be=null) {
if (strlen($str) < 2) {
return $str;
}
$c0 = ord($str{0});
$c1 = ord($str{1});
$start = 0;
if ($c0 == 0xFE && $c1 == 0xFF) {
$be = true;
$start = 2;
} else if ($c0 == 0xFF && $c1 == 0xFE) {
$start = 2;
$be = false;
}
if ($be === null) {
$be = true;
}
$len = strlen($str);
$newstr = '';
for ($i = $start; $i < $len; $i += 2) {
if ($be) {
$val = ord($str{$i}) << 4;
$val += ord($str{$i+1});
} else {
$val = ord($str{$i+1}) << 4;
$val += ord($str{$i});
}
$newstr .= ($val == 0x228) ? "\n" : chr($val);
}
return $newstr;
}
?>
Trying the "setlocale" trick did not work for me, e.g.
<?php
setlocale(LC_CTYPE, "en.UTF16");
$line = fgetcsv($file, ...)
?>
But that's perhaps because my platform
didn't support it. However, fgetcsv
only supports single characters for
the delimiter, etc. and complains if
you pass in a UTF-16 version of said
character, so I gave up on that rather
quickly.
Hope this is helpful to someone out
there.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

UTF-8 problems while reading CSV file with fgetcsv - php

Try putting this into the top of your file (before any other output): <?php header('Content-Type: text/html; charset=UTF-8'); ?>

Related

Irish accent get changed to wiered charecter while processing the csv file in php (yii1.1 framework)

Multibyte pointer when reading part of a file in PHP

Processing csv file as UTF-8

php fgetcsv - charset encoding problems

Missing first character of fields in csv

Categories

Resources