php fgetcsv - charset encoding problems - php

Using PHP 5.3 fgetcsv function, I am experiencing some problems due to encoding matters. Note that that file has spanish "special" latin characters like graphic accents á, é, í ï, etc...
I get the CSV file exporting some structured data I have in an MS 2008 for Mac Excel file.
If I open it with Mac OS X TextEdit application, everything seems to go perfect.
But when I get down to my PHP program and try to read the CSV using that fgetcsv PHP function, I am not getting it to read properly the charset.
/**
* #Route("/cvsLoad", name="_csv_load")
* #Template()
*/
public function cvsLoadAction(){
//setlocale(LC_ALL, 'es_ES.UTF-8');
$reader = new Reader($this->get('kernel')->getRootDir().'/../web/uploads/documents/question_images/2/41/masiva.csv');
$i = 1;
$r = array("hhh" => $reader -> getAll());
return new Response(json_encode($r, 200));
}
As you can see, I have tried also to use a setlocale to es_ES.UTF-8. But nothing get it working.
The read part comes here:
public function getRow()
{
if (($row = fgetcsv($this->_handle, 10000, $this->_delimiter)) !== false) {
$this->_line++;
return $this->_headers ? array_combine($this->_headers, $row) : $row;
} else {
return false;
}
}
See what I get in the $row variable after each row reading:
Those ? characters are supposed to be vowels with graphic accents on them.
Any clue over there? Would it work if I used MS Excel for Windows? How can I know in run time the exact encoding of the file and set it before reading it?
(For those spanish speakers, don't get frightened with such awful medical stuff in those texts ;)).

Try this:
function convert( $str ) {
return iconv( "Windows-1252", "UTF-8", $str );
}
public function getRow()
{
if (($row = fgetcsv($this->_handle, 10000, $this->_delimiter)) !== false) {
$row = array_map( "convert", $row );
$this->_line++;
return $this->_headers ? array_combine($this->_headers, $row) : $row;
} else {
return false;
}
}

This is likely to do with the way excel encodes the file when saving.
Try uploading the .xls file to google docs and downloading as a .csv

Related

Irish accent get changed to wiered charecter while processing the csv file in php (yii1.1 framework)

I know there are a lot of similar question has already been asked in this community but unfortunately nothing gonna work for me.
I have a CSV sheet which I need to import in our system. Sheet is getting imported without any issue in Linux (creating the sheet with Libre Office) even with Irish character.
But main problem starts in Windows and iOS environment with excel (MS-excel) where character encoding get changed. And few of the Irish characters like
Ž, Ŕ and many others are getting changed to different symbols.
P.S : CSV is working fine if we are creating that through Numbers in iOS.
Below is the php method by which I'm reading the CSV sheet.
$path = CUploadedFile::getInstance($model, 'absence_data_file'); // Get the instance of selected file
$target = ['First Name', 'Last Name', 'Class', 'Year', 'From']; // Valid Header
public static function readCSV($path, $target) {
$updated_header = array();
$data = array();
if ($path->type == 'text/csv' || $path->type == 'application/vnd.ms-excel' || $path->type == 'text/plain' || $path->type == 'text/tsv') {
$fp = fopen($path->tempName, 'r');
$encoding_type = mb_detect_encoding(file_get_contents($path->tempName));
if ($fp !== FALSE) {
$header = fgetcsv($fp);
foreach ($header as $h) {
$updated_header[] = $h;
}
$updated_header = array_map( 'trim', array_values($updated_header));
if (array_diff($target, $updated_header)) {
$errormessage = 'Invalid header format.';
return $errormessage;
} else {
while ($ar = fgetcsv($fp)) {
$data[] = array_combine($updated_header, $ar);
}
$data['file_encoding'] = $encoding_type;
return $data;
}
}
} else {
$errormessage = "Invalid File type, You can import CSV files only";
return $errormessage;
}
}
Sheet which I'm importing (Check the pic):
Printing the data (First Record)
I'm not sure about Irish Codepage, but if it is Western European as you mentioned in your comment, I'm guessing your codepage would be ISO-8859-1 or ISO-8859-14, and your line of code should be:
$encoding_type = mb_detect_encoding(file_get_contents($path->tempName), 'ISO-8859-1', true);
or just simply following since you are sure its encoding is 'ISO-8859-1'
$encoding_type = 'ISO-8859-1'
Second and 3rd parameters in mb_detect_encoding tells the function to strictly try and encode using ISO-8859-1 if you want to try other codepages at the same time, you can provide a list of comma separated code pages to second parameter, e.g. UTF-8, ISO-8859-1
Note that you will need to call mb_convert_encoding to actually get the file in your desired encoding, so following code will strictly try to decode from ISO-8859-1 to UTF-8
$UTF8_text = mb_convert_encoding($content, 'UTF-8', 'ISO-8859-1');
if you insist in using fgetcsv, have a look at (mb_internal_encoding)[https://www.php.net/manual/en/function.mb-internal-encoding.php], it will set default encoding.

Hebrew letters ignored selectively by fgetcsv

I'm trying to read a CSV file in Hebrew in order to insert multiple posts to Wordpress.
I've saved the excel sheet as CSV (coma delimited).
After some encoding manipulation in Sublime Text, I see the Hebrew content normally in any text editor.
However, when I try to read the contents of the file using fgetcsv the Hebrew letters are being ignored selectively, i.e the letters in the field which are preceded with either a number or a Latin letter, ARE showing correctly. Hebrew Letters before the number/Latin letter are ignored and omitted from the output.
If I use file_get_contents and var_dump it, I get the entire content correctly, so it stands to reason that the problem lies with fgetcsv.
Code in functions.php:
function csv_to_array($filename='', $delimiter=',')
{
if(!file_exists($filename) || !is_readable($filename)) {
return FALSE;
}
$header = NULL;
$data = array();
if (($handle = fopen($filename, 'r')) !== FALSE)
{
while (($row = fgetcsv($handle, 1000, $delimiter)) !== FALSE)
{
if(!$header):
$header = $row;
else:
$data[] = $row;
endif;
}
fclose($handle);
}
return $data;
}
used:
if (isset($_FILES['events'])) {
extract($_FILES['events']);
$events = csv_to_array($tmp_name);
Not very likely that the language that gave the world T_PAAMAYIM_NEKUDOTAYIM has now problems with Hebrew letters ;-).
Checking the encoding of the strings (var_dump might not be enough!) and Manvel's solution to this question might be of help to you:
The problem is that the function returns UTF-8 (it can check using
mb_detect_encoding), but do not convert, and these characters
take UTF-8. Тherefore, it's necessary to do the reverse-convert to
initial encoding (Windows-1251 or CP1251) using iconv. But since
fgetcsv returns an array, I suggest to write a custom function:
function customfgetcsv(&$handle, $length, $separator = ';'){
if(($buffer = fgets($handle, $length)) !== false) {
return explode( $separator, iconv( "CP1251", "UTF-8", $buffer ) );
}
return false;
}

PHP strpos() finds just single characters, not a whole string

I have a strange problem...
I would like to search in a logfile.
$lines = file($file);
$sampleName = "T3173sGas";
foreach ($lines as &$line) {
if (strpos($line, $sampleName) !== false) {
echo "yes";
}
}
This code is not working, $sampleName is to 100% in the log file. The search works just for single characters; for example "T" or "3" but not for "T3".
Do you have an idea why it's not working? Is the encoding of the logfile wrong?
Thanks a lot for your help!
If you can only find single characters I would assume that your logfile is in some multi-byte character set like UTF-16. As you already assume similar, next step for you is to consult the documentation / specification of the logfile you're trying to operate with regarding the character encoding.
You then can use character-encoding specific string functions, the package is called http://php.net/mbstring.
$encoding = ... ; // encoding of logfile
if (mb_strpos($line, $sampleName, 0, $encoding) !== false) {
echo "yes";
}
This may work, it searches for the entire string
<?php
$filename = 'test.php';
$file = file_get_contents($filename);
$sampleName = "T3173sGas";
if(strlen(strstr($file,$sampleName))>0)
{
echo "yes";
}
?>

UTF-8 problems while reading CSV file with fgetcsv

I try to read a CSV and echo the content. But the content displays the characters wrong.
Mäx Müstermänn -> Mäx Müstermänn
Encoding of the CSV file is UTF-8 without BOM (checked with Notepad++).
This is the content of the CSV file:
"Mäx";"Müstermänn"
My PHP script
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<?php
$handle = fopen ("specialchars.csv","r");
echo '<table border="1"><tr><td>First name</td><td>Last name</td></tr><tr>';
while ($data = fgetcsv ($handle, 1000, ";")) {
$num = count ($data);
for ($c=0; $c < $num; $c++) {
// output data
echo "<td>$data[$c]</td>";
}
echo "</tr><tr>";
}
?>
</body>
</html>
I tried to use setlocale(LC_ALL, 'de_DE.utf8'); as suggested here without success. The content is still wrong displayed.
What I'm missing?
Edit:
An echo mb_detect_encoding($data[$c],'UTF-8'); gives me UTF-8 UTF-8.
echo file_get_contents("specialchars.csv"); gives me "Mäx";"Müstermänn".
And
print_r(str_getcsv(reset(explode("\n", file_get_contents("specialchars.csv"))), ';'))
gives me
Array ( [0] => Mäx [1] => Müstermänn )
What does it mean?
Try this:
<?php
$handle = fopen ("specialchars.csv","r");
echo '<table border="1"><tr><td>First name</td><td>Last name</td></tr><tr>';
while ($data = fgetcsv ($handle, 1000, ";")) {
$data = array_map("utf8_encode", $data); //added
$num = count ($data);
for ($c=0; $c < $num; $c++) {
// output data
echo "<td>$data[$c]</td>";
}
echo "</tr><tr>";
}
?>
Encountered similar problem: parsing CSV file with special characters like é, è, ö etc ...
The following worked fine for me:
To represent the characters correctly on the html page, the header was needed :
header('Content-Type: text/html; charset=UTF-8');
In order to parse every character correctly, I used:
utf8_encode(fgets($file));
Dont forget to use in all following string operations the 'Multibyte String Functions', like:
mb_strtolower($value, 'UTF-8');
In my case the source file has windows-1250 encoding and iconv prints tons of notices about illegal characters in input string...
So this solution helped me a lot:
/**
* getting CSV array with UTF-8 encoding
*
* #param resource &$handle
* #param integer $length
* #param string $separator
*
* #return array|false
*/
private function fgetcsvUTF8(&$handle, $length, $separator = ';')
{
if (($buffer = fgets($handle, $length)) !== false)
{
$buffer = $this->autoUTF($buffer);
return str_getcsv($buffer, $separator);
}
return false;
}
/**
* automatic convertion windows-1250 and iso-8859-2 info utf-8 string
*
* #param string $s
*
* #return string
*/
private function autoUTF($s)
{
// detect UTF-8
if (preg_match('#[\x80-\x{1FF}\x{2000}-\x{3FFF}]#u', $s))
return $s;
// detect WINDOWS-1250
if (preg_match('#[\x7F-\x9F\xBC]#', $s))
return iconv('WINDOWS-1250', 'UTF-8', $s);
// assume ISO-8859-2
return iconv('ISO-8859-2', 'UTF-8', $s);
}
Response to #manvel's answer - use str_getcsv instead of explode - because of cases like this:
some;nice;value;"and;here;comes;combinated;value";and;some;others
explode will explode string into parts:
some
nice
value
"and
here
comes
combinated
value"
and
some
others
but str_getcsv will explode string into parts:
some
nice
value
and;here;comes;combinated;value
and
some
others
Try putting this into the top of your file (before any other output):
<?php
header('Content-Type: text/html; charset=UTF-8');
?>
The problem is that the function returns UTF-8 (it can check using mb_detect_encoding), but do not convert, and these characters takes as UTF-8. Тherefore, it's necessary to do the reverse-convert to initial encoding (Windows-1251 or CP1251) using iconv. But since by the fgetcsv returns an array, I suggest to write a custom function:
[Sorry for my english]
function customfgetcsv(&$handle, $length, $separator = ';'){
if (($buffer = fgets($handle, $length)) !== false) {
return explode($separator, iconv("CP1251", "UTF-8", $buffer));
}
return false;
}
Now I got it working (after removing the header command). I think the problem was that the encoding of the php file was in ISO-8859-1. I set it to UTF-8 without BOM. I thought I already have done that, but perhaps I made an additional undo.
Furthermore, I used SET NAMES 'utf8' for the database. Now it is also correct in the database.

parse remote csv-file with PHP on GAE

I seem to be in a catch-22 with a small app I'm developing in PHP on Google App Engine using Quercus;
I have a remote csv-file which I can download & store in a string
To parse that string I'd ideally use str_getcsv, but Quercus doesn't have that function yet
Quercus does seem to know fgetcsv, but that function expects a file handle which I don't have (and I can't make a new one as GAE doesn't allow files to be created)
Anyone got an idea of how to solve this without having to dismiss the built-in PHP csv-parser functions and write my own parser instead?
I think the simplest solution really is to write your own parser . it's a piece of cake anyway and will get you to learn more regex- it makes no sense that there is no csv string to array parser in PHP so it's totally justified to write your own. Just make sure it's not too slow ;)
You might be able to create a new stream wrapper using stream_wrapper_register.
Here's an example from the manual which reads global variables: http://www.php.net/manual/en/stream.streamwrapper.example-1.php
You could then use it like a normal file handle:
$csvStr = '...';
$fp = fopen('var://csvStr', 'r+');
while ($row = fgetcsv($fp)) {
// ...
}
fclose($fp);
this shows a simple manual parser i wrote with example input with qualifed, non-qualified, escape feature. it can be used for the header and data rows and included an assoc array function to make your data into a kvp style array.
//example data
$fields = strparser('"first","second","third","fourth","fifth","sixth","seventh"');
print_r(makeAssocArray($fields, strparser('"asdf","bla\"1","bl,ah2","bl,ah\"3",123,34.234,"k;jsdfj ;alsjf;"')));
//do something like this
$fields = strparser(<csvfirstline>);
foreach ($lines as $line)
$data = makeAssocArray($fields, strparser($line));
function strparser($string, $div = ",", $qual = "\"", $esc = "\\") {
$buff = "";
$data = array();
$isQual = false; //the result will be a qualifier
$inQual = false; //currently parseing inside qualifier
//itereate through string each byte
for ($i = 0; $i < strlen($string); $i++) {
switch ($string[$i]) {
case $esc:
//add next byte to buffer and skip it
$buff .= $string[$i+1];
$i++;
break;
case $qual:
//see if this is escaped qualifier
if (!$inQual) {
$isQual = true;
$inQual = true;
break;
} else {
$inQual = false; //done parseing qualifier
break;
}
case $div:
if (!$inQual) {
$data[] = $buff; //add value to data
$buff = ""; //reset buffer
break;
}
default:
$buff .= $string[$i];
}
}
//get last item as it doesnt have a divider
$data[] = $buff;
return $data;
}
function makeAssocArray($fields, $data) {
foreach ($fields as $key => $field)
$array[$field] = $data[$key];
return $array;
}
if it can be dirty and quick. I would just use the
http://php.net/manual/en/function.exec.php
to pass it in and use sed and awk (http://shop.oreilly.com/product/9781565922259.do) to parse it. I know you wanted to use the php parser. I've tried before and failed simply because its not vocal about its errors.
Hope this helps.
Good luck.
You might be able to use fopen with php://temp or php://memory (php.net) to get it to work. What you would do is open either php://temp or php://memory, write to it, then rewind it (php.net), and then pass it to fgetcsv. I didn't test this, but it might work.

Categories