fgetcsv() drops characters with diacritics (i.e. non-ASCII) - how to fix? - php

Similar questions:
Some characters in CSV file are not read during PHP fgetcsv() ,
fgetcsv() ignores special characters when they are at the beginning of line
My application has a form where the users can upload a CSV file (its 5 internal users have always uploaded a valid file - comma-delimited, quoted, records end by LF), and the file is then imported into a database using PHP:
$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\\')) {
print_r($row);
// further code not relevant as the data is already corrupt at this point
}
For reasons I cannot change, the users are uploading the file encoded in the Windows-1250 charset - a single-byte, 8-bit character encoding.
The problem: and some (not all!) characters beyond 127 ("extended ASCII") are dropped in fgetcsv(). Example data:
"15","Ústav"
"420","Špičák"
"7","Tmaň"
becomes
Array (
0 => 15
1 => "stav"
)
Array (
0 => 420
1 => "pičák"
)
Array (
0 => 7
1 => "Tma"
)
(Note that č is kept, but Ú is dropped)
The documentation for fgetcsv says that "since 4.3.5 fgetcsv() is now binary safe", but looks like it isn't. Am I doing something wrong, or is this function broken and I should look for a different way to parse CSV?

It turns out that I didn't read the documentation well enough - fgetcsv() is only somewhat binary-safe. It is safe for plain ASCII < 127, but the documentation also says:
Note:
Locale setting is taken into account
by this function. If LANG is e.g.
en_US.UTF-8, files in one-byte
encoding are read wrong by this
function
In other words, fgetcsv() tries to be binary-safe, but it's actually not (because it's also messing with the charset at the same time), and it will probably mangle the data it reads (as this setting is not configured in php.ini, but rather read from $LANG).
I've sidestepped the issue by reading the lines with fgets (which works on bytes, not characters) and using a CSV function from the comment in the docs to parse them into an array:
$fhandle = fopen($uploaded_file,'r');
while($raw_row = fgets($fhandle)) { // fgets is actually binary safe
$row = csvstring_to_array($raw_row, ',', '"', "\n");
// $row is now read correctly
}

Related

Wrong encoding of Word files uploaded to my server and processed by PHP script

I have a simple function (not mine one, actually) that opens and reads Word 97-2003 .doc files:
function docToText ($arrayOfFileProperties)
{
$pathToFile = $arrayOfFileProperties[1];
$response = '';
if(file_exists($pathToFile))
{
if(($fh = fopen($pathToFile, 'r')) !== false )
{
$headers = fread($fh, 0xA00);
$n1 = ( ord($headers[0x21C]) - 1 );
$n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );
$n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );
$n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );
$textLength = ($n1 + $n2 + $n3 + $n4);
$extracted_plaintext = fread($fh, $textLength);
return mb_convert_encoding($extracted_plaintext,'ISO-8859-5');
}
}
$pathToFile is the property holding an uploaded .doc file. The file is being opened by fopen method, and then it gets fread-ed.
When people upload files containing Latin script, the resulting string is absolutely OK (UTF-8). But, I have users who can occasionally upload Russian .doc files, and in this case I get a string of gibberish, and different Russian decoding portals tell me it is UTF-8-encoded Cyrillic text. I do understand that it is because PHP does not know the encoding of the Russian files being uploaded, it just uses the UTF-8 encoding to just give me at least something. And, as you may see, I am trying to encode the output into ISO-8859-5 (why this particular encoding? see below).
The thing is that you can use https://www.artlebedev.ru/decoder/, e.g., to try to detect encoding of your Cyrillic text. It successfully converts my UTF-8-encoded Russian text into absolutely readable Russian ISO-8859-5, so, naturally, I tried to use this conversion info to try to de-code the output of my function, so that I can see the Cyrillic script properly. To no avail, actually.
Neither mb_convert_encoding, nor iconv are able to convert my gibberish to readable Russian text, even when I indicate 'ISO-8859-5' as the target encoding. Moreover, as far as I can see, it does not even matter what target encoding I put there for mb_convert_encoding or iconv methods - the text does not change at all. I still get UTF-8 gibberish of my Cyrillic text.
Is there any way to manipulate encodings in my case? The thing is that we have some 3 or 4 Cyrillic encodings, with one or two being almost always the case.
Am I missing something? Thank you!

binary safe write on file with php to create a DBF file

I need to split a big DBF file using php functions, this means that i have for example 1000 records, i have to create 2 files with 500 records each.
I do not have any dbase extension available nor i can install it so i have to work with basic php functions. Using basic fread function i'm able to correctly read and parse the file, but when i try to write a new dbf i have some problems.
As i have understood, the DBF file is structured in a 2 line file: the first line contains file info, header info and it's in binary. The second line contains the data and it's plain text. So i thought to simply write a new binary file replicating the first line and manually adding the first records in the first file, the other records in the other file.
That's the code i use to parse the file and it works nicely
$fdbf = fopen($_FILES['userfile']['tmp_name'],'r');
$fields = array();
$buf = fread($fdbf,32);
$header=unpack( "VRecordCount/vFirstRecord/vRecordLength", substr($buf,4,8));
$goon = true;
$unpackString='';
while ($goon && !feof($fdbf)) { // read fields:
$buf = fread($fdbf,32);
if (substr($buf,0,1)==chr(13)) {$goon=false;} // end of field list
else {
$field=unpack( "a11fieldname/A1fieldtype/Voffset/Cfieldlen/Cfielddec", substr($buf,0,18));
$unpackString.="A$field[fieldlen]$field[fieldname]/";
array_push($fields, $field);
}
}
fseek($fdbf, 0);
$first_line = fread($fdbf, $header['FirstRecord']+1);
fseek($fdbf, $header['FirstRecord']+1); // move back to the start of the first record (after the field definitions)
first_line is the variable the contains the header data, but when i try to write it in a new file something wrong happens and the row isn't written exactly as it was read. That's the code i use for writing:
$handle_log = fopen($new_filename, "wb");
fwrite($handle_log, $first_line, strlen($first_line) );
fwrite($handle_log, $string );
fclose($handle_log);
I've tried to add the b value to fopen mode parameter as suggested to open it in a binary way, i've also taken a suggestion to add exactly the length of the string to avoid the stripes of some characters but unsuccessfully since all the files written are not correctly in DBF format. What can i do to achieve my goal?
As i have understood, the DBF file is structured in a 2 line file: the
first line contains file info, header info and it's in binary. The
second line contains the data and it's plain text.
Well, it's a bit more complicated than that.
See here for a full description of the dbf file format.
So it would be best if you could use a library to read and write the dbf files.
If you really need to do this yourself, here are the most important parts:
Dbf is a binary file format, so you have to read and write it as binary. For example the number of records is stored in a 32 bit integer, which can contain zero bytes.
You can't use string functions on that binary data. For example strlen() will scan the data up to the first null byte, which is present in that 32 bit integer, and will return the wrong value.
If you split the file (the records), you'll have to adjust the record count in the header.
When splitting the records keep in mind that each record is preceded by an extra byte, a space 0x20 if the record is not deleted, an asterisk 0x2A if the record is deleted. (for example, if you have 4 fields of 10 bytes, the length of each record will be 41) - that value is also available in the header: bytes 10-11 - 16-bit number - Number of bytes in the record. (Least significant byte first)
The file could end with the end-of-file marker 0x1A, so you'll have to check for that as well.

PHP string comparison not working, file reading

I am using PHP to read in a tab delimited CSV file and a pipe delimited TXT file. Unfortunately, I cannot get a string comparsion to work even though the characters (appear) to be exactly the same. I used trim to make sure to clean up hidden characters and I even tried type-casting to string.
Var dump shows they are clearly different but I am not sure how to make them the same?
// read in CSV file
$fh = fopen($mapping_date, 'r');
$mapping_data = fread($fh, filesize($mapping_date));
...
// use str_getcsv to put each line into an array
// get values out that I want to compare
$this_strategy = (string)trim($strategy_name);
$row_strategy = (string)trim($row3[_Strategy_Name]);
if($this_strategy == $row_strategy) { // do something }
var_dump($this_strategy);
Vardump: string(16) "Low Spend ($0.2)"
var_dump($row_strategy);
Vardump: string(31) "Low Spend ($0.2)"
Can't figure out for the life of me how to make this work.
Looks like you have the database encoded in UCS2 (assuming it's MySQL). http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-ucs2.html
You can use possibly use iconv to convert the format - but there's an example in the comments on that page (but it doesn't use iconv - http://php.net/manual/en/function.iconv.php#49171 ). I've not tested it.
Alternatively, change the database field encoding to utf8_generic or ASCII or whatever the file is encoded as?
Edit: Found the actual PHP function you want: mb_convert_encoding - UCS2 is one of the supported encodings, so enable that in php ini and you're good to go.

Excel csv export into a php file with fgetcsv

I'm using excel 2010 professional plus to create an excel file.
Later on I'm trying to export it as a UTF-8 .csv file.
I do this by saving it as CSV (symbol separated.....sry I know not the exact wording there
but I don't have the english version and I fear it is translated differently than 1:1).
There I click on tools->weboptions and select unicode (UTF-8) as encoding.
The example .csv is as follows:
ID;englishName;germanName
1;Austria;Österreich
So far so good, but if I open the file now with my php code:
header('Content-Type: text/html; charset=UTF-8');
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-8");
setlocale(LC_ALL, 'de_DE.utf8');
$fp=fopen($filePathName,'r');
while (($dataRow= fgetcsv($fp,0,";",'"') )!==FALSE)
{
print_r($dataRow);
}
I get: �sterreich as a result on the screen (as that is the "error" I cut all other parts of the result).
If I open the file with notedpad++ and look at the encoding I see "ANSI" instead of UTF-8.
If I change the encoding in notepad++ to UTF8....the ö,ä,... are replaced by special chars, which I have to correct manually.
If I go another route and create a new UTF-8 file with notedpad++ and put in the same data as in the excel file I get shown "Österreich" on screen when I open it with the php file.
Now the question I have is, why does it not function with excel, thus am I doing something wrong here? Or am I overlooking something?
Edit:
As the program will in the end be installed on windows servers provided by customers,
a solution is needed where it is not necessary to install additional tools (php libraries,... are ok, but having to install a vm-ware or cygwin,... is not).
Also there won't be a excel (or office) locally installed on the server as the
customer will be able to upload the .csv file via a file upload dialog (the dialog itself
is not part of the problem, as I know how to handle those and additionally the problem itself I stumbled over when I created an excel file and converted it to .csv on a testmachine where excel was locally installed).
Tnx
From PHP DOC
Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function.
You can try
header('Content-Type: text/html; charset=UTF-8');
$fp = fopen("log.txt", "r");
echo "<pre>";
while ( ($dataRow = fgetcsv($fp, 1000, ";")) !== FALSE ) {
$dataRow = array_map("utf8_encode", $dataRow);
print_r($dataRow);
}
Output
Array
(
[0] => ID
[1] => englishName
[2] => germanName
)
Array
(
[0] => 1
[1] => Austria
[2] => Österreich
)
I don't know why Excel is generating a ANSI file instead of UTF-8 (as you can see in Notepad++), but if this is the case, you can convert the file using iconv:
iconv --from-code=ISO-8859-1 --to-code=UTF-8 my_csv_file.csv > my_csv_file_utf8.csv
And for the people from Czech republic:
function convert( $str ) {
return iconv( "CP1250", "UTF-8", $str );
}
...
while (($data = fgetcsv($this->fhandle, 1000, ";")) !== FALSE) {
$data = array_map( "convert", $data );
...
The problem must be your file encoding, it looks it's not utf-8.
When I tried your example and double checked file that is indeed utf-8, it works for me, I get:
Array ( [0] => 1 [1] => Austria [2] => Österreich )
Use LibreOffice (OpenOffice), it's more reliable for these sort of things.
From what you say, I suspect excel writes an UTF-8 file without BOM, which makes guessing that the encoding is utf-8 slightly trickier. You can confirm this diagnostic if the characters appear correctly in Notepad++ when pressing to Format->Encode in UTF-8 (without BOM) (rather than Format->Convert to UTF-8 (without BOM)).
And are you sure every user is going to use UTF-8 ? Sounds to me that you need something that does a little smart guessing of what your real input encoding is. By "smart", I mean that this guessing recognizes BOM-less UTF-8.
To cut to the chase, I'd do something like that :
$f = fopen('file.csv', 'r');
while( ($row = fgets($f)) != null )
if( mb_detect_encoding($row, 'UTF-8', true) !== false )
var_dump(str_getcsv( $row, ';' ));
else
var_dump(str_getcsv( utf8_encode($row), ';' ));
fclose($f);
Which works because you read the characters to guess the encoding, rather than lazily trusting the first 3 characters : so UTF-8 without BOM would still be recognized as UTF-8. Of course if your csv file is not too big you could do that encoding detection on the whole file contents : something like mb_detect_encoding(file_get_contents(...), ...)

Determine unknown data format of binary data in PHP

I have binary data with a mix of uint32 and null terminated strings. I know the size of an individual data set ( each set of data shares the same format ), but not the actual format.
I've been using unpack to read the data with the following functions:
function read_uint32( $fh ){
$return_value = fread($fh, 4 );
$return_value = unpack( 'L', $return_value );
return $return_value[1];
}
function read_string( $fh ){
do{
$char = fread( $fh, 1 );
$return_string .= $char;
}while( ord( $char ) != 0 );
return substr($return_string, 0, -1);
}
and then basically trying both functions and seeing if the data makes sense as a string, and if not it's probably an int, is there an easier way to go about doing this?
Thanks.
well i think your approcah is okay.
well if you get only ascii strings its quite easy as the hightest bit will always be 0 or 1 (in some strange cases...) analyzing some bytes from the file and then look at the distribution will tell you probably whether its ascii or something binary.
if you have a different encoding like utf8 or something its really a pain in the ass.
you could probablly look for recurring CR/LF chars or filter out the raing 0-31 to only let tab, cr, lf, ff slip trhough. when you analyze the first X bytes and compare the ratio of non tab,cr,lf,ff chars and others. this will work for any encoding as the ascii range is normed...
to define the actual filetype its probably best to let this to the os layer and simply call file from the shell or use the php functions to get the mimetype...

Categories