Character-set conversion: PHP's iconv won't work

Character-set conversion: PHP's iconv won't work - php

We have some internal dashboards with PHP backend used for uploading CSV files. Recently we
found some CSVs would fail to parse: the fgetcsv function returns false, which is super nasty since we couldn't determine the actual problem in CSV (for e.g. at which line no it is experience issues, which characters is it unable to digest etc.)
We narrowed down the problem to character-set encoding: CSVs generated from Windows machines were failing. Linux's iconv command was able to fix the CSVs for us
iconv -c --from-code=UTF-8 --to-code=ASCII path/to/uncleaned.csv > path/to/cleaned.csv
while it's PHP equivalent didn't work (tried using both //IGNORE//TRANSLIT options).
$uncleaned_csv_text = file_get_contents($source_data_csv_filename);
$cleaned_csv_text = iconv('UTF-8', 'ASCII/IGNORE//TRANSLIT', $uncleaned_csv_text);
file_put_contents($source_data_csv_filename, $cleaned_csv_text);
..
$headers = fgetcsv($source_data_csv_filename)
While we can use PHP's exec function to run the shell command
it is less than ideal
the practise is forbidden in our organisation from security viewpoint (Travis doesn't let it pass through)
Is there any alternative way to achieve this CSV 'cleaning'?
UPDATE-1
We explored several other options, none of which worked for us
regex based cleaning
forceutf8 package
mb_convert_encoding (as suggested by discussions)
UPDATE-2
Upon echoing the sha1 digest of CSV's text before and after subjecting it to PHP's iconv function, we found that iconv is not doing any change
Also in my case, mb_check_encoding on original CSV's text outputs true regardless of input query: windows-1252, ascii, utf-8

I've been working on a plugin (in Wordpress) that handles csv-files for a while now (with a 5/5 star rating) and I've been using mb_convert_encoding() with no issues. I know I have users that uses both windows and Linux.
Basically:
(TO UTF-8, FROM: Windows-1252)
$cleaned_csv_text = mb_convert_encoding($uncleaned_csv_text, 'UTF-8', 'Windows-1252');
If you don't know the original's encoding (maybe better in your case):
(TO UTF-8)
$cleaned_csv_text = mb_convert_encoding($uncleaned_csv_text, 'UTF-8');
UPDATE:
Here is a more complete answer which I hope you will find useful: I've used file() together with str_getcsv() etc:
<?php
$file = "csvfiles/pricelist.csv"; //This is Windows-1252 encoded
//Load a csv file into an array $content_arr
$content_arr = array_map(function($v) {
$delimiter = ';';
return str_getcsv($v, $delimiter);},
file( $file ));
//Do encoding row by row
//and include end of line based on the item in the array $content_arr
$csv = array_map(function($v) {
return mb_convert_encoding($v[0], 'UTF8','Windows-1252') . detect_eol($v[0]);},
$content_arr);
//Save modified file in UTF8
file_put_contents('csvfiles/pricelist_modified.csv', $csv);
//Detects the end-of-line character of a string.
//
//function from
//https://stackoverflow.com/questions/11066857/detect-eol-type-using-php/11066858#11066858
function detect_eol( $str )
{
static $eols = array(
"\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
"\0x000A", // [UNICODE] LF: Line Feed, U+000A
"\0x000B", // [UNICODE] VT: Vertical Tab, U+000B
"\0x000C", // [UNICODE] FF: Form Feed, U+000C
"\0x000D", // [UNICODE] CR: Carriage Return, U+000D
"\0x0085", // [UNICODE] NEL: Next Line, U+0085
"\0x2028", // [UNICODE] LS: Line Separator, U+2028
"\0x2029", // [UNICODE] PS: Paragraph Separator, U+2029
"\0x0D0A", // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
"\0x0A0D", // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
"\0x0A", // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
"\0x0D", // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
"\0x1E", // [ASCII] RS: QNX (pre-POSIX)
"\0x15", // [EBCDEIC] NEL: OS/390, OS/400
"\r\n",
"\r",
"\n"
);
$cur_cnt = 0;
$cur_eol = "\r\n"; //default
//Check if eols in array above exists in string
foreach($eols as $eol){
$char_cnt = mb_substr_count($str, $eol);
if($char_cnt > $cur_cnt)
{
$cur_cnt = $char_cnt;
$cur_eol = $eol;
}
}
return $cur_eol;
}

This can't be deemed as a solution (since we didn't even determine the root cause of the problems), but rather a hack.
We asked people using Windows machines to do this for uploading CSVs
Upload CSV to Google Sheet (File > Import)
Download the created Sheet back as a CSV (File > Download > Comma-separated values)
Then use that downloaded CSV for uploading on dashboard
Credits to #RohitChandana for suggesting this workaround

Related

DOCX Encoding issues

I have a PHP script that reads information in from a MySQL Database and puts it into a DOCX file, using a template. In the template, there are placeholders called <<<variable_name>>> where variable_name is the name of the MySQL field.
DOCX files are Zip archives, so my PHP script uses the ZipArchive library to open up the DOCX and edit the document.xml file, replacing the placeholders with the correct data.
This worked fine until today, when I ran into some coding issues. Any non-ANSI characters do not encode properly and make the output DOCX corrupt. MS Word gives the error message "Illegal XML character."
When I unzip the document and open document.xml in notepad++, I can see the problematic characters. By going to the encoding menu, and selecting "Encode in ANSI", I can see the characters normally: They are Pound (£) symbols. When N++ is set to "Encode in UTF-8 they appear as a hexadecimal value.
By selecting the N++ option to "Convert to UTF-8", the characters appear OK in UTF-8 and MS Word opens the document perfectly. But I don't want to manually unzip my DOCX archive every time I create something - The whole point of the script is to make generating the document quick and easy.
Obviously I need the PHP script to output the file in UTF-8 to make the '£' characters appear properly.
My code (Partially copied from another question on SO):
if (!copy($source, $target)) // make a duplicate so we dont overwrite the template
print "Could not duplicate template.\n";
$zip = new ZipArchive();
if ($zip->open($target, ZIPARCHIVE::CHECKCONS) !== TRUE)
print "Source is not a docx.\n";
$content_file = substr($source, -4) == '.odt' ? 'content.xml' : 'word/document.xml';
$file_contents = $zip->getFromName($content_file);
// Code here to process the file, get list of substitutions to make
foreach ($matches[0] as $x => $variable)
{
$find[$x] = '/' . $matches[0][$x] . '/';
$replace[$x] = $$matches[1][$x];<br>\n";
}
$file_contents = preg_replace($find, $replace, $file_contents, -1, $count);
$zip->deleteName($content_file);
$zip->addFromString($content_file, $file_contents);
$zip->close();
chmod($target, 0777);
I have tried:
$file_contents = iconv("Windows-1252", "UTF-8", $file_contents);
And:
$file_contents_utf8 = utf8_encode($file_contents_utf8);
To try to get the PHP script to encode the file in UTF-8.
How can I make the PHP script encode the file into UTF-8 when saving, using the ZipArchive library?

Don't use any conversion functions; simply use utf8 everywhere.
Let's check that you really have utf8 -- In PHP, use the bin2hex() function, apply it to the string that supposedly contains £, you should see C2A3, which is the utf8 hex £.

Importing multibyte characters from CSV files created using Mozilla Thunderbird into PHP

I am trying to import a CSV file into my PHP application built with Drupal. I have encountered a strange situation when importing CSV files exported from Mozilla Thunderbird (I am exporting the address book of contacts). If the I export using the Windows version of Thunderbird, any multibyte characters are not rendered to the screen, and appear as missing characters when dumping the contents of the extracted contents to screen. However, this problem does not exist when using an identical file created using the Linux Version of Thunderbird. In this case eveything works perfectly.
To test this I have installed the same version of Thunderbird on Linux and Windows 7. I then create the same single user (surname: 张, given name: 利) in the address book, then export the address book as a CSV file. As mentions above the linux CSV file works imports successfully but the Windows one doesn't.
If I examine both files in linux using file --mime myfilename.csv is get the following output:
LinuxTB14.csv: text/plain; charset=utf-8
WinTB14.csv: text/plain; charset=iso-8859-1
So the windows file, even though it contains Chinese characters, is being encoded as iso-8859-1. After discovering this, I assumed that it is an encoding issue and that I just need to tell PHP to encode the offending content as UTF-8.
Problem is that PHP appears to be detecting the encoding in another way that I can't understand.
// Set correct locale to avoid any issues with multibyte characters.
$original_local_value = setlocale(LC_CTYPE, 0);
if ($original_local_value !== 'en_US.UTF-8') {
setlocale(LC_CTYPE, 'en_US.UTF-8');
}
$handle = fopen($file->uri, "r");
$cardinfo = array();
while (($data = fgetcsv($handle, 5000, ",")) !== FALSE) {
$cardinfo[] = $data;
// dsm() is a drupal function which prints the content of the argument to screen.
dsm(mb_detect_encoding($data[0]));
dsm($data[0]);
}
If I include the above code, which shows the encoding and content of the first value in each line of the CSV file, I get the following rendered to the screen:
For the CSV created by Thunderbird in windows
ASCII
First Name
UTF-8
For the CSV create by Thunderbird in Linux
ASCII
First Name
UTF-8
利
As you can see PHP is reporting the same encoding for both files, even though the Chinese character in the Windows file is not being printed to screen.
Anyone have any ideas what might be going on here?
EDIT
If I open the Windows CSV file in notepad and save as.. UTF-8 format, then the file will import correctly. So it is obviously an encoding issue. I have added the following code to convert the file encoding if it is not already set to UTF-8.
$file_contents = file_get_contents($file->uri);
$file_encoding = mb_detect_encoding($file_contents, 'UTF-8, ISO-8859-1, WINDOWS-1252');
if ($file_encoding !== 'UTF-8') {
$file_contents = iconv($file_encoding, 'UTF-8', $file_contents);
$handle = fopen($file->uri, 'w');
fwrite($handle, $file_contents);
fclose($handle);
}
This partially fixes the problem. The characters are appearing, but they are garbled (e.g. 张 appears as ÕÅ). I checked the page encoding of my browser and the page headers and both are set to UTF-8, so it is not a browser issue.
Any ideas?

The only solution I have come up with for this issue to not try to detect and convert the encoding of the uploaded file in the first place. After much research it appears that reliable encoding detection is not really existent. There is just too much room for error in doing this.
The safest option is to ensure that the uploaded file is encoded in UTF-8, as UTF-8 encoding can be reliably detected. The following code is how I am doing the UTF-8 encoding detection.
$file_content = file_get_contents($file->uri);
// Create regex pattern which detects UTF-8 encoding.
$regex = '%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs';
if (!preg_match($regex, $file_content)) {
// Not valid UTF-8 encoding so flag an error.
}

Perform character operations with shell commands

I need to convert a CSV file exported from Mac Excel 2011 to an importable format recognized by a CMS (the solution should not be related, however the import format is for Drupal Feeds module, although the target).
In order to do this currently I need to perform the following operations in Vim:
:%s/\r/\r/g
:w ++enc=utf8
Which basically means:
Convert carriage returns to some sort of universal format
Initially as Excel exports them, the carriage return character is represented by ^M
the Vim command :%s/\r/\r/g converts them all to a format the CMS recognizes as a carriage return
Convert the character encoding to UTF8.
As exported initially, the character set is ASCII Extended or something similar.
Ideally this process will need to be triggered upon uploading the file as part of the import, which means PHP will trigger the process, whether that has any bearing on the process. However I feel more comfortable at this point handling the solution as a shell script or something similar, but of course PHP solutions are welcome if I can figure out how to hook it into Drupal 7 Feeds.

Some untested code:
#!/bin/php
<?php
$replacements = array(
// Adjust destination char to your liking
"\r\n" => "\n",
"\r" => "\n",
"\n" => "\n",
);
// No risk to split chars: input is single byte
while( $line = fread(STDIN, 10240) ){
// Normalize line feeds
$line = strtr($line, $replacements);
// Convert to UTF-8 (adjust source encoding to your needs)
$line = iconv('CP1252', 'UTF-8', $line);
fwrite(STDOUT, $line);
}
Usage:
./fix-csv < input.csv > output.csv

How can I make PHP's fgetcsv() to recognize the Classic Mac (CR) new line character?

I use this code to get the number of columns from a CSV file:
$this->dummy_file_handler = fopen($this->config['file'],'r');
if ($dataset =fgetcsv($this->dummy_file_handler))
{
$this->number_of_columns = count($dataset);
}
It works fine unless the file is exported with Excel for Mac 2011 since the new line character is then Classic Mac (CR) which fgetcsv doesn't recognize.
If I manually change the newline from Classic Mac (CR) to Unix (LR), then it works, but I need this to be automated.
How can I make fgetcsv recognize the Classic Mac (CR) new line character?

From the manual:
Note: If PHP is not properly
recognizing the line endings when
reading files either on or created by
a Macintosh computer, enabling the
auto_detect_line_endings run-time
configuration option may help resolve
the problem.

If Saul's answer doesn't work, I'd write a simple script to read in the file all at once and str_replace all \r with \n, dumping the results into a new file, then fgetcsv'ing that new file.
I find it amusing that these terms come from the days of using typewriters:
\n = Line Feed(LF) - advances the paper one line.
\r = Carriage Return (CR) - returns the carriage to the left side of the typewriter.

Search And Replace Special Characters PHP

I am trying to search and replace special characters in strings that I am parsing from a csv file. When I open the text file with vim it shows me the character is <95> . I can't for the life of me figure out what character this is to use preg_replace with. Any help would be appreciated.
Thanks,
Chris Edwards

0x95 is probably supposed to represent the character U+2022 Bullet (•), encoded in Windows code page 1252. You can get rid of it in a byte string using:
$line= str_replace("\x95", '', $line);
or you can use iconv to convert the character set of the data from cp1252 to utf8 (or whatever other encoding you want), if you've got a CSV parser that can read non-ASCII characters reliably. Otherwise, you probably want to remove all non-ASCII characters, eg with:
$line= preg_replace("/[\x80-\xFF]/", '', $line);
If your CSV parser is fgetcsv() you've got problems. Theoretically you should be able to do this as a preprocessing step on a string before passing it to str_getcsv() (PHP 5.3) instead. Unfortunately this also means you have to read the file and split it row-by-row yourself, and this is not trivial to do given that quoted CSV values may contain newlines. By the time you've written the code to handle properly that you've pretty much written a CSV parser. So what you actually have to do is read the file into a string, do your pre-processing changes, write it back out to a temporary file, and have fgetcsv() read that.
The alternative would be to post-process each string returned by fgetcsv() individually. But that's also unpredictable, because PHP mangles the input by decoding it using the system default encoding instead of just giving you the damned bytes. And the default encoding outside of Windows is usually UTF-8, which won't read a 0x95 byte on its own as that'd be an invalid byte sequence. And whilst you could try to work around that using setlocale() to change the system default encoding, that is pretty bad practice which won't play nicely with any other apps you've got running that depend on system locale.
In summary, PHP's built-in CSV parsing stuff is pretty crap.

Following Bobince's suggestion, the following worked for me:
analyse_file() -> http://www.php.net/manual/en/function.fgetcsv.php#101238
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
if( !($_FILES['file']['error'] == 4) ) {
foreach($_FILES as $file) {
$n = $file['name'];
$s = $file['size'];
$filename = $file['tmp_name'];
ini_set('auto_detect_line_endings',TRUE); // in case Mac csv
// dealing with fgetcsv() special chars
// read the file into a string, do your pre-processing changes
// write it back out to a temporary file, and have fgetcsv() read that.
$file = file_get_contents_utf8($filename);
$tempFile = tempnam(sys_get_temp_dir(), '');
$handle = fopen($tempFile, "w+");
fwrite($handle,$file);
fseek($handle, 0);
$filename = $tempFile;
// END -- dealing with fgetcsv() special chars
$Array = analyse_file($filename, 10);
$csvDelim = $Array['delimiter']['value'];
while (($data = fgetcsv($handle, 1000, $csvDelim)) !== FALSE) {
// process the csv file
}
} // end foreach
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.