Perform character operations with shell commands

Perform character operations with shell commands - php

I need to convert a CSV file exported from Mac Excel 2011 to an importable format recognized by a CMS (the solution should not be related, however the import format is for Drupal Feeds module, although the target).
In order to do this currently I need to perform the following operations in Vim:
:%s/\r/\r/g
:w ++enc=utf8
Which basically means:
Convert carriage returns to some sort of universal format
Initially as Excel exports them, the carriage return character is represented by ^M
the Vim command :%s/\r/\r/g converts them all to a format the CMS recognizes as a carriage return
Convert the character encoding to UTF8.
As exported initially, the character set is ASCII Extended or something similar.
Ideally this process will need to be triggered upon uploading the file as part of the import, which means PHP will trigger the process, whether that has any bearing on the process. However I feel more comfortable at this point handling the solution as a shell script or something similar, but of course PHP solutions are welcome if I can figure out how to hook it into Drupal 7 Feeds.

Some untested code:
#!/bin/php
<?php
$replacements = array(
// Adjust destination char to your liking
"\r\n" => "\n",
"\r" => "\n",
"\n" => "\n",
);
// No risk to split chars: input is single byte
while( $line = fread(STDIN, 10240) ){
// Normalize line feeds
$line = strtr($line, $replacements);
// Convert to UTF-8 (adjust source encoding to your needs)
$line = iconv('CP1252', 'UTF-8', $line);
fwrite(STDOUT, $line);
}
Usage:
./fix-csv < input.csv > output.csv

Related

How to convert UTF-8 to ANSI csv file

I'm actually generating csv files in php, works great but I have to use these csv files to use into Microsoft Dynamics AX and here's the problem.
Csv file that I generated gets "NUL" space on some columns and I have to pull off those spaces to get clean csv files and use it in Dynamics AX.
I saw when opening them into Notepad ++ that csv files are in UTF-8 BOM and I want to convert them to ANSI, when I make the conversion to ANSI in Notepad++, all NUL spaces disappear.
I tried different things saw on StackOverflow and it is with the iconv method that I obtained the better result but it is far from perfect and what I expect.
Here's the actual code :
fprintf($fp, chr(0xEF) . chr(0xBB) . chr(0xBF));
for ( $a = 0 ; $a < count($tableau) ; $a++ ) {
foreach ( $tableau[$a] as $data ) {
fputcsv($fp, $data, ";", chr(0));
}
}
$fp=iconv("UTF-8", "Windows-1252//TRANSLIT//IGNORE", $fp);
fclose($fp);
echo json_encode(responseAjax(true));
}
and I obtain these result :
I don't understand why it's only apply in one cell instead on working on every cells which contain "NUL" spaces.
I tried the mb_converting_encoding method with no great result.
Any other idea, method or advice will be welcome,
thanks

"NUL" is the name generally given to a binary value of 00000000, which has the same meaning in all ASCII-compatible encodings, which includes UTF-8, Windows-1252, and most things that could be given the vague label "ANSI". So character encoding is not relevant here.
You appear to be explicitly adding it with chr(0) - specifically as the "enclosure" argument to fputcsv, so it's being used in place of quote marks around strings with spaces or punctuation. The string "Avion" doesn't need enclosing, which is why it doesn't have any NULs around.
Let's add some comments to the code you've apparently copied without understanding:
// Output the three bytes known as the "UTF-8 BOM"
// - an invisible character used to help software guess that a file should be read as UTF-8
fprintf($fp, chr(0xEF) . chr(0xBB) . chr(0xBF));
// Loop over the data
for ( $a = 0 ; $a < count($tableau) ; $a++ ) {
foreach ( $tableau[$a] as $data ) {
// Output a row of CSV data with the built-in function fputcsv
// $data is the array of data you want to output on this row
// Use ';' as the separator between columns
// Use chr(0) - a NUL byte - to "enclose" fields with spaces or punctuation
// The default would be to use comma (',') and quotes ('"')
// See the PHP manual at https://php.net/fputcsv for more details
fputcsv($fp, $data, ";", chr(0));
}
}
The character you use for the "enclosure" is entirely up to you; most systems will probably expect the default ", so you could use this:
fputcsv($fp, $data, ";");
Which is the same as this:
fputcsv($fp, $data, ";", '"');
The function doesn't support disabling the enclosure completely, but without it, CSV is fundamentally a very simple format - just combine the fields separated by some character, e.g. using implode:
fwrite($fp, implode(";", $data));
Character encoding is a completely separate issue. For that, you need to know two things:
What encoding is your data in
What encoding does the remote system want it in
If these don't match, you can use a function like iconv or mb_convert_encoding.
If your output is not UTF-8, you should also remove the line at the beginning of your code that outputs the UTF-8 BOM.
If your data is stored in UTF-8, and the remote system accepts data in UTF-8, you don't need to do anything here.

Character-set conversion: PHP's iconv won't work

We have some internal dashboards with PHP backend used for uploading CSV files. Recently we
found some CSVs would fail to parse: the fgetcsv function returns false, which is super nasty since we couldn't determine the actual problem in CSV (for e.g. at which line no it is experience issues, which characters is it unable to digest etc.)
We narrowed down the problem to character-set encoding: CSVs generated from Windows machines were failing. Linux's iconv command was able to fix the CSVs for us
iconv -c --from-code=UTF-8 --to-code=ASCII path/to/uncleaned.csv > path/to/cleaned.csv
while it's PHP equivalent didn't work (tried using both //IGNORE//TRANSLIT options).
$uncleaned_csv_text = file_get_contents($source_data_csv_filename);
$cleaned_csv_text = iconv('UTF-8', 'ASCII/IGNORE//TRANSLIT', $uncleaned_csv_text);
file_put_contents($source_data_csv_filename, $cleaned_csv_text);
..
$headers = fgetcsv($source_data_csv_filename)
While we can use PHP's exec function to run the shell command
it is less than ideal
the practise is forbidden in our organisation from security viewpoint (Travis doesn't let it pass through)
Is there any alternative way to achieve this CSV 'cleaning'?
UPDATE-1
We explored several other options, none of which worked for us
regex based cleaning
forceutf8 package
mb_convert_encoding (as suggested by discussions)
UPDATE-2
Upon echoing the sha1 digest of CSV's text before and after subjecting it to PHP's iconv function, we found that iconv is not doing any change
Also in my case, mb_check_encoding on original CSV's text outputs true regardless of input query: windows-1252, ascii, utf-8

I've been working on a plugin (in Wordpress) that handles csv-files for a while now (with a 5/5 star rating) and I've been using mb_convert_encoding() with no issues. I know I have users that uses both windows and Linux.
Basically:
(TO UTF-8, FROM: Windows-1252)
$cleaned_csv_text = mb_convert_encoding($uncleaned_csv_text, 'UTF-8', 'Windows-1252');
If you don't know the original's encoding (maybe better in your case):
(TO UTF-8)
$cleaned_csv_text = mb_convert_encoding($uncleaned_csv_text, 'UTF-8');
UPDATE:
Here is a more complete answer which I hope you will find useful: I've used file() together with str_getcsv() etc:
<?php
$file = "csvfiles/pricelist.csv"; //This is Windows-1252 encoded
//Load a csv file into an array $content_arr
$content_arr = array_map(function($v) {
$delimiter = ';';
return str_getcsv($v, $delimiter);},
file( $file ));
//Do encoding row by row
//and include end of line based on the item in the array $content_arr
$csv = array_map(function($v) {
return mb_convert_encoding($v[0], 'UTF8','Windows-1252') . detect_eol($v[0]);},
$content_arr);
//Save modified file in UTF8
file_put_contents('csvfiles/pricelist_modified.csv', $csv);
//Detects the end-of-line character of a string.
//
//function from
//https://stackoverflow.com/questions/11066857/detect-eol-type-using-php/11066858#11066858
function detect_eol( $str )
{
static $eols = array(
"\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
"\0x000A", // [UNICODE] LF: Line Feed, U+000A
"\0x000B", // [UNICODE] VT: Vertical Tab, U+000B
"\0x000C", // [UNICODE] FF: Form Feed, U+000C
"\0x000D", // [UNICODE] CR: Carriage Return, U+000D
"\0x0085", // [UNICODE] NEL: Next Line, U+0085
"\0x2028", // [UNICODE] LS: Line Separator, U+2028
"\0x2029", // [UNICODE] PS: Paragraph Separator, U+2029
"\0x0D0A", // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
"\0x0A0D", // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
"\0x0A", // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
"\0x0D", // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
"\0x1E", // [ASCII] RS: QNX (pre-POSIX)
"\0x15", // [EBCDEIC] NEL: OS/390, OS/400
"\r\n",
"\r",
"\n"
);
$cur_cnt = 0;
$cur_eol = "\r\n"; //default
//Check if eols in array above exists in string
foreach($eols as $eol){
$char_cnt = mb_substr_count($str, $eol);
if($char_cnt > $cur_cnt)
{
$cur_cnt = $char_cnt;
$cur_eol = $eol;
}
}
return $cur_eol;
}

This can't be deemed as a solution (since we didn't even determine the root cause of the problems), but rather a hack.
We asked people using Windows machines to do this for uploading CSVs
Upload CSV to Google Sheet (File > Import)
Download the created Sheet back as a CSV (File > Download > Comma-separated values)
Then use that downloaded CSV for uploading on dashboard
Credits to #RohitChandana for suggesting this workaround

php merging txt files, issue with encoding

I found this code on stackoverflow, from user #Attgun:
link: merge all files in directory to one text file
<?php
//Name of the directory containing all files to merge
$Dir = "directory";
//Name of the output file
$OutputFile = "filename.txt";
//Scan the files in the directory into an array
$Files = scandir ($Dir);
//Create a stream to the output file
$Open = fopen ($OutputFile, "w"); //Use "w" to start a new output file from
zero. If you want to increment an existing file, use "a".
//Loop through the files, read their content into a string variable and
write it to the file stream. Then, clean the variable.
foreach ($Files as $k => $v) {
if ($v != "." AND $v != "..") {
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, $Data);
}
unset ($Data);
}
//Close the file stream
fclose ($Open);
?>
The code works right but when it is merging, php inserts a character in the beginning of every file copied. The file encoding i am using is UCS-2 LE.
I can view that character when i change the encoding to ANSI.
My problem is that i can't use another encoding than UCS-2 LE.
Can someone help me with this problem?
Edit: I don't wan't to change the file encoding. I want keep the same encoding without PHP add another character.

#AlexHowansky motivated me to search for an other way.
The solution that it seems to work without messing with file encoding is this :
bat file :
#echo on
copy *.txt all.txt
#pause
Now the final file keeps the encoding from the files that reads.
My compiler doesn't show any error message like before!

Most PHP string functions are encoding-agnostic. They merely see strings as a collection of bytes. You may append a b to the fopen() call in order to be sure that line feeds are not mangled but nothing in your code should change the actual encoding.
UCS-2 (as well as its successor UTF-16 and some other members of the UTF family) is a special case because the Unicode standard defines two possible directions to print the individual bytes that conform a multi-byte character (that has the fancy name of endianness), and such direction is determined by the presence of the byte order mark character, followed by a variable number of bytes that depends on the encoding and determine the endianness of the file.
Such prefix is what prevents raw file concatenation from working. However, it's a still a pretty simple format. All that's needed is removing the BOM from all files but the first one.
To be honest, I couldn't find what the BOM is for UCS-2 (it's a obsolete encoding and it's no longer present in most Unicode documentation) but since you have several samples you should be able to see it yourself. Making the assumption that it's the same as in UTF-16 (FF FE) you'd just need to omit two bytes, e.g.:
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, substr($Data, 2));
I've composed a little self-contained example. I don't have any editor that's able to handle UCS-2 so I've used UTF-16 LE. The BOM is 0xFFFF (you can inspect your BOM with an hexadecimal editor like hexed.it):
file_put_contents('a.txt', hex2bin('FFFE6100'));
file_put_contents('b.txt', hex2bin('FFFE6200'));
$output = fopen('all.txt', 'wb');
$first = true;
foreach (scandir(__DIR__) as $position => $file) {
if (pathinfo($file, PATHINFO_EXTENSION)==='txt' && $file!=='all.txt') {
$data = file_get_contents($file);
fwrite($output, $first ? $data : substr($data, 2));
$first = false;
}
}
fclose($output);
var_dump(
bin2hex(file_get_contents('a.txt')),
bin2hex(file_get_contents('b.txt')),
bin2hex(file_get_contents('all.txt'))
);
string(8) "fffe6100"
string(8) "fffe6200"
string(12) "fffe61006200"
As you can see, we end up with a single BOM on top and no other byte has been changed. Of course, this assumes that all your text files have the same encoding the encoding is exactly the one you think.

DOCX Encoding issues

I have a PHP script that reads information in from a MySQL Database and puts it into a DOCX file, using a template. In the template, there are placeholders called <<<variable_name>>> where variable_name is the name of the MySQL field.
DOCX files are Zip archives, so my PHP script uses the ZipArchive library to open up the DOCX and edit the document.xml file, replacing the placeholders with the correct data.
This worked fine until today, when I ran into some coding issues. Any non-ANSI characters do not encode properly and make the output DOCX corrupt. MS Word gives the error message "Illegal XML character."
When I unzip the document and open document.xml in notepad++, I can see the problematic characters. By going to the encoding menu, and selecting "Encode in ANSI", I can see the characters normally: They are Pound (£) symbols. When N++ is set to "Encode in UTF-8 they appear as a hexadecimal value.
By selecting the N++ option to "Convert to UTF-8", the characters appear OK in UTF-8 and MS Word opens the document perfectly. But I don't want to manually unzip my DOCX archive every time I create something - The whole point of the script is to make generating the document quick and easy.
Obviously I need the PHP script to output the file in UTF-8 to make the '£' characters appear properly.
My code (Partially copied from another question on SO):
if (!copy($source, $target)) // make a duplicate so we dont overwrite the template
print "Could not duplicate template.\n";
$zip = new ZipArchive();
if ($zip->open($target, ZIPARCHIVE::CHECKCONS) !== TRUE)
print "Source is not a docx.\n";
$content_file = substr($source, -4) == '.odt' ? 'content.xml' : 'word/document.xml';
$file_contents = $zip->getFromName($content_file);
// Code here to process the file, get list of substitutions to make
foreach ($matches[0] as $x => $variable)
{
$find[$x] = '/' . $matches[0][$x] . '/';
$replace[$x] = $$matches[1][$x];<br>\n";
}
$file_contents = preg_replace($find, $replace, $file_contents, -1, $count);
$zip->deleteName($content_file);
$zip->addFromString($content_file, $file_contents);
$zip->close();
chmod($target, 0777);
I have tried:
$file_contents = iconv("Windows-1252", "UTF-8", $file_contents);
And:
$file_contents_utf8 = utf8_encode($file_contents_utf8);
To try to get the PHP script to encode the file in UTF-8.
How can I make the PHP script encode the file into UTF-8 when saving, using the ZipArchive library?

Don't use any conversion functions; simply use utf8 everywhere.
Let's check that you really have utf8 -- In PHP, use the bin2hex() function, apply it to the string that supposedly contains £, you should see C2A3, which is the utf8 hex £.

Search And Replace Special Characters PHP

I am trying to search and replace special characters in strings that I am parsing from a csv file. When I open the text file with vim it shows me the character is <95> . I can't for the life of me figure out what character this is to use preg_replace with. Any help would be appreciated.
Thanks,
Chris Edwards

0x95 is probably supposed to represent the character U+2022 Bullet (•), encoded in Windows code page 1252. You can get rid of it in a byte string using:
$line= str_replace("\x95", '', $line);
or you can use iconv to convert the character set of the data from cp1252 to utf8 (or whatever other encoding you want), if you've got a CSV parser that can read non-ASCII characters reliably. Otherwise, you probably want to remove all non-ASCII characters, eg with:
$line= preg_replace("/[\x80-\xFF]/", '', $line);
If your CSV parser is fgetcsv() you've got problems. Theoretically you should be able to do this as a preprocessing step on a string before passing it to str_getcsv() (PHP 5.3) instead. Unfortunately this also means you have to read the file and split it row-by-row yourself, and this is not trivial to do given that quoted CSV values may contain newlines. By the time you've written the code to handle properly that you've pretty much written a CSV parser. So what you actually have to do is read the file into a string, do your pre-processing changes, write it back out to a temporary file, and have fgetcsv() read that.
The alternative would be to post-process each string returned by fgetcsv() individually. But that's also unpredictable, because PHP mangles the input by decoding it using the system default encoding instead of just giving you the damned bytes. And the default encoding outside of Windows is usually UTF-8, which won't read a 0x95 byte on its own as that'd be an invalid byte sequence. And whilst you could try to work around that using setlocale() to change the system default encoding, that is pretty bad practice which won't play nicely with any other apps you've got running that depend on system locale.
In summary, PHP's built-in CSV parsing stuff is pretty crap.

Following Bobince's suggestion, the following worked for me:
analyse_file() -> http://www.php.net/manual/en/function.fgetcsv.php#101238
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
if( !($_FILES['file']['error'] == 4) ) {
foreach($_FILES as $file) {
$n = $file['name'];
$s = $file['size'];
$filename = $file['tmp_name'];
ini_set('auto_detect_line_endings',TRUE); // in case Mac csv
// dealing with fgetcsv() special chars
// read the file into a string, do your pre-processing changes
// write it back out to a temporary file, and have fgetcsv() read that.
$file = file_get_contents_utf8($filename);
$tempFile = tempnam(sys_get_temp_dir(), '');
$handle = fopen($tempFile, "w+");
fwrite($handle,$file);
fseek($handle, 0);
$filename = $tempFile;
// END -- dealing with fgetcsv() special chars
$Array = analyse_file($filename, 10);
$csvDelim = $Array['delimiter']['value'];
while (($data = fgetcsv($handle, 1000, $csvDelim)) !== FALSE) {
// process the csv file
}
} // end foreach
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.