How to convert UTF-8 to ANSI csv file

How to convert UTF-8 to ANSI csv file - php

I'm actually generating csv files in php, works great but I have to use these csv files to use into Microsoft Dynamics AX and here's the problem.
Csv file that I generated gets "NUL" space on some columns and I have to pull off those spaces to get clean csv files and use it in Dynamics AX.
I saw when opening them into Notepad ++ that csv files are in UTF-8 BOM and I want to convert them to ANSI, when I make the conversion to ANSI in Notepad++, all NUL spaces disappear.
I tried different things saw on StackOverflow and it is with the iconv method that I obtained the better result but it is far from perfect and what I expect.
Here's the actual code :
fprintf($fp, chr(0xEF) . chr(0xBB) . chr(0xBF));
for ( $a = 0 ; $a < count($tableau) ; $a++ ) {
foreach ( $tableau[$a] as $data ) {
fputcsv($fp, $data, ";", chr(0));
}
}
$fp=iconv("UTF-8", "Windows-1252//TRANSLIT//IGNORE", $fp);
fclose($fp);
echo json_encode(responseAjax(true));
}
and I obtain these result :
I don't understand why it's only apply in one cell instead on working on every cells which contain "NUL" spaces.
I tried the mb_converting_encoding method with no great result.
Any other idea, method or advice will be welcome,
thanks

"NUL" is the name generally given to a binary value of 00000000, which has the same meaning in all ASCII-compatible encodings, which includes UTF-8, Windows-1252, and most things that could be given the vague label "ANSI". So character encoding is not relevant here.
You appear to be explicitly adding it with chr(0) - specifically as the "enclosure" argument to fputcsv, so it's being used in place of quote marks around strings with spaces or punctuation. The string "Avion" doesn't need enclosing, which is why it doesn't have any NULs around.
Let's add some comments to the code you've apparently copied without understanding:
// Output the three bytes known as the "UTF-8 BOM"
// - an invisible character used to help software guess that a file should be read as UTF-8
fprintf($fp, chr(0xEF) . chr(0xBB) . chr(0xBF));
// Loop over the data
for ( $a = 0 ; $a < count($tableau) ; $a++ ) {
foreach ( $tableau[$a] as $data ) {
// Output a row of CSV data with the built-in function fputcsv
// $data is the array of data you want to output on this row
// Use ';' as the separator between columns
// Use chr(0) - a NUL byte - to "enclose" fields with spaces or punctuation
// The default would be to use comma (',') and quotes ('"')
// See the PHP manual at https://php.net/fputcsv for more details
fputcsv($fp, $data, ";", chr(0));
}
}
The character you use for the "enclosure" is entirely up to you; most systems will probably expect the default ", so you could use this:
fputcsv($fp, $data, ";");
Which is the same as this:
fputcsv($fp, $data, ";", '"');
The function doesn't support disabling the enclosure completely, but without it, CSV is fundamentally a very simple format - just combine the fields separated by some character, e.g. using implode:
fwrite($fp, implode(";", $data));
Character encoding is a completely separate issue. For that, you need to know two things:
What encoding is your data in
What encoding does the remote system want it in
If these don't match, you can use a function like iconv or mb_convert_encoding.
If your output is not UTF-8, you should also remove the line at the beginning of your code that outputs the UTF-8 BOM.
If your data is stored in UTF-8, and the remote system accepts data in UTF-8, you don't need to do anything here.

Related

php merging txt files, issue with encoding

I found this code on stackoverflow, from user #Attgun:
link: merge all files in directory to one text file
<?php
//Name of the directory containing all files to merge
$Dir = "directory";
//Name of the output file
$OutputFile = "filename.txt";
//Scan the files in the directory into an array
$Files = scandir ($Dir);
//Create a stream to the output file
$Open = fopen ($OutputFile, "w"); //Use "w" to start a new output file from
zero. If you want to increment an existing file, use "a".
//Loop through the files, read their content into a string variable and
write it to the file stream. Then, clean the variable.
foreach ($Files as $k => $v) {
if ($v != "." AND $v != "..") {
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, $Data);
}
unset ($Data);
}
//Close the file stream
fclose ($Open);
?>
The code works right but when it is merging, php inserts a character in the beginning of every file copied. The file encoding i am using is UCS-2 LE.
I can view that character when i change the encoding to ANSI.
My problem is that i can't use another encoding than UCS-2 LE.
Can someone help me with this problem?
Edit: I don't wan't to change the file encoding. I want keep the same encoding without PHP add another character.

#AlexHowansky motivated me to search for an other way.
The solution that it seems to work without messing with file encoding is this :
bat file :
#echo on
copy *.txt all.txt
#pause
Now the final file keeps the encoding from the files that reads.
My compiler doesn't show any error message like before!

Most PHP string functions are encoding-agnostic. They merely see strings as a collection of bytes. You may append a b to the fopen() call in order to be sure that line feeds are not mangled but nothing in your code should change the actual encoding.
UCS-2 (as well as its successor UTF-16 and some other members of the UTF family) is a special case because the Unicode standard defines two possible directions to print the individual bytes that conform a multi-byte character (that has the fancy name of endianness), and such direction is determined by the presence of the byte order mark character, followed by a variable number of bytes that depends on the encoding and determine the endianness of the file.
Such prefix is what prevents raw file concatenation from working. However, it's a still a pretty simple format. All that's needed is removing the BOM from all files but the first one.
To be honest, I couldn't find what the BOM is for UCS-2 (it's a obsolete encoding and it's no longer present in most Unicode documentation) but since you have several samples you should be able to see it yourself. Making the assumption that it's the same as in UTF-16 (FF FE) you'd just need to omit two bytes, e.g.:
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, substr($Data, 2));
I've composed a little self-contained example. I don't have any editor that's able to handle UCS-2 so I've used UTF-16 LE. The BOM is 0xFFFF (you can inspect your BOM with an hexadecimal editor like hexed.it):
file_put_contents('a.txt', hex2bin('FFFE6100'));
file_put_contents('b.txt', hex2bin('FFFE6200'));
$output = fopen('all.txt', 'wb');
$first = true;
foreach (scandir(__DIR__) as $position => $file) {
if (pathinfo($file, PATHINFO_EXTENSION)==='txt' && $file!=='all.txt') {
$data = file_get_contents($file);
fwrite($output, $first ? $data : substr($data, 2));
$first = false;
}
}
fclose($output);
var_dump(
bin2hex(file_get_contents('a.txt')),
bin2hex(file_get_contents('b.txt')),
bin2hex(file_get_contents('all.txt'))
);
string(8) "fffe6100"
string(8) "fffe6200"
string(12) "fffe61006200"
As you can see, we end up with a single BOM on top and no other byte has been changed. Of course, this assumes that all your text files have the same encoding the encoding is exactly the one you think.

UTF-8 Hebrew encoding and big question marks

I have read a lot of articles but still i dont get it
Im importing text from file using
$fp = fopen($storagename, 'r');
while ( !feof($fp) ){
$line = fgets($fp, 2048);
$delimiter = "\t";
$data = str_getcsv($line, $delimiter);
print_r($data);
}
For displaying numbers and english charachters correctly i had to use
str_replace("\x00", '', $data[7])
But now trying to display hebrew charachters ends up looking like
�
I have tried converting with iconv/mb_convert_encoding/utf8_decode/encode
Nothing helps..
Any assistance will be great

UCS-2 is an older version of UTF-16 so you should probably try both (auto-detect text encoding is not a bullet-proof job).
We have the source encoding. We can speculate the target encoding is UTF-8 (because it's the sensible choice in 2016 and your question is actually tagged as UTF-8). So we have all we need.
We should first remove non-standard raw byte manipulations (e.g. remove str_replace("\x00", '', $data[7]) and similar code). We can then do a proper conversion. If you use mb_convert_encoding(), an initial approach could be:
$delimiter = "\t";
$fp = fopen($storagename, 'r');
while ( !feof($fp) ){
$line = mb_convert_encoding(fgets($fp, 2048), 'UTF-8', 'UCS-2LE');
$data = str_getcsv($line, $delimiter);
print_r($data);
}
You can check the list of supported encodings.
But we have a potential problem here: there's no way to tell str_getcsv() about the file encoding so it's unlikely that it will recognise your UCS-2 line endings.
You can try different solutions depending of the size of the CSV file. If it's small, I'll simply convert it at once. Otherwise, I'll have a look at stream_get_line():
This function is nearly identical to fgets() except in that it allows end of line delimiters other than the standard \n, \r, and \r\n, and does not return the delimiter itself.
It'd be something like this:
$ending = mb_convert_encoding("\n", 'UCS-2LE', 'UTF-8');
$line = mb_convert_encoding(stream_get_line($fp, 2048, $ending), 'UTF-8', 'UCS-2LE');
This should work with both Unix line endings (\n) and Windows ones (\r\n).

PHP fputcsv encoding

I create csv file with fputcsv. I want csv file to be in windows1251 ecnding. But can't find the solution. How can I do that?
Cheers

Default encoding for an excel file is machine specific ANSI, mainly windows1252. But since you are creating that file and maybe inserting UTF-8 characters, that file is not going to be handled ok.
You could use iconv() when creating the file. Eg:
function encodeCSV(&$value, $key){
$value = iconv('UTF-8', 'Windows-1252', $value);
}
array_walk($values, 'encodeCSV');

It's Working: Enjoy
use this before fputcsv:
$line = array_map("utf8_decode", $line);

The file will be in whatever encoding your strings are in. PHP strings are raw byte arrays, their encodings depends on wherever the bytes came from. If you just read them from your source code files, they're in whatever you saved your source code as. If they're from a database, they're in whatever encoding the database connection was set to.
If you need to convert from one encoding to another, use iconv. If you need more in-depth information, see here:
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Handling Unicode Front To Back In A Web App

fputs( $fp, $bom = chr(0xEF) . chr(0xBB) . chr(0xBF) );
Try this it worked for me!!!

Try the iconv function:
http://php.net/manual/en/function.iconv.php
To transform the members of the array you're passing to the fputcsv function to the encoding you want.

$string = iconv(mb_detect_encoding($string), 'Windows-1252//TRANSLIT', $string);

Search And Replace Special Characters PHP

I am trying to search and replace special characters in strings that I am parsing from a csv file. When I open the text file with vim it shows me the character is <95> . I can't for the life of me figure out what character this is to use preg_replace with. Any help would be appreciated.
Thanks,
Chris Edwards

0x95 is probably supposed to represent the character U+2022 Bullet (•), encoded in Windows code page 1252. You can get rid of it in a byte string using:
$line= str_replace("\x95", '', $line);
or you can use iconv to convert the character set of the data from cp1252 to utf8 (or whatever other encoding you want), if you've got a CSV parser that can read non-ASCII characters reliably. Otherwise, you probably want to remove all non-ASCII characters, eg with:
$line= preg_replace("/[\x80-\xFF]/", '', $line);
If your CSV parser is fgetcsv() you've got problems. Theoretically you should be able to do this as a preprocessing step on a string before passing it to str_getcsv() (PHP 5.3) instead. Unfortunately this also means you have to read the file and split it row-by-row yourself, and this is not trivial to do given that quoted CSV values may contain newlines. By the time you've written the code to handle properly that you've pretty much written a CSV parser. So what you actually have to do is read the file into a string, do your pre-processing changes, write it back out to a temporary file, and have fgetcsv() read that.
The alternative would be to post-process each string returned by fgetcsv() individually. But that's also unpredictable, because PHP mangles the input by decoding it using the system default encoding instead of just giving you the damned bytes. And the default encoding outside of Windows is usually UTF-8, which won't read a 0x95 byte on its own as that'd be an invalid byte sequence. And whilst you could try to work around that using setlocale() to change the system default encoding, that is pretty bad practice which won't play nicely with any other apps you've got running that depend on system locale.
In summary, PHP's built-in CSV parsing stuff is pretty crap.

Following Bobince's suggestion, the following worked for me:
analyse_file() -> http://www.php.net/manual/en/function.fgetcsv.php#101238
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
if( !($_FILES['file']['error'] == 4) ) {
foreach($_FILES as $file) {
$n = $file['name'];
$s = $file['size'];
$filename = $file['tmp_name'];
ini_set('auto_detect_line_endings',TRUE); // in case Mac csv
// dealing with fgetcsv() special chars
// read the file into a string, do your pre-processing changes
// write it back out to a temporary file, and have fgetcsv() read that.
$file = file_get_contents_utf8($filename);
$tempFile = tempnam(sys_get_temp_dir(), '');
$handle = fopen($tempFile, "w+");
fwrite($handle,$file);
fseek($handle, 0);
$filename = $tempFile;
// END -- dealing with fgetcsv() special chars
$Array = analyse_file($filename, 10);
$csvDelim = $Array['delimiter']['value'];
while (($data = fgetcsv($handle, 1000, $csvDelim)) !== FALSE) {
// process the csv file
}
} // end foreach
}

Working with files and utf8 in PHP

Lets say I have a file called foo.txt encoded in utf8:
aoeu
qjkx
ñpyf
And I want to get an array that contains all the lines in that file (one line per index) that have the letters aoeuñpyf, and only the lines with these letters.
I wrote the following code (also encoded as utf8):
$allowed_letters=array("a","o","e","u","ñ","p","y","f");
$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
$line=fgets($f);
foreach(preg_split("//",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
if(!in_array($letter,$allowed_letters)){
$line="";
}
}
if($line!=""){
$lines[]=$line;
}
}
fclose($f);
However, after that, the $lines array just has the aoeu line in it.
This seems to be because somehow, the "ñ" in $allowed_letters is not the same as the "ñ" in foo.txt.
Also if I print a "ñ" of the file, a question mark appears, but if I print it like this print "ñ";, it works.
How can I make it work?

If you are running Windows, the OS does not save files in UTF-8, but in cp1251 (or something...) by default you need to save the file in that format explicitly or run each line in utf8_encode() before performing your check. I.e.:
$line=utf8_encode(fgets($f));
If you are sure that the file is UTF-8 encoded, is your PHP file also UTF-8 encoded?
If everything is UTF-8, then this is what you need :
foreach(preg_split("//u",$line,-1,PREG_SPLIT_NO_EMPTY) as $letter){
// ...
}
(append u for unicode chars)
However, let me suggest a yet faster way to perform your check :
$allowed_letters=array("a","o","e","u","ñ","p","y","f");
$lines=array();
$f=fopen("foo.txt","r");
while(!feof($f)){
$line=fgets($f);
$line = str_split(rtrim($line));
if (count(array_intersect($line, $allowed_letters)) == count($line)) {
$lines[] = $line;
}
}
fclose($f);
(add space chars to allow space characters as well, and remove the rtrim($line))

In UTF-8, ñ is encoded as two bytes. Normally in PHP all string operations are byte-based, so when you preg_split the input it splits up the first byte and the second byte into separate array items. Neither the first byte on its own nor the second byte on its own will match both bytes together as found in $allowed_letters, so it'll never match ñ.
As Yanick posted, the solution is to add the u modifier. This makes PHP's regex engine treat both the pattern and the input line as Unicode characters instead of bytes. It's lucky that PHP has special Unicode support here; elsewhere PHP's Unicode support is extremely spotty.
A simpler and quicker way than splitting would be to compare each line against a character-group regex. Again, this must be a u regex.
if(preg_match('/^[aoeuñpyf]+$/u', $line))
$lines[]= $line;

It sounds like you've already got your answer, but it is important to recognize that unicode characters can be stored in multiple ways. Unicode normalization* is a process which can help ensure comparisons work as expected.
http://en.wikipedia.org/wiki/Unicode_equivalence

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.