php merging txt files, issue with encoding

php merging txt files, issue with encoding - php

I found this code on stackoverflow, from user #Attgun:
link: merge all files in directory to one text file
<?php
//Name of the directory containing all files to merge
$Dir = "directory";
//Name of the output file
$OutputFile = "filename.txt";
//Scan the files in the directory into an array
$Files = scandir ($Dir);
//Create a stream to the output file
$Open = fopen ($OutputFile, "w"); //Use "w" to start a new output file from
zero. If you want to increment an existing file, use "a".
//Loop through the files, read their content into a string variable and
write it to the file stream. Then, clean the variable.
foreach ($Files as $k => $v) {
if ($v != "." AND $v != "..") {
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, $Data);
}
unset ($Data);
}
//Close the file stream
fclose ($Open);
?>
The code works right but when it is merging, php inserts a character in the beginning of every file copied. The file encoding i am using is UCS-2 LE.
I can view that character when i change the encoding to ANSI.
My problem is that i can't use another encoding than UCS-2 LE.
Can someone help me with this problem?
Edit: I don't wan't to change the file encoding. I want keep the same encoding without PHP add another character.

#AlexHowansky motivated me to search for an other way.
The solution that it seems to work without messing with file encoding is this :
bat file :
#echo on
copy *.txt all.txt
#pause
Now the final file keeps the encoding from the files that reads.
My compiler doesn't show any error message like before!

Most PHP string functions are encoding-agnostic. They merely see strings as a collection of bytes. You may append a b to the fopen() call in order to be sure that line feeds are not mangled but nothing in your code should change the actual encoding.
UCS-2 (as well as its successor UTF-16 and some other members of the UTF family) is a special case because the Unicode standard defines two possible directions to print the individual bytes that conform a multi-byte character (that has the fancy name of endianness), and such direction is determined by the presence of the byte order mark character, followed by a variable number of bytes that depends on the encoding and determine the endianness of the file.
Such prefix is what prevents raw file concatenation from working. However, it's a still a pretty simple format. All that's needed is removing the BOM from all files but the first one.
To be honest, I couldn't find what the BOM is for UCS-2 (it's a obsolete encoding and it's no longer present in most Unicode documentation) but since you have several samples you should be able to see it yourself. Making the assumption that it's the same as in UTF-16 (FF FE) you'd just need to omit two bytes, e.g.:
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, substr($Data, 2));
I've composed a little self-contained example. I don't have any editor that's able to handle UCS-2 so I've used UTF-16 LE. The BOM is 0xFFFF (you can inspect your BOM with an hexadecimal editor like hexed.it):
file_put_contents('a.txt', hex2bin('FFFE6100'));
file_put_contents('b.txt', hex2bin('FFFE6200'));
$output = fopen('all.txt', 'wb');
$first = true;
foreach (scandir(__DIR__) as $position => $file) {
if (pathinfo($file, PATHINFO_EXTENSION)==='txt' && $file!=='all.txt') {
$data = file_get_contents($file);
fwrite($output, $first ? $data : substr($data, 2));
$first = false;
}
}
fclose($output);
var_dump(
bin2hex(file_get_contents('a.txt')),
bin2hex(file_get_contents('b.txt')),
bin2hex(file_get_contents('all.txt'))
);
string(8) "fffe6100"
string(8) "fffe6200"
string(12) "fffe61006200"
As you can see, we end up with a single BOM on top and no other byte has been changed. Of course, this assumes that all your text files have the same encoding the encoding is exactly the one you think.

Related

How to convert UTF-8 to ANSI csv file

I'm actually generating csv files in php, works great but I have to use these csv files to use into Microsoft Dynamics AX and here's the problem.
Csv file that I generated gets "NUL" space on some columns and I have to pull off those spaces to get clean csv files and use it in Dynamics AX.
I saw when opening them into Notepad ++ that csv files are in UTF-8 BOM and I want to convert them to ANSI, when I make the conversion to ANSI in Notepad++, all NUL spaces disappear.
I tried different things saw on StackOverflow and it is with the iconv method that I obtained the better result but it is far from perfect and what I expect.
Here's the actual code :
fprintf($fp, chr(0xEF) . chr(0xBB) . chr(0xBF));
for ( $a = 0 ; $a < count($tableau) ; $a++ ) {
foreach ( $tableau[$a] as $data ) {
fputcsv($fp, $data, ";", chr(0));
}
}
$fp=iconv("UTF-8", "Windows-1252//TRANSLIT//IGNORE", $fp);
fclose($fp);
echo json_encode(responseAjax(true));
}
and I obtain these result :
I don't understand why it's only apply in one cell instead on working on every cells which contain "NUL" spaces.
I tried the mb_converting_encoding method with no great result.
Any other idea, method or advice will be welcome,
thanks

"NUL" is the name generally given to a binary value of 00000000, which has the same meaning in all ASCII-compatible encodings, which includes UTF-8, Windows-1252, and most things that could be given the vague label "ANSI". So character encoding is not relevant here.
You appear to be explicitly adding it with chr(0) - specifically as the "enclosure" argument to fputcsv, so it's being used in place of quote marks around strings with spaces or punctuation. The string "Avion" doesn't need enclosing, which is why it doesn't have any NULs around.
Let's add some comments to the code you've apparently copied without understanding:
// Output the three bytes known as the "UTF-8 BOM"
// - an invisible character used to help software guess that a file should be read as UTF-8
fprintf($fp, chr(0xEF) . chr(0xBB) . chr(0xBF));
// Loop over the data
for ( $a = 0 ; $a < count($tableau) ; $a++ ) {
foreach ( $tableau[$a] as $data ) {
// Output a row of CSV data with the built-in function fputcsv
// $data is the array of data you want to output on this row
// Use ';' as the separator between columns
// Use chr(0) - a NUL byte - to "enclose" fields with spaces or punctuation
// The default would be to use comma (',') and quotes ('"')
// See the PHP manual at https://php.net/fputcsv for more details
fputcsv($fp, $data, ";", chr(0));
}
}
The character you use for the "enclosure" is entirely up to you; most systems will probably expect the default ", so you could use this:
fputcsv($fp, $data, ";");
Which is the same as this:
fputcsv($fp, $data, ";", '"');
The function doesn't support disabling the enclosure completely, but without it, CSV is fundamentally a very simple format - just combine the fields separated by some character, e.g. using implode:
fwrite($fp, implode(";", $data));
Character encoding is a completely separate issue. For that, you need to know two things:
What encoding is your data in
What encoding does the remote system want it in
If these don't match, you can use a function like iconv or mb_convert_encoding.
If your output is not UTF-8, you should also remove the line at the beginning of your code that outputs the UTF-8 BOM.
If your data is stored in UTF-8, and the remote system accepts data in UTF-8, you don't need to do anything here.

fgets a UTF-8 txt file returns rubbish letters and true when file is blank

I assume that this is due to the UTF-8 txt file format. The txt file is totally empty and when I tried fgets($file_handle), I get these rubbish letters:
How do I fix this? I want to check if the file is empty by using:
if ( !$file_data = fgets($file_handle) )
// This code runs if file is empty
EDIT
This is a new file using encoding UTF-8:

This has to do with the BOM (Byte Order Mark) added by Notepad to detect the encoding:
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.
From this article you can also see that:
The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF
We should therefore be able to write a PHP function to account for this:
function is_utf8_file_empty($filename)
{
$file = #fopen($filename, "r");
$bom = fread($file, filesize($filename));
if ($bom == b"\xEF\xBB\xBF") {
return true;
}
return false;
}
Do be aware that this is specific for files created in the manner you described and this is just example code - you should definitely test this and possible modify it to allow it to better handle large files / files that are completely empty etc

DOCX Encoding issues

I have a PHP script that reads information in from a MySQL Database and puts it into a DOCX file, using a template. In the template, there are placeholders called <<<variable_name>>> where variable_name is the name of the MySQL field.
DOCX files are Zip archives, so my PHP script uses the ZipArchive library to open up the DOCX and edit the document.xml file, replacing the placeholders with the correct data.
This worked fine until today, when I ran into some coding issues. Any non-ANSI characters do not encode properly and make the output DOCX corrupt. MS Word gives the error message "Illegal XML character."
When I unzip the document and open document.xml in notepad++, I can see the problematic characters. By going to the encoding menu, and selecting "Encode in ANSI", I can see the characters normally: They are Pound (£) symbols. When N++ is set to "Encode in UTF-8 they appear as a hexadecimal value.
By selecting the N++ option to "Convert to UTF-8", the characters appear OK in UTF-8 and MS Word opens the document perfectly. But I don't want to manually unzip my DOCX archive every time I create something - The whole point of the script is to make generating the document quick and easy.
Obviously I need the PHP script to output the file in UTF-8 to make the '£' characters appear properly.
My code (Partially copied from another question on SO):
if (!copy($source, $target)) // make a duplicate so we dont overwrite the template
print "Could not duplicate template.\n";
$zip = new ZipArchive();
if ($zip->open($target, ZIPARCHIVE::CHECKCONS) !== TRUE)
print "Source is not a docx.\n";
$content_file = substr($source, -4) == '.odt' ? 'content.xml' : 'word/document.xml';
$file_contents = $zip->getFromName($content_file);
// Code here to process the file, get list of substitutions to make
foreach ($matches[0] as $x => $variable)
{
$find[$x] = '/' . $matches[0][$x] . '/';
$replace[$x] = $$matches[1][$x];<br>\n";
}
$file_contents = preg_replace($find, $replace, $file_contents, -1, $count);
$zip->deleteName($content_file);
$zip->addFromString($content_file, $file_contents);
$zip->close();
chmod($target, 0777);
I have tried:
$file_contents = iconv("Windows-1252", "UTF-8", $file_contents);
And:
$file_contents_utf8 = utf8_encode($file_contents_utf8);
To try to get the PHP script to encode the file in UTF-8.
How can I make the PHP script encode the file into UTF-8 when saving, using the ZipArchive library?

Don't use any conversion functions; simply use utf8 everywhere.
Let's check that you really have utf8 -- In PHP, use the bin2hex() function, apply it to the string that supposedly contains £, you should see C2A3, which is the utf8 hex £.

PHP fread() Function Returning Extra Characters at the Front on UTF-8 Text Files

While I'm using fread() on a normal text file (for example: ANSI file saved normally with Notepad), the returned content string is correct, as everyone knows.
But when I read the UTF-8 text file, the returning content string contains invisible characters (at the front). Why I said invisible is that the extra characters can't be seen normally on output (e.g.. echo for just read). But when the content string is used for processing (for example: Build a link with href value), problem is arisen then.
$filename = "blabla.txt";
$handle = fopen($filename, "r");
$contents = fread($handle, filesize($filename));
fclose($handle);
echo ''.$contents.'';
I put only http://www.google.com in the UTF-8 encoding text file. While running the PHP file, you will see a output link http://www.google.com
.. but you will never reach to Google.
Because address source href is being like this:
%EF%BB%BFhttp://www.google.com
It means, fread added %EF%BB%BF weird characters at the front.
This is extra annoying stuff. Why it is happening?
Added:
Some pointing that is BOM. So, BOM or whatever, it is changing my original values. So now, it is problem with other steps, function calls, etc. Now I have to substr($string,3) for all outputs. This is totally non-sense changing the original values.

This is called the UTF-8 BOM. Please refer to http://en.wikipedia.org/wiki/Byte_order_mark
It is something that is optionally added to the beginnning of Utf-8 files, meaning it is in the file, and not something fread adds. Most text editors won't display the BOM, but some will -- mostly those that don't understand it. Not all editors will add it to Utf-8 files, but yet again, some will...
For Utf-8 the usage of BOM is not recommended, as it has no meaning and by many instances are not understood.

It is UTF-8 BOM. IF you look at the docs for fread(here) someone has discussed a solution for it.
The solution given over there is the following
// Reads past the UTF-8 bom if it is there.
function fopen_utf8 ($filename, $mode) {
$file = #fopen($filename, $mode);
$bom = fread($file, 3);
if ($bom != b"\xEF\xBB\xBF")
rewind($file, 0);
else
echo "bom found!\n";
return $file;
}

Search And Replace Special Characters PHP

I am trying to search and replace special characters in strings that I am parsing from a csv file. When I open the text file with vim it shows me the character is <95> . I can't for the life of me figure out what character this is to use preg_replace with. Any help would be appreciated.
Thanks,
Chris Edwards

0x95 is probably supposed to represent the character U+2022 Bullet (•), encoded in Windows code page 1252. You can get rid of it in a byte string using:
$line= str_replace("\x95", '', $line);
or you can use iconv to convert the character set of the data from cp1252 to utf8 (or whatever other encoding you want), if you've got a CSV parser that can read non-ASCII characters reliably. Otherwise, you probably want to remove all non-ASCII characters, eg with:
$line= preg_replace("/[\x80-\xFF]/", '', $line);
If your CSV parser is fgetcsv() you've got problems. Theoretically you should be able to do this as a preprocessing step on a string before passing it to str_getcsv() (PHP 5.3) instead. Unfortunately this also means you have to read the file and split it row-by-row yourself, and this is not trivial to do given that quoted CSV values may contain newlines. By the time you've written the code to handle properly that you've pretty much written a CSV parser. So what you actually have to do is read the file into a string, do your pre-processing changes, write it back out to a temporary file, and have fgetcsv() read that.
The alternative would be to post-process each string returned by fgetcsv() individually. But that's also unpredictable, because PHP mangles the input by decoding it using the system default encoding instead of just giving you the damned bytes. And the default encoding outside of Windows is usually UTF-8, which won't read a 0x95 byte on its own as that'd be an invalid byte sequence. And whilst you could try to work around that using setlocale() to change the system default encoding, that is pretty bad practice which won't play nicely with any other apps you've got running that depend on system locale.
In summary, PHP's built-in CSV parsing stuff is pretty crap.

Following Bobince's suggestion, the following worked for me:
analyse_file() -> http://www.php.net/manual/en/function.fgetcsv.php#101238
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
if( !($_FILES['file']['error'] == 4) ) {
foreach($_FILES as $file) {
$n = $file['name'];
$s = $file['size'];
$filename = $file['tmp_name'];
ini_set('auto_detect_line_endings',TRUE); // in case Mac csv
// dealing with fgetcsv() special chars
// read the file into a string, do your pre-processing changes
// write it back out to a temporary file, and have fgetcsv() read that.
$file = file_get_contents_utf8($filename);
$tempFile = tempnam(sys_get_temp_dir(), '');
$handle = fopen($tempFile, "w+");
fwrite($handle,$file);
fseek($handle, 0);
$filename = $tempFile;
// END -- dealing with fgetcsv() special chars
$Array = analyse_file($filename, 10);
$csvDelim = $Array['delimiter']['value'];
while (($data = fgetcsv($handle, 1000, $csvDelim)) !== FALSE) {
// process the csv file
}
} // end foreach
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.