DOCX Encoding issues

DOCX Encoding issues - php

I have a PHP script that reads information in from a MySQL Database and puts it into a DOCX file, using a template. In the template, there are placeholders called <<<variable_name>>> where variable_name is the name of the MySQL field.
DOCX files are Zip archives, so my PHP script uses the ZipArchive library to open up the DOCX and edit the document.xml file, replacing the placeholders with the correct data.
This worked fine until today, when I ran into some coding issues. Any non-ANSI characters do not encode properly and make the output DOCX corrupt. MS Word gives the error message "Illegal XML character."
When I unzip the document and open document.xml in notepad++, I can see the problematic characters. By going to the encoding menu, and selecting "Encode in ANSI", I can see the characters normally: They are Pound (£) symbols. When N++ is set to "Encode in UTF-8 they appear as a hexadecimal value.
By selecting the N++ option to "Convert to UTF-8", the characters appear OK in UTF-8 and MS Word opens the document perfectly. But I don't want to manually unzip my DOCX archive every time I create something - The whole point of the script is to make generating the document quick and easy.
Obviously I need the PHP script to output the file in UTF-8 to make the '£' characters appear properly.
My code (Partially copied from another question on SO):
if (!copy($source, $target)) // make a duplicate so we dont overwrite the template
print "Could not duplicate template.\n";
$zip = new ZipArchive();
if ($zip->open($target, ZIPARCHIVE::CHECKCONS) !== TRUE)
print "Source is not a docx.\n";
$content_file = substr($source, -4) == '.odt' ? 'content.xml' : 'word/document.xml';
$file_contents = $zip->getFromName($content_file);
// Code here to process the file, get list of substitutions to make
foreach ($matches[0] as $x => $variable)
{
$find[$x] = '/' . $matches[0][$x] . '/';
$replace[$x] = $$matches[1][$x];<br>\n";
}
$file_contents = preg_replace($find, $replace, $file_contents, -1, $count);
$zip->deleteName($content_file);
$zip->addFromString($content_file, $file_contents);
$zip->close();
chmod($target, 0777);
I have tried:
$file_contents = iconv("Windows-1252", "UTF-8", $file_contents);
And:
$file_contents_utf8 = utf8_encode($file_contents_utf8);
To try to get the PHP script to encode the file in UTF-8.
How can I make the PHP script encode the file into UTF-8 when saving, using the ZipArchive library?

Don't use any conversion functions; simply use utf8 everywhere.
Let's check that you really have utf8 -- In PHP, use the bin2hex() function, apply it to the string that supposedly contains £, you should see C2A3, which is the utf8 hex £.

Related

PHP .rtf encoding problem with polish characters

Got a problem with replacing polish characters through php in rtf file.
I want to find tagwords in rtf file content in replace them with relevant content
So what I'm doing:
// Getting rtf file content
$content = file_get_contents('<link_to_file_here>');
// encoding to utf-8
$content = mb_convert_encoding($content, 'UTF-8');
// replacing tagword with relevant content
$content = str_replace('[company_address]', 'Częstochowa', $content);
// save rtf file with replaced content
file_put_contents('uploads/test.rtf', $content);
echo $content;
When i check what happened with rtf file content after this code executed, i've noticed that Częstochowa replaced with Cz\u0119stochowa.
Then i open a new created rtf file in MS Word and see this CzÄ™stochowa.
After this i decided to write Częstochowa manually in rtf file and check what happens. I get file content the same way (via file_get_contents) and noticed that MS Word replaced my manually wrote Częstochowa with Cz\\'eastochowa. So i decided to do this:
// replacing tagword with relevant content
$content = str_replace('[company_address]', 'Cz\\\'eastochowa', $content);
And after this i open file in MS Word and see this Czêstochowa
Googled a bit and found that ê is character from Unicode Block “Latin-1 Supplement” (from U+0080 to U+00FF) with code U+00EA but polish characters are in Unicode Block “Latin Extended-A” (from U+0100 to U+017F), so i need to encode rtf file content to it somehow
I tried a lot of things but still didn't solve the problem.
Hope on Your help. Thanks for attention.

Found a solution:
$string = str_replace('&#', "\\u", mb_convert_encoding('Częstochowa', 'html'));
$content = str_replace('[company_address]', $string, $content);

fgets a UTF-8 txt file returns rubbish letters and true when file is blank

I assume that this is due to the UTF-8 txt file format. The txt file is totally empty and when I tried fgets($file_handle), I get these rubbish letters:
How do I fix this? I want to check if the file is empty by using:
if ( !$file_data = fgets($file_handle) )
// This code runs if file is empty
EDIT
This is a new file using encoding UTF-8:

This has to do with the BOM (Byte Order Mark) added by Notepad to detect the encoding:
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.
From this article you can also see that:
The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF
We should therefore be able to write a PHP function to account for this:
function is_utf8_file_empty($filename)
{
$file = #fopen($filename, "r");
$bom = fread($file, filesize($filename));
if ($bom == b"\xEF\xBB\xBF") {
return true;
}
return false;
}
Do be aware that this is specific for files created in the manner you described and this is just example code - you should definitely test this and possible modify it to allow it to better handle large files / files that are completely empty etc

php merging txt files, issue with encoding

I found this code on stackoverflow, from user #Attgun:
link: merge all files in directory to one text file
<?php
//Name of the directory containing all files to merge
$Dir = "directory";
//Name of the output file
$OutputFile = "filename.txt";
//Scan the files in the directory into an array
$Files = scandir ($Dir);
//Create a stream to the output file
$Open = fopen ($OutputFile, "w"); //Use "w" to start a new output file from
zero. If you want to increment an existing file, use "a".
//Loop through the files, read their content into a string variable and
write it to the file stream. Then, clean the variable.
foreach ($Files as $k => $v) {
if ($v != "." AND $v != "..") {
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, $Data);
}
unset ($Data);
}
//Close the file stream
fclose ($Open);
?>
The code works right but when it is merging, php inserts a character in the beginning of every file copied. The file encoding i am using is UCS-2 LE.
I can view that character when i change the encoding to ANSI.
My problem is that i can't use another encoding than UCS-2 LE.
Can someone help me with this problem?
Edit: I don't wan't to change the file encoding. I want keep the same encoding without PHP add another character.

#AlexHowansky motivated me to search for an other way.
The solution that it seems to work without messing with file encoding is this :
bat file :
#echo on
copy *.txt all.txt
#pause
Now the final file keeps the encoding from the files that reads.
My compiler doesn't show any error message like before!

Most PHP string functions are encoding-agnostic. They merely see strings as a collection of bytes. You may append a b to the fopen() call in order to be sure that line feeds are not mangled but nothing in your code should change the actual encoding.
UCS-2 (as well as its successor UTF-16 and some other members of the UTF family) is a special case because the Unicode standard defines two possible directions to print the individual bytes that conform a multi-byte character (that has the fancy name of endianness), and such direction is determined by the presence of the byte order mark character, followed by a variable number of bytes that depends on the encoding and determine the endianness of the file.
Such prefix is what prevents raw file concatenation from working. However, it's a still a pretty simple format. All that's needed is removing the BOM from all files but the first one.
To be honest, I couldn't find what the BOM is for UCS-2 (it's a obsolete encoding and it's no longer present in most Unicode documentation) but since you have several samples you should be able to see it yourself. Making the assumption that it's the same as in UTF-16 (FF FE) you'd just need to omit two bytes, e.g.:
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, substr($Data, 2));
I've composed a little self-contained example. I don't have any editor that's able to handle UCS-2 so I've used UTF-16 LE. The BOM is 0xFFFF (you can inspect your BOM with an hexadecimal editor like hexed.it):
file_put_contents('a.txt', hex2bin('FFFE6100'));
file_put_contents('b.txt', hex2bin('FFFE6200'));
$output = fopen('all.txt', 'wb');
$first = true;
foreach (scandir(__DIR__) as $position => $file) {
if (pathinfo($file, PATHINFO_EXTENSION)==='txt' && $file!=='all.txt') {
$data = file_get_contents($file);
fwrite($output, $first ? $data : substr($data, 2));
$first = false;
}
}
fclose($output);
var_dump(
bin2hex(file_get_contents('a.txt')),
bin2hex(file_get_contents('b.txt')),
bin2hex(file_get_contents('all.txt'))
);
string(8) "fffe6100"
string(8) "fffe6200"
string(12) "fffe61006200"
As you can see, we end up with a single BOM on top and no other byte has been changed. Of course, this assumes that all your text files have the same encoding the encoding is exactly the one you think.

write file with special characters in php

SOLUTION:
$output = '–– € ––';
//written like this php 5 does not understand because it interprets it as single-byte chars.
//so i found the function below to write a multi-byte char in a string.
//unicode version of php's chr()
function uchr ($codes) {
if (is_scalar($codes)) $codes= func_get_args();
$str= '';
foreach ($codes as $code) $str.= html_entity_decode('&#'.$code.';',ENT_NOQUOTES,'UTF-8');
return $str;
}
//decimal values of unicode chars: – 8211 - 8211, [space] 32, € 8364,[space] 32, – 8211 - 8211
$output = uchr(8211,8211,32,8364,32,8211,8211);
//or
$output = uchr(8211,8211).' '.uchr(8364).' '.uchr(8211,8211);
echo $output;
QUESTION:
How can i write these special chars to a simple file?
$file = "./upload/myfile.txt";
$output = "–– € ––".PHP_EOL; // the "–" is not an underscore _ or - but –
file_put_contents($file, $output);
If I access this file from the browser http://mydomain.com/upload/myfile.txt i only get "�" characters.
However if i save "–– € ––" with Zend Developer or my local texteditor (on OSX) and upload this everything is perfectly fine. The browser shows it correctly.
How can i achieve this with php? it seems php uses a different way of writing the file than my macbook. thought i thought php's standard was UTF-8 and i also saved the file as UTF-8 in my local text editor.
EXTRA INFO: in the .htaccess file that's in the upload folder i wrote:
AddDefaultCharset utf-8
AddCharset utf-8 .txt
otherwise the firebug addon from firefox gave a message that the charset was not specified.
any ideas?
It has to do with saving the file because my uploaded file shows correctly.
i tried different options while saving the file like:
$output = mb_convert_encoding($output, 'UTF-8', 'OLD-ENCODING');
and the iconv function of php, but i cant find the solution.
any help is greatly appreciated.
EDIT: if i get the content from my uploaded file and echo it the following happens
$output = file_get_contents('./upload/myuploadedfile.txt',FILE_USE_INCLUDE_PATH);
//it show correctly –– € ––
$output = $output[1]; //it shows a �
$output = $output[3]; //it shows a �
echo $output;

PHP will write the contents of the file exactly as they are in your source code. It takes bytes exactly as they are encoded in your .php file and puts them in a file. From then it depends on how the file is interpreted. Assuming your source code is actually UTF-8 encoded, so will the file be. Try opening it with a text editor that can understand UTF-8. Change the encoding the browser interprets it with to UTF-8 (View menu > Encoding). Check if the web server actually sets the correct charset header when you open it in the browser (Firebug Network tab, headers of the response).
It's correct that $output[0] shows a broken UTF-8 character, since PHP only gives you the first byte of the multi-byte character "–".
For more in-depth information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Search And Replace Special Characters PHP

I am trying to search and replace special characters in strings that I am parsing from a csv file. When I open the text file with vim it shows me the character is <95> . I can't for the life of me figure out what character this is to use preg_replace with. Any help would be appreciated.
Thanks,
Chris Edwards

0x95 is probably supposed to represent the character U+2022 Bullet (•), encoded in Windows code page 1252. You can get rid of it in a byte string using:
$line= str_replace("\x95", '', $line);
or you can use iconv to convert the character set of the data from cp1252 to utf8 (or whatever other encoding you want), if you've got a CSV parser that can read non-ASCII characters reliably. Otherwise, you probably want to remove all non-ASCII characters, eg with:
$line= preg_replace("/[\x80-\xFF]/", '', $line);
If your CSV parser is fgetcsv() you've got problems. Theoretically you should be able to do this as a preprocessing step on a string before passing it to str_getcsv() (PHP 5.3) instead. Unfortunately this also means you have to read the file and split it row-by-row yourself, and this is not trivial to do given that quoted CSV values may contain newlines. By the time you've written the code to handle properly that you've pretty much written a CSV parser. So what you actually have to do is read the file into a string, do your pre-processing changes, write it back out to a temporary file, and have fgetcsv() read that.
The alternative would be to post-process each string returned by fgetcsv() individually. But that's also unpredictable, because PHP mangles the input by decoding it using the system default encoding instead of just giving you the damned bytes. And the default encoding outside of Windows is usually UTF-8, which won't read a 0x95 byte on its own as that'd be an invalid byte sequence. And whilst you could try to work around that using setlocale() to change the system default encoding, that is pretty bad practice which won't play nicely with any other apps you've got running that depend on system locale.
In summary, PHP's built-in CSV parsing stuff is pretty crap.

Following Bobince's suggestion, the following worked for me:
analyse_file() -> http://www.php.net/manual/en/function.fgetcsv.php#101238
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
if( !($_FILES['file']['error'] == 4) ) {
foreach($_FILES as $file) {
$n = $file['name'];
$s = $file['size'];
$filename = $file['tmp_name'];
ini_set('auto_detect_line_endings',TRUE); // in case Mac csv
// dealing with fgetcsv() special chars
// read the file into a string, do your pre-processing changes
// write it back out to a temporary file, and have fgetcsv() read that.
$file = file_get_contents_utf8($filename);
$tempFile = tempnam(sys_get_temp_dir(), '');
$handle = fopen($tempFile, "w+");
fwrite($handle,$file);
fseek($handle, 0);
$filename = $tempFile;
// END -- dealing with fgetcsv() special chars
$Array = analyse_file($filename, 10);
$csvDelim = $Array['delimiter']['value'];
while (($data = fgetcsv($handle, 1000, $csvDelim)) !== FALSE) {
// process the csv file
}
} // end foreach
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.