PHP .rtf encoding problem with polish characters - php

Got a problem with replacing polish characters through php in rtf file.
I want to find tagwords in rtf file content in replace them with relevant content
So what I'm doing:
// Getting rtf file content
$content = file_get_contents('<link_to_file_here>');
// encoding to utf-8
$content = mb_convert_encoding($content, 'UTF-8');
// replacing tagword with relevant content
$content = str_replace('[company_address]', 'Częstochowa', $content);
// save rtf file with replaced content
file_put_contents('uploads/test.rtf', $content);
echo $content;
When i check what happened with rtf file content after this code executed, i've noticed that Częstochowa replaced with Cz\u0119stochowa.
Then i open a new created rtf file in MS Word and see this Częstochowa.
After this i decided to write Częstochowa manually in rtf file and check what happens. I get file content the same way (via file_get_contents) and noticed that MS Word replaced my manually wrote Częstochowa with Cz\\'eastochowa. So i decided to do this:
// replacing tagword with relevant content
$content = str_replace('[company_address]', 'Cz\\\'eastochowa', $content);
And after this i open file in MS Word and see this Czêstochowa
Googled a bit and found that ê is character from Unicode Block “Latin-1 Supplement” (from U+0080 to U+00FF) with code U+00EA but polish characters are in Unicode Block “Latin Extended-A” (from U+0100 to U+017F), so i need to encode rtf file content to it somehow
I tried a lot of things but still didn't solve the problem.
Hope on Your help. Thanks for attention.

Found a solution:
$string = str_replace('&#', "\\u", mb_convert_encoding('Częstochowa', 'html'));
$content = str_replace('[company_address]', $string, $content);

Related

file_get_contents returns bizarre characters from raw text file

This is very bizarre. I have a .txt file on my Windows server. I'm using file_get_contents to retrieve it, but the first several characters show up as a diamond with a question make inside them. I've tried recreating the file from scratch and it's the same result. What's really bizarre is other files don't have this issue.
Also, if I put a * at the start of the file it seems to fix it, but if I try to open the file and do it with PHP it's still messed up.
The start of the file in question begins with: Trinity Cannon - that's a direct copy and paste from the text file. I've tried re-typing it and the first few characters are always that diamond with a question mark.
$myfile='C:\\inetpub\\wwwroot\\fastpitchscores\\data\\2020.txt';
$fh = file_get_contents($myfile);
echo $fh; // Trinity Cannon
echo $fh[0]; // �
It sounds like whatever editor you used to originally create the file a UTF Byte Order Mark at the beginning the file.
You typically can't edit the BOM from within an editor. If your editor has a encoding conversion functionality, try converting to ASCII. For example, in Notepad++ use Encoding->Encode in ANSI.

Replacing smart quote with regular quote causes entire string to be erased

I have some files that were originally RTF files. They were opened with Microsoft Word 2016 and saved as .txt files. No other changes were made to the files. They were transferred to a Linux system.
When using the command:
file myfile.txt on the Linux they are showing as Non-ISO extended-ASCII text, with CRLF line terminators.
I am reading the files into PHP and processing them line by line. I am trying to replace any right smart-quotes with regular single quotes, but my entire string is being erased.
My code looks like this:
$text = "I can’t go for supper";
$text = preg_replace('/\x{2019}/u', "'", $text);
echo $text;
The apostrophe here is a right smart quote which shows up in Vim as <92>. Upon researching on the web, I have discovered this is actually unicode character 2019.
However, when I try to display the new value of $text nothing is displayed.
What is wrong with my code and why is it wiping out the entire string of text?
Upon further research, I have determined that character code <92> is specific to the Windows-1252 character encoding. I first needed to convert it to UTF-8 before I was able to manipulate the string.
The following code works correctly:
$text = "I can’t go for supper";
$text = iconv("Windows-1251", "UTF-8", $text);
$text = preg_replace('/\x{2019}/u', "'", $text);
echo $text;

DOCX Encoding issues

I have a PHP script that reads information in from a MySQL Database and puts it into a DOCX file, using a template. In the template, there are placeholders called <<<variable_name>>> where variable_name is the name of the MySQL field.
DOCX files are Zip archives, so my PHP script uses the ZipArchive library to open up the DOCX and edit the document.xml file, replacing the placeholders with the correct data.
This worked fine until today, when I ran into some coding issues. Any non-ANSI characters do not encode properly and make the output DOCX corrupt. MS Word gives the error message "Illegal XML character."
When I unzip the document and open document.xml in notepad++, I can see the problematic characters. By going to the encoding menu, and selecting "Encode in ANSI", I can see the characters normally: They are Pound (£) symbols. When N++ is set to "Encode in UTF-8 they appear as a hexadecimal value.
By selecting the N++ option to "Convert to UTF-8", the characters appear OK in UTF-8 and MS Word opens the document perfectly. But I don't want to manually unzip my DOCX archive every time I create something - The whole point of the script is to make generating the document quick and easy.
Obviously I need the PHP script to output the file in UTF-8 to make the '£' characters appear properly.
My code (Partially copied from another question on SO):
if (!copy($source, $target)) // make a duplicate so we dont overwrite the template
print "Could not duplicate template.\n";
$zip = new ZipArchive();
if ($zip->open($target, ZIPARCHIVE::CHECKCONS) !== TRUE)
print "Source is not a docx.\n";
$content_file = substr($source, -4) == '.odt' ? 'content.xml' : 'word/document.xml';
$file_contents = $zip->getFromName($content_file);
// Code here to process the file, get list of substitutions to make
foreach ($matches[0] as $x => $variable)
{
$find[$x] = '/' . $matches[0][$x] . '/';
$replace[$x] = $$matches[1][$x];<br>\n";
}
$file_contents = preg_replace($find, $replace, $file_contents, -1, $count);
$zip->deleteName($content_file);
$zip->addFromString($content_file, $file_contents);
$zip->close();
chmod($target, 0777);
I have tried:
$file_contents = iconv("Windows-1252", "UTF-8", $file_contents);
And:
$file_contents_utf8 = utf8_encode($file_contents_utf8);
To try to get the PHP script to encode the file in UTF-8.
How can I make the PHP script encode the file into UTF-8 when saving, using the ZipArchive library?
Don't use any conversion functions; simply use utf8 everywhere.
Let's check that you really have utf8 -- In PHP, use the bin2hex() function, apply it to the string that supposedly contains £, you should see C2A3, which is the utf8 hex £.

write file with special characters in php

SOLUTION:
$output = '–– € ––';
//written like this php 5 does not understand because it interprets it as single-byte chars.
//so i found the function below to write a multi-byte char in a string.
//unicode version of php's chr()
function uchr ($codes) {
if (is_scalar($codes)) $codes= func_get_args();
$str= '';
foreach ($codes as $code) $str.= html_entity_decode('&#'.$code.';',ENT_NOQUOTES,'UTF-8');
return $str;
}
//decimal values of unicode chars: – 8211 - 8211, [space] 32, € 8364,[space] 32, – 8211 - 8211
$output = uchr(8211,8211,32,8364,32,8211,8211);
//or
$output = uchr(8211,8211).' '.uchr(8364).' '.uchr(8211,8211);
echo $output;
QUESTION:
How can i write these special chars to a simple file?
$file = "./upload/myfile.txt";
$output = "–– € ––".PHP_EOL; // the "–" is not an underscore _ or - but –
file_put_contents($file, $output);
If I access this file from the browser http://mydomain.com/upload/myfile.txt i only get "�" characters.
However if i save "–– € ––" with Zend Developer or my local texteditor (on OSX) and upload this everything is perfectly fine. The browser shows it correctly.
How can i achieve this with php? it seems php uses a different way of writing the file than my macbook. thought i thought php's standard was UTF-8 and i also saved the file as UTF-8 in my local text editor.
EXTRA INFO: in the .htaccess file that's in the upload folder i wrote:
AddDefaultCharset utf-8
AddCharset utf-8 .txt
otherwise the firebug addon from firefox gave a message that the charset was not specified.
any ideas?
It has to do with saving the file because my uploaded file shows correctly.
i tried different options while saving the file like:
$output = mb_convert_encoding($output, 'UTF-8', 'OLD-ENCODING');
and the iconv function of php, but i cant find the solution.
any help is greatly appreciated.
EDIT: if i get the content from my uploaded file and echo it the following happens
$output = file_get_contents('./upload/myuploadedfile.txt',FILE_USE_INCLUDE_PATH);
//it show correctly –– € ––
$output = $output[1]; //it shows a �
$output = $output[3]; //it shows a �
echo $output;
PHP will write the contents of the file exactly as they are in your source code. It takes bytes exactly as they are encoded in your .php file and puts them in a file. From then it depends on how the file is interpreted. Assuming your source code is actually UTF-8 encoded, so will the file be. Try opening it with a text editor that can understand UTF-8. Change the encoding the browser interprets it with to UTF-8 (View menu > Encoding). Check if the web server actually sets the correct charset header when you open it in the browser (Firebug Network tab, headers of the response).
It's correct that $output[0] shows a broken UTF-8 character, since PHP only gives you the first byte of the multi-byte character "–".
For more in-depth information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

PHP fread() Function Returning Extra Characters at the Front on UTF-8 Text Files

While I'm using fread() on a normal text file (for example: ANSI file saved normally with Notepad), the returned content string is correct, as everyone knows.
But when I read the UTF-8 text file, the returning content string contains invisible characters (at the front). Why I said invisible is that the extra characters can't be seen normally on output (e.g.. echo for just read). But when the content string is used for processing (for example: Build a link with href value), problem is arisen then.
$filename = "blabla.txt";
$handle = fopen($filename, "r");
$contents = fread($handle, filesize($filename));
fclose($handle);
echo ''.$contents.'';
I put only http://www.google.com in the UTF-8 encoding text file. While running the PHP file, you will see a output link http://www.google.com
.. but you will never reach to Google.
Because address source href is being like this:
%EF%BB%BFhttp://www.google.com
It means, fread added %EF%BB%BF weird characters at the front.
This is extra annoying stuff. Why it is happening?
Added:
Some pointing that is BOM. So, BOM or whatever, it is changing my original values. So now, it is problem with other steps, function calls, etc. Now I have to substr($string,3) for all outputs. This is totally non-sense changing the original values.
This is called the UTF-8 BOM. Please refer to http://en.wikipedia.org/wiki/Byte_order_mark
It is something that is optionally added to the beginnning of Utf-8 files, meaning it is in the file, and not something fread adds. Most text editors won't display the BOM, but some will -- mostly those that don't understand it. Not all editors will add it to Utf-8 files, but yet again, some will...
For Utf-8 the usage of BOM is not recommended, as it has no meaning and by many instances are not understood.
It is UTF-8 BOM. IF you look at the docs for fread(here) someone has discussed a solution for it.
The solution given over there is the following
// Reads past the UTF-8 bom if it is there.
function fopen_utf8 ($filename, $mode) {
$file = #fopen($filename, $mode);
$bom = fread($file, 3);
if ($bom != b"\xEF\xBB\xBF")
rewind($file, 0);
else
echo "bom found!\n";
return $file;
}

Categories