Encoding problems in odtphp segments

Encoding problems in odtphp segments - php

I'm writing text from database to ODT document table using odtphp, using this http://www.odtphp.com/index.php?i=tutorials&p=tutorial6 example. In generated ODT some international characters are encoded wrong (or not encoded?). There was similar problem with other values, not in segments, that were set using setVar() function, but it was solved using
$odf->setVars($k, $v, true, 'UTF-8');
Looks like there's no additional settings for segment values.

Looks like that all text in segments were encoded to UTF-8 again, even if text has already been in UTF-8.
Currently I solved this issue by replacing line 203 in Segment.php from odtphp with following code:
return $this->setVars($meth, $args[0], false, 'UTF-8');

Related

Artifacts in text file

I have a test project which allows upload of various text(subs), then content of these text file gets displayed. Problem occurs when characters used, are not alphabet, i.e Cyrillic as in diacritics as ŠŽČĆ. Characters in text file are ok pre upload, but when i opened a uploaded file on server, all characters ŠŽČĆĐ get replaced by a . Yes, you saw it correctly, it's a rectangle thingy.
I use this line which work great on localhost, but on shared hosting throws a fit.
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", utf8_encode($tmp));
Where $temp variable is the string to be decoded.
Is it hosting thing, could i do something to prevent it?
PS: If i don't use utf8_encode on $tmp variable, server throws an error.
Edit1:
First image shows how it looks when file is opened on shared hosting.
And when i copy/paste that thing, it looks like this
sadly it doesn't get rendered on SO. Or lucky, depends of how you look at it ...
Above this sentence is an image, not typed out characters. It is how ever, a text i typed and character that is in a uploaded file copied then pasted when you making post on SO.
Edit2:
I sort a figured it out what's the problem. File is correctly saved as utf8 which contains previously said letters.
When file gets uploaded, these letters get changed to rectangle thing. So when i open file on server, instead ŠŽČĆĐ, i get rectangles. How to prevent server changing anything and to upload as is?
So it's not a formatting thing, athrough setting encoding to utf8 seems to help to at least display it and if i don't set encoding to utf8, it throws an error.
I'm using Laravel as backend.
Edit3:
If i test specific char after being read from file with this
mb_convert_encoding(file($path)[8][9])//It should be **š** character
It shows it's utf8, but if it was it will be shown.
If i try this line:
mb_convert_encoding(file($path)[8][9], "UTF-8", "ISO-8859-1")
then it shows rectangle thing like in file on server.
If i use to detect encoding with additional parameters like:
mb_detect_encoding(file($path)[8], "UTF-8", TRUE);
to determine if it's actual utf8, it says it's false.
And if i paste rectangle thing into google translate it shows an "š".
which is correct letter.
If i use bin2hex() to see hex code and for example argument is š letter, i get 9a hex code.
If anybody has any idea how to recreate function that will differentiate between these rectangles and show correct hex code or char itself, or how to upload to shared hosting without allowing it to change letters encoding in text file, or how to approach whole problem, it would be much obliged.

Do not use utf8-encode. It is only to converting from ISO-8859-1 and it doesn’t work with Windows-1252.
https://www.php.net/manual/en/function.utf8-encode.php
The second problem is that your code do a double encoding. I have marked two function that convert a string to UTF-8.
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", utf8_encode($tmp));
/* ^^^^^ ^^^^^^^^^^^ */
If the code below do not work, I would debug output of mb_detect_encoding($tmp, mb_detect_order(), true). The default values for mb_detect_order() may be far for optimal for Your situation.
https://www.php.net/manual/en/function.mb-detect-encoding.php
https://www.php.net/manual/en/function.mb-detect-order.php
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", $tmp);
You can use mb_convert_encoding() in place of iconv.
https://www.php.net/manual/en/function.mb-convert-encoding.php
For Your problem I would write this code:
/* If there are no Asian languages, the UTF-8 is the only encoding the mb_detect_encoding can recognize. */
if (mb_detect_encoding($tmp, 'UTF-8')) {
$temp = $tmp;
} else {
/* It is not UTF-8. Assume WINDOWS-1252. */
$temp = mb_convert_encoding($tmp, 'UTF-8', 'WINDOWS-1252');
}
It is very hard to reliably detect a particular single byte encoding. I am not aware of any build in PHP function for this.

Problems with php xml characters

Hello friends i have a problem with some characters reading a xml file from php i am using this source code:
$file = 'test.xml';
$xml_1 = simplexml_load_file($file);
echo ($xml_1->content);
its work ok but when the content is a special character like ñ ó it show a rarer character like this Ã± i tried to include in html head utf8 charset but its the same

SimpleXML emits UTF-8 output by design. If you application does not support UTF-8 you'll have to convert with the usual tools (e.g. mb_convert_encoding()) but you need to take this into account:
You need to know for sure the encoding your app is using.
UTF-8 can hold the complete Unicode catalogue thus some characters may not have an equivalent in your target encoding.
Whatever, in 2016 there's no reason to use anything else than UTF-8 unless your maintaining legacy code.

Finally i find the solution i must to use utf8_decode php function to convert the characters it is not enought with put utf8 charset in the head page you must to convert using php before

XML file isn't UTF-8 encoded when created in PHP

I'm trying to output XML file using PHP, and everything is right except that the file that is created isn't UTF-8 encoded, it's ANSI. (I see that when I open the file an do the Save as...).
I was using
$dom = new DOMDocument('1.0', 'UTF-8');
but I figured out that non-english characters don't appear on the output.
I was searching for solution and I tryed first adding
header("Content-Type: application/xml; charset=utf-8");
at the beginning of the php script but it say's:
Extra content at the end of the document
Below is a rendering of the page up to the first error.
I've tryed some other suggestions like not to include 'UTF-8' when creating the document but to write it separately:
$doc->encoding = 'UTF-8'; , but the result was the same.
I used
$doc->save("filename.xml");
to save the file, and I've tryed to change it to
$doc->saveXML();
but the non-english characters didn't appear.
Any ideas?

ANSI is not a real encoding. It's a word that basically means "whatever encoding my Windows computer is configured to use". Getting ANSI is a clear sign of relying on default encoding somewhere.
In order to generate valid UTF-8 output, you have to feed all XML functions with proper UTF-8 input. The most straightforward way to do it is to save your PHP source code as UTF-8 and then just type some non-English letters. If you are reading data from external sources (such as a database) you need to ensure that the complete toolchain makes proper use of encodings.
Whatever, using "Save as" in an undisclosed piece of software is not a reliable way to determine the file encoding.

Regex not matching? Encoding the issue?

Weird problem...
I have this document, when I copy the text and place it inside my script (as a string variable), the regex matches successfully. However, when I use file_get_contents to get to the document (from the internet), it does not.
Does this have something to do with encoding? The document is ISO-8859-1, but converted to utf8 via utf8_encode
Note that the string variable is created from this utf8 encoded output.
It's a simple regex too:
if (preg_match_all('/<h3 align=center><A NAME="([^"]*)"><\/A>(.*)<\/h3>(.*)::break::/isUu', $contents, $matches, PREG_SET_ORDER)) {
Any ideas what could be wrong?

This was not due to encoding, but due to the backtrack_limit being reached.
Overriding the setting with the following:
ini_set('pcre.backtrack_limit', '1000000');
(up from 100,000) fixes the issue. PHP 5.3.? also has this value so it's not just some really large number.

XML charactor encoding issues with accents

I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.â��"
OR
tambiÃ©n
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.

Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().

You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.

Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.

You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.