Character Encoding/decoding becomes a mess

Character Encoding/decoding becomes a mess - php

In a webapp I place a <div id="xxx" contentEditable=true > for editing purpose. The encodeURIComponent(xxx.innerHTML) will be send via Ajax POST type to a server, where a PHP script creates a simple txt file from it which in turn can be downloaded from the user to store it locally or print it on screen. It works perfect so far, but … Yes, but, character encoding is a mess. All special characters like the german Ä are interpretated wrong. In this case as ÃÂ¤
I google for some days and I study PHP methods like iconv() and I know how to set up a browsers character encoding and also set a text editor for a correct correspondending decoding. But nothing helps, its still a messs, or becoming even weired.
So my question is : Where in this encoding/decoding roundtrip from the browser to a server and back to the browser I have to do what, to ensure that an Ä will still be an Ä ?

I answer my question, because it turns out to be another problem as stated above. The contenteditable is actually part of a section of html code. On the serverside with PHP I need to filter out the contenteditable text which I do via a DOMDocument like this:
$doc = new DOMDocument();
$doc->loadHTML($_POST["data"]);
then I access the elements and their textual content as usual.
Finally I save the text with
file_put_contents($txtFile, $plainText, LOCK_EX);
The saved text then was a mess as written above. Now it turns out that you need to tell the DOMDocument the character set wich loadHTML() has to interpretate. In this case UTF-8.
First I did it as recommended in PHP this way :
$doc = new DOMDocument('1.0', 'UTF-8');
But that doesn't help (I wonder). Then I found this answer in SO. And the final solution is this :
$doc->loadHTML('<?xml encoding="UTF-8">' . $_POST["data"]);
Though it works it is a trick. Finally the question is left over, how to do it the right way ? If somebedoy has the definite answer, he is very welcome.

You need to make sure that the content is encoded consistently throughout its roundtrip from user input to server-side storage and back to the browser again.
I would recommend using UTF-8. Check that your HTML document (which includes the contenteditable zone) is UTF-8 encoded, and that the XMLHttpRequest/Ajax request does not specify a different encoding when it sends the content to the server.
Check that your server-side application encodes the text file as UTF-8 also. And check that the HTTP response headers declare the file's encoding as UTF-8 when the file is requested and downloaded in the browser.
Somewhere along this path, the encoding differs, and that is what is causing the error. iconv converts between different encodings, which should not be necessary if everything is consistent.
Good luck!

Related

Force DOMXpath - php - to return utf-8 results

First off, I know this problem was signaled before, but the solutions do not apply to my case
Here is the url
http://www.astagiudiziaria.com/beni/porzione_di_rustico_e_terreni_agricoli/index.html
The page says its charset is ISO-8859-1, but it cannot be since it has the EURO sign on it. Chrome browser identifies it as windows-1252
I used
$file = str_replace('charset=iso-8859-1', 'charset=utf-8', $file);
$file = iconv('windows-1252', 'UTF-8', $file);
and save it and my text editor says it is UTF-8 encoded
Then I use
$doc2->loadHTML($file);
$doc2->saveHTMLFile('ggg.html');
and also my text editor says it is UTF-8 encoded
But http://i-tools.org/charset says this file, ggg.html is actually ASCII !
Nonetheless, inside it things look as expected, even though they are using html encodings , like Pré or proprietà
The xpath queries return garbage data, like
instead of Pré is PrÃ©
instead of € is â‚¬Â
I have tried the solutions suggested around here without any success
I think it's about how php is dealing with libxml, since in ruby it works flawlessly - also using libxml through curb gem - problem being that my client wants a php script

I took a quick glance, and the way I see it the site outputs mixed encoding.
It is iso-8859-1 with a mixed-in windows-1252 € sign (I think).
Thats why the browser gets confused (but somehow handles it).
No idea how you would proceed here, apart from asking them to fix their site or alternativly do some bit-fiddling.
the Pré is PrÃ© breaks because you attemt to windows-1252->utf8 transcode what actually is iso-8859-1 stuff (I suppose).

XML file isn't UTF-8 encoded when created in PHP

I'm trying to output XML file using PHP, and everything is right except that the file that is created isn't UTF-8 encoded, it's ANSI. (I see that when I open the file an do the Save as...).
I was using
$dom = new DOMDocument('1.0', 'UTF-8');
but I figured out that non-english characters don't appear on the output.
I was searching for solution and I tryed first adding
header("Content-Type: application/xml; charset=utf-8");
at the beginning of the php script but it say's:
Extra content at the end of the document
Below is a rendering of the page up to the first error.
I've tryed some other suggestions like not to include 'UTF-8' when creating the document but to write it separately:
$doc->encoding = 'UTF-8'; , but the result was the same.
I used
$doc->save("filename.xml");
to save the file, and I've tryed to change it to
$doc->saveXML();
but the non-english characters didn't appear.
Any ideas?

ANSI is not a real encoding. It's a word that basically means "whatever encoding my Windows computer is configured to use". Getting ANSI is a clear sign of relying on default encoding somewhere.
In order to generate valid UTF-8 output, you have to feed all XML functions with proper UTF-8 input. The most straightforward way to do it is to save your PHP source code as UTF-8 and then just type some non-English letters. If you are reading data from external sources (such as a database) you need to ensure that the complete toolchain makes proper use of encodings.
Whatever, using "Save as" in an undisclosed piece of software is not a reliable way to determine the file encoding.

Converting from DOMNodeList to string in PHP extra characters

I have converted results from a web scrape from DOMNodeLists to strings:
$node = $the_sentence->item(0);
$the_sentence = "{$node->nodeName} - {$node->nodeValue}";
However now when I print out the result it includes whatever tag the text had in the page as well as the &nbsp character:
Before:
"This is the sentence"
Now:
"h2 - This is the Âsentence Â"
Any ideas how I can get rid of these characters? Thanks for any help.

This looks like a character set problem.
Have a look at the source page and see what character set it is encoded in. This might be in a Content-Type HTTP header, or it might be in a <meta> tag at the start of the document. Then, when you handle the data, make sure that everything you do handles it in the same format.
You probably want to store the data in UTF-8. Thus, if you capture in another format, in general it is a good idea to convert it from that charset to UTF-8; this will mean you can capture from a wide range of sources and store it in the same database. Look at iconv in the PHP manual if you wish to learn more about charset conversion.
Are you printing the output to console or a browser? If the former, note that some consoles (old versions of Windows in particular) do not handle UTF-8 well at all. If you are echoing to a browser, make sure your character set is set to "UTF-8" in your own HTML.

XML charactor encoding issues with accents

I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.â��"
OR
tambiÃ©n
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.

Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().

You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.

Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.

You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.

Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string

I'm getting the error:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20
When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:
<?xml version="1.0" encoding="UTF-8"?>
Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.
I'm unable to get the 3rd party to sort out their XML.
How can I pre-process the XML and fix the encoding incompatibilities?
Is there a way to detect the correct encoding for a XML file?

Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.
Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)
Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.
Either way, notify your data provider that they're sending invalid data so that they can fix it.
Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}
function utf8_encode_callback($m)
{
return utf8_encode($m[0]);
}

I solved this using
$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);

If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);

We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control character in our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.
We managed to solve our problem thanks to this post and this:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);

Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:
mysql_set_charset('utf8',$connection);
Cheers.

Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.
If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).
If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().

If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)
String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.

I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.
I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.

After several tries i found htmlentities function works.
$value = htmlentities($value)

What I was facing was solved by what Erik proposed
https://stackoverflow.com/a/4575802/14934277
and it IS, actually, the only way to know if your data is okay to be printed.
And here is some peace of code that could be useful to anyone out there:
$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash
Note that part.
<![CDATA[]]>
When you try to create an XML out of it, be sure to pass it the final product a browser would see, meaning, having your field wrapped with CDATA

When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Character Encoding/decoding becomes a mess - php

Related

Force DOMXpath - php - to return utf-8 results

XML file isn't UTF-8 encoded when created in PHP

Converting from DOMNodeList to string in PHP extra characters

XML charactor encoding issues with accents

Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string

Categories

Resources