Parse XML with special characters (UTF-8) - php

I'm starting out with some XML that looks like this (simplified):
<?xml version="1.0" encoding="UTF-8"?>
<alldata>
<data name="Forsetì" />
</alldata>
</xml>
But after I've parsed it with simplexml_load_string the special character (the i) becomes: ì which is obviously pretty mangled.
Is there a way to prevent this from happening?
I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.

This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.

It's very likely that the XML is fine, but the character gets mangled when stored or output.
If you're outputting data on a HTML page: Make sure it's encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use utf8_decode as a quick fix; using UTF-8 is the better option in the long run.
If you're storing the data in a mySQL, you need to have UTF8 selected as the encoding all the way through: As the connection's encoding, in the table, and in the column(s) you insert the data into.

I've also had some problems with this, and it came from the PHP script encoding. Make sure it's set to UTF-8.
If it's still not good, try printing the variable using uft8_encode or utf8_decode.

XML is strict when it comes to entities, like & should be &amp; and ì should &igrave;
So you will need a translation table.
function xml_entity_decode($_string) {
// Set up XML translation table
$_xml=array();
$_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
while (list($_key,)=each($_xl8))
$_xml['&#'.ord($_key).';']=$_key;
return strtr($_string,$_xml);
}

Late to the party... But I've faced this and solved like below.
You have declared encoding in XML so if you load xml file using DOMDocument it won't cause any issue.
But in case it happens in other use case, you can use html_entity_decode like below:
html_entity_decode($xml->saveXML());

Related

Problems with php xml characters

Hello friends i have a problem with some characters reading a xml file from php i am using this source code:
$file = 'test.xml';
$xml_1 = simplexml_load_file($file);
echo ($xml_1->content);
its work ok but when the content is a special character like ñ ó it show a rarer character like this ñ i tried to include in html head utf8 charset but its the same
SimpleXML emits UTF-8 output by design. If you application does not support UTF-8 you'll have to convert with the usual tools (e.g. mb_convert_encoding()) but you need to take this into account:
You need to know for sure the encoding your app is using.
UTF-8 can hold the complete Unicode catalogue thus some characters may not have an equivalent in your target encoding.
Whatever, in 2016 there's no reason to use anything else than UTF-8 unless your maintaining legacy code.
Finally i find the solution i must to use utf8_decode php function to convert the characters it is not enought with put utf8 charset in the head page you must to convert using php before

Converting odd character encoding back to utf-8

I have a database full of strings containing strange characters such as:
Design Tattoo Ãœbungshaut
Mehrflächiges Biozid Reinigungs- & Desinfektionsmittel
Where the Ãœ and ä should be, as I understand, an Ü and à when in proper UTF-8.
Is there a standard function to revert these multiple characters back to there proper UTF-8 form?
In PHP I have come across $url = iconv('utf-8', 'iso-8859-1', $url); which seems to get close but falls short. Perhaps I have the wrong parameters, but in any case was just wondering how well this issue is know and if there is an established fix?
The original data was taken from the eCommerce system CubeCart which seems to have no problem converting it back to normal text FYI.
The data shown as example is UTF-8 encoded data mistakenly interpreted as ISO-8859-1 (or windows-1252). The problem combinations are in fact “Ü” and “ä” (“Ā” does not appear in German). So apparently what you need to do is to read the data as UTF-8 and display it that way, instead of converting it.
If the database and output is utf-8 it could be because your not using utf-8 as the client character set.
If your using mysqli you can use set_charset or run SET NAMES utf8 as a query before fetching data.

simplexml_load_string strange characters

I am having difficulty with non-standard characters using simplexml_load_string.
I have loaded an newspaper xml feed using file_get_contents. If I print to screen the contents I get a title for one of the articles as :
<title>‘If Legault were running in Alberta, he’d be more popular’: How right-wing is the CAQ?</title>
If I then do this:
$feed = #simplexml_load_string($xml);
And print the results of $feed, the title has changed to:
[title] => �If Legault were running in Alberta, he�d be more popular�: How right-wing is the CAQ?
Any advice on how to stop these characters being displayed like this?
This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.
This is a charset issue. it needs to be utf8, you can run utf8_decode on the content, but its better to fix this issue by matching charsets from your input (feed) to your output (html page i presume).

encoding from any language to utf in php

I insert from csv characters from different languages..
I apply this to every set of characters:
private function process_elements($element){
utf8_encode($element);
return $element;
}
The problem is when they go into the database, they go like this:
???????? ?? ???????????? ????? ??????? ??? ???????...
When I retrieve them from the databse, I also get this.
This happens with greek. However, when I retrieve greek pages (through scrapping), who are on a utf encoded page. The characters look like this:
Δες webcam δωμάτια | Gr.ImLive.com
which is okay, because when i use the utf8_encode function, they look normal on the screen..
But when the data is taken from the csv and be put into the database, i get those question marks..
Is there a way to encode form any language to utf.. why retrieving data from csv and a utf8 encoded webpage makes such a difference.. they look the same.. how do I address that problem?
please take a look at this
it will help you
Handling Unicode Front To Back In A Web App
It's not about "languages", it's about encodings. Text is encoded as bits and bytes. Any one byte is equal to any other byte. If you only have a blob of bytes, you cannot know what encoding it represents. You can guess, but that's not accurate. You have to know what encoding some text is in by reading the accompanying meta data. That may be documentation, a <meta> tag or an HTTP header. Then you need to treat the text in that encoding.
utf8_encode actually converts text from ISO-8859-1 to UTF-8. It does not simply encode anything to UTF-8, because it does not have the means to determine what something is encoded in either. If your text is already UTF-8 encoded or was not ISO-8859-1 encoded to begin with, you're just garbling the text (as you are).

PHP 5, XSL and The Character Ú

Im having dificulty getting the letter
Ú
to render through PHP 5.3 and XSL. Its part of a string in a database and that is loaded into an XML node within a tags. However it causes the XSL/XML transformation to not render. Removing the character from the string fixes the problem instantly.
Any ideas?
What character encoding are you using? From the sounds of it you have some sort of character encoding mismatch.
If your XSL is using ISO-8559-1 (or ASCII equivalent) and you are trying to output to a page that is UTF-8 encoded then the character output will be off. It also works vice-versa.
Actually I don't know right answer but I have a solution like below :
"&".htmlentities("Ú");
Your XSL transformation engine probably interprets your document as non-well-formed XML because of encoding issues. If that text containing Ú is stored using some 8-bit encoding (like ISO-8859 variants), then this character will not produce a valid UTF-8 octet if it is used as such without any character conversion. Invalid characters in an XML document will mean it is not well formed XML and processing it as XML is forbidden.
There are many points where that encoding error might happen:
it could be stored in the database incorrectly
it could be read from the database incorrectly
you might produce your XML by concatenating strings that use different encodings
you might manipulate the text with a tool or method that can't handle your encoding or is not aware of it
your XSLT engine might not be aware of the correct encoding of the input stream resulting a rejected file even though it has no encoding error
My random guesses for the probable causes of that are points 3 and 5.

Categories