simplexml_load_string() doesn't like foreign languages - php

I am receiving an XML response which comes through perfectly.
Words such as "português" and "españa" are correctly formatted.
However, once I have parsed the XML through the php function simplexml_load_string(), the words are transformed as follows: "portugu�s" and "espa�a".

Simple XML always treats text internally as UTF-8 encoded, converting to and from this character set if necessary. To solve your issue either make sure that all output from your app is UTF-8 encoded or convert it to another character set (possibly using utf8decode()).

Related

Why is php converting certain characters to '?'

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));
Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));
You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

decoding ISO characters

I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.
You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.
Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}

PHP 5, XSL and The Character Ú

Im having dificulty getting the letter
Ú
to render through PHP 5.3 and XSL. Its part of a string in a database and that is loaded into an XML node within a tags. However it causes the XSL/XML transformation to not render. Removing the character from the string fixes the problem instantly.
Any ideas?
What character encoding are you using? From the sounds of it you have some sort of character encoding mismatch.
If your XSL is using ISO-8559-1 (or ASCII equivalent) and you are trying to output to a page that is UTF-8 encoded then the character output will be off. It also works vice-versa.
Actually I don't know right answer but I have a solution like below :
"&".htmlentities("Ú");
Your XSL transformation engine probably interprets your document as non-well-formed XML because of encoding issues. If that text containing Ú is stored using some 8-bit encoding (like ISO-8859 variants), then this character will not produce a valid UTF-8 octet if it is used as such without any character conversion. Invalid characters in an XML document will mean it is not well formed XML and processing it as XML is forbidden.
There are many points where that encoding error might happen:
it could be stored in the database incorrectly
it could be read from the database incorrectly
you might produce your XML by concatenating strings that use different encodings
you might manipulate the text with a tool or method that can't handle your encoding or is not aware of it
your XSLT engine might not be aware of the correct encoding of the input stream resulting a rejected file even though it has no encoding error
My random guesses for the probable causes of that are points 3 and 5.

base64 decode French characters

We are getting base64 encoded (XML) data from a third party. If the XML data is in English, everything works fine, I am able do base64 decode, and parse the XML. If the XML is all lower case French characters, everything works fine. But if the xml data contains upper case French characters (like &Agrave), if I do base64 decode and try to parse it, the parser fails. Any suggestions on how to fix this problem?
Thanks.
Base64 is a method to encode 8-bit binary data using 7-bits/US-ASCII charachters. After the Base64 decode you should have a standard XML file.
Probably this XML file contains illegal characters, or does not correctly specify the character encoding it uses.
You mention À, an HTML-specific (not-XML) representation of À. If the XML contains the HTML encoded string À, there should also be a reference in the XML to an entity table specifying how to decode that string.
Alternatively, if your XML contains the À character directly, encoded using (for example) the ISO-8859-1 character set, either your XML should specify this encoding (<?xml version="1.0" encoding="ISO-8859-1"?>), or you should specify it yourself when decoding it.
Failing that, the parser may assume (e.g) UTF-8 encoding is used, and will fail when trying to decode the À.
The exact error message should tell you what the problem is.
[update: À directly]:
Sounds like the XML is invalid then; that they say UTF-8 but are actually using a different encoding. Check the XML bytes (after the base 64 decode) for this; if the À is encoded as one byte, it is definitely not UTF-8.
[update: how to fix?] If they incorrectly specify it in the XML header, they should really replace the false header (<?xml version="1.0" encoding="UTF-8"?>) with the correct one (<?xml version="1.0" encoding="windows-1252"?>).
If they don't specify anything, it looks like the iconv function may be your best bet. I haven't really needed it, so I'm not 100 % sure about this, but looks like you could use: $data = iconv("ISO-8859-1", "UTF-8", $data) after the base64_decode and before the simplexml_load_string. I don't know of a way to specify the encoding directly while decoding the XML.
I'm not really experienced with the PHP specifics of character encoding, so I'm not giving any guarantees...
What's the XML character encoding? Maybe it's not UTF-8 and your parser is trying to parse the XML string as UTF-8.

Parse XML with special characters (UTF-8)

I'm starting out with some XML that looks like this (simplified):
<?xml version="1.0" encoding="UTF-8"?>
<alldata>
<data name="Forsetì" />
</alldata>
</xml>
But after I've parsed it with simplexml_load_string the special character (the i) becomes: ì which is obviously pretty mangled.
Is there a way to prevent this from happening?
I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.
This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.
It's very likely that the XML is fine, but the character gets mangled when stored or output.
If you're outputting data on a HTML page: Make sure it's encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use utf8_decode as a quick fix; using UTF-8 is the better option in the long run.
If you're storing the data in a mySQL, you need to have UTF8 selected as the encoding all the way through: As the connection's encoding, in the table, and in the column(s) you insert the data into.
I've also had some problems with this, and it came from the PHP script encoding. Make sure it's set to UTF-8.
If it's still not good, try printing the variable using uft8_encode or utf8_decode.
XML is strict when it comes to entities, like & should be &amp; and ì should &igrave;
So you will need a translation table.
function xml_entity_decode($_string) {
// Set up XML translation table
$_xml=array();
$_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
while (list($_key,)=each($_xl8))
$_xml['&#'.ord($_key).';']=$_key;
return strtr($_string,$_xml);
}
Late to the party... But I've faced this and solved like below.
You have declared encoding in XML so if you load xml file using DOMDocument it won't cause any issue.
But in case it happens in other use case, you can use html_entity_decode like below:
html_entity_decode($xml->saveXML());

Categories