PHP export of accent characters to XML fails [duplicate]

PHP export of accent characters to XML fails [duplicate] - php

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 3 years ago.
I am working with exporting accented characters from a mySQL database to XML, but I am getting really wonky results.
For the basics - the mySQL table is set up as latin-1 encoding. Not ideal. However, all input is run through HTML entities, which seems to be working great; I can read data back all day long, and it looks correct on the screen.
Here is a sample item.
On the screen, it looks like this:
me hace reír
Note the accented "i" character (with acute accent).
In the database, it is stored like this:
me hace reír
The "i" with the acute is properly replaced with the HTML entity, which allows for proper display on screen. If I wrap that inside of a textarea, it still reads correctly - no acute HTML entity, just he correct accented "i" character.
My XML file has a proper UTF-8 header on it:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?
But when I read the data from the DB and export it to the XML...
$xml.="<dedicatedBecause>".($dedicatedbecause)."</dedicatedBecause>"."\n";
With "$dedicatedbecause" holding a totally unprocessed piece of data from the DB, I get the following in my XML file:
me hace reÃ-r
In other words, a DIFFERENT accent character plus a dash. In other cases, I get other nonsense characters (copyright symbol, various other accents, etc, etc).
I have a huge function for massaging data to UTF-8, but it doesn't seem to matter. If I turn it off, I get the same result.
What gives? What am I missing here?
Thanks for your help!

í is a named (X)HTML entity. They are not known/valid in basic, wellformed XML. Converting it to UTF-8 is the right way. But it looks at some point you treat the UTF-8 string with the decoded entity as Latin-1. The Ã is a typical symptom.
Here is a demo provoking the behavior:
$data = 'me hace reír';
$decoded = html_entity_decode($data, ENT_COMPAT, "UTF-8");
$treatedAsLatin1 = utf8_encode($decoded);
var_dump(
$decoded, $treatedAsLatin1
);
Output:
string(13) "me hace reír"
string(15) "me hace reÃr"
utf8_encode() is an old PHP function that converts a Latin-1 string to UTF-8. However this can happen in the browser as well (depending on your HTTP headers).

Related

is it utf-8 problems? php mysql

I am getting stuck with this, previously i am using php 5 and now i came up with php 7,
the problem is when i am trying to echo value from database it returns weird special character and where in my previous web it returns normal. Is it utf-8 problem? i tried meta tag utf-8, and change collation sql into utf8_unicode_ci, but somehow it doesn't help at all...
it returns like this
â€Šâ€”â€Š
what i want to return
 —

What you get from the database is a UTF8 encoded string.
The characters you see is a UTF8 string interpreted with encoding Western (Windows Latin 1).
If you include that string in a web page whose character set is Latin 1 then you'll see the string you posted; if the character set is UTF-8 then you should see the correct characters (without need to convert them into HTML entities).
As the latter is not your case you can proceed as follows:
Let the characters you see are stored in the variable $string: you can get html entities with mb_convert_encoding:
$html = mb_convert_encoding( $string, 'HTML-ENTITIES', 'UTF-8' );
This will result in:
 — 
As after conversion you get characters in the ASCII range then the resulting string is suitable for any destination character encoding.
Note that, according to the above, even the dash — is converted (into —)
This is just a quick solution to the problem you faced.
I think the comment from Machavity:
"Take a minute and read stackoverflow.com/questions/279170/utf-8-all-the-way-through"
is a good advice.

php - utf8 encoding the text twice. Will it have any negative effect? [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 9 years ago.
MySQL Database returns utf8 encoded text. Basically, I used PDO attribute MYSQL_ATTR_INIT_COMMAND and passed:
SET CHARACTER SET utf8
It returns utf8 encoded text. But some text in the database is plain utf8, something like &alum; are returned as is.
So I need to call utf8_encode again in php to get the actual utf8 char. Its working fine.
I would like to know, if it will have any negative effect encoding the text twice or it does not affect anything other than encoding the non-encoded text like above?
Thanks!
Edit:
I am using the following code to get the right characters:
$val = utf8_encode(addslashes(html_entity_decode(strip_tags($val))));
So what it does is convert the following text from:
<font color=\"#222222\" face=\"arial, sans-serif\" size=\"2\"> Test Event </font><span style=\"color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 13px;\">Persönlichkeit Universität"</span>
(This text is coming from the database, after calling the SET CHARACTER SET utf8)
to:
Test Event Persönlichkeit Universität\"

ä is a html entity that probably shouldn't have made it to your database in the first place. It has nothing to do with UTF-8.
If you call utf8_encode on "ä" nothing will happen as the encoding is the same for ISO-8859-1 and UTF-8. You will see the character it represents in browser because it is interpreted as html.
You should never, as a normal web app developer, call utf8_encode. You don't actually need ISO-8859-1 to UTF-8 conversion, firstly because browsers and MySQL do not support it. They alias Latin1 and ISO-8859-1 to Windows-1252 for compatibility. Secondly, you can cause browsers and database to send their data in UTF-8 so it is already UTF-8 and no conversion is necessary.
You shouldn't convert to html entities either - it is unnecessary because UTF-8 can represent all characters.
The data in database should not have any concern about html - the data there should be canonical authorative as-is representation of data. Right now there is confusion whether the data is actually literally meant to be ä or ä which causes problems like this:
Image from TheDailyWTF

Converting odd character encoding back to utf-8

I have a database full of strings containing strange characters such as:
Design Tattoo Ãœbungshaut
MehrflÃ¤chiges Biozid Reinigungs- & Desinfektionsmittel
Where the Ãœ and Ã¤ should be, as I understand, an Ü and Ã when in proper UTF-8.
Is there a standard function to revert these multiple characters back to there proper UTF-8 form?
In PHP I have come across $url = iconv('utf-8', 'iso-8859-1', $url); which seems to get close but falls short. Perhaps I have the wrong parameters, but in any case was just wondering how well this issue is know and if there is an established fix?
The original data was taken from the eCommerce system CubeCart which seems to have no problem converting it back to normal text FYI.

The data shown as example is UTF-8 encoded data mistakenly interpreted as ISO-8859-1 (or windows-1252). The problem combinations are in fact “Ü” and “ä” (“Ā” does not appear in German). So apparently what you need to do is to read the data as UTF-8 and display it that way, instead of converting it.

If the database and output is utf-8 it could be because your not using utf-8 as the client character set.
If your using mysqli you can use set_charset or run SET NAMES utf8 as a query before fetching data.

Convert foreign characters with accents

I'm trying to compare some text to the text in a database. In the database any text with an accent is encoded like in HTML (i.e. é) when I compare the database text to my string it doesn't match because my string just shows é. When I use the PHP function htmlentities to encode the string first the é turns into Ã© weird? Using htmlspecialchars doesn't encode the é at all.
How would you suggest I compare é to é as well as all the other accented characters?

You need to send in the correct charset to htmlentities. It looks like you're using UTF-8, but the default is ISO-8859-1. Change it like this:
$encoded = htmlentities($text, ENT_COMPAT, 'UTF-8');
Another solution is to convert the text to ISO-8859-1 before encoding, but that may destroy information (ISO-8859-1 does not contain nearly as many characters as UTF-8). If you want to try that instead, do like this:
$encoded = htmlentities(utf8_decode($text));

I'm working on french site, and I also had same problem. This is the function that I use.
function convert_accent($string)
{
return htmlspecialchars_decode(htmlentities(utf8_decode($string)));
}
What it does it decodes your string to utf8, than converts everything HTML entities. even tags. But we want to convert tags back to normal, than htmlspecialchars_decode will convert them back. So in the end you will get a string with converted accents without touching tags.
You can use pass through this function your email content before sending it to recipent.
Another issue you might face is that, sometimes with this function the content from database converts to ? . In this case you should do this before running your query:
mysql_query("SET NAMES `utf8`");
But you might need to do it, it depends on encoding in your table. I hope it helps.

The comparing task is related to the charset and the collation you selected when you create the database or the tables. If you are saving strings with a lot of accents like spanish I sugget you to use charset uft8 and the collation could be the more accurate to the language(english, french or whatever) you're using.
The best thing of using the correct charset in the database is that you can save the string in natural way e.g: my name I can store it as is "Mario Juárez" and I have no need of doing some weird conversions.

Ran into similar issues recently. Followed Emil's answer and it worked fine locally but not on our dev/stage environments. I ended up using this and it worked all around:
$title = html_entity_decode(utf8_decode($item));
Thanks for leading me in the right direction!

Parse XML with special characters (UTF-8)

I'm starting out with some XML that looks like this (simplified):
<?xml version="1.0" encoding="UTF-8"?>
<alldata>
<data name="Forsetì" />
</alldata>
</xml>
But after I've parsed it with simplexml_load_string the special character (the i) becomes: Ã¬ which is obviously pretty mangled.
Is there a way to prevent this from happening?
I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.

This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get Â followed by some other nonsense character.

It's very likely that the XML is fine, but the character gets mangled when stored or output.
If you're outputting data on a HTML page: Make sure it's encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use utf8_decode as a quick fix; using UTF-8 is the better option in the long run.
If you're storing the data in a mySQL, you need to have UTF8 selected as the encoding all the way through: As the connection's encoding, in the table, and in the column(s) you insert the data into.

I've also had some problems with this, and it came from the PHP script encoding. Make sure it's set to UTF-8.
If it's still not good, try printing the variable using uft8_encode or utf8_decode.

XML is strict when it comes to entities, like & should be &amp; and ì should &igrave;
So you will need a translation table.
function xml_entity_decode($_string) {
// Set up XML translation table
$_xml=array();
$_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
while (list($_key,)=each($_xl8))
$_xml['&#'.ord($_key).';']=$_key;
return strtr($_string,$_xml);
}

Late to the party... But I've faced this and solved like below.
You have declared encoding in XML so if you load xml file using DOMDocument it won't cause any issue.
But in case it happens in other use case, you can use html_entity_decode like below:
html_entity_decode($xml->saveXML());

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.