I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this into a ’. What can I do to stop this?
Just to answer some of the questions that have come up - I'm pulling an RSS feed using CURL. If I output this directly to the browser, the ’ displays as ’ which is what's expected. When I create a new SimpleXMLElement using this, (e.g. $xml = new SimpleXmlElement($raw_feed); and dump the $xml variable, every instance of ’ is replaced with ’.
It appears that SimpleXML is having trouble with UTF-8 ampersand encoded characters. (The XML declaration specifies UTF-8.)
I do have control over the feed after CURL has retrieved the feed before it's used to construct a SimpleXML element.
’ represents the Unicode character ’ (U+2019) that is encoded with 0xE28099 in UTF-8. And when that byte sequence is interpreted with Windows-1252, it represents the characters â (0xE2), € (0x80), and ™ (0x99).
That means SimpleXML handles the input as UTF-8 encoded but you interpret its output as Windows-1252. And unless you really want to use Windows-1252, you are probably just missing to specify the character encoding of your output properly.
It came down to having to set the default encoding to UTF-8 in four places:
The default locale at the head of the file: setlocale(LC_ALL, 'en_US.UTF8');
Encoding the string that comes out of CURL: utf8_encode($string);
Setting the MySQL connection to use UTF-8 by default: mysqli_set_charset($database_insert_connection, 'utf8');
Setting the appropriate collation in the MySQL database to utf8_general_ci
If outputting to the browser, setting the appropriate header (e.g. header ('Content-type: text/html; charset=utf-8');)
Hope this helps someone in the future!
Related
I am having difficulty with non-standard characters using simplexml_load_string.
I have loaded an newspaper xml feed using file_get_contents. If I print to screen the contents I get a title for one of the articles as :
<title>‘If Legault were running in Alberta, he’d be more popular’: How right-wing is the CAQ?</title>
If I then do this:
$feed = #simplexml_load_string($xml);
And print the results of $feed, the title has changed to:
[title] => �If Legault were running in Alberta, he�d be more popular�: How right-wing is the CAQ?
Any advice on how to stop these characters being displayed like this?
This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.
This is a charset issue. it needs to be utf8, you can run utf8_decode on the content, but its better to fix this issue by matching charsets from your input (feed) to your output (html page i presume).
One of my projects pulls a document from the web and reads it. This document is provided by a third party and will not change (the content will, but formatting and other stuff will not).
The problem is that this document includes content copy and pasted from Word, which is UTF-8, however the document is encoded in ISO-8858-1, so these characters get saved to the database as '?'.
If I past over the text, and re-encode it in UTF-8, instead of getting the smartquotes and em dashes, I just get two garbage characters.
How can I convert this ISO-8859-1 document with UTF-8 character back into UTF-8 so it can be displayed as it was originally created?
$fixed = mb_convert_encoding($broken, "UTF-8", "ISO-8859-1");
don't know if it'd properly handle UTF-8 embedded in 8859, but that's the "normal" way of doing it. Man page here. Give it a whirl and see if things get cleaner or more mangled.
I found the solution here: PHP: Problems converting "’" character from ISO-8859-1 to UTF-8
The server claims it's serving up ISO-8859-1, but it's really Windows-1252, which converts to UTF-8 without a problem.
Luckily, ISO 8859-1 is 8bit-transparent. Therefore, you can just decode the content with iconv, mb_convert_encoding or utf8_encode.
I'm not sure what "I past over the text" means, but if this is really UTF-8 designated as ISO 8859-1, try eliminating all intermediate text manipulation. If that still fails, please provide an example of a (short) input document. Chances are it's not actually UTF-8 designated as ISO 8859-1.
I'm starting out with some XML that looks like this (simplified):
<?xml version="1.0" encoding="UTF-8"?>
<alldata>
<data name="Forsetì" />
</alldata>
</xml>
But after I've parsed it with simplexml_load_string the special character (the i) becomes: ì which is obviously pretty mangled.
Is there a way to prevent this from happening?
I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.
This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.
It's very likely that the XML is fine, but the character gets mangled when stored or output.
If you're outputting data on a HTML page: Make sure it's encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use utf8_decode as a quick fix; using UTF-8 is the better option in the long run.
If you're storing the data in a mySQL, you need to have UTF8 selected as the encoding all the way through: As the connection's encoding, in the table, and in the column(s) you insert the data into.
I've also had some problems with this, and it came from the PHP script encoding. Make sure it's set to UTF-8.
If it's still not good, try printing the variable using uft8_encode or utf8_decode.
XML is strict when it comes to entities, like & should be & and ì should ì
So you will need a translation table.
function xml_entity_decode($_string) {
// Set up XML translation table
$_xml=array();
$_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
while (list($_key,)=each($_xl8))
$_xml['&#'.ord($_key).';']=$_key;
return strtr($_string,$_xml);
}
Late to the party... But I've faced this and solved like below.
You have declared encoding in XML so if you load xml file using DOMDocument it won't cause any issue.
But in case it happens in other use case, you can use html_entity_decode like below:
html_entity_decode($xml->saveXML());
I have written an XML file which is using the ISO-8859-15 encoding and most of the data within the feed is ran through htmlspecialchars().
I am then using simplyxml_load_string() to retrieve the contents of the XML file to use in my script. However, if I have any special characters (ie: é á ó) it comes out as "é á ó". The
How can I get my script to display the proper special accented characters?
You’re probably using a different character encoding for you output than the XML data is actually encoded.
According to your description, your XML data encoded with UTF-8 but your output is using ISO 8859-15. Because UTF-8 encodes the character é (U+00E9) with 0xC3A9 and that represents the two characters à and © respectively in ISO 8859-15.
So you either use UTF-8 for your output as well. Or you convert the data from UTF-8 to ISO 8859-15 using mb_convert_encoding.
I'm writing some RSS feeds in PHP and stuggling with character-encoding issues. Should I utf8_encode() before or after htmlentities() encoding? For example, I've got both ampersands and Chinese characters in a description element, and I'm not sure which of these is proper:
$output = utf8_encode(htmlentities($source)); or
$output = htmlentities(utf8_encode($source));
And why?
It's important to pass the character set to the htmlentities function, as the default is ISO-8859-1:
utf8_encode(htmlentities($source,ENT_COMPAT,'utf-8'));
You should apply htmlentities first as to allow utf8_encode to encode the entities properly.
(EDIT: I changed from my opinion before that the order didn't matter based on the comments. This code is tested and works well).
First: The utf8_encode function converts from ISO 8859-1 to UTF-8. So you only need this function, if your input encoding/charset is ISO 8859-1. But why don’t you use UTF-8 in the first place?
Second: You don’t need htmlentities. You just need htmlspecialchars to replace the special characters by character references. htmlentities would replace “too much” characters that can be encoded directly using UTF-8. Important is that you use the ENT_QUOTES quote style to replace the single quotes as well.
So my proposal:
// if your input encoding is ISO 8859-1
htmlspecialchars(utf8_encode($string), ENT_QUOTES)
// if your input encoding is UTF-8
htmlspecialchars($string, ENT_QUOTES, 'UTF-8')
Don't use htmlentities()!
Simply use UTF-8 characters. Just make sure you declare encoding of the feed in HTTP headers (Content-Type:application/xml;charset=UTF-8) or failing that, in the feed itself using <?xml version="1.0" encoding="UTF-8"?> on the first line.
It might be easier to forget htmlentities and use a CDATA section. It works for the title section, which doesn't seem support encoded HTML characters in Firefox's RSS viewer:
<title><![CDATA[News & Updates " > » ☂ ☺ ☹ ☃ Test!]]></title>
You want to do $output = htmlentities(utf8_encode($source));. This is because you want to convert your international characters into proper UTF8 first, and then have ampersands (and possibly some of the UTF-8 characters as well) turned in to HTML entities. If you do the entities first, then some of the international characters may not be handled properly.
If none of your international characters are going to be changed by utf8_encode, then it doesn't matter which order you call them in.
After much trial & error, I finally found a way to properly display a string from a utf8-encoded database value, through an xml file, to an html page:
$output = '<![CDATA['.utf8_encode(htmlentities($string)).']]>';
I hope this helps someone.