Mojibake issue in XML retrieved over cURL - php

I'm retrieving this XML feed over PHP cURL and outputting it in a textarea on my page. The problem is, it's coming back full of mojibake characters. The feed itself is fine; it's only when output on my page that the chars appear.
Pound signs (£) are coming back as £, for example.
I've tried throwing UTF-8 at the issue, as suggested in the answer to this question.
ini_set('default_charset', 'UTF-8');
header("Content-Type:text/html; charset=UTF-8");
And in the HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
and even by outputting the cURL response via utf8_encode(), yet still they persist.
$ch = curl_init($feed_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$xml = curl_exec($ch);
echo '<textarea>'.utf8_encode($xml).'</textarea>';
I even tried swapping these chars out, but that didn't cut it.
$xml = strtr($xml, array('£' => ''));
Am I powerless here, or is there something I can do?

Use htmlentities (http://php.net/manual/en/function.htmlentities.php) before displaying the XML content in an HTML page, also change $ch to $xml in that call, so:
echo '<textarea>'.htmlentities($xml).'</textarea>';

utf8_encode() will treat the input as latin-1 and convert it to utf-8. If the input is utf-8 this will be a double encoding - that's what you're seeing.
Check the XML string you're fetching from the URL. The encoding of an XML file is usually in the XML processing instruction:
<?xml version="1.0" encoding="utf-8"?>
<document-element/>
Loaded into DOM, XMLReader or SimpleXML it will always be converted to UTF-8. Any value you read using the APIs will be UTF-8.
If you like to output the UTF-8 XML into the textarea of your HTML page you need to escape the special characters.
echo '<textarea>'.htmlspecialchars($xml).'</textarea>';
This will escape characters like < and >, but this is needed. Imagine the XML containing the string </textarea>. This would break your HTML page. The browser will decode < and the other entities before displaying them.

Related

SimpleXML and french characters

I work for a International company and thus we have loads of languages to cater for.
I'm having a problem with some special characters.
I created a standalone test php page to eliminate any other issues that could be introduced by my system.
From various pages i read through i found that SimpleXML processed XML as UTF-8.
Eg : PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes
SO i did just that at top of the page:
header("Content-type:text/html; charset=UTF-8");
THen i did this to check :
print mb_internal_encoding();
Not sure if this is the right function but it gave me ISO-8859-1 in FF and Chome.
XML looks like this:
$xml = '<?xml version="1.0" encoding="ISO-8859-15"?>
<Tracking>
<File>
<FileNumber>çúé$`~ € Š š Ž ž Œ œ Ÿ</FileNumber>
<OrigBranch>124</OrigBranch>
<Login></Login>
</File>
</Tracking>';
This prints out all funny, but for the page i need, i'm not too concrened how it prints out in browser as the actual page will actually run from a cron to import the XML into a MYSQL DB, so dislay not too important. It displays on FF like this though
print $xml;
���$`~ � � � � � � � � � 124
Then i create the SimpleXML object :
$parser = new SimpleXMLElement($xml);
print_r($parser);
This prints out :
[File] => SimpleXMLElement Object
(
[FileNumber] => çúé$`~
[OrigBranch] => 124
[Login] => SimpleXMLElement Object
(
)
)
I'm not too worried about the funny characters in the print $xml;, but more need to fix the characters in the SimpleXMLElement Object that is being inserted into the DB.
Why is the SimpleXMLElement Object losing the character after the '~'. I tried to change the charset to ISO-8859-15 in header function call, but this only lead to the print $xml; looking slightly better , but still missing characters after '~', but SimpleXMLElement give fatal error :
'String could not be parsed as XML
I tried before parsing XML :
$xml = mb_convert_encoding($xml, "ISO-8859-15");
$xml = iconv('UTF-8', 'ISO-8859-15//TRANSLIT', $xml)
But these did not help either.
Any suggestions?
I created a specific file in latin1(ISO-8859-1) named latin1.xml with this content (you can add encoding="UTF-8" in the xml tag, it's the same):
<?xml version="1.0"?>
<Tracking>
<File>
<FileNumber>çùé$ °à §çòò àù§</FileNumber>
<OrigBranch>124</OrigBranch>
<Login></Login>
</File>
</Tracking>
Then I loaded the content in the php file and made the conversion from ISO-8859-1 to UTF-8, after that the parsing with SimpleXMLElement.
I echoed the content of the xml before
<?php
$xml = file_get_contents('latin1.xml');
echo '<pre>'.$xml.'</pre>'."<br>";
$xml2 = iconv("ISO-8859-1","UTF-8",$xml);
echo '<pre>'.$xml2.'</pre>'."<br>";
$parser = new SimpleXMLElement($xml2);
echo '<pre>'.print_r($parser).'</pre>'."<br>";
Now loading the script, if your browser is set with UTF-8 econding, the first echo will be rightly not well shown but it will be ok the second and the print_r($parser). Otherwise if the browser is set with ISO-8859-1 then you will see the first echo good but not the second and the print_r.
You can adjust to fit your needs.
UPDATE
ISO/IEC 8859-1 is missing some characters for French and Finnish text, as well as the euro sign.
If I understand well your comments you can have the source file (xml) in ISO-8859-15, in this way you can use correctly the euro sign.
I made a new file, named iso8859-15.xml, and put you new test characters there (with euro sign too). In the php file I changed the first instruction:
//$xml = file_get_contents('latin1.xml');
$xml = file_get_contents('iso8859-15.xml');
and, later, the conversion in:
$xml2 = iconv("ISO-8859-15","UTF-8",$xml);
Now loading the script, if your browser is set with UTF-8 econding, the first echo will be rightly not well shown but it will be ok the second and the print_r($parser), the output of SimpleXml.
So, now that you have your parsed xml rightly (in UTF-8) you can convert it before write on DB (that is in ISO-8859-15 encoding, if I correctly understood).
To be more clear you can add this line, at the end, to the php script above:
echo '<pre> File number in ISO-8859-15 for db: '.iconv("UTF-8","ISO-8859-15",$parser->File->FileNumber).'</pre>'."<br>";
As you can see I converted the UTF-8 data from the simpleXml in ISO-8859-15, as you should do when you'll write on DB.
That worked for me.
Hope it helps
If you build XML, try to base64 decode all strings and then on the client side where you read the XML encode them back
Try $xml = '<?xml version="1.0" encoding="UTF-8"?>...

SimpleXML & html entities = strange characters

I am getting a feed as such..
$posts = new SimpleXMLElement(WP_ROOT_URL . 'feed/', 0, true);
In this feed one of the items I am getting contains a html entity, which is the entity for the "hyphen character", which is –
However when this is returned from SimpleXML all I get is a –. I have read other similar questions on SO & some mention to make sure your page is set to UTF-8; though not sure how this will stop SimpleXML from returning the strange character?
Any which way I do have this on the page the data is output on:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
What can I do here to get the correct entity?
In PHP strings don't have unified or managed encoding, therefore you cannot think of them as containing characters but bytes. The result always contains the bytes 0xE28093, only the interpretation changes. You can see this by calling bin2hex() on the result.
The bytes interpreted in Windows-1252 come out as –, interpreted in UTF-8, they come out as –.
If you are echoing this on a web page, then you can make browser interpret your output in UTF-8 by doing:
<?php
header("Content-Type: text/html; charset=UTF-8"); //Put this before any output
echo "stuff";

DOMDocument breaks encoding?

I run the following code:
$page = '<p>Ä</p>';
$DOM = new DOMDocument;
$DOM->loadHTML($page);
echo 'source:'.$page;
echo 'dom: '.$DOM->getElementsByTagName('p')->item (0)->textContent;
and it outputs the following:
source: Ä
dom: Ã
so, I don't understand why when the text comes through DOMDocument its encoding becomes broken?
Here's a workaround that adds the proper encoding via meta header:
$DOM->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />' . $page);
I'm not sure if that's the actual character set you're trying to use, but adjust where necessary
See also: domdocument character set issue
DOMDocument appears to be treating the input as UTF-8. In this conversion, Ä becomes Ä. Here's the catch: That second character does not exist in ISO-8859-1, but does exist in Windows-1252. This is why you are seeing no second character in your output.
You can fix this by calling utf8_decode on the output of textContent, or using UTF-8 as your page's character encoding.

PHP: htmlentities and htmlspecialchars not converting some characters

I'm trying convert all special chars into HTML safe entities on their way into my database, but I can't seem to get PHP to handle certain characters. For example, if my string contains any of the following: ¡£¢∞§¶ It gets turned into an empty string.
So for example, the following string:
Hello£
Get turned into an empty string after it's POSTed and processed by the following code:
$workDetails["copy"] = htmlentities($workDetails["copy"], ENT_QUOTES, "UTF-8");
I presume I'm doing something wrong? :(
Maybe it will just be enough if you change the Encoding of your website to UTF-8 via the header() command:
header("Content-Type: text/html; charset=utf-8"); in PHP
or
<?xml version="1.0" encoding="utf-8" ?>; at the top of your HTML template if you use one.
but if you definitely need to convert those chars to its specific html code, you should create your own function to replace the symbols which are not covered by htmlspecialchars() as well.

Saving special characters to DB then display using PHP

I have a script which caches a number of RSS feeds, however I have noticed that I've started getting strange characters appearing in the page where I output the cached contents (Stored in DB).
For instance the RSS feed contains the characters: Introducing…: ...
Which should read: Introducing...: ...
However my page displays it as: Introducing…: ...
It seems that these strangers chars are actually being stored in the database like this.
Can anyone suggest where I might be going wrong?
Do I need to encode on the way into the database the decode on the way out?
You need to make sure that the encoding of the RSS feed is the same as in your DB. Otherwise you first need to convert the content.
The encoding of the feed should be in the XML header:
<?xml version="1.0" encoding="UTF-8"?>
You can use this function to convert it to the encoding you use in the DB (preferably UTF-8):
http://php.net/manual/function.mb-convert-encoding.php
When you use UTF-8 then make sure you set the database connection to utf-8.. f.e. in mysql
SET NAMES 'utf-8';
Then set the correct output content-type like described by Anthony Williams. At best you do both: set the META Content-Type and send the Content-Type HTTP-Header.
Since your application seems to decode the htmlentities of that cached RSS feed before writing them to the DB, you may also output them like you got them in the first place
<?php echo htmlentities($string, ENT_QUOTES, 'UTF-8'); ?>
The fact that there are 3 bad characters in the output suggests that the RSS feed is being interpreted so that the HTML character reference is converted to UTF-8.
Try setting the text encoding of your display page to UTF-8 by adding the following to the output HTML in the <head> section:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Alternatively, since this is PHP you can set the HTTP header directly:
<?php
header("Content-Type: text/html; charset=UTF-8");
?>
However, a better solution might be to avoid converting the entity in the first place. Have you got a call to html_entity_decode() in the code that retrieves the RSS feed? If so, then it might be wise to remove it.

Categories