Outputting UTF-8 with PHP SimpleXML - php

I'm trying to parse an XML file generated from Wordpress' export function. I've grabbed the text from the block but when I echo the text it gets malformed, into ASCII I think.
<?php
header("Content-Type: text/plain; charset: UTF-8;");
$source = file_get_contents("blog.wordpress.2013-10-31.xml");
$xml = simplexml_load_string($source);
$items = $xml->channel->item;
foreach($items as $item) {
$namepsaces = $item->getNameSpaces(true);
$content = $item->children($namepsaces['content']);
if($content != '') {
echo '#' . $item->title . "#\n";
echo $content->encoded;
echo "\n\n\n";
}
}
So As the BBC’s would become As the BBC’s. Anyway I can stop this?
Edit: I've appended echo '“Test”'; to just after the header and I'm seeing “Test†in my browser, so this doesn't appear to be a SimpleXML issue.

As UTF-8 ’ (0xE2 0x80 0x99) is WINDOWS-1252 â € ™ and that is exactly what you describe, it seems that you load UTF-8 encoded strings as WINDOWS-1252.
The output of SimpleXML when you read from elements or attributes is always UTF-8 encoded, therefore about that part I see no problem with your code.
So it's more likely that the XML file has the wrong encoding hinted. Fix that and you should be fine (as you have not shown that file, it's hard to say what exactly needs to be changed and why the encoding got mixed-up in the first place, perhaps some transfer issue).
You perhaps need to re-encode the XML file before you send it to the parser. If so, XMLRecoder might be helpful.

You are using a colon here: charset: UTF-8
The correct code is
header('Content-Type: text/html; charset=utf-8');

Check your XML file starts with
<?xml version="1.0" encoding="UTF-8"?>

Related

PHP htmlentities or htmlspecialchars

I'm using fwrite to create an xml file but i'm losing the special the characters.
Example:
$message0 = htmlspecialchars('<?xml version="1.0" encoding="UTF-8"?>');
$file = fopen("test.xml","w");
echo fwrite($file,"$message0");
fclose($file);
The above code gives me the following output
<?xml version="1.0" encoding="UTF-8"?><JobTemplates>
I need the special characters in order for the xml file to work. If i echo the variables, the special characters appear on the page.
Not understanding why you're encoding html characters for this. It's a trusted string, so, just put it in single quotes and write it. If any character's are giving you trouble, escape them instead of encoding them.
If there's a reason you must do it this way, then decode inline. But it all seems a bit messy to me.
Here is a tested example , you should not use htmlspecialchars
$message0 = '<?xml version="1.0" encoding="UTF-8"?><contact><name>foo</name><phone>123456</phone></contact>';
$file = fopen("test.xml","w");
fwrite($file,$message0);
fclose($file);

Accents and ñ xml problems

I'm trying to create a document XML and I have problem with some characters. I need to replace accents and letter ñ.
The output of the following code:
header('Content-type: text/html; charset=utf-8');
var_dump($this->xml_entities_s("Relucí"));
It shows:
string 'Reducí'
When I try to create the XML:
header('Content-type: text/xml; charset=utf-8');
$output = '<?xml version="1.0" encoding="UTF-8"?>';
$output .= $this->xml_entities_s("Relucí");
echo $output;
It shows:
string 'Reducí'
And I want this to show:
string 'Reducí'
I need to show the above because there is a site that get data from my site and they asked for getting data on that way in xml with í so that it can be parsed correctly.
private function xml_entities_s($string) {
return str_replace(array("<",">",'"',"'","&","á","Á","é","É","í","Í","ó","Ó","ú","Ú","ñ","Ñ"),
array("<",">",""","&apos;","&","á","Á","é","É","í","Í","ó","Ó","ú","Ú","ñ","Ñ"),
$string);
}
Could you help with this? Thanks in advance.
You don't really need to encode characters. UTF-8 supports them. Only characters with a special meaning (like <) need to be encoded. If you're using DOM to generate the XML it will take care of it.
If you want to generate an ASCII XML you can define that in the constructor:
$dom = new DOMDocument('1.0', 'ASCII');
$dom
->appendChild($dom->createElement('div'))
->appendChild($dom->createTextNode('Relucí'));
echo $dom->saveXml();
Output:
<?xml version="1.0" encoding="ASCII"?>
<div>Relucí</div>

Character encoding while using DOMDocument for parsing a xml-file

I have problems with wrong character encoding while reading a xml-file.
While this one shows the complete content of the file correctly...
$reader = new DOMDocument();
$reader->preserveWhiteSpace = false;
$reader->load('zip://content.odt#content.xml');
echo $reader->saveXML();
...this one gives me a strange output (german umlauts, em dashes, µ or similar characters aren't shown correctly):
$reader = new DOMDocument();
$reader->preserveWhiteSpace = false;
$reader->load('zip://content.odt#content.xml');
$elements = $reader->getElementsByTagName('text');
foreach($elements as $node){
foreach($node->childNodes as $child) {
$content .= $child->nodeValue;
}
}
echo $content;
I don't know why this is the case. Hope someone can explain it to me.
DOMDocument::saveXML()
This method returns the whole XML document as string. As with any XML document, the encoding is given in the XML declaration or it has the default encoding which is UTF-8.
DOMNode::$nodeValue
Contains the value of a node, most often text. All text-strings the DOMDocument library returns - of which DOMNode is part of - is in UTF-8 encoding regardless of the encoding of the XML document.
As you write that if you display the first:
echo $reader->saveXML();
all umlauts are preserved, it's most likely the XML itself ships with a different encoding as UTF-8 because the later
$content .= $child->nodeValue;
...
echo $content;
doesn't do it.
As you don't share how and with which application you're displaying and reading the output, not much more can be said.
You most likely need to hint the character encoding in the later case to the displaying application. For example, if you display text in a browser, you should add the appropriate content-type header at the very beginning:
header("Content-Type: text/plain; charset=utf-8");
Compare with How to set UTF-8 encoding for a PHP file.

Parsing xml with PHP what to do with characters like these

I'm parsing an xml document using php.
When I see the result in my browser I get the following characters:
ñ instead of spanish ñ
í instead of í
á instead of á
ó instead of ó
é instead of é
I was going to use a str_replace and replace every odd character for the good ones, but sadly the pattern before happens only sometimes and in general I have a wide collection of odd characters :(
The xml heading is:
<?xml version="1.0" encoding="iso-8859-1"?>
But if I change it to utf-8 it simply won't be printed ..
I load the xml as a string with simplexml_load_string (comes from database like that)
Can you please give me any ideas on how to solve this?
Thanks a lot
You have 2 options:
a) include a header('Content-Type: text/html; charset=iso-8859-1'); before any output in your php file.
b) convert the output to utf-8 with $str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
Both should do the trick.
SimpleXML uses UTF-8 to encode stored strings. You can use an XML-File with iso-8859-1, but if you want to print XML values with this encoding, you have to use utf8_decode before.
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
// new xml
$xml = new SimpleXMLElement('new.xml');
// Displaying XML in textual form
echo $xml->asXML();

simplexml_load_file and encoding problem

SimpleXML will convert all text into UTF-8, if the source XML declaration has another encoding. So, all the text in the resulting SimpleXMLElement will be in UTF-8 automatically.
In my case the source has the following XML decl:
<?xml version="1.0" encoding="windows-1251" ?>
What should I do so as to get normal output? Because, as you can imagine, for now I get stange symbols.
Thanks.
Maybe a stupid answer, but just don't use SimpleXML. Just use DOM.
Try using the iconv to convert the encoding.
Using the iconv() function you can convert from one encodign to another, the TRANSLIT option might work.
$xml = {STRING CONTAINING YOUR XML FILE DATA};
<?php
// convert string from utf-8 to iso8859-1
//$xml = iconv( "UTF-8", "ISO-8859-1//TRANSLIT", $xml);
$xml = iconv( "YOUR_ENCODING", "UTF-8//TRANSLIT", $xml);
?>
My advice is to use UTF-8 as source .php files encoding and (if possible) output encoding too. With gzip compression difference between size of windows-1251 and UTF-8 replies (even for mostly Cyrillic text) is minimal and UTF-8 is better in many ways.
As you said, simplexml will convert windows-1251 to UTF-8 on xml import and then you don't have to worry about any encodings.
If you have to use windows-1251 for output then use something like:
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "windows-1251");
ob_start("ob_iconv_handler");
One catchup for UTF-8 in PHP source files are char classes in regexps: /[ю]/ won't work as you might have expected, /(ю)/ will.

Categories