Accents and ñ xml problems - php

I'm trying to create a document XML and I have problem with some characters. I need to replace accents and letter ñ.
The output of the following code:
header('Content-type: text/html; charset=utf-8');
var_dump($this->xml_entities_s("Relucí"));
It shows:
string 'Reducí'
When I try to create the XML:
header('Content-type: text/xml; charset=utf-8');
$output = '<?xml version="1.0" encoding="UTF-8"?>';
$output .= $this->xml_entities_s("Relucí");
echo $output;
It shows:
string 'Reducí'
And I want this to show:
string 'Reducí'
I need to show the above because there is a site that get data from my site and they asked for getting data on that way in xml with í so that it can be parsed correctly.
private function xml_entities_s($string) {
return str_replace(array("<",">",'"',"'","&","á","Á","é","É","í","Í","ó","Ó","ú","Ú","ñ","Ñ"),
array("<",">",""","&apos;","&","á","Á","é","É","í","Í","ó","Ó","ú","Ú","ñ","Ñ"),
$string);
}
Could you help with this? Thanks in advance.

You don't really need to encode characters. UTF-8 supports them. Only characters with a special meaning (like <) need to be encoded. If you're using DOM to generate the XML it will take care of it.
If you want to generate an ASCII XML you can define that in the constructor:
$dom = new DOMDocument('1.0', 'ASCII');
$dom
->appendChild($dom->createElement('div'))
->appendChild($dom->createTextNode('Relucí'));
echo $dom->saveXml();
Output:
<?xml version="1.0" encoding="ASCII"?>
<div>Relucí</div>

Related

How to dump an XML document's element as a string that has the same encoding as the document?

So for example, an ISO-8859-1 encoded XML document that even has some characters that are not part of the character set of that encoding, let's say the € (euro) symbol. This is possible in XML if the symbol is represented as a unicode character entity, in this case the € (euro) string:
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
<bar>€</bar>
</foo>
I need to obtain the bar element string with the same encoding as the document, which means encoded in ISO-8859-1 (also means to preserve the unicode character entities that are not part of this encoding), i.e. the ISO-8859-1 string <bar>€</bar>.
I couldn't achieve this by using the saveXML method of the DOMDocument class, since it dumps elements always in UTF-8 (whilst whole documents always in the encoding of their XML declaration):
$DD = new DOMDocument;
$DD -> load('foo.xml');
$dump = $DD -> saveXML($DD -> getElementsByTagName('bar') -> item(0));
The $dump variable resulted in the UTF-8 string <bar>€</bar>.
Notice how elements are dumped also with its unicode character entities traduced to actual UTF-8 characters.
So, how would I get the ISO-8859-1 string <bar>€</bar>? Are XML parsers meant to work this sort of task or should I just utilize regular expressions o something else?
Yes, they will decode entities and if you only save a part of a document it will be UTF-8 because it has no way to specify the encoding - it defaults back to UTF-8.
Here is a demo:
$xml = <<<'XML'
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
<bar>€</bar>
</foo>
XML;
$source = new DOMDocument();
$source->loadXML($xml);
echo "Document Part:\n";
echo $source->saveXML($source->getElementsByTagName('bar')->item(0));
echo "\n\n";
echo "Whole Document:\n";
echo $source->saveXML();
echo "\n\n";
Output:
Document Part:
<bar>€</bar>
Whole Document:
<?xml version="1.0" encoding="ISO-8859-1"?>
<foo>
<bar>€</bar>
</foo>
You could copy the node into a new document. However the output will include the XML declaration with the encoding:
$target = new DOMDocument('1.0', 'ASCII');
$target->appendChild($target->importNode($source->getElementsByTagName('bar')->item(0), true));
echo "Separated Node:\n";
echo $target->saveXML();
Output:
Separated Node:
<?xml version="1.0" encoding="ASCII"?>
<bar>€</bar>
It looks like the encoding is not used when saveXML() is used with a node argument. When you set the $encoding property on the DOMDocument class it will be used in the saveXML() function, but only when saving the whole document. By checking the source code of the saveXML() function you will see there is even a comment mentioning the encoding property:
if (nodep != NULL) {
[...]
} else {
[...]
/* Encoding is handled from the encoding property set on the document */
xmlDocDumpFormatMemory(docp, &mem, &size, format);
}
According to the Document Object Model (DOM) Level 3 Load and Save Specification a lot of defined types support setting the encoding (and the PHP implementation has it at least on the DOMDocument class). So I'm not sure if it is a bug in the implementation of DOM in PHP. However, the documentation also states that it uses UTF-8 encoding:
Note:
The DOM extension uses UTF-8 encoding. Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or iconv for other encodings.
So, the solution would be to use such functions to convert it to the correct result or only save the whole XML document with saveXML() without any arguments given.

PHP htmlentities or htmlspecialchars

I'm using fwrite to create an xml file but i'm losing the special the characters.
Example:
$message0 = htmlspecialchars('<?xml version="1.0" encoding="UTF-8"?>');
$file = fopen("test.xml","w");
echo fwrite($file,"$message0");
fclose($file);
The above code gives me the following output
<?xml version="1.0" encoding="UTF-8"?><JobTemplates>
I need the special characters in order for the xml file to work. If i echo the variables, the special characters appear on the page.
Not understanding why you're encoding html characters for this. It's a trusted string, so, just put it in single quotes and write it. If any character's are giving you trouble, escape them instead of encoding them.
If there's a reason you must do it this way, then decode inline. But it all seems a bit messy to me.
Here is a tested example , you should not use htmlspecialchars
$message0 = '<?xml version="1.0" encoding="UTF-8"?><contact><name>foo</name><phone>123456</phone></contact>';
$file = fopen("test.xml","w");
fwrite($file,$message0);
fclose($file);

Euro Currency Symbol breaks XML document

I am adding content to an XML document using PHP File_Put_Contents and then I am using Microsoft Word to open that document. The problem is, if I add the Euro currency symbol(€), then the document breaks, I get the following error:
€ is not a valid XML entity.
Trying to solve encoding issues with entities is a bad practice. Instead, make sure all your strings are properly UTF-8.
First make sure that your strings are UTF-8 actually. The methods and functions in PHP will expect it as UTF-8 independent from the output. It is possible to work with other character sets/encodings but this is really complex.
If you create the XML using an XML API like DOM or XMLWriter, it will take care of the encoding as needed. In an UTF-8 XML document the € does not need to be encoded.
$document = new DOMDocument('1.0', 'UTF-8');
$document
->appendChild($document->createElement('price'))
->appendChild($document->createTextNode('€ 42.00'));
echo $document->saveXml();
Output:
<?xml version="1.0" encoding="UTF-8"?>
<price>€ 42.00</price>
However in an ASCII XML document the special character needs to be encoded as a numeric entity. Named entities like € will not work. They are specific to (X)HTML and not XML.
$document = new DOMDocument('1.0', 'ASCII');
$document
->appendChild($document->createElement('price'))
->appendChild($document->createTextNode('€ 42.00'));
echo $document->saveXml();
Output:
<?xml version="1.0" encoding="ASCII"?>
<price>€ 42.00</price>
The same is possible with XMLWriter:
$writer = new XMLWriter();
$writer->openMemory();
$writer->startDocument('1.0', 'ASCII');
$writer->writeElement("price", '€ 42.00');
$writer->endDocument();
echo $writer->outputMemory();
If you generate the XML as text (usually not the best choice), you will have to take care of the encoding yourself:
echo '<?xml version="1.0" encoding="UTF-8"?>', "\n";
printf('<price>%s</price>', htmlentities('€ 42.00', ENT_XML1 | ENT_COMPAT, "UTF-8"));
Output:
<?xml version="1.0" encoding="UTF-8"?>
<price>€ 42.00</price>
Have you tried to used '€'? And make sure you clean up your string using the snipped below:
$currentString = preg_replace("[^!-~ ]", '', $currentString);

Outputting UTF-8 with PHP SimpleXML

I'm trying to parse an XML file generated from Wordpress' export function. I've grabbed the text from the block but when I echo the text it gets malformed, into ASCII I think.
<?php
header("Content-Type: text/plain; charset: UTF-8;");
$source = file_get_contents("blog.wordpress.2013-10-31.xml");
$xml = simplexml_load_string($source);
$items = $xml->channel->item;
foreach($items as $item) {
$namepsaces = $item->getNameSpaces(true);
$content = $item->children($namepsaces['content']);
if($content != '') {
echo '#' . $item->title . "#\n";
echo $content->encoded;
echo "\n\n\n";
}
}
So As the BBC’s would become As the BBC’s. Anyway I can stop this?
Edit: I've appended echo '“Test”'; to just after the header and I'm seeing “Test†in my browser, so this doesn't appear to be a SimpleXML issue.
As UTF-8 ’ (0xE2 0x80 0x99) is WINDOWS-1252 â € ™ and that is exactly what you describe, it seems that you load UTF-8 encoded strings as WINDOWS-1252.
The output of SimpleXML when you read from elements or attributes is always UTF-8 encoded, therefore about that part I see no problem with your code.
So it's more likely that the XML file has the wrong encoding hinted. Fix that and you should be fine (as you have not shown that file, it's hard to say what exactly needs to be changed and why the encoding got mixed-up in the first place, perhaps some transfer issue).
You perhaps need to re-encode the XML file before you send it to the parser. If so, XMLRecoder might be helpful.
You are using a colon here: charset: UTF-8
The correct code is
header('Content-Type: text/html; charset=utf-8');
Check your XML file starts with
<?xml version="1.0" encoding="UTF-8"?>

Parsing xml with PHP what to do with characters like these

I'm parsing an xml document using php.
When I see the result in my browser I get the following characters:
ñ instead of spanish ñ
í instead of í
á instead of á
ó instead of ó
é instead of é
I was going to use a str_replace and replace every odd character for the good ones, but sadly the pattern before happens only sometimes and in general I have a wide collection of odd characters :(
The xml heading is:
<?xml version="1.0" encoding="iso-8859-1"?>
But if I change it to utf-8 it simply won't be printed ..
I load the xml as a string with simplexml_load_string (comes from database like that)
Can you please give me any ideas on how to solve this?
Thanks a lot
You have 2 options:
a) include a header('Content-Type: text/html; charset=iso-8859-1'); before any output in your php file.
b) convert the output to utf-8 with $str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
Both should do the trick.
SimpleXML uses UTF-8 to encode stored strings. You can use an XML-File with iso-8859-1, but if you want to print XML values with this encoding, you have to use utf8_decode before.
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
// new xml
$xml = new SimpleXMLElement('new.xml');
// Displaying XML in textual form
echo $xml->asXML();

Categories