SimpleXML and french characters - php

I work for a International company and thus we have loads of languages to cater for.
I'm having a problem with some special characters.
I created a standalone test php page to eliminate any other issues that could be introduced by my system.
From various pages i read through i found that SimpleXML processed XML as UTF-8.
Eg : PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes
SO i did just that at top of the page:
header("Content-type:text/html; charset=UTF-8");
THen i did this to check :
print mb_internal_encoding();
Not sure if this is the right function but it gave me ISO-8859-1 in FF and Chome.
XML looks like this:
$xml = '<?xml version="1.0" encoding="ISO-8859-15"?>
<Tracking>
<File>
<FileNumber>çúé$`~ € Š š Ž ž Œ œ Ÿ</FileNumber>
<OrigBranch>124</OrigBranch>
<Login></Login>
</File>
</Tracking>';
This prints out all funny, but for the page i need, i'm not too concrened how it prints out in browser as the actual page will actually run from a cron to import the XML into a MYSQL DB, so dislay not too important. It displays on FF like this though
print $xml;
���$`~ � � � � � � � � � 124
Then i create the SimpleXML object :
$parser = new SimpleXMLElement($xml);
print_r($parser);
This prints out :
[File] => SimpleXMLElement Object
(
[FileNumber] => çúé$`~
[OrigBranch] => 124
[Login] => SimpleXMLElement Object
(
)
)
I'm not too worried about the funny characters in the print $xml;, but more need to fix the characters in the SimpleXMLElement Object that is being inserted into the DB.
Why is the SimpleXMLElement Object losing the character after the '~'. I tried to change the charset to ISO-8859-15 in header function call, but this only lead to the print $xml; looking slightly better , but still missing characters after '~', but SimpleXMLElement give fatal error :
'String could not be parsed as XML
I tried before parsing XML :
$xml = mb_convert_encoding($xml, "ISO-8859-15");
$xml = iconv('UTF-8', 'ISO-8859-15//TRANSLIT', $xml)
But these did not help either.
Any suggestions?

I created a specific file in latin1(ISO-8859-1) named latin1.xml with this content (you can add encoding="UTF-8" in the xml tag, it's the same):
<?xml version="1.0"?>
<Tracking>
<File>
<FileNumber>çùé$ °à §çòò àù§</FileNumber>
<OrigBranch>124</OrigBranch>
<Login></Login>
</File>
</Tracking>
Then I loaded the content in the php file and made the conversion from ISO-8859-1 to UTF-8, after that the parsing with SimpleXMLElement.
I echoed the content of the xml before
<?php
$xml = file_get_contents('latin1.xml');
echo '<pre>'.$xml.'</pre>'."<br>";
$xml2 = iconv("ISO-8859-1","UTF-8",$xml);
echo '<pre>'.$xml2.'</pre>'."<br>";
$parser = new SimpleXMLElement($xml2);
echo '<pre>'.print_r($parser).'</pre>'."<br>";
Now loading the script, if your browser is set with UTF-8 econding, the first echo will be rightly not well shown but it will be ok the second and the print_r($parser). Otherwise if the browser is set with ISO-8859-1 then you will see the first echo good but not the second and the print_r.
You can adjust to fit your needs.
UPDATE
ISO/IEC 8859-1 is missing some characters for French and Finnish text, as well as the euro sign.
If I understand well your comments you can have the source file (xml) in ISO-8859-15, in this way you can use correctly the euro sign.
I made a new file, named iso8859-15.xml, and put you new test characters there (with euro sign too). In the php file I changed the first instruction:
//$xml = file_get_contents('latin1.xml');
$xml = file_get_contents('iso8859-15.xml');
and, later, the conversion in:
$xml2 = iconv("ISO-8859-15","UTF-8",$xml);
Now loading the script, if your browser is set with UTF-8 econding, the first echo will be rightly not well shown but it will be ok the second and the print_r($parser), the output of SimpleXml.
So, now that you have your parsed xml rightly (in UTF-8) you can convert it before write on DB (that is in ISO-8859-15 encoding, if I correctly understood).
To be more clear you can add this line, at the end, to the php script above:
echo '<pre> File number in ISO-8859-15 for db: '.iconv("UTF-8","ISO-8859-15",$parser->File->FileNumber).'</pre>'."<br>";
As you can see I converted the UTF-8 data from the simpleXml in ISO-8859-15, as you should do when you'll write on DB.
That worked for me.
Hope it helps

If you build XML, try to base64 decode all strings and then on the client side where you read the XML encode them back

Try $xml = '<?xml version="1.0" encoding="UTF-8"?>...

Related

PHP: simplexml_load_file gets strange characters from an XML file with UTF-8 encoding

The simplexml_load_file() function doesn't parse the accent characters well. The file is UTF-8 encoded, the xml tag has encoding="UTF-8".
I'm importing an XML file encoded in UTF-8 with simplexml_load_file() function. This file has some accent characters, and when I do a print_r() or var_dump() the accent characters are converted to strange characters.
First line in XML file is
<?xml version="1.0" encoding="UTF-8"?>
In code I'm running the basic
$xFile = simplexml_load_file($xmlFile)
I'm looping through the SimpleXML Object and fetching the word with accent characters like so
$text = (string)$p->i
Now
var_dump($text);
shows Ge├»rriteerd instead of Geïrriteerd
I've tried to get_file_contents() and then simplexml_load_string() and
I've also tried to load the XML file with DOMDocument, but the same 'wild' characters are being displayed.
Any thoughts on what else could I do?
Note: I'm working on PHP5.4, that's the PROD version and I can't change it.
The issue was a windows console default encoding.
I've changed the encoding to UTF-8 by running chcp 65001.
#Phil's comment was helpful.

Encode ’ to be XML safe

I have a string that contains a right single quotation mark:
$str = "David’s Spade";
I am sending the string via XML and need to encode it. I have read that I should encode string using htmlspecialchars, but I have found that XML request still fails whereas htmlentities works.
When I error_log $str:
$str; // David\xe2\x80\x99s Spade
htmlspecialchars($str); // David\xe2\x80\x99s Spade
htmlspecialchars($str, ENT_QUOTES, 'UTF-8'); // David\xe2\x80\x99s Spade
htmlentities($str); // David’s Spade
Would it be better to str_replace ’ and then use htmlentities? Are there any other chars htmlentities may miss?
I am sending the string via XML and need to encode it.
No, you don't. If the XML is UTF-8 encoded (it is by default) and as your $str is UTF-8 encoded (as you show by the binary sequences in your question), you do not need to encode it.
This is by the book. So given on the technical information of the data you collaborate with, this is clear and fine.
You then write that some things work and others won't. Whatever you do there, there problem lies within the things you've hidden from your question.
To make this more explicit:
$str = "David’s Spade"; // "David\xE2\x80\x99s Spade"
is a perfectly valid string, for example to use it with an XML library like Simplexml to add it to an XML document:
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0" encoding="UTF-8"?>
<doc><element>David’s Spade</element></doc>
As you can see, the XML has been encoded by not changing the byte-sequence of the string here because it's UTF-8.
Let's take some ASCII:
$xml = new SimpleXMLElement('<doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0"?>
<doc><element>David’s Spade</element></doc>
As this example shows, it depends on the document encoding then. This second example is a fall-back of Simplexml to make the output more robust, but actually this wouldn't be necessary as UTF-8 would be the default encoding.
In any case you should not be too concerned about the encoding yourself by using a library that has specialized on creating XML documents. PHP has some few for exactly that. Take one of them.

How to convert this UTF-8 escaped string from an Amazon MWS response to proper UTF-8?

In part of an XML Amazon MWS ListOrders response we got an escaped UTF-8 character in one element:
<Address><Name>Ramírez Jones</Name></Address>
The name is supposed to be Ramírez. The diacritic character í is UTF-8 character U+00ED (\xc3\xad in literal; see this chart for reference).
However PHP's SimpleXML function mangles this string(which you can see because I simply pasted), transforming it into
Ramírez Jones
into the editor box here (evidently stackoverflow's ASP.NET underpinnings do the same thing as PHP).
Now when this mangled string gets saved into, then pulled out of MongoDB, it then becomes
RamÃ-­rez Jones
For some reason a hyphen is inserted there, although believe it or not, if you select the above bold text, then paste it back into a StackOverflow editor window, it will simply appear as Ramírez (the hyphen mysteriously vanishes, at least on OS X 10.8.5)!
Here is some example code to show this problem:
$xml = "<Address><Name>Ramírez Jones</Name></Address>";
$elem = new SimpleXMLAddressent($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $elem->Name->__toString());
Here is the output from the above sample code, as run on onlinephpfunction.com's sandbox:
UTF-8
Ramírez Jones
RamA-rez Jones
How can we avoid this problem? It's really screwing things up.
EDIT:
Let me add that while the name in the XML is supposed to be Ramírez Jones, I need to transliterate it to Ramirez Jones (strip the diacrtic mark off of the í).
REVISED FINAL SOLUTION:
It's different than the correct answer below but this was the most elegant solution that I found. Just replace the last line of the example with this:
echo iconv('UTF-8','ASCII//TRANSLIT', html_entity_decode($xml));
This works because "í" are HTML entities.
ALTERNATE SOLUTION
Strangely, this also works:
$xml = '<?xml version="1.0"?><Address><Name>Ramírez Jones</Name></Address>';
$xml= str_replace('<?xml version="1.0"?>', '<?xml version="1.0" encoding="ISO-8859-1"?>' , $xml);
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
$xml = iconv('UTF-8','ASCII//TRANSLIT',$domdoc->saveXML());
$elem = new SimpleXMLElement($xml);
echo $elem->Name;
It does not work because it is encoded twice. The character í has the code U+00ED and it should be encoded in XML as &#ED;.
You can fix its encoding using either:
$name = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $elem->Name->__toString());
or
$name = mb_convert_encoding($elem->Name->__toString(), 'ISO-8859-1', 'UTF-8');
UPDATE:
Both ways suggested above work to fix the encoding (they actually convert the encoding of the string from UTF-8 to ISO-8859-1 which incidentally fix the issue at hand).
The solution provided by #Hazzit also works.
The real challenge for both solutions (and for your code) is to detect if the received data is encoded in a wrong way and apply these fixes only in that situation, letting the code work correctly when Amazon fixes the encoding issue. I hope they will do it.
Stripping the accents with minimum loss of information
After you fix the encoding, in order to replace the accented letters with similar letters from the ASCII subset you must use iconv() (because only iconv() can help), as you already did in the sample code.
$nameAscii = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $name);
An explanation of the second parameter can be found in the documentation page of iconv(); please also read the user comments.
SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not how XML or UTF-8 actually works. Nevertheless, if Amazon produces such nonsense, you need to correct that error before parsing it as XML.
function decode_hexentities($xml) {
return
preg_replace_callback(
'~&#x([0-9a-fA-F]+);~i',
function ($matches) { return chr(hexdec($matches[1])); },
$xml
);
}
$xml = "<Address><Name>Ramírez Jones</Name></Address>";
$xml = decode_hexentities($xml);
$elem = new SimpleXMLElement($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT', $elem->Name->__toString());
results in:
UTF-8
Ramírez Jones
Ramirez Jones

Parsing xml with PHP what to do with characters like these

I'm parsing an xml document using php.
When I see the result in my browser I get the following characters:
ñ instead of spanish ñ
í instead of í
á instead of á
ó instead of ó
é instead of é
I was going to use a str_replace and replace every odd character for the good ones, but sadly the pattern before happens only sometimes and in general I have a wide collection of odd characters :(
The xml heading is:
<?xml version="1.0" encoding="iso-8859-1"?>
But if I change it to utf-8 it simply won't be printed ..
I load the xml as a string with simplexml_load_string (comes from database like that)
Can you please give me any ideas on how to solve this?
Thanks a lot
You have 2 options:
a) include a header('Content-Type: text/html; charset=iso-8859-1'); before any output in your php file.
b) convert the output to utf-8 with $str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
Both should do the trick.
SimpleXML uses UTF-8 to encode stored strings. You can use an XML-File with iso-8859-1, but if you want to print XML values with this encoding, you have to use utf8_decode before.
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
// new xml
$xml = new SimpleXMLElement('new.xml');
// Displaying XML in textual form
echo $xml->asXML();

simplexml encoding issue

I'm not really sure if this is an encoding problem or what, but I have a problem using simple xml with some of the characters in the text
$xml = <<<HOHOHO
<?xml version="1.0" encoding="iso-8859-2" standalone="yes"?>
<videos>
<video>
<ContentProvider>bl abla</ContentProvider>
<ArtistName>T-Boz</ArtistName>
<CopyrightLine>(C)2009 SME España, S.</CopyrightLine>
</video>
</videos>
HOHOHO;
$a = simplexml_load_string ($xml);
foreach ( $a->video as $new )
die($new->CopyrightLine);
The thing is that the ñ character gets all messed up and becomes something like Ăą, when it should be a ñ.
I find it strange simplexml changes this to a character anyway instead of just keeping it as it is...
I know that this has to do something with hex codes but I haven't found a solution yet
Things I've tried so far:
converting the string to iso-8859-2 with mb_convert_string,
converting the string to utf-8 with mb_convert_string,
converting with html_entity_decode,
converting with html_special chars
all of above attempts either failed to parse xml or just didn't fix the character
Help would me very appreciated!
The problem you have is not the input string, but the output string. SimpleXML uses UTF-8 internally, and if you request a string from the SimpleXMLElement, you will get the string encoded as UTF-8.
$output = (string) $new->CopyrightLine; # will always be UTF-8 encoded
So you need to the re-encoding with the output, not the input.
Compare with this code example and output, that is displayed as UTF-8 while the input is your input.
There is no way around this btw, because SimpleXML will always give you UTF-8 encoded strings.

Categories