I'm parsing an xml document using php.
When I see the result in my browser I get the following characters:
ñ instead of spanish ñ
à instead of í
á instead of á
ó instead of ó
é instead of é
I was going to use a str_replace and replace every odd character for the good ones, but sadly the pattern before happens only sometimes and in general I have a wide collection of odd characters :(
The xml heading is:
<?xml version="1.0" encoding="iso-8859-1"?>
But if I change it to utf-8 it simply won't be printed ..
I load the xml as a string with simplexml_load_string (comes from database like that)
Can you please give me any ideas on how to solve this?
Thanks a lot
You have 2 options:
a) include a header('Content-Type: text/html; charset=iso-8859-1'); before any output in your php file.
b) convert the output to utf-8 with $str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
Both should do the trick.
SimpleXML uses UTF-8 to encode stored strings. You can use an XML-File with iso-8859-1, but if you want to print XML values with this encoding, you have to use utf8_decode before.
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
// new xml
$xml = new SimpleXMLElement('new.xml');
// Displaying XML in textual form
echo $xml->asXML();
Related
I have a string that contains a right single quotation mark:
$str = "David’s Spade";
I am sending the string via XML and need to encode it. I have read that I should encode string using htmlspecialchars, but I have found that XML request still fails whereas htmlentities works.
When I error_log $str:
$str; // David\xe2\x80\x99s Spade
htmlspecialchars($str); // David\xe2\x80\x99s Spade
htmlspecialchars($str, ENT_QUOTES, 'UTF-8'); // David\xe2\x80\x99s Spade
htmlentities($str); // David’s Spade
Would it be better to str_replace ’ and then use htmlentities? Are there any other chars htmlentities may miss?
I am sending the string via XML and need to encode it.
No, you don't. If the XML is UTF-8 encoded (it is by default) and as your $str is UTF-8 encoded (as you show by the binary sequences in your question), you do not need to encode it.
This is by the book. So given on the technical information of the data you collaborate with, this is clear and fine.
You then write that some things work and others won't. Whatever you do there, there problem lies within the things you've hidden from your question.
To make this more explicit:
$str = "David’s Spade"; // "David\xE2\x80\x99s Spade"
is a perfectly valid string, for example to use it with an XML library like Simplexml to add it to an XML document:
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0" encoding="UTF-8"?>
<doc><element>David’s Spade</element></doc>
As you can see, the XML has been encoded by not changing the byte-sequence of the string here because it's UTF-8.
Let's take some ASCII:
$xml = new SimpleXMLElement('<doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0"?>
<doc><element>David’s Spade</element></doc>
As this example shows, it depends on the document encoding then. This second example is a fall-back of Simplexml to make the output more robust, but actually this wouldn't be necessary as UTF-8 would be the default encoding.
In any case you should not be too concerned about the encoding yourself by using a library that has specialized on creating XML documents. PHP has some few for exactly that. Take one of them.
In part of an XML Amazon MWS ListOrders response we got an escaped UTF-8 character in one element:
<Address><Name>RamÃrez Jones</Name></Address>
The name is supposed to be Ramírez. The diacritic character í is UTF-8 character U+00ED (\xc3\xad in literal; see this chart for reference).
However PHP's SimpleXML function mangles this string(which you can see because I simply pasted), transforming it into
RamÃrez Jones
into the editor box here (evidently stackoverflow's ASP.NET underpinnings do the same thing as PHP).
Now when this mangled string gets saved into, then pulled out of MongoDB, it then becomes
RamÃ-rez Jones
For some reason a hyphen is inserted there, although believe it or not, if you select the above bold text, then paste it back into a StackOverflow editor window, it will simply appear as RamÃrez (the hyphen mysteriously vanishes, at least on OS X 10.8.5)!
Here is some example code to show this problem:
$xml = "<Address><Name>RamÃrez Jones</Name></Address>";
$elem = new SimpleXMLAddressent($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $elem->Name->__toString());
Here is the output from the above sample code, as run on onlinephpfunction.com's sandbox:
UTF-8
RamÃrez Jones
RamA-rez Jones
How can we avoid this problem? It's really screwing things up.
EDIT:
Let me add that while the name in the XML is supposed to be Ramírez Jones, I need to transliterate it to Ramirez Jones (strip the diacrtic mark off of the í).
REVISED FINAL SOLUTION:
It's different than the correct answer below but this was the most elegant solution that I found. Just replace the last line of the example with this:
echo iconv('UTF-8','ASCII//TRANSLIT', html_entity_decode($xml));
This works because "Ã" are HTML entities.
ALTERNATE SOLUTION
Strangely, this also works:
$xml = '<?xml version="1.0"?><Address><Name>RamÃrez Jones</Name></Address>';
$xml= str_replace('<?xml version="1.0"?>', '<?xml version="1.0" encoding="ISO-8859-1"?>' , $xml);
$domdoc = new DOMDocument();
$domdoc->loadXML($xml);
$xml = iconv('UTF-8','ASCII//TRANSLIT',$domdoc->saveXML());
$elem = new SimpleXMLElement($xml);
echo $elem->Name;
It does not work because it is encoded twice. The character í has the code U+00ED and it should be encoded in XML as &#ED;.
You can fix its encoding using either:
$name = iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $elem->Name->__toString());
or
$name = mb_convert_encoding($elem->Name->__toString(), 'ISO-8859-1', 'UTF-8');
UPDATE:
Both ways suggested above work to fix the encoding (they actually convert the encoding of the string from UTF-8 to ISO-8859-1 which incidentally fix the issue at hand).
The solution provided by #Hazzit also works.
The real challenge for both solutions (and for your code) is to detect if the received data is encoded in a wrong way and apply these fixes only in that situation, letting the code work correctly when Amazon fixes the encoding issue. I hope they will do it.
Stripping the accents with minimum loss of information
After you fix the encoding, in order to replace the accented letters with similar letters from the ASCII subset you must use iconv() (because only iconv() can help), as you already did in the sample code.
$nameAscii = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $name);
An explanation of the second parameter can be found in the documentation page of iconv(); please also read the user comments.
SimpleXML does not decode the hex entities and understand the result as UTF-8, because that's not how XML or UTF-8 actually works. Nevertheless, if Amazon produces such nonsense, you need to correct that error before parsing it as XML.
function decode_hexentities($xml) {
return
preg_replace_callback(
'~&#x([0-9a-fA-F]+);~i',
function ($matches) { return chr(hexdec($matches[1])); },
$xml
);
}
$xml = "<Address><Name>RamÃrez Jones</Name></Address>";
$xml = decode_hexentities($xml);
$elem = new SimpleXMLElement($xml);
$bad_string = $elem->Name;
echo mb_detect_encoding($bad_string)."\n";
echo $elem->Name->__toString()."\n";
echo iconv('UTF-8', 'ASCII//TRANSLIT', $elem->Name->__toString());
results in:
UTF-8
Ramírez Jones
Ramirez Jones
I work for a International company and thus we have loads of languages to cater for.
I'm having a problem with some special characters.
I created a standalone test php page to eliminate any other issues that could be introduced by my system.
From various pages i read through i found that SimpleXML processed XML as UTF-8.
Eg : PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes
SO i did just that at top of the page:
header("Content-type:text/html; charset=UTF-8");
THen i did this to check :
print mb_internal_encoding();
Not sure if this is the right function but it gave me ISO-8859-1 in FF and Chome.
XML looks like this:
$xml = '<?xml version="1.0" encoding="ISO-8859-15"?>
<Tracking>
<File>
<FileNumber>çúé$`~ € Š š Ž ž Œ œ Ÿ</FileNumber>
<OrigBranch>124</OrigBranch>
<Login></Login>
</File>
</Tracking>';
This prints out all funny, but for the page i need, i'm not too concrened how it prints out in browser as the actual page will actually run from a cron to import the XML into a MYSQL DB, so dislay not too important. It displays on FF like this though
print $xml;
���$`~ � � � � � � � � � 124
Then i create the SimpleXML object :
$parser = new SimpleXMLElement($xml);
print_r($parser);
This prints out :
[File] => SimpleXMLElement Object
(
[FileNumber] => çúé$`~
[OrigBranch] => 124
[Login] => SimpleXMLElement Object
(
)
)
I'm not too worried about the funny characters in the print $xml;, but more need to fix the characters in the SimpleXMLElement Object that is being inserted into the DB.
Why is the SimpleXMLElement Object losing the character after the '~'. I tried to change the charset to ISO-8859-15 in header function call, but this only lead to the print $xml; looking slightly better , but still missing characters after '~', but SimpleXMLElement give fatal error :
'String could not be parsed as XML
I tried before parsing XML :
$xml = mb_convert_encoding($xml, "ISO-8859-15");
$xml = iconv('UTF-8', 'ISO-8859-15//TRANSLIT', $xml)
But these did not help either.
Any suggestions?
I created a specific file in latin1(ISO-8859-1) named latin1.xml with this content (you can add encoding="UTF-8" in the xml tag, it's the same):
<?xml version="1.0"?>
<Tracking>
<File>
<FileNumber>çùé$ °à §çòò àù§</FileNumber>
<OrigBranch>124</OrigBranch>
<Login></Login>
</File>
</Tracking>
Then I loaded the content in the php file and made the conversion from ISO-8859-1 to UTF-8, after that the parsing with SimpleXMLElement.
I echoed the content of the xml before
<?php
$xml = file_get_contents('latin1.xml');
echo '<pre>'.$xml.'</pre>'."<br>";
$xml2 = iconv("ISO-8859-1","UTF-8",$xml);
echo '<pre>'.$xml2.'</pre>'."<br>";
$parser = new SimpleXMLElement($xml2);
echo '<pre>'.print_r($parser).'</pre>'."<br>";
Now loading the script, if your browser is set with UTF-8 econding, the first echo will be rightly not well shown but it will be ok the second and the print_r($parser). Otherwise if the browser is set with ISO-8859-1 then you will see the first echo good but not the second and the print_r.
You can adjust to fit your needs.
UPDATE
ISO/IEC 8859-1 is missing some characters for French and Finnish text, as well as the euro sign.
If I understand well your comments you can have the source file (xml) in ISO-8859-15, in this way you can use correctly the euro sign.
I made a new file, named iso8859-15.xml, and put you new test characters there (with euro sign too). In the php file I changed the first instruction:
//$xml = file_get_contents('latin1.xml');
$xml = file_get_contents('iso8859-15.xml');
and, later, the conversion in:
$xml2 = iconv("ISO-8859-15","UTF-8",$xml);
Now loading the script, if your browser is set with UTF-8 econding, the first echo will be rightly not well shown but it will be ok the second and the print_r($parser), the output of SimpleXml.
So, now that you have your parsed xml rightly (in UTF-8) you can convert it before write on DB (that is in ISO-8859-15 encoding, if I correctly understood).
To be more clear you can add this line, at the end, to the php script above:
echo '<pre> File number in ISO-8859-15 for db: '.iconv("UTF-8","ISO-8859-15",$parser->File->FileNumber).'</pre>'."<br>";
As you can see I converted the UTF-8 data from the simpleXml in ISO-8859-15, as you should do when you'll write on DB.
That worked for me.
Hope it helps
If you build XML, try to base64 decode all strings and then on the client side where you read the XML encode them back
Try $xml = '<?xml version="1.0" encoding="UTF-8"?>...
I'm not really sure if this is an encoding problem or what, but I have a problem using simple xml with some of the characters in the text
$xml = <<<HOHOHO
<?xml version="1.0" encoding="iso-8859-2" standalone="yes"?>
<videos>
<video>
<ContentProvider>bl abla</ContentProvider>
<ArtistName>T-Boz</ArtistName>
<CopyrightLine>(C)2009 SME España, S.</CopyrightLine>
</video>
</videos>
HOHOHO;
$a = simplexml_load_string ($xml);
foreach ( $a->video as $new )
die($new->CopyrightLine);
The thing is that the ñ character gets all messed up and becomes something like Ăą, when it should be a ñ.
I find it strange simplexml changes this to a character anyway instead of just keeping it as it is...
I know that this has to do something with hex codes but I haven't found a solution yet
Things I've tried so far:
converting the string to iso-8859-2 with mb_convert_string,
converting the string to utf-8 with mb_convert_string,
converting with html_entity_decode,
converting with html_special chars
all of above attempts either failed to parse xml or just didn't fix the character
Help would me very appreciated!
The problem you have is not the input string, but the output string. SimpleXML uses UTF-8 internally, and if you request a string from the SimpleXMLElement, you will get the string encoded as UTF-8.
$output = (string) $new->CopyrightLine; # will always be UTF-8 encoded
So you need to the re-encoding with the output, not the input.
Compare with this code example and output, that is displayed as UTF-8 while the input is your input.
There is no way around this btw, because SimpleXML will always give you UTF-8 encoded strings.
SimpleXML will convert all text into UTF-8, if the source XML declaration has another encoding. So, all the text in the resulting SimpleXMLElement will be in UTF-8 automatically.
In my case the source has the following XML decl:
<?xml version="1.0" encoding="windows-1251" ?>
What should I do so as to get normal output? Because, as you can imagine, for now I get stange symbols.
Thanks.
Maybe a stupid answer, but just don't use SimpleXML. Just use DOM.
Try using the iconv to convert the encoding.
Using the iconv() function you can convert from one encodign to another, the TRANSLIT option might work.
$xml = {STRING CONTAINING YOUR XML FILE DATA};
<?php
// convert string from utf-8 to iso8859-1
//$xml = iconv( "UTF-8", "ISO-8859-1//TRANSLIT", $xml);
$xml = iconv( "YOUR_ENCODING", "UTF-8//TRANSLIT", $xml);
?>
My advice is to use UTF-8 as source .php files encoding and (if possible) output encoding too. With gzip compression difference between size of windows-1251 and UTF-8 replies (even for mostly Cyrillic text) is minimal and UTF-8 is better in many ways.
As you said, simplexml will convert windows-1251 to UTF-8 on xml import and then you don't have to worry about any encodings.
If you have to use windows-1251 for output then use something like:
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "windows-1251");
ob_start("ob_iconv_handler");
One catchup for UTF-8 in PHP source files are char classes in regexps: /[ю]/ won't work as you might have expected, /(ю)/ will.