I'm not really sure if this is an encoding problem or what, but I have a problem using simple xml with some of the characters in the text
$xml = <<<HOHOHO
<?xml version="1.0" encoding="iso-8859-2" standalone="yes"?>
<videos>
<video>
<ContentProvider>bl abla</ContentProvider>
<ArtistName>T-Boz</ArtistName>
<CopyrightLine>(C)2009 SME España, S.</CopyrightLine>
</video>
</videos>
HOHOHO;
$a = simplexml_load_string ($xml);
foreach ( $a->video as $new )
die($new->CopyrightLine);
The thing is that the ñ character gets all messed up and becomes something like Ăą, when it should be a ñ.
I find it strange simplexml changes this to a character anyway instead of just keeping it as it is...
I know that this has to do something with hex codes but I haven't found a solution yet
Things I've tried so far:
converting the string to iso-8859-2 with mb_convert_string,
converting the string to utf-8 with mb_convert_string,
converting with html_entity_decode,
converting with html_special chars
all of above attempts either failed to parse xml or just didn't fix the character
Help would me very appreciated!
The problem you have is not the input string, but the output string. SimpleXML uses UTF-8 internally, and if you request a string from the SimpleXMLElement, you will get the string encoded as UTF-8.
$output = (string) $new->CopyrightLine; # will always be UTF-8 encoded
So you need to the re-encoding with the output, not the input.
Compare with this code example and output, that is displayed as UTF-8 while the input is your input.
There is no way around this btw, because SimpleXML will always give you UTF-8 encoded strings.
Related
The simplexml_load_file() function doesn't parse the accent characters well. The file is UTF-8 encoded, the xml tag has encoding="UTF-8".
I'm importing an XML file encoded in UTF-8 with simplexml_load_file() function. This file has some accent characters, and when I do a print_r() or var_dump() the accent characters are converted to strange characters.
First line in XML file is
<?xml version="1.0" encoding="UTF-8"?>
In code I'm running the basic
$xFile = simplexml_load_file($xmlFile)
I'm looping through the SimpleXML Object and fetching the word with accent characters like so
$text = (string)$p->i
Now
var_dump($text);
shows Ge├»rriteerd instead of Geïrriteerd
I've tried to get_file_contents() and then simplexml_load_string() and
I've also tried to load the XML file with DOMDocument, but the same 'wild' characters are being displayed.
Any thoughts on what else could I do?
Note: I'm working on PHP5.4, that's the PROD version and I can't change it.
The issue was a windows console default encoding.
I've changed the encoding to UTF-8 by running chcp 65001.
#Phil's comment was helpful.
I have a string that contains a right single quotation mark:
$str = "David’s Spade";
I am sending the string via XML and need to encode it. I have read that I should encode string using htmlspecialchars, but I have found that XML request still fails whereas htmlentities works.
When I error_log $str:
$str; // David\xe2\x80\x99s Spade
htmlspecialchars($str); // David\xe2\x80\x99s Spade
htmlspecialchars($str, ENT_QUOTES, 'UTF-8'); // David\xe2\x80\x99s Spade
htmlentities($str); // David’s Spade
Would it be better to str_replace ’ and then use htmlentities? Are there any other chars htmlentities may miss?
I am sending the string via XML and need to encode it.
No, you don't. If the XML is UTF-8 encoded (it is by default) and as your $str is UTF-8 encoded (as you show by the binary sequences in your question), you do not need to encode it.
This is by the book. So given on the technical information of the data you collaborate with, this is clear and fine.
You then write that some things work and others won't. Whatever you do there, there problem lies within the things you've hidden from your question.
To make this more explicit:
$str = "David’s Spade"; // "David\xE2\x80\x99s Spade"
is a perfectly valid string, for example to use it with an XML library like Simplexml to add it to an XML document:
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0" encoding="UTF-8"?>
<doc><element>David’s Spade</element></doc>
As you can see, the XML has been encoded by not changing the byte-sequence of the string here because it's UTF-8.
Let's take some ASCII:
$xml = new SimpleXMLElement('<doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0"?>
<doc><element>David’s Spade</element></doc>
As this example shows, it depends on the document encoding then. This second example is a fall-back of Simplexml to make the output more robust, but actually this wouldn't be necessary as UTF-8 would be the default encoding.
In any case you should not be too concerned about the encoding yourself by using a library that has specialized on creating XML documents. PHP has some few for exactly that. Take one of them.
I'm parsing an xml document using php.
When I see the result in my browser I get the following characters:
ñ instead of spanish ñ
à instead of í
á instead of á
ó instead of ó
é instead of é
I was going to use a str_replace and replace every odd character for the good ones, but sadly the pattern before happens only sometimes and in general I have a wide collection of odd characters :(
The xml heading is:
<?xml version="1.0" encoding="iso-8859-1"?>
But if I change it to utf-8 it simply won't be printed ..
I load the xml as a string with simplexml_load_string (comes from database like that)
Can you please give me any ideas on how to solve this?
Thanks a lot
You have 2 options:
a) include a header('Content-Type: text/html; charset=iso-8859-1'); before any output in your php file.
b) convert the output to utf-8 with $str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
Both should do the trick.
SimpleXML uses UTF-8 to encode stored strings. You can use an XML-File with iso-8859-1, but if you want to print XML values with this encoding, you have to use utf8_decode before.
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
// new xml
$xml = new SimpleXMLElement('new.xml');
// Displaying XML in textual form
echo $xml->asXML();
We are getting base64 encoded (XML) data from a third party. If the XML data is in English, everything works fine, I am able do base64 decode, and parse the XML. If the XML is all lower case French characters, everything works fine. But if the xml data contains upper case French characters (like À), if I do base64 decode and try to parse it, the parser fails. Any suggestions on how to fix this problem?
Thanks.
Base64 is a method to encode 8-bit binary data using 7-bits/US-ASCII charachters. After the Base64 decode you should have a standard XML file.
Probably this XML file contains illegal characters, or does not correctly specify the character encoding it uses.
You mention À, an HTML-specific (not-XML) representation of À. If the XML contains the HTML encoded string À, there should also be a reference in the XML to an entity table specifying how to decode that string.
Alternatively, if your XML contains the À character directly, encoded using (for example) the ISO-8859-1 character set, either your XML should specify this encoding (<?xml version="1.0" encoding="ISO-8859-1"?>), or you should specify it yourself when decoding it.
Failing that, the parser may assume (e.g) UTF-8 encoding is used, and will fail when trying to decode the À.
The exact error message should tell you what the problem is.
[update: À directly]:
Sounds like the XML is invalid then; that they say UTF-8 but are actually using a different encoding. Check the XML bytes (after the base 64 decode) for this; if the À is encoded as one byte, it is definitely not UTF-8.
[update: how to fix?] If they incorrectly specify it in the XML header, they should really replace the false header (<?xml version="1.0" encoding="UTF-8"?>) with the correct one (<?xml version="1.0" encoding="windows-1252"?>).
If they don't specify anything, it looks like the iconv function may be your best bet. I haven't really needed it, so I'm not 100 % sure about this, but looks like you could use: $data = iconv("ISO-8859-1", "UTF-8", $data) after the base64_decode and before the simplexml_load_string. I don't know of a way to specify the encoding directly while decoding the XML.
I'm not really experienced with the PHP specifics of character encoding, so I'm not giving any guarantees...
What's the XML character encoding? Maybe it's not UTF-8 and your parser is trying to parse the XML string as UTF-8.
SimpleXML will convert all text into UTF-8, if the source XML declaration has another encoding. So, all the text in the resulting SimpleXMLElement will be in UTF-8 automatically.
In my case the source has the following XML decl:
<?xml version="1.0" encoding="windows-1251" ?>
What should I do so as to get normal output? Because, as you can imagine, for now I get stange symbols.
Thanks.
Maybe a stupid answer, but just don't use SimpleXML. Just use DOM.
Try using the iconv to convert the encoding.
Using the iconv() function you can convert from one encodign to another, the TRANSLIT option might work.
$xml = {STRING CONTAINING YOUR XML FILE DATA};
<?php
// convert string from utf-8 to iso8859-1
//$xml = iconv( "UTF-8", "ISO-8859-1//TRANSLIT", $xml);
$xml = iconv( "YOUR_ENCODING", "UTF-8//TRANSLIT", $xml);
?>
My advice is to use UTF-8 as source .php files encoding and (if possible) output encoding too. With gzip compression difference between size of windows-1251 and UTF-8 replies (even for mostly Cyrillic text) is minimal and UTF-8 is better in many ways.
As you said, simplexml will convert windows-1251 to UTF-8 on xml import and then you don't have to worry about any encodings.
If you have to use windows-1251 for output then use something like:
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "windows-1251");
ob_start("ob_iconv_handler");
One catchup for UTF-8 in PHP source files are char classes in regexps: /[ю]/ won't work as you might have expected, /(ю)/ will.