XML Validation error : EntityRef: expecting ';' - php

I am using PHP's SimpleXML to process an XML file, and get this error:
Message: simplexml_load_string(): Entity: line 9: parser error : EntityRef: expecting ';'
A quick Google search reveals that this is generally caused by an un-escaped & - there's a dozen questions with that answer here on Stack Overflow. However, here's line 9 of the file:
<p>In-kingdom commentary on the following items can be found on the November LoP. https://oscar.sca.org/kingdom/kingloi.php?kingdom=9&loi=4191</p>
As you can see, the & is escaped. A text search on the file reveals no other instances of &.
What am I missing?
Please note: I have no ability to edit the XML file - I must take it as it comes and only fix things in my code. I currently open the XML with the following code:
$rawstring = file_get_contents($filename);
$safestring = html_entity_decode($rawstring, 0, 'ISO-8859-1');
$xmlstring = simplexml_load_string($safestring);
(the html_entity_decode is necessary as the file uses Latin-1 encoding and simplexml expects UTF-8)
Help appreciated.

html_entity_decode() is not intended for what you appear to think it is intended for and is actually exactly what is causing your problem. As the name suggests: it decodes html entities, like &, into their actual representation; in the case of & => &.
If you want to convert the character encoding of the original $rawstring to ISO-8859-1 or UTF-8 you should use something like iconv() or mb_convert_encoding().
Here's an example that should work:
$rawstring = file_get_contents($filename);
$safestring = mb_convert_encoding($rawstring, 'ISO-8859-1' /*, $optionalOriginalEncoding */);
$xmlstring = simplexml_load_string($safestring);
See the list of supported encodings, as well.
However, since the original $rawstring is Latin-1, conversion to ISO-8859-1 is pointless, since Latin-1 is ISO-8859-1. You may need to convert to UTF-8, but I'm fairly certain that that's not even necessary either.

Related

Removing invisible characters from UTF-8 XML data

I am consuming an XML feed which contains a great deal of whitespace.
When I echo out the raw feed, it looks as though the columns of the tabled data are properly formatted with just the white space.
I have tried many regex patterns to remove it, to only allow visible characters, trim, chop, utf-8 encode/decode, nothing is touching it. It's like it is laughing in my face when I echo out a value and see this:
string(17) "72"
Opened the data in Notepad++ with show all characters on, and it simply shows it as spaces. I am at a loss of where to go with this.
I did recieve the following error:
simplexml_load_string(): Entity: line 265: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x43 0x20 0x74
I just found this regex (untested)
$xml_data = preg_replace("/>\s+</", "><", $xml_data);
If you are using the xml parser, I think you can use the 'XML_OPTION_SKIP_WHITE' option referenced here:
http://php.net/manual/en/function.xml-parser-set-option.php
Try running the data through utf8_encode() - it might seem like a hack, but it seems like the originating data isn't properly setup.
My theory is that you're grabbing it with the wrong encoding, and the proper solution would be to load it differently.
Solution
My very hacky workaround that works:
$raw = file_get_contents('http://stupidwebservice.com/xmldata.asmx/Feed');
$raw = urlencode(utf8_encode($raw));
$raw = str_replace('++','',$raw);
$raw = urldecode($raw);
urlencoding after the utf-8 encoding turned the space into +'s. I simply removed all instances of double ++'s and took it back. Works great.

SimpleXML parse external php feed

I want to parse an external php feed.
The address: http://www.hittadjur.se/feed.php?count=1
The output:
<?xml version="1.0"?>
<annons>
<rubrik>Wilja</rubrik>
<datum>2013-03-22</datum>
<ras>Chihuahua långhår</ras>
<ort>Göteborg</ort><bildurl>http://www.hittadjur.se/images/uploaded/thumbs/1363984467.jpg</bildurl><addurl>http://www.hittadjur.se/index.php?page=case&type=&county=32&subpage=show&case=1363984558</addurl>
</annons>
My PHP code that doesn't work:
$content = utf8_encode(file_get_contents('http://www.hittadjur.se/feed.php?count=1'));
$xml = simplexml_load_file($content);
echo $xml->annons->rubrik;
The reason I use the utf8_encode is that I receive this message if I don't:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE5 0x6E 0x67 0x68
The error now is:
Warning: simplexml_load_file() [function.simplexml-load-file]: I/O warning : failed to load external entity
Any ideas?
Thanks!
Try to pass full directory path if you are trying to load xmls held at your server
simplexml_load_file($_SERVER['DOCUMENT_ROOT'].'/example.xml')
or if you want to access xml by http protocol you will need to set allow_url_fopen ON in php.ini or
ini_set('allow_url_fopen ','ON');
in your code. or you can also do this if you are using php version <5
$temp = file_get_contents($url);
$XmlObj = simplexml_load_string($temp);
like alvaro vicario wrote, the problem are the parameter separators ( & ) in your urls. in xml, an ampersand is a entity marker ( = start of a named symbol (sequence) or numerical representation of a character code point ) and must be escaped.
either replace & by & in your urls or mark the urls as literal text (CDATA section in xml speak): <![CDATA[http://...]]>.
eg. : <addurl><![CDATA[http://www.hittadjur.se/index.php?page=case&type=&county=32&subpage=show&case=1363984558]]></addurl>
if you are uncomfortable with the express utf8 conversion in your code and you know the character encoding of your data source, you may enhance the xml prologue (iso-8859-1 contains the offending å/0xE5of your xml):
<?xml version="1.0" encoding="iso-8859-1"?>
I'm afraid that the feed provides malformed XML. Apart from the encoding mess:
<addurl>http://www.hittadjur.se/index.php?page=case&type=&county=32&subpage=show&case=1363984558</addurl>
^
\_ Data not properly escaped
I may be wrong but I don't think you can parse it using regular XML functions because they're designed for valid XML (that's the whole purpose of using XML in the first place).
Perhaps you can try with DOMDocument. It's designed for HTML so it can deal with invalid input but it can also do XML.
Edit: Here's a trick to fix invalid XML but, honestly, I'm not sure it's worth the effort.

Zend_Config_XML encoding issue

I am creating a XML navigation for my website. This line below is causing a simpleXML issue:
<label>Osnabrück</label>
My PHP code, using HTMLentities has changed Osnabrück into Osnabrück. However, when trying to parse my XML with this line in it, I get this error:
/application/configs/navigation.xml:318: parser error : Entity 'Atilde' not defined simplexml_load_file()
Should I not be using htmlentities()? Or is there some kind of setting I'm missing?
Kind Regards
Steve
You should not be using HTML Entities in XML. Using normal UTF-8 characters should be fine.
The occurrence of Osnabrück means that at some point, most likely, the city name is processed as ISO-8859-1 instead of UTF-8. It is not htmlentities()'s fault. You need to find that point and fix it.
You can use iconv() function to convert to utf-8 dynamicaly.
iconv("ISO-8859-1", "UTF-8", $text);

Error: "Input is not proper UTF-8, indicate encoding !" using PHP's simplexml_load_string

I'm getting the error:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20
When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:
<?xml version="1.0" encoding="UTF-8"?>
Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.
I'm unable to get the 3rd party to sort out their XML.
How can I pre-process the XML and fix the encoding incompatibilities?
Is there a way to detect the correct encoding for a XML file?
Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.
Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)
Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.
Either way, notify your data provider that they're sending invalid data so that they can fix it.
Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}
function utf8_encode_callback($m)
{
return utf8_encode($m[0]);
}
I solved this using
$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);
If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control character in our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.
We managed to solve our problem thanks to this post and this:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:
mysql_set_charset('utf8',$connection);
Cheers.
Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.
If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).
If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().
If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)
String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.
I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.
I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.
After several tries i found htmlentities function works.
$value = htmlentities($value)
What I was facing was solved by what Erik proposed
https://stackoverflow.com/a/4575802/14934277
and it IS, actually, the only way to know if your data is okay to be printed.
And here is some peace of code that could be useful to anyone out there:
$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash
Note that part.
<![CDATA[]]>
When you try to create an XML out of it, be sure to pass it the final product a browser would see, meaning, having your field wrapped with CDATA
When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.

Problem with simpleXML and entity not being defined

I'm trying to parse a XML file, but when loading it simpleXML prints the following warning:
Warning: simplexml_load_file() [function.simplexml-load-file]: gpr_545.xml:55: parser error : Entity 'Oslash' not defined in import.php on line 35
This is that line:
<forenames>BØIE</forenames><x> </x>
As it is a warning, I might ignore it, but I'd like to understand what is happening.
HTML-entities like &Oslash is not the same as XML-entities. Here's a table for replacing HTML-entities to XML-entities.
As I can tell from one of your comments to another post, you're having trouble with an entity &sol;. I don't know if this even is a valid HTML-entity, my Firefox won't show the character - only ouputs the entity name. But I found an other table for most entities and their character reference number. Try adding them to your replace-table and you should be safe. &sol;'s reference number is / by the way.
HTML Encoding of Latin1 characters (like Ø, what that character describes) is what has broken the XML parser. If you're in control of the data, you need to escape it using XML style character encoding (Ø just happens to be & #216;)
I think this is an encoding problem. php, simplexml in this particular case, does not like the danish O you've got in that fornames tag. You could try to encode the whole file in utf-8 and removing the escaped version from the tag by that. Aferwards you can read a fully escaped character free file into simplexml.
K
Just had a very similar problem and solved it in the following way. The main idea was to load a file into a string, replace all bad entities on something like "[[entity]]Oslash;" and carry out reverse replacement before displaying some xml node.
function readXML($filename){
$xml_string = implode("", file($filename));
$xml_string = str_replace("&", "[[entity]]", $xml_string);
return simplexml_load_string($xml_string);
}
function xml2str($xml){
$str = str_replace("[[entity]]", "&", (string)$xml);
$str = iconv("UTF-8", "WINDOWS-1251", $str);
return $str;
}
$xml = readXML($filename);
echo xml2str($xml->forenames);
iconv("UTF-8", "WINDOWS-1251", $str) as I have "WINDOWS-1251" encoding on my page
Try to use this line:
<forenames><![CDATA[BØIE]]></forenames><x> </x>
and read this about CDATA

Categories