I want to parse an external php feed.
The address: http://www.hittadjur.se/feed.php?count=1
The output:
<?xml version="1.0"?>
<annons>
<rubrik>Wilja</rubrik>
<datum>2013-03-22</datum>
<ras>Chihuahua långhår</ras>
<ort>Göteborg</ort><bildurl>http://www.hittadjur.se/images/uploaded/thumbs/1363984467.jpg</bildurl><addurl>http://www.hittadjur.se/index.php?page=case&type=&county=32&subpage=show&case=1363984558</addurl>
</annons>
My PHP code that doesn't work:
$content = utf8_encode(file_get_contents('http://www.hittadjur.se/feed.php?count=1'));
$xml = simplexml_load_file($content);
echo $xml->annons->rubrik;
The reason I use the utf8_encode is that I receive this message if I don't:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE5 0x6E 0x67 0x68
The error now is:
Warning: simplexml_load_file() [function.simplexml-load-file]: I/O warning : failed to load external entity
Any ideas?
Thanks!
Try to pass full directory path if you are trying to load xmls held at your server
simplexml_load_file($_SERVER['DOCUMENT_ROOT'].'/example.xml')
or if you want to access xml by http protocol you will need to set allow_url_fopen ON in php.ini or
ini_set('allow_url_fopen ','ON');
in your code. or you can also do this if you are using php version <5
$temp = file_get_contents($url);
$XmlObj = simplexml_load_string($temp);
like alvaro vicario wrote, the problem are the parameter separators ( & ) in your urls. in xml, an ampersand is a entity marker ( = start of a named symbol (sequence) or numerical representation of a character code point ) and must be escaped.
either replace & by & in your urls or mark the urls as literal text (CDATA section in xml speak): <![CDATA[http://...]]>.
eg. : <addurl><![CDATA[http://www.hittadjur.se/index.php?page=case&type=&county=32&subpage=show&case=1363984558]]></addurl>
if you are uncomfortable with the express utf8 conversion in your code and you know the character encoding of your data source, you may enhance the xml prologue (iso-8859-1 contains the offending å/0xE5of your xml):
<?xml version="1.0" encoding="iso-8859-1"?>
I'm afraid that the feed provides malformed XML. Apart from the encoding mess:
<addurl>http://www.hittadjur.se/index.php?page=case&type=&county=32&subpage=show&case=1363984558</addurl>
^
\_ Data not properly escaped
I may be wrong but I don't think you can parse it using regular XML functions because they're designed for valid XML (that's the whole purpose of using XML in the first place).
Perhaps you can try with DOMDocument. It's designed for HTML so it can deal with invalid input but it can also do XML.
Edit: Here's a trick to fix invalid XML but, honestly, I'm not sure it's worth the effort.
Related
I am using PHP's SimpleXML to process an XML file, and get this error:
Message: simplexml_load_string(): Entity: line 9: parser error : EntityRef: expecting ';'
A quick Google search reveals that this is generally caused by an un-escaped & - there's a dozen questions with that answer here on Stack Overflow. However, here's line 9 of the file:
<p>In-kingdom commentary on the following items can be found on the November LoP. https://oscar.sca.org/kingdom/kingloi.php?kingdom=9&loi=4191</p>
As you can see, the & is escaped. A text search on the file reveals no other instances of &.
What am I missing?
Please note: I have no ability to edit the XML file - I must take it as it comes and only fix things in my code. I currently open the XML with the following code:
$rawstring = file_get_contents($filename);
$safestring = html_entity_decode($rawstring, 0, 'ISO-8859-1');
$xmlstring = simplexml_load_string($safestring);
(the html_entity_decode is necessary as the file uses Latin-1 encoding and simplexml expects UTF-8)
Help appreciated.
html_entity_decode() is not intended for what you appear to think it is intended for and is actually exactly what is causing your problem. As the name suggests: it decodes html entities, like &, into their actual representation; in the case of & => &.
If you want to convert the character encoding of the original $rawstring to ISO-8859-1 or UTF-8 you should use something like iconv() or mb_convert_encoding().
Here's an example that should work:
$rawstring = file_get_contents($filename);
$safestring = mb_convert_encoding($rawstring, 'ISO-8859-1' /*, $optionalOriginalEncoding */);
$xmlstring = simplexml_load_string($safestring);
See the list of supported encodings, as well.
However, since the original $rawstring is Latin-1, conversion to ISO-8859-1 is pointless, since Latin-1 is ISO-8859-1. You may need to convert to UTF-8, but I'm fairly certain that that's not even necessary either.
How can I create an XML file which uses special characters like À,Æ,Ç,È?
Using SimpleXML, it creates the following error
Warning: SimpleXMLElement::__construct(): Entity: line 24: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE5 0x6C 0x3A 0x20 in C:\xampp\htdocs\protech\admin\xml and rss\xml_create2.php on line 84
TRY This...
<?xml version='1.0' encoding='UTF-8'?>
utf8_encode($variable)
Most likely, utf8_encode() should be enough to fix your problem. It will create an UTF-8 encoded string, as the function name already suggests. So when creating your element, use something like
new SimpleXMLElement(utf8_encode($xml));
You can use DOMDocument to create the xml document and add the elements,text whatever you want ..
See here for the reference
I am integrating with Quickbooks using QBXML. I am running a customer query and the XML that Quickbooks returns appears to contain an invalid character (!).
Looking at the source XML that quickbooks returns, I can see the invalid character (actual named changed for privacy reasons, but I left in the character in question):
<Contact>Ongél Davabond</Contact>
When I try to parse the XML (with the PHP XML parser, starting with xml_parser_create() ), I get an invalid character message.
I noticed that the XML header is just:
<?xml version="1.0" ?>
I tried preg_replacing that with
<?xml version="1.0" encoding="utf-8" ?>
but that didn't make any difference.
Given that I can't change how I receive the XML, how do I best deal with it on my end? Is there a way to have the PHP XML parser accept such characters? Does PHP have a way to convert any invalid characters into their &#nnn; equivalents, without affecting the XML structure, or do I need to go through the whole of the XML character by character looking for invalid characters and replacing them manually? I have no idea what other invalid characters might come up in the future, so I am after a way to deal with all the possibilities in one go, rather than just fixing this one 'é' character.
Although I was expecting UTF-8, the XML returned was ISO-8859-1. Forcing ISO-8859-1 encoding solved the issue.
I'm using XMLReader to parse XML from a 3rd party. The files are supposed to be UTF-8, but I'm getting this error:
parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x11 0x72 0x20 0x41 in C:\file.php on line 166
Looking at the XML file in notepad++ it's clear what's causing this: there is a control character DC1 contained in the problematic line.
The XML file is provided by a 3rd party who I cannot reliably get to fix this/ensure it doesn't happen in the future. Could someone recommend a good way of dealing with this? I'd like to just do away with the control character -- in this particular case just deleting it from the XML file is fine -- but am concerned that always doing this could lead to unforeseen problems down the road. Thanks.
Why can't the 3rd party reliably fix this issue? If they have illegal characters in their XML, I would wager that it's a valid issue.
Having said that, why not just remove the character before you parse it using str_replace?
You can use str_replace() provided that the string is valid UTF-8. Note that str_replace() will then work with byte offsets, so you are no longer dealing with PHP strings but with byte strings.
And there is the rub: if your 3rd party includes random whitespace and control characters that serve no purpose in XML, you might as well assume they eventually break UTF-8. So you can't use str_replace() with confidence (only in good faith) until you have ascertained that their current dump of the day is not entirely useless.
Maybe you could take a shortcut and stuff it in a libxml DOMDocument object and suppress errors with #, leaving the libxml library to deal with errors. Something like:
$doc = new DOMDocument();
if(#$doc->loadXML($raw_string)) {
// document is loaded. time to normalize() it.
}
else {
throw new Exception("This data is junk");
}
Why are you and the third party exchanging data in XML? Presumably both parties expect to get some benefits by using XML rather than some random proprietary format. If you allow them to get away with generating bad XML (I prefer to call it non-XML), then neither party is getting these benefits. It's in their interests to mend their ways. Try to convince them of this.
I'm getting the error:
parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20
When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:
<?xml version="1.0" encoding="UTF-8"?>
Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.
I'm unable to get the 3rd party to sort out their XML.
How can I pre-process the XML and fix the encoding incompatibilities?
Is there a way to detect the correct encoding for a XML file?
Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.
Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)
Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.
Either way, notify your data provider that they're sending invalid data so that they can fix it.
Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.
function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}
function utf8_encode_callback($m)
{
return utf8_encode($m[0]);
}
I solved this using
$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);
If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control character in our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.
We managed to solve our problem thanks to this post and this:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:
mysql_set_charset('utf8',$connection);
Cheers.
Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.
If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).
If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().
If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)
String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.
I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.
I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.
After several tries i found htmlentities function works.
$value = htmlentities($value)
What I was facing was solved by what Erik proposed
https://stackoverflow.com/a/4575802/14934277
and it IS, actually, the only way to know if your data is okay to be printed.
And here is some peace of code that could be useful to anyone out there:
$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash
Note that part.
<![CDATA[]]>
When you try to create an XML out of it, be sure to pass it the final product a browser would see, meaning, having your field wrapped with CDATA
When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.