process unconform xml in php without simplexml [duplicate] - php

I'm having some trouble parsing malformed XML in PHP. In particular I'm querying a third party webservice that returns data in an XML format without encoding the XML entities in actual data. For example one of the the elements contains an ASCII heart, '<3', without the quotes, which the XML parser sees as an opening tag. It should be '<3'.
Right now I'm simply passing the XML string into a SimpleXMLElement which, predictably, fails on these instances. I've done some looking around and it seems like PHP Tidy package might be able to help me, but the amount of configuration you can do is overwhelming :(
Thus, I'm just wondering if anyone else has had a problem like this and, if so, how they were able to solve it.
Thanks!

Try tidy.repairString:
php > $tidy = new tidy();
php > $repaired = $tidy->repairString("<foo>I <3 Philadelphia</foo>", array("input-xml"=>1));
php > print($repaired);
<foo>I <3 Philadelphia</foo>
php > $el = new SimpleXMLElement($repaired);

Read the content as a string.
htmlspecialchars(preg_replace('/[\x-\x8\xb-\xc\xe-\x1f]/','',$string))
Load the transformed string in SimpleXMLElement
It worked for me so far.

Related

Create comma separated string via xml values

I'm working on some system for a few hours now and this little thing is too much for me to think logically about at the moment.
Normally I would wait a few hours but this is a last minute job and I need to finish this.
Here's my problem:
I have an XML file that gets posted to my PHP file, the PHP file inserts certain data into a DB, but some XML nodes have the same name:
<accessoires>
<accessoire>value1</accessoire>
<accessoire>value2</accessoire>
<accessoire>value3</accessoire>
</accessoires>
Now I want to get a var $acclist which contains all values seperated by a comma:
value1,value2,value3,
I bet the solution to this is very easy but I'm at the known point where even the easiest piece of code becomes a hassle. And googling only comes up with nodes that in some way have their own identifiers.
Could someone help me out please?
You can try simplexml_load_string to parse the html then call implode on the node after casting to an array.
NOTE This code was tested in php 5.4.6 and behaves as expected.
<?php
$xml = '<accessoires>
<accessoire>value1</accessoire>
<accessoire>value2</accessoire>
<accessoire>value3</accessoire>
</accessoires>';
$dat = simplexml_load_string($xml);
echo implode(",",(array)$dat->accessoire);
For 5.3.x I had to change to
$xml = '<accessoires>
<accessoire>value1</accessoire>
<accessoire>value2</accessoire>
<accessoire>value3</accessoire>
</accessoires>';
$dat = simplexml_load_string($xml);
$dat = (array)$dat;
echo implode(",",$dat["accessoire"]);
You do this by taking a library that is able to parse and process XML, for example with SimpleXML:
implode(',', iterator_to_array($accessoires->accessoire, FALSE));
The key part here is to use iterator_to_array() as SimpleXML offers the same-named child-elements here as an iterator. Otherwise $accessoires->accessoire gives you auto-magically only the first element (if any).

Load an invalid XML in PHP DOM

I have and input XML file that is not correctly formatted ( ie. it has '&' instead of '& amp;')
When i try to load this XML using PHP DOM, $doc->load("file.xml") it throws and error and stops the parsing.
Is there any way to load this un-formatted XML? and No I cant edit the source XML file.
I did try using $doc->loadHTML() but it throws errors all over the place.
I wanted to know if there is a proper way to do this (like load file contents and change it using regex or something similar)
Try setting $doc->validateOnParse = false; before loading your XML via $doc->loadHTML(...).
First, check that it's the & that's causing the error and not something else.
One way or another, you'll have to modify the XML to get it parsed. The HTML in loadHTML is loaded from a string, can't you just replace the invalid characters with the correct ones?
If your installation supports the PHP Tidy extension (http://php.net/manual/en/book.tidy.php) you could try to clean it up with that, though in my experience it's far from foolproof.
If you are sure that's the only thing making it not validate, then you could try loading the file into a string with file_get_contents() function, then search & replace through the string to change the &'s into &'s, then place that string into simpleXML like $xml = simplexml_load_string($cleaned_string);

Why do I see `á` instead of a space when writing to screen (encoding problem)?

I am completely lost with encoding issues, I have no idea what's going on, what the problem is exactly and how to fix it.
Basically I'm just trying to read an HTML file from a Zip file, parse it then output pieces to XML. Now something funky is happening with the text I get out of the parser.
When parsing the HTML, instead of a space I get á only if I write to the screen. If I keep it in a variable and write to a file it looks fine in the file. However even though it looks right in the XML something is wrong with it, my PHP parser can't parse that XML nor does IE seem to like it.
I had to first mb_convert_encoding($xmlcontent, "ASCII"); so I could get that XML to parse in PHP.
Any idea what my problem is?
extract HTML from a .tar.gz file using Perl
my $tar = Archive::Tar->new;
$tar->read("myfile.tar.gz");
$tar->extract_file('index.html', 'output.html');
load HTML, this is where it starts to get funky, I get output like Numberáofásourceálines
my $tree = HTML::TreeBuilder->new;
$tree->parse_file('output.html') or die $!;
$tree->elementify;
write to XML
my $output = new IO::File(">output.xml");
my $writer = new XML::Writer(OUTPUT => $output, DATA_MODE => 1,DATA_INDENT => 2);
If it looks correct when you write it to a file and wrong when you write it to the terminal, it sounds like your terminal is expecting the wrong encoding. Check your terminal settings.'
Also, see Jon Rockway's answer to "Why does modern Perl avoid UTF-8 by default?". With encodings, you have to convert your input to the correct encoding and convert your output to the correct encoding. Everything that looks at the data needs to know which encoding you're using.
I think I just fixed it by processing this on the html before parsing it, thanks for all the great pointers!
s/\&nbsp\;/ /g;

Dealing with XML in PHP

I'm currently working a project that has me working with XML a lot. I have to take an XML response and decrypt each text node and then do various tasks with the data. The problem I'm having is taking the response and processing each text node. Originally I was using the XMLToArray library, and that worked fine I would change the XML into an array and then loop through the array and decrypt the values. However some of the XML response I'm dealing with have repeated tags and the XMLToArray library will only return the last values.
Is there a good way that I can take an XML response and process all the text nodes and easily putting the values into an array that has a similar structure to the response?
Thanks in advance.
I would use SimpleXML.
Here's a small example of using it. It loads and parses XML from http://www.w3schools.com/xml/plant_catalog.xml and then outputs values of "COMMON" and "PRICE" tags of each "PLANT" tag.
$xml = simplexml_load_file('http://www.w3schools.com/xml/plant_catalog.xml');
foreach ( $xml->PLANT as $plantNode ) {
echo $plantNode->COMMON, ' - ', $plantNode->PRICE, "\n";
}
If you have any problems with adapting it to your needs, just give an example of your XML so that we can help with it.
All those XML to array libraries are a remain of the times where PHP 4 would force you to write your own XML parser almost from scratch. In recent PHP versions you have a good set of XML libraries that do the hard job. I particularly recommend SimpleXML (for small files) and XMLReader (for large files). If you still find them complicate, you can try phpQuery.
You might want to give SimpleXML a try. Plus it comes by default in php so you dont need to install
Check out SimpleXML, it may offer a bit more for what you are looking for.

XML parsing _ dealing with an exception ... basic PHP problem

I have written a perfectly working XML parser using PHP...
using,
$xml = simplexml_load_file($newfile);
Now, The newfile was a pointer to an xml file that has over 20000 lines.
The problem was, the system ( an android) started generating tags called
< none >
... when no value existed...
BUT, there is NO < /none > tag ... There are multiple < none > values !!
It seems like
a) Either instruct Android not to do this ! - TRIED, but it can't be controlled .. OS deals with it.
b) or, Create a PHP loophole to prevent this error !
I need help. How do I achieve this ?
Thanks
If there no end tag - xml is not valid. If there only issue with none tag you can try ignore it in your parser. Or even try remove none tags with php "find and replace" functions like preg_replace before dealing with parser

Categories