I have a project where I need to parse a xml page and pick out some data. The domDocument class seems perfect and I tried a few basic tests to see if it would do what I wanted.
Here is my code for the moment:
$dom = new domDocument;
$html = file_get_contents('http://wadmag.com/feed.xml');
$previous_value = libxml_use_internal_errors(TRUE);
$dom->loadHTML("$html");
libxml_clear_errors(); //This here is to clear the errors caused by the page not
libxml_use_internal_errors($previous_value); // being proper html
$links = $dom->getElementsByTagName('item');
echo "Found : ".$links->length. " items";
foreach ($links as $link) {
echo $link->nodeValue."<br>";
}
Now the problem is that when I load the page, I get the message "Found: 21 items", meaning that the getElementsByTagName returned a list, but when I try to display the contents of the list, nothing is displayed, as if the nodeValue was empty.
The even weirder thing is that if I replace "link" in the getElementsByTagName by title or description, it displays everything as it should. Can't seem to understand why, the only difference I can see is that and might be proper html whereas is not.
If you parse XML, use $dom->loadXML($response) instead of $dom->loadHtml($response)
Related
I know there are many questions on parsing HTML in PHP, but I can't seem to find the specific problem I'm experiencing. My code works on other elements in the page, and also iterates over the inputs returning the tag name. At the same time their value property is empty, when 2 of them have a value for sure. Here is my code
$html = file_get_contents('http://...sample website...html');
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*/input[#type='hidden']");
if(!is_null($elements)){
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
echo $element->nodeValue. "\n";
}
}
$xpath->query("//*/input[#type='hidden']/#value");
instead of
$xpath->query("//*/input[#type='hidden']");
also works well.
Same question, same answers
I got it myself, if anyone else has a similar problem it is just that nodeValue returns the "innerHTML" of an element, to get its properties use $element -> getAttribute("value") (for the "value" attribute)
I have a seemingly unique situation in which I want to use DOMDocument to find a node on page, store it's value into variable (working), then remove it from the output. I am not able to figure out how to remove the node from the DOMDocument output and still save it's value first.
I am able to either remove the node completely first, which means nothing is stored in the variable, or I receive a 'Not Found Error' when trying to remove the node.
There is only one node (<h6>) on the page that needs to be removed. The code I have so far (with not found error) is below.
// Strip Everything Before and After Header Tags
$domdoc = new DOMDocument;
$docnew = new DOMDocument;
// Disable errors for <article> tag
libxml_use_internal_errors(true);
$domdoc->loadHTML(file_get_contents($file));
libxml_clear_errors();
$body = $domdoc->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$docnew->appendChild($docnew->importNode($child, true));
}
// Get the Page Title
$ppretitle = $docnew->getElementsByTagName('h6')->item(0);
$pagetitle = $ppretitle->nodeValue;
// Remove Same Element From Output
$trunctitl = $docnew->removeChild($ppretitle);
// Save Cleaned Output In Var
$pagecontent = $docnew->saveHTML();
The h6 element might not be a direct child node of the body element: try $ppretitle->parentNode->removeChild($ppretitle) instead of $trunctitl = $docnew->removeChild($ppretitle);
I have a page in php where I have to parse an xml.
I have done this for example:
$hotelNodes = $xml_data->getElementsByTagName('Hotel');
foreach($hotelNodes as $hotel){
$supplementsNodes2 = $hotel->getElementsByTagName('BoardBase');
foreach($supplementsNodes2 as $suppl2) {
echo'<p>HERE</p>'; //not enter here
}
}
}
In this code I access to each hotel of my xml, and foreach hotel I would like to search the tag BoardBase but it doesn0t enter inside it.
This is my xml (cutted of many parts!!!!!)
<hotel desc="DESC" name="Hotel">
<selctedsupplements>
<boardbases>
<boardbase bbpublishprice="0" bbprice="0" bbname="Colazione Continentale" bbid="1"></boardbase>
</boardbases>
</selctedsupplements>
</occupancy></occupancies>
</hotel>
I have many nodes that doesn't have BoardBase but sometimes there is but not enter.
Is possible that this node isn't accessible?
This xml is received by a server with a SoapClient.
If I inspect the XML printed in firebug I can see the node with opacity like this:
I have also tried this:
$supplementsNodes2 = $hotel->getElementsByTagName('boardbase');
but without success
2 issues I can see from the get-go: XML names are case-sensitive, hence:
$hotelNodes = $xml_data->getElementsByTagName('Hotel');
Can't work, because your xml node looks like:
<hotel desc="DESC" name="Hotel">
hotel => lower-case!
As you can see here:
[...] names for such things as elements, while XML is explicitly case sensitive.
The official specs specify tag names as case-sensitive, so getElementsByTagName('FOO') won't return the same elements as getElementsByTagName('foo')...
Secondly, you seem to have some tag-soup going on:
</occupancy></occupancies>
<!-- tag names don't match, both are closing tags -->
This is just plain invalid markup, it should read either:
<occupancy></occupancy>
or
<occupancies></occupancies>
That would be the first 2 ports of call.
I've set up a quick codepad using this code, which you can see here:
$xml = '<hotel desc="DESC" name="Hotel">
<selctedsupplements>
<boardbases>
<boardbase bbpublishprice="0" bbprice="0" bbname="Colazione Continentale" bbid="1"></boardbase>
</boardbases>
</selctedsupplements>
<occupancy></occupancy>
</hotel>';
$dom = new DOMDocument;
$dom->loadXML($xml);
$badList = $dom->getElementsByTagName('Hotel');
$correctList = $dom->getElementsByTagName('hotel');
echo sprintf("%d",$badList->lenght),
' compared to ',
$correctList->length, PHP_EOL;
The output was "0 compared to 1", meaning that using a lower-case selector returned 1 element, the one with the upper-case H returned an empty list.
To get to the boardbase tags for each hotel tag, you just have to write this:
$hotels = $dom->getElementsByTagName('html');
foreach($hotels as $hotel)
{
$supplementsNodes2 = $hotel->getElementsByTagName('boardbase');
foreach($supplementsNodes2 as $node)
{
var_dump($node);//you _will_ get here now
}
}
As you can see on this updated codepad.
Alessandro, your XML is a mess (=un casino), you really need to get that straight. Elias' answer pointed out some very basic stuff to consider.
I built on the code pad Elias has been setting up, it is working perfectly with me:
$dom = new DOMDocument;
$dom->loadXML($xml);
$hotels = $dom->getElementsByTagName('hotel');
foreach ($hotels as $hotel) {
$bbs = $hotel->getElementsByTagName('boardbase');
foreach ($bbs as $bb) echo $bb->getAttribute('bbname');
}
see http://codepad.org/I6oxkEOC
I'm using DOMDocument to retrieve on a HTML page a special div.
I just want to retrive the content of this div, without the div tag.
For example :
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML()
Here, i have the result :
<div id="inter">
//SOME THINGS IN MY DIV
</div>
And i just want to have :
//SOME THINGS IN MY DIV
Ideas ? Thanks !
I'm going to go with simple does it. You already have:
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML();
Now, DOMDocument::getElementById() returns one DOMElement which extends DOMNode which has the public stringnodeValue. Since you don't specify if you are expecting anything but text within that div, I'm going to assume that you want anything that may be stored in there as plain text. For that, we are going to remove $dom->saveHTML();, and instead replace it with:
$divString = $main->nodeValue;
With that, $divString will contain //SOME THINGS IN MY DIV, which, from your example, is the desired output.
If, however, you want the HTML of the inside of it and not just a String representation - replace it with the following instead:
$divString = "";
foreach($main->childNodes as $c)
$divString .= $c->ownerDocument->saveXML($c);
What that does is takes advantage of the inherited DOMNode::childNodes which contains a DOMNodeList each containing its own DOMNode (for reference, see above), and we loop through each one getting the ownerDocument which is a DOMDocument and we call the DOMDocument::saveXML() function. The reason we pass the current $c node in to the function is to prevent an entire valid document from being outputted, and because the ownerDocument is what we are looping through - we need to get one child at a time, with no children left behind. (sorry, it's late, couldn't resist.)
Now, after either option, you can do with $divString what you will. I hope this has helped explain the process to you and hopefully you walk away with a better understanding of what is going on instead of rote copying of code just because it works. ^^
you can use my custom function to remove extra div from content
$html_string = '<div id="inter">
SOME THINGS IN MY DIV
</div>';
// custom function
function DOMgetinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
your code will like
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = DOMgetinnerHTML($divs->item(0));
echo $innerHTML_contents
and your output will be
SOME THINGS IN MY DIV
you can use xpath
$xpath = new DOMXPath($xml);
foreach($xpath->query('//div[#id="inter"]/*') as $node)
{
$node->nodeValue
}
or simplu you can edit your code. see here
$main = $dom->getElementById('inter');
echo $main->nodeValue
I am playing around with xpath, but have no Idea how to for example get a title from a website using xpath, here is my code but I don't know what to do next...
$dom = new DOMDocument();
$dom->loadHTMLFile("http://www.cool.de");
$x=new DOMXPath($dom);
$result = $x->query("//TITLE");
//...???
and print_r($result) shows me only "Object", is there a function like print_r to see what is inside an object so I don't have to guess?
$result is a DOMNodeList
echo $result->item(0)->textContent
Edit: xpath is case sensitive - dom nodes must be lower case:
echo $x->query('//title')->item(0)->textContent
This now works