I know there are many questions on parsing HTML in PHP, but I can't seem to find the specific problem I'm experiencing. My code works on other elements in the page, and also iterates over the inputs returning the tag name. At the same time their value property is empty, when 2 of them have a value for sure. Here is my code
$html = file_get_contents('http://...sample website...html');
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*/input[#type='hidden']");
if(!is_null($elements)){
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
echo $element->nodeValue. "\n";
}
}
$xpath->query("//*/input[#type='hidden']/#value");
instead of
$xpath->query("//*/input[#type='hidden']");
also works well.
Same question, same answers
I got it myself, if anyone else has a similar problem it is just that nodeValue returns the "innerHTML" of an element, to get its properties use $element -> getAttribute("value") (for the "value" attribute)
Related
I am using domDocument hoping to parse this little html code. I am looking for a specific span tag with a specific id.
<span id="CPHCenter_lblOperandName">Hello world</span>
My code:
$dom = new domDocument;
#$dom->loadHTML($html); // the # is to silence errors and misconfigures of HTML
$dom->preserveWhiteSpace = false;
$nodes = $dom->getElementsByTagName('//span[#id="CPHCenter_lblOperandName"');
foreach($nodes as $node){
echo $node->nodeValue;
}
But For some reason I think something is wrong with either the code or the html (how can I tell?):
When I count nodes with echo count($nodes); the result is always 1
I get nothing outputted in the nodes loop
How can I learn the syntax of these complex queries?
What did I do wrong?
You can use simple getElementById:
$dom->getElementById('CPHCenter_lblOperandName')->nodeValue
or in selector way:
$selector = new DOMXPath($dom);
$list = $selector->query('/html/body//span[#id="CPHCenter_lblOperandName"]');
echo($list->item(0)->nodeValue);
//or
foreach($list as $span) {
$text = $span->nodeValue;
}
Your four part question gets an answer in three parts:
getElementsByTagName does not take an XPath expression, you need to give it a tag name;
Nothing is output because no tag would ever match the tagname you provided (see #1);
It looks like what you want is XPath, which means you need to create an XPath object - see the PHP docs for more;
Also, a better method of controlling the libxml errors is to use libxml_use_internal_errors(true) (rather than the '#' operator, which will also hide other, more legitimate errors). That would leave you with code that looks something like this:
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query("//span[#id='CPHCenter_lblOperandName']") as $node) {
echo $node->textContent;
}
I have a project where I need to parse a xml page and pick out some data. The domDocument class seems perfect and I tried a few basic tests to see if it would do what I wanted.
Here is my code for the moment:
$dom = new domDocument;
$html = file_get_contents('http://wadmag.com/feed.xml');
$previous_value = libxml_use_internal_errors(TRUE);
$dom->loadHTML("$html");
libxml_clear_errors(); //This here is to clear the errors caused by the page not
libxml_use_internal_errors($previous_value); // being proper html
$links = $dom->getElementsByTagName('item');
echo "Found : ".$links->length. " items";
foreach ($links as $link) {
echo $link->nodeValue."<br>";
}
Now the problem is that when I load the page, I get the message "Found: 21 items", meaning that the getElementsByTagName returned a list, but when I try to display the contents of the list, nothing is displayed, as if the nodeValue was empty.
The even weirder thing is that if I replace "link" in the getElementsByTagName by title or description, it displays everything as it should. Can't seem to understand why, the only difference I can see is that and might be proper html whereas is not.
If you parse XML, use $dom->loadXML($response) instead of $dom->loadHtml($response)
I'm using DOMDocument to retrieve on a HTML page a special div.
I just want to retrive the content of this div, without the div tag.
For example :
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML()
Here, i have the result :
<div id="inter">
//SOME THINGS IN MY DIV
</div>
And i just want to have :
//SOME THINGS IN MY DIV
Ideas ? Thanks !
I'm going to go with simple does it. You already have:
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML();
Now, DOMDocument::getElementById() returns one DOMElement which extends DOMNode which has the public stringnodeValue. Since you don't specify if you are expecting anything but text within that div, I'm going to assume that you want anything that may be stored in there as plain text. For that, we are going to remove $dom->saveHTML();, and instead replace it with:
$divString = $main->nodeValue;
With that, $divString will contain //SOME THINGS IN MY DIV, which, from your example, is the desired output.
If, however, you want the HTML of the inside of it and not just a String representation - replace it with the following instead:
$divString = "";
foreach($main->childNodes as $c)
$divString .= $c->ownerDocument->saveXML($c);
What that does is takes advantage of the inherited DOMNode::childNodes which contains a DOMNodeList each containing its own DOMNode (for reference, see above), and we loop through each one getting the ownerDocument which is a DOMDocument and we call the DOMDocument::saveXML() function. The reason we pass the current $c node in to the function is to prevent an entire valid document from being outputted, and because the ownerDocument is what we are looping through - we need to get one child at a time, with no children left behind. (sorry, it's late, couldn't resist.)
Now, after either option, you can do with $divString what you will. I hope this has helped explain the process to you and hopefully you walk away with a better understanding of what is going on instead of rote copying of code just because it works. ^^
you can use my custom function to remove extra div from content
$html_string = '<div id="inter">
SOME THINGS IN MY DIV
</div>';
// custom function
function DOMgetinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
your code will like
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = DOMgetinnerHTML($divs->item(0));
echo $innerHTML_contents
and your output will be
SOME THINGS IN MY DIV
you can use xpath
$xpath = new DOMXPath($xml);
foreach($xpath->query('//div[#id="inter"]/*') as $node)
{
$node->nodeValue
}
or simplu you can edit your code. see here
$main = $dom->getElementById('inter');
echo $main->nodeValue
Is it possible to delete element from loaded DOM without creating a new one? For example something like this:
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $href)
if($href->nodeValue == 'First')
//delete
You remove the node by telling the parent node to remove the child:
$href->parentNode->removeChild($href);
See DOMNode::$parentNodeDocs and DOMNode::removeChild()Docs.
See as well:
How to remove attributes using PHP DOMDocument?
How to remove an HTML element using the DOMDocument class
This took me a while to figure out, so here's some clarification:
If you're deleting elements from within a loop (as in the OP's example), you need to loop backwards
$elements = $completePage->getElementsByTagName('a');
for ($i = $elements->length; --$i >= 0; ) {
$href = $elements->item($i);
$href->parentNode->removeChild($href);
}
DOMNodeList documentation: You can modify, and even delete, nodes from a DOMNodeList if you iterate backwards
Easily:
$href->parentNode->removeChild($href);
I know this has already been answered but I wanted to add to it.
In case someone faces the same problem I have faced.
Looping through the domnode list and removing items directly can cause issues.
I just read this and based on that I created a method in my own code base which works:https://www.php.net/manual/en/domnode.removechild.php
Here is what I would do:
$links = $dom->getElementsByTagName('a');
$links_to_remove = [];
foreach($links as $link){
$links_to_remove[] = $link;
}
foreach($links_to_remove as $link){
$link->parentNode->removeChild($link);
}
$dom->saveHTML();
for remove tag or somthing.
removeChild($element->id());
full example:
$dom = new Dom;
$dom->loadFromUrl('URL');
$html = $dom->find('main')[0];
$html2 = $html->find('p')[0];
$span = $html2->find('span')[0];
$html2->removeChild($span->id());
echo $html2;
I have just learned about XPath and I am wanting to read data from only certain columns in a table.
My current code looks like this:
<?php
$file_contents = file_get_contents('test.html');
$dom_document = new DOMDocument();
$dom_document->loadHTML($file_contents);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
$elements = $dom_xpath->query("//tr[#class='rowstyle']");
if (!is_null($elements)) {
foreach ($elements as $element)
{
echo $element->nodeValue . '<br />';
}
}
else
{
echo 'none';
}
?>
Also a variation in the query because through my research I have seen lots of issues with nest table elements but it produces the same result:
$elements = $dom_xpath->query("//table[#class='tablestyle']/tbody/tr[#class='rowstyle']");
It does grab the row of data but it makes into a single string, combining all of the cells into one string and making the tags disappear.
What I really want to do is separate those cells and grab the certain row number.
I am also curious on how to find out which version of XPath I have... My PHP version is 5.3.5
Its not combining those cells... youre outputting the nodeValue which in this case is behaving like innerHTML. IF you want to work on the cells themselves the either use childNodes or a xpah query using the row as the context, then loop over the cells.
Example:
$dom_xpath = new DOMXpath($dom_document);
$elements = $dom_xpath->query("//tr[#class='rowstyle']");
foreach ($elements as $element)
{
foreach($element->childNodes as $cell) {
echo $cell->nodeValue . '<br />';
}
}