I am loading HTML into DOM and then querying it using XPath in PHP. My current problem is how do I find out how many matches have been made, and once that is ascertained, how do I access them?
I currently have this dirty solution:
$i = 0;
foreach($nodes as $node) {
echo $dom->savexml($nodes->item($i));
$i++;
}
Is there a cleaner solution to find the number of nodes, I have tried count(), but that does not work.
You haven't posted any code related to $nodes so I assume you are using DOMXPath and query(), or at the very least, you have a DOMNodeList.
DOMXPath::query() returns a DOMNodeList, which has a length member. You can access it via (given your code):
$nodes->length
If you just want to know the count, you can also use DOMXPath::evaluate.
Example from PHP Manual:
$doc = new DOMDocument;
$doc->load('book.xml');
$xpath = new DOMXPath($doc);
$tbody = $doc->getElementsByTagName('tbody')->item(0);
// our query is relative to the tbody node
$query = 'count(row/entry[. = "en"])';
$entries = $xpath->evaluate($query, $tbody);
echo "There are $entries english books\n";
Related
Having a real bugger of an Xpath issue. I am trying to match the nodes with a certain value.
Here is an example XML fragment.
http://pastie.org/private/xrjb2ncya8rdm8rckrjqg
I am trying to match a given MatchNumber node value to see if there are two or more. Assuming that this is stored in a variable called $data I am using the below expression. Its been a while since ive done much XPath as most thing seem to be JSON these days so please excuse any rookie oversights.
$doc = new DOMDocument;
$doc->load($data);
$xpath = new DOMXPath($doc);
$result = $xpath->query("/CupRoundSpot/MatchNumber[.='1']");
I need to basically match any node that has a Match Number value of 1 and then determine if the result length is greater than 1 ( i.e. 2 or more have been found ).
Many thanks in advance for any help.
Your XML document has a default namespace: xmlns="http://www.fixtureslive.com/".
You have to register this namespace on the xpath element and use the (registered) prefix in your query.
$xpath->registerNamespace ('fl' , 'http://www.fixtureslive.com/');
$result = $xpath->query("/fl:ArrayOfCupRoundSpot/fl:CupRoundSpot/fl:MatchNumber[.='1']");
foreach( $result as $e ) {
echo '.';
}
The following XPath:
/CupRoundSpot[MatchNumber = 1]
Returns all the CupRoundSpot nodes where MatchNumber equals 1. You could use these nodes futher in your PHP to do stuff with it.
Executing:
count(/CupRoundSpot[MatchNumber = 1])
Returns you the total CupRoundSpot nodes found where MatchNumber equals 1.
You have to register the namespace. After that you can use the Xpath count() function. An expression like that will only work with evaluate(), not with query(). query() can only return node lists, not scalar values.
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
$xpath->registerNamespace('fl', 'http://www.fixtureslive.com/');
var_dump(
$xpath->evaluate(
'count(/fl:ArrayOfCupRoundSpot/fl:CupRoundSpot[number(fl:MatchNumber) = 1])'
)
);
Output:
float(2)
DEMO: https://eval.in/130366
To iterate the CupRoundSpot nodes, just use foreach:
$nodes = $xpath->evaluate(
'/fl:ArrayOfCupRoundSpot/fl:CupRoundSpot[number(fl:MatchNumber) = 1]'
);
foreach ($nodes as $node) {
//...
}
I am using domDocument hoping to parse this little html code. I am looking for a specific span tag with a specific id.
<span id="CPHCenter_lblOperandName">Hello world</span>
My code:
$dom = new domDocument;
#$dom->loadHTML($html); // the # is to silence errors and misconfigures of HTML
$dom->preserveWhiteSpace = false;
$nodes = $dom->getElementsByTagName('//span[#id="CPHCenter_lblOperandName"');
foreach($nodes as $node){
echo $node->nodeValue;
}
But For some reason I think something is wrong with either the code or the html (how can I tell?):
When I count nodes with echo count($nodes); the result is always 1
I get nothing outputted in the nodes loop
How can I learn the syntax of these complex queries?
What did I do wrong?
You can use simple getElementById:
$dom->getElementById('CPHCenter_lblOperandName')->nodeValue
or in selector way:
$selector = new DOMXPath($dom);
$list = $selector->query('/html/body//span[#id="CPHCenter_lblOperandName"]');
echo($list->item(0)->nodeValue);
//or
foreach($list as $span) {
$text = $span->nodeValue;
}
Your four part question gets an answer in three parts:
getElementsByTagName does not take an XPath expression, you need to give it a tag name;
Nothing is output because no tag would ever match the tagname you provided (see #1);
It looks like what you want is XPath, which means you need to create an XPath object - see the PHP docs for more;
Also, a better method of controlling the libxml errors is to use libxml_use_internal_errors(true) (rather than the '#' operator, which will also hide other, more legitimate errors). That would leave you with code that looks something like this:
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query("//span[#id='CPHCenter_lblOperandName']") as $node) {
echo $node->textContent;
}
Is it possible to delete element from loaded DOM without creating a new one? For example something like this:
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $href)
if($href->nodeValue == 'First')
//delete
You remove the node by telling the parent node to remove the child:
$href->parentNode->removeChild($href);
See DOMNode::$parentNodeDocs and DOMNode::removeChild()Docs.
See as well:
How to remove attributes using PHP DOMDocument?
How to remove an HTML element using the DOMDocument class
This took me a while to figure out, so here's some clarification:
If you're deleting elements from within a loop (as in the OP's example), you need to loop backwards
$elements = $completePage->getElementsByTagName('a');
for ($i = $elements->length; --$i >= 0; ) {
$href = $elements->item($i);
$href->parentNode->removeChild($href);
}
DOMNodeList documentation: You can modify, and even delete, nodes from a DOMNodeList if you iterate backwards
Easily:
$href->parentNode->removeChild($href);
I know this has already been answered but I wanted to add to it.
In case someone faces the same problem I have faced.
Looping through the domnode list and removing items directly can cause issues.
I just read this and based on that I created a method in my own code base which works:https://www.php.net/manual/en/domnode.removechild.php
Here is what I would do:
$links = $dom->getElementsByTagName('a');
$links_to_remove = [];
foreach($links as $link){
$links_to_remove[] = $link;
}
foreach($links_to_remove as $link){
$link->parentNode->removeChild($link);
}
$dom->saveHTML();
for remove tag or somthing.
removeChild($element->id());
full example:
$dom = new Dom;
$dom->loadFromUrl('URL');
$html = $dom->find('main')[0];
$html2 = $html->find('p')[0];
$span = $html2->find('span')[0];
$html2->removeChild($span->id());
echo $html2;
I am parsing an HTML page with DOM and XPath in PHP.
I have to fetch a nested <Table...></table> from the HTML.
I have defined a query using FirePath in the browser which is pointing to
html/body/table[2]/tbody/tr/td[2]/table[2]/tbody/tr/td/table
When I run the code it says DOMNodeList is fetched having length 0. My objective is to spout out the queried <Table> as a string. This is an HTML scraping script in PHP.
Below is the function. Please help me how can I extract the required <table>
$pageUrl = "http://www.boc.cn/sourcedb/whpj/enindex.html";
getExchangeRateTable($pageUrl);
function getExchangeRateTable($url){
$htmlTable = "";
$xPathTable = nulll;
$xPathQuery1 = "html/body/table[2]/tbody/tr/td[2]/table[2]/tbody/tr/td/table";
if(strlen($url)==0){die('Argument exception: method call [getExchangeRateTable] expects a string of URL!');}
// initialize objects
$page = tidyit($url);
$dom = new DOMDocument();
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
// $elements is sppearing as DOMNodeList
$elements = $xpath->query($xPathQuery1);
// print_r($elements);
foreach($elements as $e){
$e->firstChild->nodeValue;
}
}
have you try like this
$dom = new domDocument;
$dom->loadHTML($tes);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName("table");
$rows = $tables->item(0)->getElementsByTagName("tr");
print_r($rows);
Remove the tbody's from your XPath query - they are in most cases inserted by your browser, as is with the page you are trying to scrape.
/html/body/table[2]/tr/td[2]/table[2]/tr/td/table
This will most likely work.
However, its probaly more safe to use a different XPath. Following XPath will select the first th based on it's textual content, then select the tr's parent - a tbody or table:
//th[contains(text(),'Currency Name')]/parent::tr/parent::*
The xpath query should be with a leading / like :-
/html/...
I'm just getting started with using php DOMDocument and am having a little trouble.
How would I select all link nodes under a specific node lets say
in jquery i could simply do.. $('h5 > a')
and this would give me all the links under h5.
how would i do this in php using DOMDocument methods?
I tried using phpquery but for some reason it can't read the html page i'm trying to parse.
As far as I know, jQuery rewrites the selector queries to XPath. Any node jQuery can select, XPath also can.
h5 > a means select any a node for which the direct parent node is h5. This can easily be translated to a XPath query: //h5/a.
So, using DOMDocument:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//h5/a');
foreach ($nodes as $node) {
// do stuff
}
Retrieve the DOMElement whose children you are interested in and call DOMElement::getElementsByTagName on it.
Get all h5 tags from it, and loop through each one, checking if it's parent is an a tag.
// ...
$h5s = $document->getElementsByTagName('h5');
$correct_tags = array();
foreach ($h5s as $h5) {
if ($h5->parentNode->tagName == 'a') {
$correct_tags[] = $h5;
}
}
// do something with $correct_tags