DOMNodeList, xPath and PHP - php

I am parsing an HTML page with DOM and XPath in PHP.
I have to fetch a nested <Table...></table> from the HTML.
I have defined a query using FirePath in the browser which is pointing to
html/body/table[2]/tbody/tr/td[2]/table[2]/tbody/tr/td/table
When I run the code it says DOMNodeList is fetched having length 0. My objective is to spout out the queried <Table> as a string. This is an HTML scraping script in PHP.
Below is the function. Please help me how can I extract the required <table>
$pageUrl = "http://www.boc.cn/sourcedb/whpj/enindex.html";
getExchangeRateTable($pageUrl);
function getExchangeRateTable($url){
$htmlTable = "";
$xPathTable = nulll;
$xPathQuery1 = "html/body/table[2]/tbody/tr/td[2]/table[2]/tbody/tr/td/table";
if(strlen($url)==0){die('Argument exception: method call [getExchangeRateTable] expects a string of URL!');}
// initialize objects
$page = tidyit($url);
$dom = new DOMDocument();
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
// $elements is sppearing as DOMNodeList
$elements = $xpath->query($xPathQuery1);
// print_r($elements);
foreach($elements as $e){
$e->firstChild->nodeValue;
}
}

have you try like this
$dom = new domDocument;
$dom->loadHTML($tes);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName("table");
$rows = $tables->item(0)->getElementsByTagName("tr");
print_r($rows);

Remove the tbody's from your XPath query - they are in most cases inserted by your browser, as is with the page you are trying to scrape.
/html/body/table[2]/tr/td[2]/table[2]/tr/td/table
This will most likely work.
However, its probaly more safe to use a different XPath. Following XPath will select the first th based on it's textual content, then select the tr's parent - a tbody or table:
//th[contains(text(),'Currency Name')]/parent::tr/parent::*

The xpath query should be with a leading / like :-
/html/...

Related

How to Extract HTML using XPath like YQL using php?

I am using YQL (https://developer.yahoo.com/yql/) but Per application limit (identified by your Access Key): 100,000 calls per day and Per IP limits: /v1/public/: 2,000 calls per hour; /v1/yql/: 20,000 calls per hour .
I need unlimited query. How to Extract HTML using XPath like YQL using php.
$homepage = file_get_contents('https://google.com');
$dom = new DOMDocument();
$dom->loadHTML($homepage);
$xpath = new DOMXPath($dom);
$result = '';
foreach($xpath->evaluate('div') as $childNode) {
$result .= $dom->saveHtml($childNode);
}
var_dump($result);
I just found this example from web but not working.
Edit
$homepage = file_get_contents('https://google.com');
$dom = new DOMDocument();
$dom->loadHTML($homepage);
$xpath = new DOMXPath($dom);
$result = '';
foreach($xpath->query('//a[#class="touch"]') as $childNode) {
// if output <a class="touch" href="url"><span alt="demo1" title="title2">Content</span> some</a> , How to get href/url and child tag span attribute alt/title ?
$result .= $dom->saveHtml($childNode);
}
var_dump($result);
If possible then how to extract full HTML to json/xml like yql using php?
There are several ways you can do further processing, one is by doing another query. To get the span node, use can use this query:
$span = $xpath->query('./span', $childNode); // all spans
$span->item(0)->attributes->getNamedItem("alt")->nodeValue; // first span
What you are doing is searching under the given node.
p.s. don't use attributes property as an array (attributes["attributeName"]) because it doesn't work in some versions of PHP.

Get DOMNodeList of elements with only the given class

I am parsing a 3rd party HTML page using PHP DOMDocument and DomXPath.
I use the following code:
$dom = new DOMDocument();
$html = file_get_contents($url);
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
$dom->encoding = "UTF-8";
$finder = new DomXPath($dom);
Now there are several elements using the same class, but I want to target the one that uses only the given class, for example:
<table class="tbl"></table>
<table class="tbl red"></table>
<table class="tbl large blue"></table>
I use the following selector:
$classname = "tbl";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
Which, of course fetches all three tables given above. Is there a simple way to get only the first one?
Thanks
Yes, there is a way.
Note that with your XPath query you can access to desired node by this way:
$nodes->item(0);
To select only the first node you have to modify your pattern in this way:
$nodes = $finder->query("(//*[contains(#class, '$classname')])[1]");
But to access to desired node you need anyway to use this syntax:
$nodes->item(0);

retrieving certain attributes using DOMDocument

I'm trying to figure out how parse an html page to get a forms action value, the labels within the form tab as well as the input field names. I took at look at php.net Domdocument and it tells me to get a childnode but all that does is give me errors that it doesnt exist. I also tried doing print_r of the variable holding the html content and all that shows me is length=1. Can someone show me a few samples that i can use because php.net is confusing to follow.
<?php
$content = "some-html-source";
$content = preg_replace("/&(?!(?:apos|quot|[gl]t|amp);|#)/", '&', $content);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->loadHTML($content);
$form = $dom->getElementsByTagName('form');
print_r($form);
I suggest using DomXPath instead of getElementsByTagName because it allows you to select attribute values directly and returns a DOMNodeList object just like getElementsByTagName. The # in #action indicates that we're selecting by attribute.
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DomXPath($doc);
$action = $xpath->query('//form/#action')->item(0);
var_dump($action);
Similarly, to get the first input
$action = $xpath->query('//form/input')->item(0);
To get all input fields
for($i=0;$i<$xpath->query('//form/input')->length;$i++) {
$label = $xpath->query('//form/input')->item($i);
var_dump($label);
}
If you're not familiar with XPath, I recommend viewing these examples.

simple HTML DOM parser return wrong elements tree

I am having problem with HTML DOM parser. This is what I used:
$url = 'http://topmmanews.com/2013/04/06/ufc-on-fuel-tv-9-results/';
$page = file_get_html($url);
$ret = $page->find("div.posttext",0);
Which is supposed to return me count($ret->children()) = 10. However, it only return me with 3, all the elements after the 3rd are combined into it and created one element only.
Can anyone help let me know if there is something wrong with my code or it was simple HTML DOM parser bug?
As Álvaro G. Vicario pointed out, your target HTML is somehow malformed. I tried your code but as you can see here it shows three children and 6 other nodes:
But the other way, which might be useful, is to use DOMDocument and DOMXPath like this:
$url = 'http://topmmanews.com/2013/04/06/ufc-on-fuel-tv-9-results/';
$html = file_get_contents($url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$dom_xpath = new DOMXpath($dom);
// XPATH to return the first DIV with class "posttext"
$elements = $dom_xpath->query("(//div[#class='posttext'])[1]");
Then you can iterate through child nodes and read the values or whatever you want.
phpquery uses DOM so it's a more reliable parser with bad html:
$html = file_get_contents('http://topmmanews.com/2013/04/06/ufc-on-fuel-tv-9-results/');
$dom = phpQuery::newDocumentHTML($html);
$ret = $dom->find("div.posttext")->eq(0);
echo count($ret->children());
#=> 10

Finding number of nodes in PHP, DOM, XPath

I am loading HTML into DOM and then querying it using XPath in PHP. My current problem is how do I find out how many matches have been made, and once that is ascertained, how do I access them?
I currently have this dirty solution:
$i = 0;
foreach($nodes as $node) {
echo $dom->savexml($nodes->item($i));
$i++;
}
Is there a cleaner solution to find the number of nodes, I have tried count(), but that does not work.
You haven't posted any code related to $nodes so I assume you are using DOMXPath and query(), or at the very least, you have a DOMNodeList.
DOMXPath::query() returns a DOMNodeList, which has a length member. You can access it via (given your code):
$nodes->length
If you just want to know the count, you can also use DOMXPath::evaluate.
Example from PHP Manual:
$doc = new DOMDocument;
$doc->load('book.xml');
$xpath = new DOMXPath($doc);
$tbody = $doc->getElementsByTagName('tbody')->item(0);
// our query is relative to the tbody node
$query = 'count(row/entry[. = "en"])';
$entries = $xpath->evaluate($query, $tbody);
echo "There are $entries english books\n";

Categories