How to query a DOMNode using XPath in PHP? - php

I'm trying to get the bing search results with XPath. Here is my code:
$html = file_get_contents("http://www.bing.com/search?q=bacon&first=11");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHtml($html);
$x = new DOMXpath($doc);
$output = array();
// just grab the urls for now
foreach ($x->query("//li[#class='b_algo']") as $node)
{
//$output[] = $node->getAttribute("href");
$tmpDom = new DOMDocument();
$tmpDom->loadHTML($node);
$tmpDP = new DOMXPath($tmpDom);
echo $tmpDP->query("//div[#class='b_title']//h2//a//href");
}
return $output;
This foreach iterates over all results, all I want to do is to extract the link and text from $node in foreach, but because $node itself is an object I can't create a DOMDocument from it. How can I query it?

First of all, your XPath expression tries to match non-existant href subelements, query #href for the attribute.
You don't need to create any new DOMDocuments, just pass the $node as context item:
foreach ($x->query("//li[#class='b_algo']") as $node)
{
var_dump( $x->query("./div[#class='b_title']//h2//a//#href", $node)->item(0) );
}
If you're just interested in the URLs, you could also query them directly:
foreach ($x->query("//li[#class='b_algo']/div[#class='b_title']/h2/a/#href") as $node)
{
var_dump($node);
}

Related

Extract node information from external URL using query

I want to be able to extract information from specific nodes from an external XML file. I currently have been trying
$contents = file_get_contents('https://experiencehermann.com/post-sitemap.xml');
$dom = new DOMDocument;
$dom -> loadXML($contents);
$finder = new DOMXPath($dom);
$nodes = $finder->query('//loc');
foreach ($nodes as $node) {
echo $node->nodeValue ."</br />";
}
I'm able to use this same technique when I have the XML in the PHP directly but not when pulling from an external source.
Thanks in advance!
As your query is quite simple, you don't even need XPath, you can simply use the getElementsByTagName method on the DOMDocument object:
$dom = new DOMDocument;
$dom->loadXML($contents);
$nodes = $dom->getElementsByTagName('loc');
foreach ($nodes as $node) {
if ($node->nodeName === 'image:loc')
continue;
echo $node->nodeValue ."<br />\n";
}

How to extract specific type of links from website using php?

I am trying to extract specific type of links from the webpage using php
links are like following..
http://www.example.com/pages/12345667/some-texts-available-here
I want to extract all links like in the above format.
maindomain.com/pages/somenumbers/sometexts
So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?
Any suggestions ?
<?php
$html = file_get_contents('http://www.example.com');
//Create a new DOM document
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
?>
You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions to use it after in an XPATH query:
function checkURL($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if ( count($parts) == 2 &&
isset($parts['host']) &&
isset($parts['path']) &&
preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
return true;
}
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($filename);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');
$links = $xp->query("//a[php:functionString('checkURL', #href)]");
foreach ($links as $link) {
echo $link->getAttribute('href'), PHP_EOL;
}
In this way you extract only the links you want.
This is a slight guess, but if I got it wrong you can still see the way to do it.
foreach ($links as $link){
//Extract and show the "href" attribute.
If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
}
You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with() as well, so this might work:
$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(#href, 'maindomain.com')]");
Loop over them afterwards:
foreach ($links as $link) {
// do sth. with it here
// after all, it is a DOMElement
}

Trouble extracting data from an XML document using XPath

I'm trying to extract all of the "name" and "form13FFileNumber" values from xpath "//otherManagers2Info/otherManager2/otherManager" in this document:
https://www.sec.gov/Archives/edgar/data/1067983/000095012314002615/primary_doc.xml
Here is my code. Any idea what I am doing wrong here?
$xml = file_get_contents($url);
$dom = new DOMDocument();
$dom->loadXML($xml);
$x = new DOMXpath($dom);
$other_managers = array();
$nodes = $x->query('//otherManagers2Info/otherManager2/otherManager');
if (!empty($nodes)) {
$i = 0;
foreach ($nodes as $n) {
$i++;
$other_managers[$i]['form13FFileNumber'] = $x->evaluate('form13FFileNumber', $n)->item(0)->nodeValue;
$other_managers[$i]['name'] = $x->evaluate('name', $n)->item(0)->nodeValue;
}
}
Like you posted in the comment you can just register the namespace with an own prefix for Xpath. Namespace prefixes are just aliases. Here is no default namespace in Xpath, so you always have to register and use an prefix.
However, expressions always return a traversable node list, you can use foreach to iterate them. query() and evaluate() take a context node as the second argument, expression are relative to the context. Last evaluate() can return scalar values directly. This happens if you cast the node list in Xpath into a scalar type (like a string) or use function like count().
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXpath($dom);
$xpath->registerNamespace('e13', 'http://www.sec.gov/edgar/thirteenffiler');
$xpath->registerNamespace('ecom', 'http://www.sec.gov/edgar/common');
$result = [];
$nodes = $xpath->evaluate('//e13:otherManagers2Info/e13:otherManager2/e13:otherManager');
foreach ($nodes as $node) {
$result[] = [
'form13FFileNumber' => $xpath->evaluate('string(e13:form13FFileNumber)', $node),
'name' => $xpath->evaluate('string(e13:name)', $node),
];
}
var_dump($result);
Demo: https://eval.in/125200

looking to loop for 2 element in the same time (php /xpath )

I'm trying to extract 2 elements using PHP Curl and Xpath!
So far have the element separated in foreach but I would like to have them in the same time:
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->evaluate("//p[#class='row']/a/#href");
//$elements = $xpath->query("//p[#class='row']/a");
foreach ($elements as $element) {
$url = $element->nodeValue;
//$title = $element->nodeValue;
}
When I echo each one out of the foreach I only get 1 element and when its echoed inside the foreach i get all of them.
My question is how can I get them both at the same time (url and title ) and whats the best way to add them into myqsl using pdo.
thank you
There is no need, in this case, to use XPath twice. You could do one query and navigate to the associated other node(s).
For example, find all of the hrefs that you are interested in and get their ownerElement's (the <a>) node value.
$hrefs = $xpath->query("//p[#class='row']/a/#href");
foreach ($hrefs as $href) {
$url = $href->value;
$title = $href->ownerElement->nodeValue;
// Insert into db here
}
Or, find all of the <a>s that you are interested in and get their href attributes.
$anchors = $xpath->query("//p[#class='row']/a[#href]");
foreach ($anchors as $anchor) {
$url = $anchor->getAttribute("href");
$title = $anchor->nodeValue;
// Insert into db here
}
You're overwriting $url on each iteration. Maybe use an array?
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->evaluate("//p[#class='row']/a/#href");
//$elements = $xpath->query("//p[#class='row']/a");
$urls = array();
foreach ($elements as $element){
array_push($urls, $element->nodeValue);
//$title = $element->nodeValue;
}

Simple HTML DOM gets only 1 element

I'm following a simplified version of the scraping tutorial by NetTuts here, which basically finds all divs with class=preview
http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/comment-page-1/#comments
This is my code. The problem is that when I count $items I get only 1, so it's getting only the first div with class=preview, not all of them.
$articles = array();
$html = new simple_html_dom();
$html->load_file('http://net.tutsplus.com/page/76/');
$items = $html->find('div[class=preview]');
echo "count: " . count($items);
Try using DOMDocument and DOMXPath:
$file = file_get_contents('http://net.tutsplus.com/page/76/');
$dom = new DOMDocument();
#$dom->loadHTML($file);
$domx = new DOMXPath($dom);
$nodelist = $domx->evaluate("//div[#class='preview']");
foreach ($nodelist as $node) { print $node->nodeValue; }

Categories