I am trying to get the price of products in a webshop but DOMXPath doest seem to be working.
The server is running php 5.5 and LibXML is enabled. No errors are returned, only a length of zero.
ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(-1);
session_start();
$xmlsource = 'https://tennistoko.nl/product/professional-supreme-comfort-grip-3-st';
$d = new DOMDocument();
$d->loadHTML($xmlsource);
$xpath = new DOMXPath($d);
$nodes = $xpath->query('//*[#itemprop]'); //this catches all elements with itemprop attribute
foreach ($nodes as $node) {
// do your stuff here with $node
print_r($node);
}
print_r($nodes);
loadHTML is for loading HTML from a string, to load from file or url use loadHTMLFile.
$xmlsource = 'https://tennistoko.nl/product/professional-supreme-comfort-grip-3-st';
$d = new DOMDocument();
#$d->loadHTMLFile($xmlsource); // # if for suppressing warnings
$xpath = new DOMXPath($d);
$nodes = $xpath->query('//*[#itemprop]'); //this catches all elements with itemprop attribute
foreach ($nodes as $node) {
// do your stuff here with $node
print_r($node);
}
Try adding / at the end of the url otherwise the document is empty and use loadHTMLFile as Danijel suggested:
$xmlsource = 'https://tennistoko.nl/product/professional-supreme-comfort-grip-3-st/';//changed your code here
$d = new DOMDocument();
#$d->loadHTMLFile($xmlsource); // # if for suppressing warnings
$xpath = new DOMXPath($d);
$nodes = $xpath->query('//*[#itemprop]'); //this catches all elements with itemprop attribute
foreach ($nodes as $node) {
// do your stuff here with $node
print_r($node);
}
Related
I have the following:
$node = $doc->getElementsByTagName('img');
if ($node->item(0) == null || $node->item(0) == '') {
// do stuff
} elseif ($node->item(0)->hasAttribute('src')) {
// do other stuff
} else {
// do more other stuff
}
What I want is to only return images from the body tag.
I have tried:
$body = $doc->getElementsByTagName('body');
foreach ($body as $body_node) {
$node = $body_node->getElementsByTagName('img');
}
however if there is an image in header it still seems to get returned by
$node->item(0)->hasAttribute('src')
Personally there should never be an img in the header but I find some url's add them in a noscript tag in the the header.
So how do I return only images from he body tag excluding any found in the head tag?
Do it using DOMXPath:
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//body//img');
$nodes is now a DOMNodeList that you can iterate over.
If you only want img nodes that have a src attribute:
$nodes = $xpath->query('//body//img[#src]');
Edit: Here is a fully working example:
<?php
$contents = file_get_contents('http://stackoverflow.com/');
$doc = new DOMDocument();
$doc->loadHTML($contents);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//body//img');
foreach ($nodes as $node) {
echo $node->getAttribute('src') . "\n";
}
HTML is as follows:
ABC<BR>DEF
However, both nodeValue and textContent attributes show "ABCDEF" as the value.
Any way to show or parse the <BR>?
Maybe this'll help you: DOMNode::C14N
It'll return the HTML of the node.
<?php
$a = 'ABC<BR>DEF';
$doc = new DOMDocument();
#$doc->loadHTML($a);
$finder = new DomXPath($doc);
$nodes = $finder->query("//a");
foreach ($nodes as $node) {
var_dump($node->c14n());
}
Demo
I know you have already solved your problem, but I wanted to add a more direct way of solving it...
$a = 'ABC<BR>DEF';
$doc = new DOMDocument();
$doc->loadHTML($a);
$xp = new DomXPath($doc);
$nodes = $xp->query("//a/node()");
$text = '';
foreach ($nodes as $node) {
$text .= $doc->saveHTML($node);
}
echo $text;
Outputs...
ABC<br>DEF
How do I echo and scrape a div class? I tried this but it doesn't work. I am using cURL to establish the connection. How do I echo it? I want it just how it is on the actual page.
$document = new DOMDocument();
$document->loadHTML($html);
$selector = new DOMXPath($document);
$anchors = $selector->query("/html/body//div[#class='resultitem']");
//a URL you want to retrieve
foreach($anchors as $a) {
echo $a;
}
Neighbor,
I just made this snippet below, that uses your logic, and some tweaks to display the specified class from the webpage in the get_contents function.
Maybe you can plug in your values and try it?
(Note: I put the error checking in there to see a few bugs. It can be helpful to use that as you tweak. )
<?php
error_reporting(E_ALL);
ini_set('display_errors', '1');
$url = "http://www.tizag.com/cssT/cssid.php";
$class_to_scrape="display";
$html = file_get_contents($url);
$document = new DOMDocument();
$document->loadHTML($html);
$selector = new DOMXPath($document);
$anchors = $selector->query("/html/body//div[#class='". $class_to_scrape ."']");
echo "ok, no php syntax errors. <br>Lets see what we scraped.<br>";
foreach ($anchors as $node) {
$full_content = innerHTML($node);
echo "<br>".$full_content."<br>" ;
}
/* this function preserves the inner content of the scraped element.
** http://stackoverflow.com/questions/5349310/how-to-scrape-web-page-data-without-losing-tags
** So be sure to go and give that post an uptick too:)
**/
function innerHTML(DOMNode $node)
{
$doc = new DOMDocument();
foreach ($node->childNodes as $child) {
$doc->appendChild($doc->importNode($child, true));
}
return $doc->saveHTML();
}
?>
I am getting errors in this php xpath app and i cannot fix, i would love some help if possible
<?php
//Get Username
$username = $_GET["u"];
$html = file_get_contents('http://us.playstation.com/publictrophy/index.htm?onlinename=' .$username);
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//*[#id="id-handle"]') as $node) {
echo $node, "\n";
}
foreach ($xpath->query('//*[#id="leveltext"]') as $node1) {
echo $node1, "\n";
}
?>
put # before $dom->loadHTML($html) because loadHTML usually rises a lot of warnings and notices
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
I have a var of a HTTP (craigslist) link $link, and put the contents into $linkhtml. In this var is the HTML code for a craigslist page, $link.
I need to extract the text between <h2> and </h2>. I could use a regexp, but how do I do this with PHP DOM? I have this so far:
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
What do I do next to put the contents of the element <h2> into a var $title?
if DOMDocument looks complicated to understand/use to you, then you may try PHP Simple HTML DOM Parser which provides the easiest ever way to parse html.
require 'simple_html_dom.php';
$html = '<h1>Header 1</h1><h2>Header 2</h2>';
$dom = new simple_html_dom();
$dom->load( $html );
$title = $dom->find('h2',0)->plaintext;
echo $title; // outputs: Header 2
You can use this code:
$linkhtml= file_get_contents($link);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($linkhtml); // loads your html
$xpath = new DOMXPath($doc);
$h2text = $xpath->evaluate("string(//h2/text())");
// $h2text is your text between <h2> and </h2>
You can do this with XPath: untested, may contain errors
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("/html/body/h2");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}