Xpaths not working correctly in PHP - php

I know that same issues have been discussed many times but I'm really stuck. I am trying to access a part of an HTML DOM tree:
<div class="price widget-tooltip widget-tooltip-tl price-breakdown">
<strong class="current-price">10€</strong>
</div>
I want to get the value of the node < strong > which would be 10€, so I use the xpath:
//div//strong[contains(#class, 'current-price')]
which works perfectly in Chrome Dev Tools, but not in my PHP script:
// Create DOM object
$dom = new DOMDocument();
#$dom->loadHTML($header['content']);
$xpath = new DOMXPath($dom);
$prices = $xpath->query("//div//strong[contains(#class, 'current-price')]");
I tried different "versions" as suggested here and here (and in many other cases) without success. The problem is that I'm not getting anything as return value, just empty, nothing.
I isolated the issue and can confirm that it happens only when the class selector goes into game, if I use the path without the class selector it works fine (returning many elements).
What am I doing wrong?

Related

Xpath query is returning NULL

I am trying to maintain some PHP code which is doing web page scraping. The web page has changed so an update is needed, but I'm not so experienced with Xpath so am struggling.
Basically this is the section of html that is relevant
<div class="carousel-item-wrapper">
<picture class="">
<source srcset="/medias/tea-tree-skin-clearing-foaming-cleanser-1-640x640.jpg?context=product-images/h3b/hd3/8796813918238/tea-tree-skin-clearing-foaming-cleanser_1-640x640.jpg" media="(min-width: 641px) and (max-width: 1024)">
<source srcset="/medias/tea-tree-skin-clearing-foaming-cleanser-1-320x320.jpg?context=product-images/h09/h9a/8796814049310/tea-tree-skin-clearing-foaming-cleanser_1-320x320.jpg" media="(max-width: 640px)">
<img srcset="/medias/myimage.jpg" alt="150 ML" class="">
</picture>
</div>
I am trying to extract the srcset attribute from the IMG tag which is the value of "/medias/myimage.jpg". I'm using XPATH Helper chrome plugin to help me and I have the following xpath;
//div[#class="carousel-item-wrapper"]/picture/img/#srcset
In the plugin, it returns exact what I expect, so it appears to work fine.
If I also use an online xpath tester http://www.online-toolz.com/tools/xpath-editor.php then it also works OK.
But in my PHP code I get a null value.
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->strictErrorChecking = false;
$dom->recover = true;
#$dom->loadHtml($html);
$xPath = new DOMXPath($dom);
//Other xPath queries executed OK.
$node = $xPath->query('//div[#class="carousel-item-wrapper"]/picture/img/#srcset')->item(0);
if ($node === NULL)
writelog("Node is NULL"); // <-- Writes NULL to the log file!
I have of course tried a lot of different variations on this, trying not to specify the attribute name etc. But all with not luck.
What am I doing wrong? I'm sure it must be something simple, but I can't spot it.
Other extracts using my PHP code on the same HTML document are working OK. So it is just this element causing me trouble.
PHP's DOMXPath class seems to have trouble with self-closing tags. You need to add a double forward-slash if you're looking to find a self-closing tag, so your new xPath query should be:
//div[#class="carousel-item-wrapper"]/picture//img/#srcset

XPath -> Selecting element with class attribute

I want to get all organic search results from Google.
I need help defining the XPath to exclude the ads. The cite tag on the ads does not contain a class attribute, and the organic results have 2 different class values. My attempts at defining the XPath have failed. The Google results page looks something like this
Ad
<cite>example.com</cite>
Organic Result 1
<cite class="_Rm">example.com/page1.html</cite>
Organic Result 2
<cite class="_Rm bc">example.com > Breadcrumbs > Page2</cite>
Here is my code:
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.google.com/search?q=mortgage&num=100');
$xpath = new DOMXPath($html);
$nodes = $xpath->query('//cite');
foreach ($nodes as $n){
echo $n->nodeValue.'<br />'; // Show all links
}
Please help
Try //cite[#class='_Rm' or #class='_Rm bc'] This will select cite nodes with a class that is either _Rm or _RM bc.
Assuming the portion of the HTML you want to get is not generated by client-side scripts (usually javascript), following simple XPath will do the job :
$nodes = $xpath->query('//cite[#class]');
Above XPath gets all <cite> tags containing class attribute with any value.
Otherwise, you need to find a way to run the client-side scripts, so that the HTML can be generated completely before you can apply above XPath query against the HTML.

Getting information from ID's

First off, I'm brand new to PHP so I'm sorry if this is a stupid question, second of all sorry if this title is incorrect.
Now, what I'm trying to do is create an overlay for a game that I play. My code for the overlay works perfectly, and now I'm working on my HTML file which gets its information from a website and outputs it. The code on the website looks like this:
<span id="example1">Information I want</span>
<span id="example2">More Info I want</span>
...
<span id="example3">And some more</span>
Now what I want to do is create a PHP script which goes in and finds elements by their names and gives me the information in those span tags. Here's what I've tried so far, it's not working however (no surprise):
//Some HTML here
<?php
$doc = new DomDocument;
$doc->validateOnParse = true;
$doc->Load('www.website.com');
echo "Example1: " . $doc->getElementById('example1') . "\n";
?>
//More HTML
To be honest, I have no clue what I'm doing. If anyone could show me an example of how to do this properly, or to point me in the right direction I would appreciate it.
The text between open and close tags is a Text Node.
Just write $doc->getElementById('example1')->nodeValue
Your code seems along the right lines, but you're missing a few things.
First of all, your load call is literally looking for a file named "www.website.com". If it's a remote file, you must include the http:// prefix.
Then, you are attempting to echo out the node itself, whereas you want its value (ie. its contents).
Try $doc->getElementById("example1")->nodeValue instead.
That should do it. You may want to add libxml_use_internal_errors(true); so that any errors in the source file won't destroy your page with PHP errors. Also, I would suggest using loadHTMLFile instead of load, as this will be more lenient towards malformed documents.
you can use getElementById:
$a = $doc->getElementById("example1");
var_dump($a); so you will see what you want to echo or put, or something.
You can also make all the names i HTML as example[] end then foreach the example array, so you can get element by id from example array with just one row of code

JS, PHP Dynamic Content and Google Crawlers

I have a series of about 25 static sites I created that share the same info and was having to change inane bits of copy here and there so i wrote this javascript so all the sites pulled the content from one location. (shortened to one example)
var dataLoc = "<?=$resourceLocation?>";
$("#listOne").load(dataLoc+"resources.html #listTypes");
When the page loads, it will find the div id listOne then replace it with the contents of the div in the file resources.html and only the contents of the div labeled listTypes there.
My Question: Google is not crawling this dynamic content at all, I am told Google will crawl dynamically imported information so what i'm curious to find out is what it is that i am currently doing that needs to be improved?
I assumed js just was skipped by the google spider so i used PHP to access the same HTML file used before and it is working slightly, but it's not working how i need it. This will return the text, but i need the markup as well, the <li>, <p><img> tags, and so on. Perhaps i could tweak this? (i am not a developer so I have just tried a few dozen things i read in the PHP online help and this is as close as i got)
function parseContents($divID)
{
$page = file_get_contents('content/resources.html');
$doc = new DOMDocument();
#$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div)
{
if ($div->getAttribute('id') === $divID)
{
echo $div->nodeValue;
}
}
}
parseContents('listOfStuff');
Thanks for your help in understanding this a little better, let me know if I need to explain it any better :)
See Making AJAX Applications Crawlable published by Google.

PHP Scraping using XPath - html5 issue?

I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.
The page to be scraped looks something like:
<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<div><span>Blah</span></div>
<div><span>Blah</span> Blah</div>
<div>
<form method="POST" action="blah">
<input name="SomeName" id="SomeId" value="GET ME"/>
<input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
</form>
</div>
</body>
</html>
and I'm attempting to parse it like this:
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));
NB: dump() just wraps print_r() but adds some stack trace info and formatting.
The output is as folllowws:
14:50:08 scraper.php 181: (Scraper->Test)
//input[#id='csrfToken-login']/#value
14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)
Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:
/input/#value
/input
//input
/div
The only selector which I've been able to get anything from is / which returns the entire document.
What am I doing wrong?
EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).
There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?
If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.
print_r won't work for this. Everything was fine in your code except for actually getting value.
Lists classes in PHP usually have a property called length, check that instead.
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;
DOMXPath looks fine to me.
As for the xpath use descendant-or-self shortcut // to get to the input tag
//input[#id='SomeId']/#value
I've been to the LinkedIn login page that you specified and it is malformed; even your pared-down example has an unclosed input node. I know nothing about PHP's XPath implementation, but I'm guessing no straight XPath API is ever going to work with a malformed document.
Your XPath is correct, by the way.
You might need an intermediary step using TagSoup to "well form" the source before you start querying it, or Google "tag soup php" for any PHP-specific solutions/implementations.
I hope this helps,
Zachary

Categories