I am trying to maintain some PHP code which is doing web page scraping. The web page has changed so an update is needed, but I'm not so experienced with Xpath so am struggling.
Basically this is the section of html that is relevant
<div class="carousel-item-wrapper">
<picture class="">
<source srcset="/medias/tea-tree-skin-clearing-foaming-cleanser-1-640x640.jpg?context=product-images/h3b/hd3/8796813918238/tea-tree-skin-clearing-foaming-cleanser_1-640x640.jpg" media="(min-width: 641px) and (max-width: 1024)">
<source srcset="/medias/tea-tree-skin-clearing-foaming-cleanser-1-320x320.jpg?context=product-images/h09/h9a/8796814049310/tea-tree-skin-clearing-foaming-cleanser_1-320x320.jpg" media="(max-width: 640px)">
<img srcset="/medias/myimage.jpg" alt="150 ML" class="">
</picture>
</div>
I am trying to extract the srcset attribute from the IMG tag which is the value of "/medias/myimage.jpg". I'm using XPATH Helper chrome plugin to help me and I have the following xpath;
//div[#class="carousel-item-wrapper"]/picture/img/#srcset
In the plugin, it returns exact what I expect, so it appears to work fine.
If I also use an online xpath tester http://www.online-toolz.com/tools/xpath-editor.php then it also works OK.
But in my PHP code I get a null value.
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->strictErrorChecking = false;
$dom->recover = true;
#$dom->loadHtml($html);
$xPath = new DOMXPath($dom);
//Other xPath queries executed OK.
$node = $xPath->query('//div[#class="carousel-item-wrapper"]/picture/img/#srcset')->item(0);
if ($node === NULL)
writelog("Node is NULL"); // <-- Writes NULL to the log file!
I have of course tried a lot of different variations on this, trying not to specify the attribute name etc. But all with not luck.
What am I doing wrong? I'm sure it must be something simple, but I can't spot it.
Other extracts using my PHP code on the same HTML document are working OK. So it is just this element causing me trouble.
PHP's DOMXPath class seems to have trouble with self-closing tags. You need to add a double forward-slash if you're looking to find a self-closing tag, so your new xPath query should be:
//div[#class="carousel-item-wrapper"]/picture//img/#srcset
Related
I know that same issues have been discussed many times but I'm really stuck. I am trying to access a part of an HTML DOM tree:
<div class="price widget-tooltip widget-tooltip-tl price-breakdown">
<strong class="current-price">10€</strong>
</div>
I want to get the value of the node < strong > which would be 10€, so I use the xpath:
//div//strong[contains(#class, 'current-price')]
which works perfectly in Chrome Dev Tools, but not in my PHP script:
// Create DOM object
$dom = new DOMDocument();
#$dom->loadHTML($header['content']);
$xpath = new DOMXPath($dom);
$prices = $xpath->query("//div//strong[contains(#class, 'current-price')]");
I tried different "versions" as suggested here and here (and in many other cases) without success. The problem is that I'm not getting anything as return value, just empty, nothing.
I isolated the issue and can confirm that it happens only when the class selector goes into game, if I use the path without the class selector it works fine (returning many elements).
What am I doing wrong?
In the url
https://itunes.apple.com/us/app/wechat/id414478124?mt=8
there's the image which is in the html in this following manner
<div class="artwork">
<img class="artwork" width="175" height="175" src="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon175x175.png" src-swap="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon175x175.png" src-load-auto-after-dom-load="" src-swap-high-dpi="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon350x350.png" alt="WeChat">
<span class="mask"></span>
</div>
Now as you can see, both the and the have the same class name.
I use this following piece of code to extract the src from the image
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://itunes.apple.com/us/app/wechat/id414478124?mt=8');
libxml_clear_errors();
$xp = new DOMXPath($dom);
$image_src = $xp->query("//img[#class='artwork']");
echo $image_src->item(0)->getAttribute('src'). "<br/>";
But it returns me only
https://s.mzstatic.com/htmlResources/1583/frameworks/images/p.png
which when seen through the browser address bar, gives only black page
It is because the static HTML page has that address as a source. Either run through a JavaScript evaluator OR see the other attributes, like src-swap
If you want to get the JavaScript rendered page, there is e.g. PhantomJS that you can probably use, but in this case since the answer is already there, but with a different attribute, it's faster not to use anything to evaluate the JS.
Is there a way to get and set inline styles of DOM elements inside an HTML fragment with PHP? Example:
<div style="background-color:black"></div>
I need to get whether the background-color is black and if it is, change it to white. (This is an example and not the actual goal)
I tried phpQuery, but it lacks the .css() method, while the branch that implements it doesn't seem to work (at least for me).
Basically, what I need is a port of jQuery's .css() method to PHP.
Per Ryan P's good suggestion above, the PHP DOM functions may help you out. Something like this might do what you want with that particular example.
$my_url = 'index.php';
$dom = new DOMDocument;
$dom->loadHTMLfile($my_url);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
$div_style = $div->getAttribute('style');
if ($div_style && $div_style=='background-color:black;') {
$div->setAttribute('style','background-color:white;');
}
}
echo $dom->saveHTML();
Im tring to use preg_match to grab image URLs from another page but problem is my PHP code always returns an empty result! I'm new to php.
Here is the stucture of the image on that page...
<a class="prs_link" href="xxxx"><img src="THE IMAGE URL I WANT TO GET" width="310" height="196"></a>
My current code is:
preg_match_all('/a class="prs_link" href="([^"]+)"><img src=.+?><\/a>/',$screen,$results);
You will see literally hundreds of Q&A here on SO cautioning coders using regex to parse HTML. There is a good reason of those comments/answers so please adhere to that and avoid finding regex solution to parse out HTML.
Here is one recommended way of parsing HTML (using DOM):
$html = '<a class="prs_link" href="xxxx"><img src="THE IMAGE URL I WANT TO GET" width="310" height="196"></a>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$src = $xpath->evaluate("string(//a[#class='prs_link']/img/#src)");
echo "src=[$src]\n";
Output:
src=[THE IMAGE URL I WANT TO GET]
I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.
The page to be scraped looks something like:
<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<div><span>Blah</span></div>
<div><span>Blah</span> Blah</div>
<div>
<form method="POST" action="blah">
<input name="SomeName" id="SomeId" value="GET ME"/>
<input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
</form>
</div>
</body>
</html>
and I'm attempting to parse it like this:
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));
NB: dump() just wraps print_r() but adds some stack trace info and formatting.
The output is as folllowws:
14:50:08 scraper.php 181: (Scraper->Test)
//input[#id='csrfToken-login']/#value
14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)
Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:
/input/#value
/input
//input
/div
The only selector which I've been able to get anything from is / which returns the entire document.
What am I doing wrong?
EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).
There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?
If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.
print_r won't work for this. Everything was fine in your code except for actually getting value.
Lists classes in PHP usually have a property called length, check that instead.
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;
DOMXPath looks fine to me.
As for the xpath use descendant-or-self shortcut // to get to the input tag
//input[#id='SomeId']/#value
I've been to the LinkedIn login page that you specified and it is malformed; even your pared-down example has an unclosed input node. I know nothing about PHP's XPath implementation, but I'm guessing no straight XPath API is ever going to work with a malformed document.
Your XPath is correct, by the way.
You might need an intermediary step using TagSoup to "well form" the source before you start querying it, or Google "tag soup php" for any PHP-specific solutions/implementations.
I hope this helps,
Zachary