Im tring to use preg_match to grab image URLs from another page but problem is my PHP code always returns an empty result! I'm new to php.
Here is the stucture of the image on that page...
<a class="prs_link" href="xxxx"><img src="THE IMAGE URL I WANT TO GET" width="310" height="196"></a>
My current code is:
preg_match_all('/a class="prs_link" href="([^"]+)"><img src=.+?><\/a>/',$screen,$results);
You will see literally hundreds of Q&A here on SO cautioning coders using regex to parse HTML. There is a good reason of those comments/answers so please adhere to that and avoid finding regex solution to parse out HTML.
Here is one recommended way of parsing HTML (using DOM):
$html = '<a class="prs_link" href="xxxx"><img src="THE IMAGE URL I WANT TO GET" width="310" height="196"></a>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$src = $xpath->evaluate("string(//a[#class='prs_link']/img/#src)");
echo "src=[$src]\n";
Output:
src=[THE IMAGE URL I WANT TO GET]
Related
I am trying to maintain some PHP code which is doing web page scraping. The web page has changed so an update is needed, but I'm not so experienced with Xpath so am struggling.
Basically this is the section of html that is relevant
<div class="carousel-item-wrapper">
<picture class="">
<source srcset="/medias/tea-tree-skin-clearing-foaming-cleanser-1-640x640.jpg?context=product-images/h3b/hd3/8796813918238/tea-tree-skin-clearing-foaming-cleanser_1-640x640.jpg" media="(min-width: 641px) and (max-width: 1024)">
<source srcset="/medias/tea-tree-skin-clearing-foaming-cleanser-1-320x320.jpg?context=product-images/h09/h9a/8796814049310/tea-tree-skin-clearing-foaming-cleanser_1-320x320.jpg" media="(max-width: 640px)">
<img srcset="/medias/myimage.jpg" alt="150 ML" class="">
</picture>
</div>
I am trying to extract the srcset attribute from the IMG tag which is the value of "/medias/myimage.jpg". I'm using XPATH Helper chrome plugin to help me and I have the following xpath;
//div[#class="carousel-item-wrapper"]/picture/img/#srcset
In the plugin, it returns exact what I expect, so it appears to work fine.
If I also use an online xpath tester http://www.online-toolz.com/tools/xpath-editor.php then it also works OK.
But in my PHP code I get a null value.
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->strictErrorChecking = false;
$dom->recover = true;
#$dom->loadHtml($html);
$xPath = new DOMXPath($dom);
//Other xPath queries executed OK.
$node = $xPath->query('//div[#class="carousel-item-wrapper"]/picture/img/#srcset')->item(0);
if ($node === NULL)
writelog("Node is NULL"); // <-- Writes NULL to the log file!
I have of course tried a lot of different variations on this, trying not to specify the attribute name etc. But all with not luck.
What am I doing wrong? I'm sure it must be something simple, but I can't spot it.
Other extracts using my PHP code on the same HTML document are working OK. So it is just this element causing me trouble.
PHP's DOMXPath class seems to have trouble with self-closing tags. You need to add a double forward-slash if you're looking to find a self-closing tag, so your new xPath query should be:
//div[#class="carousel-item-wrapper"]/picture//img/#srcset
In the url
https://itunes.apple.com/us/app/wechat/id414478124?mt=8
there's the image which is in the html in this following manner
<div class="artwork">
<img class="artwork" width="175" height="175" src="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon175x175.png" src-swap="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon175x175.png" src-load-auto-after-dom-load="" src-swap-high-dpi="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon350x350.png" alt="WeChat">
<span class="mask"></span>
</div>
Now as you can see, both the and the have the same class name.
I use this following piece of code to extract the src from the image
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://itunes.apple.com/us/app/wechat/id414478124?mt=8');
libxml_clear_errors();
$xp = new DOMXPath($dom);
$image_src = $xp->query("//img[#class='artwork']");
echo $image_src->item(0)->getAttribute('src'). "<br/>";
But it returns me only
https://s.mzstatic.com/htmlResources/1583/frameworks/images/p.png
which when seen through the browser address bar, gives only black page
It is because the static HTML page has that address as a source. Either run through a JavaScript evaluator OR see the other attributes, like src-swap
If you want to get the JavaScript rendered page, there is e.g. PhantomJS that you can probably use, but in this case since the answer is already there, but with a different attribute, it's faster not to use anything to evaluate the JS.
With php file_get_contents() i want just only the post and image. But it's get whole page. (I know there is other way to do this)
Example:
$homepage = file_get_contents('http://www.bdnews24.com/details.php?cid=2&id=221107&hb=5',
true);
echo $homepage;
It's show full page. Is there any way to show only the post which cid=2&id=221107&hb=5.
Thanks a lot.
Use PHP's DomDocument to parse the page. You can filter it more if you wish, but this is the general idea.
$url = 'http://www.bdnews24.com/details.php?cid=2&id=221107&hb=5';
// Create new DomDocument
$doc = new DomDocument();
$doc->loadHTMLFile($url);
// Get the post
$post = $doc->getElementById('opage_mid_left');
var_dump($post);
Update:
Unless the image is a requirement, I'd use the printer-friendly version: http://www.bdnews24.com/pdetails.php?id=221107, it's much cleaner.
You will need to parse the resulting HTML using a DOM parser to get the HTML of only the part you want. I like PHP Simple HTML DOM Parser, but as Paul pointed out, PHP also has it's own.
you can extract the
<div id="page">
//POST AND IMAGE EXIST HERE
</div>
part from the fetched contents using regex and push it on your page...
I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.
The page to be scraped looks something like:
<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<div><span>Blah</span></div>
<div><span>Blah</span> Blah</div>
<div>
<form method="POST" action="blah">
<input name="SomeName" id="SomeId" value="GET ME"/>
<input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
</form>
</div>
</body>
</html>
and I'm attempting to parse it like this:
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));
NB: dump() just wraps print_r() but adds some stack trace info and formatting.
The output is as folllowws:
14:50:08 scraper.php 181: (Scraper->Test)
//input[#id='csrfToken-login']/#value
14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)
Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:
/input/#value
/input
//input
/div
The only selector which I've been able to get anything from is / which returns the entire document.
What am I doing wrong?
EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).
There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?
If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.
print_r won't work for this. Everything was fine in your code except for actually getting value.
Lists classes in PHP usually have a property called length, check that instead.
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;
DOMXPath looks fine to me.
As for the xpath use descendant-or-self shortcut // to get to the input tag
//input[#id='SomeId']/#value
I've been to the LinkedIn login page that you specified and it is malformed; even your pared-down example has an unclosed input node. I know nothing about PHP's XPath implementation, but I'm guessing no straight XPath API is ever going to work with a malformed document.
Your XPath is correct, by the way.
You might need an intermediary step using TagSoup to "well form" the source before you start querying it, or Google "tag soup php" for any PHP-specific solutions/implementations.
I hope this helps,
Zachary
I'm not sure this is possible or not.
I want a php script when executed , it will go to a page (on a different domain) and get the html contents of it and inside the html there's links , and that script is able to get each link's href.
html code:
<div id="somediv">
Yahoo
Google
Facebook
</div>
The output code(which php will echo out) will be
http://yahoo.com
http://google.com
http://facebook.com
I have heard of cURL in php can do something like this but not exactly like this , i'm a bit confused , i hope some can guide me on this.
Thanks.
have a look at something like http://simplehtmldom.sourceforge.net/
Using DOM and XPath:
<?php
$doc = new DOMDocument();
$doc->loadHTMLFile("http://www.example.com/"); // or you could load from a string using loadHTML();
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//div[#id='somediv']//a");
foreach($elements as $elem){
echo $elem->getAttribute('href');
}
BTW: you should read up on DOM and XPath.