Retrieve img src from itunes page - php

In the url
https://itunes.apple.com/us/app/wechat/id414478124?mt=8
there's the image which is in the html in this following manner
<div class="artwork">
<img class="artwork" width="175" height="175" src="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon175x175.png" src-swap="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon175x175.png" src-load-auto-after-dom-load="" src-swap-high-dpi="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon350x350.png" alt="WeChat">
<span class="mask"></span>
</div>
Now as you can see, both the and the have the same class name.
I use this following piece of code to extract the src from the image
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://itunes.apple.com/us/app/wechat/id414478124?mt=8');
libxml_clear_errors();
$xp = new DOMXPath($dom);
$image_src = $xp->query("//img[#class='artwork']");
echo $image_src->item(0)->getAttribute('src'). "<br/>";
But it returns me only
https://s.mzstatic.com/htmlResources/1583/frameworks/images/p.png
which when seen through the browser address bar, gives only black page

It is because the static HTML page has that address as a source. Either run through a JavaScript evaluator OR see the other attributes, like src-swap
If you want to get the JavaScript rendered page, there is e.g. PhantomJS that you can probably use, but in this case since the answer is already there, but with a different attribute, it's faster not to use anything to evaluate the JS.

Related

How do I save as a HTML fragment, not as full DOM model

Here's the issue: I have a web page that saves HTML fragments to the server side. The problem is that in PHP, when I start the DOMDocument parser, add a custom attribute to a given element and save the HTML as a file, it literally adds the html, body, and other unnecessary elements that are clearly not going to be valid since that fragment would be going back to the browser as a HTML fragment to be inserted inside the DOM model and it would be invalid (you cannot have nested HTML/BODY). Here's a quick example of my code:
$html="<div><magic></magic>
<video controls></video>
<a href='http://example.com'>Example</a><br>
<a href='http://google.com'>Google</a><br></div>
";
$dom = new DOMDocument();
$dom->loadHTML($html);
$html=$dom->C14N();
echo $html;
But it shows:
<html>
<body>
<div>
<magic></magic>
<video controls=""></video>
Example
<br></br>
Google
<br></br>
</div>
</body>
</html>
How do I save just the fragmented HTML? I came up with $dom->C14N() but it still adds html and body tags. It also adds </br> but that's no big deal.
At this point, I am resorting to preg_replace to remove html and body tags but it would be nice if there's a way to save it as a fragment.
You need to initialize the DOM structure like this:
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$html=$dom->saveHTML();
See PHP documentation:
LIBXML_HTML_NOIMPLIED (integer)
Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements.
LIBXML_HTML_NODEFDTD (integer)
Sets HTML_PARSE_NODEFDTD flag, which prevents a default doctype being added when one is not found.

Xpath query is returning NULL

I am trying to maintain some PHP code which is doing web page scraping. The web page has changed so an update is needed, but I'm not so experienced with Xpath so am struggling.
Basically this is the section of html that is relevant
<div class="carousel-item-wrapper">
<picture class="">
<source srcset="/medias/tea-tree-skin-clearing-foaming-cleanser-1-640x640.jpg?context=product-images/h3b/hd3/8796813918238/tea-tree-skin-clearing-foaming-cleanser_1-640x640.jpg" media="(min-width: 641px) and (max-width: 1024)">
<source srcset="/medias/tea-tree-skin-clearing-foaming-cleanser-1-320x320.jpg?context=product-images/h09/h9a/8796814049310/tea-tree-skin-clearing-foaming-cleanser_1-320x320.jpg" media="(max-width: 640px)">
<img srcset="/medias/myimage.jpg" alt="150 ML" class="">
</picture>
</div>
I am trying to extract the srcset attribute from the IMG tag which is the value of "/medias/myimage.jpg". I'm using XPATH Helper chrome plugin to help me and I have the following xpath;
//div[#class="carousel-item-wrapper"]/picture/img/#srcset
In the plugin, it returns exact what I expect, so it appears to work fine.
If I also use an online xpath tester http://www.online-toolz.com/tools/xpath-editor.php then it also works OK.
But in my PHP code I get a null value.
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->strictErrorChecking = false;
$dom->recover = true;
#$dom->loadHtml($html);
$xPath = new DOMXPath($dom);
//Other xPath queries executed OK.
$node = $xPath->query('//div[#class="carousel-item-wrapper"]/picture/img/#srcset')->item(0);
if ($node === NULL)
writelog("Node is NULL"); // <-- Writes NULL to the log file!
I have of course tried a lot of different variations on this, trying not to specify the attribute name etc. But all with not luck.
What am I doing wrong? I'm sure it must be something simple, but I can't spot it.
Other extracts using my PHP code on the same HTML document are working OK. So it is just this element causing me trouble.
PHP's DOMXPath class seems to have trouble with self-closing tags. You need to add a double forward-slash if you're looking to find a self-closing tag, so your new xPath query should be:
//div[#class="carousel-item-wrapper"]/picture//img/#srcset

Find iFrame in HTML and check its SRC

I have a website where a user can among other objects like text and images also insert a YouTube video into CKEditor type textarea form.
YouTube video is embedded by iFrame objects. But I don't want users to be able to insert any other iFrame except for YouTube (I am sure you can guess why)
So when the form is submitted I want to scan the $text variable for all iFrames and if they do not point to youtube.com or youtube-nocookie.com, remove those iFrame tags.
These are iFrames with allowed sources:
<iframe allowfullscreen="" frameborder="0" height="360" src="//www.youtube.com/embed/6dk-5HN4fvg" width="640"></iframe>
<iframe allowfullscreen="" frameborder="0" height="360" src="//www.youtube-nocookie.com/embed/IY37l4PDsao" width="640"></iframe>
The task:
find the iFrame
find the value of its SRC
check if it is an allowed domain
if not delete it, or disable it, but preserve the rest of the surrounding HTML
check if there is another
Here is one way of utilizing DOM and XPath to achieve this task.
$doc = new DOMDocument;
#$doc->loadHTML($html);
$doc->removeChild($doc->doctype);
$xp = new DOMXPath($doc);
$tag = $xp->query("//iframe[not(contains(#src, 'youtube.com') or
contains(#src, 'youtube-nocookie.com'))]");
foreach ($tag as $t) {
$t->parentNode->removeChild($t);
}
echo $doc->saveHTML();

Can I use Zend Dom Query to query other html attributes such as ALT

Using Zend Dom Query I would like to search HTML to find certain attributes.
Take the following image as an example.
<img id="active-main-image" src="/images/example.jpg" alt="Image 1234" class="product-image">
Instead of using $this->_xhtml->query('img#active-main-image'); I would like to find the image by using the alt attribute.
Pseudo -> $this->_xhtml->query('img alt Image 1234');
I can see why this method is not conventionally popular however, when equipped with nothing but the Alt value of a certain image on a page, I see no alternative.
Thanks
Zend_Dom_Query has a queryXpath method that will accept valid xpath queries.
Untested but this should work :
$dom = new Zend_Dom_Query($html);
$img = $dom->queryXpath("//img[#alt='Image 1234']");

How to get image url using preg_match

Im tring to use preg_match to grab image URLs from another page but problem is my PHP code always returns an empty result! I'm new to php.
Here is the stucture of the image on that page...
<a class="prs_link" href="xxxx"><img src="THE IMAGE URL I WANT TO GET" width="310" height="196"></a>
My current code is:
preg_match_all('/a class="prs_link" href="([^"]+)"><img src=.+?><\/a>/',$screen,$results);
You will see literally hundreds of Q&A here on SO cautioning coders using regex to parse HTML. There is a good reason of those comments/answers so please adhere to that and avoid finding regex solution to parse out HTML.
Here is one recommended way of parsing HTML (using DOM):
$html = '<a class="prs_link" href="xxxx"><img src="THE IMAGE URL I WANT TO GET" width="310" height="196"></a>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$src = $xpath->evaluate("string(//a[#class='prs_link']/img/#src)");
echo "src=[$src]\n";
Output:
src=[THE IMAGE URL I WANT TO GET]

Categories