Find iFrame in HTML and check its SRC - php

I have a website where a user can among other objects like text and images also insert a YouTube video into CKEditor type textarea form.
YouTube video is embedded by iFrame objects. But I don't want users to be able to insert any other iFrame except for YouTube (I am sure you can guess why)
So when the form is submitted I want to scan the $text variable for all iFrames and if they do not point to youtube.com or youtube-nocookie.com, remove those iFrame tags.
These are iFrames with allowed sources:
<iframe allowfullscreen="" frameborder="0" height="360" src="//www.youtube.com/embed/6dk-5HN4fvg" width="640"></iframe>
<iframe allowfullscreen="" frameborder="0" height="360" src="//www.youtube-nocookie.com/embed/IY37l4PDsao" width="640"></iframe>
The task:
find the iFrame
find the value of its SRC
check if it is an allowed domain
if not delete it, or disable it, but preserve the rest of the surrounding HTML
check if there is another

Here is one way of utilizing DOM and XPath to achieve this task.
$doc = new DOMDocument;
#$doc->loadHTML($html);
$doc->removeChild($doc->doctype);
$xp = new DOMXPath($doc);
$tag = $xp->query("//iframe[not(contains(#src, 'youtube.com') or
contains(#src, 'youtube-nocookie.com'))]");
foreach ($tag as $t) {
$t->parentNode->removeChild($t);
}
echo $doc->saveHTML();

Related

Xpath query is returning NULL

I am trying to maintain some PHP code which is doing web page scraping. The web page has changed so an update is needed, but I'm not so experienced with Xpath so am struggling.
Basically this is the section of html that is relevant
<div class="carousel-item-wrapper">
<picture class="">
<source srcset="/medias/tea-tree-skin-clearing-foaming-cleanser-1-640x640.jpg?context=product-images/h3b/hd3/8796813918238/tea-tree-skin-clearing-foaming-cleanser_1-640x640.jpg" media="(min-width: 641px) and (max-width: 1024)">
<source srcset="/medias/tea-tree-skin-clearing-foaming-cleanser-1-320x320.jpg?context=product-images/h09/h9a/8796814049310/tea-tree-skin-clearing-foaming-cleanser_1-320x320.jpg" media="(max-width: 640px)">
<img srcset="/medias/myimage.jpg" alt="150 ML" class="">
</picture>
</div>
I am trying to extract the srcset attribute from the IMG tag which is the value of "/medias/myimage.jpg". I'm using XPATH Helper chrome plugin to help me and I have the following xpath;
//div[#class="carousel-item-wrapper"]/picture/img/#srcset
In the plugin, it returns exact what I expect, so it appears to work fine.
If I also use an online xpath tester http://www.online-toolz.com/tools/xpath-editor.php then it also works OK.
But in my PHP code I get a null value.
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->strictErrorChecking = false;
$dom->recover = true;
#$dom->loadHtml($html);
$xPath = new DOMXPath($dom);
//Other xPath queries executed OK.
$node = $xPath->query('//div[#class="carousel-item-wrapper"]/picture/img/#srcset')->item(0);
if ($node === NULL)
writelog("Node is NULL"); // <-- Writes NULL to the log file!
I have of course tried a lot of different variations on this, trying not to specify the attribute name etc. But all with not luck.
What am I doing wrong? I'm sure it must be something simple, but I can't spot it.
Other extracts using my PHP code on the same HTML document are working OK. So it is just this element causing me trouble.
PHP's DOMXPath class seems to have trouble with self-closing tags. You need to add a double forward-slash if you're looking to find a self-closing tag, so your new xPath query should be:
//div[#class="carousel-item-wrapper"]/picture//img/#srcset

Retrieve img src from itunes page

In the url
https://itunes.apple.com/us/app/wechat/id414478124?mt=8
there's the image which is in the html in this following manner
<div class="artwork">
<img class="artwork" width="175" height="175" src="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon175x175.png" src-swap="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon175x175.png" src-load-auto-after-dom-load="" src-swap-high-dpi="http://a3.mzstatic.com/us/r30/Purple1/v4/64/d2/e1/64d2e14d-9339-32f0-9382-77c158a90941/icon350x350.png" alt="WeChat">
<span class="mask"></span>
</div>
Now as you can see, both the and the have the same class name.
I use this following piece of code to extract the src from the image
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://itunes.apple.com/us/app/wechat/id414478124?mt=8');
libxml_clear_errors();
$xp = new DOMXPath($dom);
$image_src = $xp->query("//img[#class='artwork']");
echo $image_src->item(0)->getAttribute('src'). "<br/>";
But it returns me only
https://s.mzstatic.com/htmlResources/1583/frameworks/images/p.png
which when seen through the browser address bar, gives only black page
It is because the static HTML page has that address as a source. Either run through a JavaScript evaluator OR see the other attributes, like src-swap
If you want to get the JavaScript rendered page, there is e.g. PhantomJS that you can probably use, but in this case since the answer is already there, but with a different attribute, it's faster not to use anything to evaluate the JS.

youtube embed code with PHP

I have created a blog in php. Users can write something and post it. If this includes a web address then a link automated is created using this :
<?php
//...code
$row['comment'] = preg_replace('#(https?://([-\w.]+[-\w])+(:\d+)?(/([\w-.~:/?#\[\]\#!$&\'()*+,;=%]*)?)?)#', '<font color="#69aa35">$1</font>', $row['comment']);
?>
Using this in posts, text posted successfuly and web address displayed in a link format inside text. Any idea how can I change this, so that if there is a link of youtube, then a youtube frame to be created. For example in facebook, when you post a youtube address, a youtube frame created and posted instead of a link.
I have improved my answer and tested this code:
<?php
// This is your comment string containing the youtube link
$string="Here is a link - https://www.youtube.com/watch?v=LJHFXenOPi4";
// This will remove all links from the input string
preg_match('/[a-zA-Z]+:\/\/[0-9a-zA-Z;.\/?:#=_#&%~,+$]+/', $string, $matches);
foreach($matches as $url){
// Parse each url within the comment data
$input = parse_url($url);
if ($input['host'] == 'youtube.com' || $input['host'] == 'www.youtube.com' ) {
// If it is a youtube link, then parse the get variables
parse_str(parse_url($url, PHP_URL_QUERY), $variables);
// Echo out the iframe with the relevant video ID
echo '<iframe width="560" height="315" src="//www.youtube.com/embed/'.$variables['v'].'" frameborder="0" allowfullscreen></iframe>';
}
}
?>
I hope this is what you were looking for, it has worked for me on a few tests
You know the solution don't you? :) If the snippet contains youtube.com URL, then using the same pattern matching you can replace it with a youtube embed tag :)
Basically it will be something like this in pseudo-code.
Check the comment pasted to see if youtube URL present (modify the same regex you are using for finding URL, just make it specific for youtube)
If yes, then replace it with:
<iframe type="text/html"
width="640"
height="385"
src="<youtube URL>"
frameborder="0">
</iframe>

Can I use Zend Dom Query to query other html attributes such as ALT

Using Zend Dom Query I would like to search HTML to find certain attributes.
Take the following image as an example.
<img id="active-main-image" src="/images/example.jpg" alt="Image 1234" class="product-image">
Instead of using $this->_xhtml->query('img#active-main-image'); I would like to find the image by using the alt attribute.
Pseudo -> $this->_xhtml->query('img alt Image 1234');
I can see why this method is not conventionally popular however, when equipped with nothing but the Alt value of a certain image on a page, I see no alternative.
Thanks
Zend_Dom_Query has a queryXpath method that will accept valid xpath queries.
Untested but this should work :
$dom = new Zend_Dom_Query($html);
$img = $dom->queryXpath("//img[#alt='Image 1234']");

How to get image url using preg_match

Im tring to use preg_match to grab image URLs from another page but problem is my PHP code always returns an empty result! I'm new to php.
Here is the stucture of the image on that page...
<a class="prs_link" href="xxxx"><img src="THE IMAGE URL I WANT TO GET" width="310" height="196"></a>
My current code is:
preg_match_all('/a class="prs_link" href="([^"]+)"><img src=.+?><\/a>/',$screen,$results);
You will see literally hundreds of Q&A here on SO cautioning coders using regex to parse HTML. There is a good reason of those comments/answers so please adhere to that and avoid finding regex solution to parse out HTML.
Here is one recommended way of parsing HTML (using DOM):
$html = '<a class="prs_link" href="xxxx"><img src="THE IMAGE URL I WANT TO GET" width="310" height="196"></a>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$src = $xpath->evaluate("string(//a[#class='prs_link']/img/#src)");
echo "src=[$src]\n";
Output:
src=[THE IMAGE URL I WANT TO GET]

Categories