PHP - Parsing iFrame from an HTML Page - php

Hey guys so I have a basic PHP application that loads a page with a video on it from one of those free TV streaming sites. You set the episode, season and show you wish to view and the server application parses the HTML tag of the iFrame that contains that video. The application works by parsing the HTML page with the PHP preg_match_all() method to get all occurrences of the iFrame HTML tag. I use the following string as the pattern, "/iframe .*\>/". This works for about half of the video players on the site, but for some reason comes up dry with all of the others.
For examples the video at http://www.free-tv-video-online.me/player/novamov.php?id=huv5cpp5k8cia which hosted on a video site called novamov is easily parsed. However, the video displayed at http://www.free-tv-video-online.me/player/gorillavid.php?id=8rerq4qpgsuw which is hosted on gorillavid, is not found by the preg_match_all() function despite it clearly being displayed in the HTML source when the element is inspected using chrome. Why is my script no returning the proper results and why is this behaviour dependant on the video player the video is using? Please could someone explain?

Try:
$dom = new DOMDocument; #$dom->loadHTML('yourURLHere');
$iframe = $dom->getElementsByTagName('iframe');
foreach($iframe as $ifr){
$ifrA[] = $ifr->getAttribute('src');
}
Now, the $ifrA Array should have your iframe src's.

Related

how to extract all image urls from a html source and download them using curl?

I am using curl to get the images from html source code of an external webpage. I am getting img original='imageurl' on view page source in Firefox. But when i select the particular images then it shows img src='imageurl' on view selection source in in Firefox.
How can I get this type of image using curl?
Currently I am using regex to get the image:
preg_match_all('/<img[^>]+>/i',$output, $result);
print_r($result);
But it doesn't display any image.
I am very confused about what to do here. Anyone have any thoughts?
I am very confused about what to do here.
The confusion probably results from that you use your webbrowser to view the source of an URL. Even if it's often the case that the source of the page displayed by the webbrowser is the data that curl would return as well, this is not always the case.
Especially the Firefox feature view selection source will not display that selection from the original resource, but often something else. To prevent that, you need to disable javascript in your Firefox browser­Docs. Because often documents are modified with javascript and you want to see the original, not the modification because curl is not able to run javascript, it can only get "the original".
Anyone have any thoughts?
Disable javascript in your browser.
Reload the page.
Locate the fragment of the HTML-source-code you're interested in.
Write it down, e.g. into a string.
Request the page with CURL. Output the source.
Locate that string in there. If it's not in there, search the curl request result for the string you're interested and use that instead.
Write a regular expression that is able to obtain what you need from that string.
Use that regular expression in your program then.
Your web browser is reformatting the HTML according to how it understands/parses the HTML page.
When you choose "View Page Source" it shows you the original source code served from the server.
When you select content and choose "View Selection Source" it shows what the browser has parsed into DOM (what the browser understands) for the selected content.
I am guessing you're using Firefox
If you are attempting to use cURL to process the HTML served from the server, you must not look at "View Selection Source" of the page, always refer to "View Page Source"..
Ultimately
You should rather refer to the ACTUAL result from cURL
For example:
$content = curl_exec($ch);
header("Content-type: text/plain");
echo $content;
That should echo exactly what cURL has received from the server...
NOTE: This is a re-post of https://stackoverflow.com/questions/8754844/can-not-get-images-using-curl
Furthermore
If you want to fetch the actual image inside a <img src=""> tag then you need to pin-point the IMG tag in the result HTML response using preg_match, and do a seperate cURL request to the IMG SRC

Parsing comments, finding links and embedded video

Right now I have a variable: $blogbody which contains the entire contents of a blog.
I'm using the following to convert URLS to clickable links:
$blogbody = ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]","\\0", $blogbody);
And the following to resize embedded video:
$blogbody = preg_replace('/(width)=("[^"]*")/i', 'width="495"', $blogbody);
The problem I'm running into is the embedded video not working, comes back with an Access Forbidden error (403). If I remove the line to convert URLS to links, the embedded video works fine. Not sure how to get these two working together. If anyone else has a better solution to converting URLS to clickable links and resizing embedded video let me know!
This might be happening because the link which you use to embed the video also gets his <a href=''> tags added. So instead of just converting all links, check that they don't have ' or " directly behind or in front of them - this will make sure that the embedded videos' links won't get anchor tags.

How can I load an external page with PHP and replace content on that page?

I'm building a PHP app that allows a user to upload an ad and preview it on specific pages of a specific website. Right now, I'm doing it by taking screenshots of the webpage, removing the ads, and placing my own s. This seems pretty stupid.
What would be the best way to get the contents of a URL and replace code that appears between a certain on that page?
Load the page source into a DOMDocument object
Do a search and replace using XPath on the DOMDocument object
Example:
<?php
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTMLFile("http://www.example.com");
$xpath = new DOMXPath($dom);
$logoImage = $xpath->query("//div[#id='header-logo']//a//img")->item(0);
$logoImage->setAttribute('src', 'http://www.google.com/images/logos/ps_logo2.png');
$logoLink = $xpath->query("//div[#id='header-logo']//a")->item(0);
$logoLink->setAttribute('href', 'http://www.google.com/');
echo $dom->saveHTML();
?>
The example loads the source of example.com and replaces the logo with google.com's logo and a link to google.
The code does not have any validation but should be pretty easy to modify for your needs :)
I am not sure about complete situation that how you would like to perform this action.
However on base of your question best way is to use Ajax.
Through Ajax pass detail of page you want to display and in php filter page as you want to display and return desired result.
And at the end of Ajax request display your result in particular location.
Even in JavaScript you can filter result as you like returned by Ajax request.

Getting the first URL of an image search result with google image API in PHP

did you know a php script (a class will be nice) who get the url of the first image result of a google api image search? Thanks
Example.
<?php echo(geturl("searchterm")) ?>
I have found a solution to get the first image from Google Image result using Simple HTML DOM as Sarfraz told.
Kindly check the below code. Currently it is working fine for me.
$search_keyword=str_replace(' ','+',$search_keyword);
$newhtml =file_get_html("https://www.google.com/search?q=".$search_keyword."&tbm=isch");
$result_image_source = $newhtml->find('img', 0)->src;
echo '<img src="'.$result_image_source.'">';
You should be able do that easily with Simple HTML DOM.
Note: See the examples on their site for more information.
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Find tags on an HTML page with selectors just like jQuery.

Facebook like on demand meta content scraper

you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube).
any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)
thanks!
FB scrapes the meta tags from the HTML.
I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.
As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.
Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/
I've had a look at how FB does it, and it looks like the scraping is done at server side.
class ScrapedInfo
{
public $url;
public $title;
public $description;
public $imageUrls;
}
function scrapeUrl($url)
{
$info = new ScrapedInfo();
$info->url = $url;
$html = file_get_html($info->url);
//Grab the page title
$info->title = trim($html->find('title', 0)->plaintext);
//Grab the page description
foreach($html->find('meta') as $meta)
if ($meta->name == "description")
$info->description = trim($meta->content);
//Grab the image URLs
$imgArr = array();
foreach($html->find('img') as $element)
{
$rawUrl = $element->src;
//Turn any relative Urls into absolutes
if (substr($rawUrl,0,4)!="http")
$imgArr[] = $url.$rawUrl;
else
$imgArr[] = $rawUrl;
}
$info->imageUrls = $imgArr;
return $info;
}
Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title and description are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" /> to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.
As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.

Categories