I want to scrape video from other sites to my sites (e.g. from a live video site).
How can I scrape the <iframe> video from other websites? Is the process the same as that for scraping images?
$html = file_get_contents('http://website.com/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$iframes = $dom->getElementsByTagName('frame');
foreach ($iframes as $iframe) {
$pic = $iframe->getAttribute('src');
echo '<li><frame src="'.$pic.'"';
}
This post is a little old, but still, here's my answer:
I'd recommend you to use cURL and Xpath to scrape the site and parse the HTML data. file_get_content has some security issues and some hosts may disable it. You could do something like this:
<?php
function scrape($URL){
//cURL options
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, //return html data in string instead of printing it out on screen
CURLOPT_FOLLOWLOCATION => TRUE, //follow header('Location: location');
CURLOPT_CONNECTTIMEOUT => 60, //max time to try to connect to page
CURLOPT_HEADER => FALSE, //include header
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0", //User Agent
CURLOPT_URL => $URL //SET THE URL
);
$ch = curl_init($URL);//initialize a cURL session
curl_setopt_array($ch, $options);//set the cURL options
$data = curl_exec($ch);//execute cURL (the scraping)
curl_close($ch);//close the cURL session
return $data;
}
function parse(&$data, $query, &$dom){
$Xpath = new DOMXpath($dom); //new Xpath object associated to the domDocument
$result = $Xpath->query($query);//run the Xpath query through the HTML
var_dump($result);
return $result;
}
//new domDocument
$dom = new DomDocument("1.0");
//Scrape and parse
$data = scrape('http://stream-tv-series.net/2013/02/22/new-girl-s1-e6-thanksgiving/'); //scrape the website
#$dom->loadHTML($data); //load the html data to the dom
$XpathQuery = '//iframe'; //Your Xpath query could look something like this
$iframes = parse($data, $XpathQuery, $dom); //parse the HTML with Xpath
foreach($iframes as $iframe){
$src = $iframe->getAttribute('src'); //get the src attribute
echo '<li><iframe src="' . $src . '"></iframe></li>'; //echo the iframes
}
?>
Here are some links that you could find useful:
cURL: http://php.net/manual/fr/book.curl.php
Xpath: http://www.w3schools.com/xpath/
There is also the DomDocument documention on php.net. I can't post the link, I don't have enough reputation.
Related
How can I get a table of content out by using PHP cURL? I have to enter name before getting into the page of having the table. I have written few code on how to get the page of having the table, but I donĂ½ know how can I extract that out and paste it on my site with the same formatting? (it contains Text and hyperlink)
<?php
function search($url,$data){
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_POST => 1,
CURLOPT_POSTFIELDS => $data,
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_HEADER => 0,
CURLOPT_TIMEOUT => -1,
CURLOPT_USERAGENT => "bot",
));
if(curl_errno($curl)) {
print_r(curl_error($curl));
die();
}
$result = curl_exec($curl);
return $result;
}
$data = "name=name&submit=submit";
$url = "www.extenal.com";
$test = search($url,$data);
echo $test;
$dom = new DOMDocument;
#$dom->loadHTML($result);
$nodes = $dom->getElementsById('table');
return $nodes;
?>
Here is code to extract html, I have used DOMxpath, see in below link to learn how to use wildcard to get specific element from html response:
<?php
$htmlreponse = "<table><tr><td>test 1</td><td>test 2</td></tr></table>";
$dom = new DOMDocument();
$dom->loadHtml($htmlreponse);
$xpath = new DOMXpath($dom);
foreach($xpath->query('//table') as $table){
echo $table->C14N();
//if you need only content then use this
echo $table->textContent;
}
Here you can learn more about domxpath, you can apply different wilcard to get specific data as well :http://php.net/manual/en/class.domxpath.php
I want to add a feature in my project of Instagram followers.
<?php
function callInstagram($url)
{
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_SSL_VERIFYHOST => 2));
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$url = "https://www.instagram.com/xyz/";
$dom = new domDocument();
$dom->loadHTML($result);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('script type');
?>
I am using DOM to get the content from HTML: 'script type' onwards but not able to get it.
You should actually call the callInstagram($url) function, otherwise your $result variable will be empty. The main routine should therefore begin like this (with added second line):
$url = "https://www.instagram.com/ravij28/";
$result = callInstagram($url);
$dom = new DOMDocument();
$dom->loadHTML($result);
[..]
Also, when you want to retrieve the scripts on the page, you need to use the tag name, which is just script, not script type. So, the last line of your snippet needs to read:
$tables = $dom->getElementsByTagName('script');
I want to build a code in which if I give the username it dump me the below highlighted value(no. of followers) from the page source of any instagram user.
I know about curl and DOM concept a bit.[![enter image description here][1]][1]
function callInstagram($url)
{
$ch = curl_init();
curl_setopt_array($ch, array(CURLOPT_URL => $url, CURLOPT_RETURNTRANSFER => true, CURLOPT_SSL_VERIFYPEER => false, CURLOPT_SSL_VERIFYHOST => 2)) $result = curl_exec($ch); curl_close($ch); return $result; }
$url = "instagram.com/xyz/";;
$dom = new domDocument();
$dom->loadHTML(callInstagram($url));
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('script');
print_r($tables); ?> Still building
Look like you are trying to get instagram's data.
It's better to use instragram's API to achieve your goal.
link: https://www.instagram.com/developer/
Edit:
Another way assume you can get string of all html.
Next, use regex to extract json string out.
You can use this regex: _sharedData = (.*);
Finally, use json_decode to convert string to json.
I am trying to write a PHP Script to pull snow and other data from http://www.snowbird.com/mountain-report to display via an LED array. I am having troubles with getting the data I need. I can't seem to be able to find a way to make it work. I've read about PHP not being the best tool for this? Would I be able to make this work, or would I have to go about and use a different language? Here is the code I cant seem to get working.
<?php
include_once('simple_html_dom.php');
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://www.snowbird.com/mountain-report/");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
$output = ($output);
$html = new DOMDocument();
$html = loadhtml( $content);
$ret1 = $html->find('div[id=twelve-hour]');
print_r ($ret1);
$ret2 = $html->find('#twenty-four-hour');
print_r ($ret2);
$ret3 = $html->find('#forty-eight-hour');
print_r ($ret3);
$ret4 = $html->find('#current-depth');
print_r ($ret4);
$ret5 = $html->find('#year-to-date');
print_r ($ret5);
?>
This is an ancient question, but it's easy enough to provide an answer for it. Use an XPath query to get the correct node's text value. (This should be as easy as passing the URL directly to DOMDocument::loadHTMLFile() but the server is requests based on user agent so we have to fake it.)
<?php
$ctx = stream_context_create(["http"=>[
"user_agent"=>"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0"
]]);
$html = file_get_contents("http://www.snowbird.com/mountain-report/", true, $ctx);
libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->loadHTML($html, LIBXML_NOWARNING|LIBXML_NOERROR);
$xp = new DomXpath($doc);
$root = $doc->getElementById("snowfall");
$snowfall = [
"12hour" => $xp->query("div[#id='twelve-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"24hour" => $xp->query("div[#id='twenty-four-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"48hour" => $xp->query("div[#id='forty-eight-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"current" => $xp->query("div[#id='current-depth']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"ytd" => $xp->query("div[#id='year-to-date']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
];
print_r($snowfall);
I'm attempting to pull some image URLs from Steam store pages, such as:
http://store.steampowered.com/app/35700/
http://store.steampowered.com/app/252490/
Here's the code I'm using:
$url = 'http://store.steampowered.com/app/35700/';
$html = file_get_contents($url);
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
It works fine with the first store page, but the second one redirects to an age verification page, and the script returns the images from there. I need a way for the script to get past the age verification and access the actual store page.
Any help would be appreciated.
Edit:
This is what's passed to the server when the age form is submitted:
snr=1_agecheck_agecheck__age-gate&ageDay=1&ageMonth=January&ageYear=1979
and the cookies that it sets:
lastagecheckage=1-January-1979; expires=Tue, 03 Mar 2015 19:53:42 GMT; path=/; domain=store.steampowered.com
birthtime=662716801; path=/; domain=store.steampowered.com
Edit2:
I can set the cookies using cURL but they aren't used by DOM loadHTML, so I get the same result as before. I need either a way for loadHTML to use specific cookies that I set, or another method of grabbing the image URLs that will use cookies set by cURL.
Solved! Here's the working code:
$url = 'http://store.steampowered.com/app/35700/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIE, "birthtime=28801; path=/; domain=store.steampowered.com");
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
$dom = new domDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($result);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$src = $image->getAttribute('src');
echo $src.PHP_EOL;
}
curl_close($ch);
You were looking for php answers, but I was trying to do the same thing in python and this was the most relevant question. Your php answer helped me out so maybe a python solution will help someone. My solution using python-requests in Python 2.7:
import requests
url = 'http://store.steampowered.com/app/252490/'
cookie = {
'birthtime' : '28801',
'path' : '/',
'domain' : 'store.steampowered.com'
}
r = requests.get(url, cookies=cookie)
assert (r.status_code == 200 and r.text.find('Please enter your birth date to continue') < 0), ("Failed to retrieve page for {url}. Error={code}.".format(url=url, code=r.status_code))
print r.text.encode('utf-8')