I want to scrape an HTML page. I am using cURL in PHP for doing the same.
I can successfully scrape a specific <div> content. i.e.
<div class="someDiv">ABC</div>
With the following working code
<?php
$curl = curl_init('https://www.someUrl.com');
curl_setopt_array($curl, array( CURLOPT_ENCODING => '',
CURLOPT_FOLLOWLOCATION => FALSE,
CURLOPT_FRESH_CONNECT => TRUE,
CURLOPT_SSL_VERIFYPEER => FALSE,
CURLOPT_REFERER => 'http://www.google.com',
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
CURLOPT_VERBOSE => FALSE));
$page = curl_exec($curl);
if(curl_errno($curl))
{
echo 'Scraper error: ' . curl_error($curl);
exit;
}
curl_close($curl);
$regex = '/<div class="someDiv">(.*?)<\/div>/s';
if (preg_match_all($regex, $page, $result)){
echo $result[1][0];
}
else{
print "Not found";
}
?>
Now I want to scrape an <img> nested inside a <span>. The code I want to scrape is as follows:
<span class="thumbnail">
<img src="image.gif" width="20" data-thumb="blabla/photo.jpg" height="20" alt="abc" >
</span>
I want to get the data-thumb from the <img> tag nested inside a <span> having class="thumbnail".
Here we go again...don't use regex to parse html, use an html parser like DOMDocument along with DOMXpath, i.e.:
<?php
...
$page = curl_exec($curl);
$dom = new DOMDocument();
$dom->loadHTML($page);
$xpath = new DOMXpath($dom);
foreach ($xpath->query("//span[#class='thumbnail']/img") as $img){
echo $img->getAttribute('data-thumb');
}
Related
How can I get a table of content out by using PHP cURL? I have to enter name before getting into the page of having the table. I have written few code on how to get the page of having the table, but I donĂ½ know how can I extract that out and paste it on my site with the same formatting? (it contains Text and hyperlink)
<?php
function search($url,$data){
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_POST => 1,
CURLOPT_POSTFIELDS => $data,
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_HEADER => 0,
CURLOPT_TIMEOUT => -1,
CURLOPT_USERAGENT => "bot",
));
if(curl_errno($curl)) {
print_r(curl_error($curl));
die();
}
$result = curl_exec($curl);
return $result;
}
$data = "name=name&submit=submit";
$url = "www.extenal.com";
$test = search($url,$data);
echo $test;
$dom = new DOMDocument;
#$dom->loadHTML($result);
$nodes = $dom->getElementsById('table');
return $nodes;
?>
Here is code to extract html, I have used DOMxpath, see in below link to learn how to use wildcard to get specific element from html response:
<?php
$htmlreponse = "<table><tr><td>test 1</td><td>test 2</td></tr></table>";
$dom = new DOMDocument();
$dom->loadHtml($htmlreponse);
$xpath = new DOMXpath($dom);
foreach($xpath->query('//table') as $table){
echo $table->C14N();
//if you need only content then use this
echo $table->textContent;
}
Here you can learn more about domxpath, you can apply different wilcard to get specific data as well :http://php.net/manual/en/class.domxpath.php
I want to add a feature in my project of Instagram followers.
<?php
function callInstagram($url)
{
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_SSL_VERIFYHOST => 2));
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$url = "https://www.instagram.com/xyz/";
$dom = new domDocument();
$dom->loadHTML($result);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('script type');
?>
I am using DOM to get the content from HTML: 'script type' onwards but not able to get it.
You should actually call the callInstagram($url) function, otherwise your $result variable will be empty. The main routine should therefore begin like this (with added second line):
$url = "https://www.instagram.com/ravij28/";
$result = callInstagram($url);
$dom = new DOMDocument();
$dom->loadHTML($result);
[..]
Also, when you want to retrieve the scripts on the page, you need to use the tag name, which is just script, not script type. So, the last line of your snippet needs to read:
$tables = $dom->getElementsByTagName('script');
I am using PHP's curl for getting webpage data, and for extracting <a> tags from the <body> I am using DOM Document, but it is creating an error.
<?php
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => "http://www.google.co.in/?gfe_rd=cr&ei=B5GBVezbDeHA8geU8pfYBw",
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_USERAGENT => 'Webbot UA'
));
$result = curl_exec($ch);
curl_close($ch);
if (isset($result)){
$doc = new DomDocument;
$doc->Load($result);
var_dump($doc['a']);
}
?>
I would not use DomDocument, use SimpleXMLElement::xpath but that's just because I believe it's faster in execution, may be wrong though.
$result = $xml->xpath('//a');
while(list( , $node) = each($result)) {
echo 'a: ',$node,"\n";
}
To use DomDocument look at DOMDocument::getElementsByTagName
$books = $dom->getElementsByTagName('a');
foreach ($books as $book) {
echo $book->nodeValue, PHP_EOL;
}
I want to scrape video from other sites to my sites (e.g. from a live video site).
How can I scrape the <iframe> video from other websites? Is the process the same as that for scraping images?
$html = file_get_contents('http://website.com/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$iframes = $dom->getElementsByTagName('frame');
foreach ($iframes as $iframe) {
$pic = $iframe->getAttribute('src');
echo '<li><frame src="'.$pic.'"';
}
This post is a little old, but still, here's my answer:
I'd recommend you to use cURL and Xpath to scrape the site and parse the HTML data. file_get_content has some security issues and some hosts may disable it. You could do something like this:
<?php
function scrape($URL){
//cURL options
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, //return html data in string instead of printing it out on screen
CURLOPT_FOLLOWLOCATION => TRUE, //follow header('Location: location');
CURLOPT_CONNECTTIMEOUT => 60, //max time to try to connect to page
CURLOPT_HEADER => FALSE, //include header
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0", //User Agent
CURLOPT_URL => $URL //SET THE URL
);
$ch = curl_init($URL);//initialize a cURL session
curl_setopt_array($ch, $options);//set the cURL options
$data = curl_exec($ch);//execute cURL (the scraping)
curl_close($ch);//close the cURL session
return $data;
}
function parse(&$data, $query, &$dom){
$Xpath = new DOMXpath($dom); //new Xpath object associated to the domDocument
$result = $Xpath->query($query);//run the Xpath query through the HTML
var_dump($result);
return $result;
}
//new domDocument
$dom = new DomDocument("1.0");
//Scrape and parse
$data = scrape('http://stream-tv-series.net/2013/02/22/new-girl-s1-e6-thanksgiving/'); //scrape the website
#$dom->loadHTML($data); //load the html data to the dom
$XpathQuery = '//iframe'; //Your Xpath query could look something like this
$iframes = parse($data, $XpathQuery, $dom); //parse the HTML with Xpath
foreach($iframes as $iframe){
$src = $iframe->getAttribute('src'); //get the src attribute
echo '<li><iframe src="' . $src . '"></iframe></li>'; //echo the iframes
}
?>
Here are some links that you could find useful:
cURL: http://php.net/manual/fr/book.curl.php
Xpath: http://www.w3schools.com/xpath/
There is also the DomDocument documention on php.net. I can't post the link, I don't have enough reputation.
I am trying to load an XML file from a different domain name as a string. All I want is an array of the text within the < title >< /title > tags of the xml file, so I am thinking since I am using php4 the easiest way would be to do a regex on it to get them. Can someone explain how to load the XML as a string? Thanks!
You could use cURL like the example below. I should add that regex-based XML parsing is generally not a good idea, and you may be better off using a real parser, especially if it gets any more complicated.
You may also want to add some regex modifiers to make it work across multiple lines etc., but I assume the question is more about fetching the content into a string.
<?php
$curl = curl_init('http://www.example.com');
//make content be returned by curl_exec rather than being printed immediately
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
if ($result !== false) {
if (preg_match('|<title>(.*)</title>|i', $result, $matches)) {
echo "Title is '{$matches[1]}'";
} else {
//did not find the title
}
} else {
//request failed
die (curl_error($curl));
}
first use
file_get_contents('http://www.example.com/');
to get the file,
insert in to var.
after parse the xml
the link is
http://php.net/manual/en/function.xml-parse.php
have example in the comments
If you're loading well-formed xml, skip the character-based parsing, and use the DOM functions:
$d = new DOMDocument;
$d->load("http://url/file.xml");
$titles = $d->getElementsByTagName('title');
if ($titles) {
echo $titles->item(0)->nodeValue;
}
If you can't use DOMDocument::load() due to how php is set up, the use curl to grab the file and then do:
$d = new DOMDocument;
$d->loadXML($grabbedfile);
...
I have this function as a snippet:
function getHTML($url) {
if($url == false || empty($url)) return false;
$options = array(
CURLOPT_URL => $url, // URL of the page
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 3, // stop after 3 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
//Ending all that cURL mess...
//Removing linebreaks,multiple whitespace and tabs for easier Regexing
$content = str_replace(array("\n", "\r", "\t", "\o", "\xOB"), '', $content);
$content = preg_replace('/\s\s+/', ' ', $content);
$this->profilehtml = $content;
return $content;
}
That returns the HTML with no linebreaks, tabs, multiple spaces, etc, only 1 line.
So now you do this preg_match:
$html = getHTML($url)
preg_match('|<title>(.*)</title>|iUsm',$html,$matches);
and $matches[1] will have the info you need.