I want to scrape a website that has 3 levels of scraping. firs I get all pages, then in each page, I get image, title, url that redirects me to a unique page which contains more info like description , date ,... . so if I use foreach it gives me the false results and if I use for instead of foreach it returns just one object. how can I handle if(use for instead of foreach);
<?php
$stackHref=array();
$eventDetail=array();
$sitecontent = file_get_contents('https://www.everfest.com/music/edm-festivals');
if($sitecontent === FALSE) {
$error_log .= 'Error on $sitecontent = file_get_contents(https://www.everfest.com/music/edm-festivals) ';
//insert_error($error_log);
}
// echo $sitecontent;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($sitecontent);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("(//ul[#class='pagination'])[1]/li/a/#href");
// $all_area_set= ' ';
//echo $sitecontent;
if(!isset($nodes))
{
$error_log .= "Error on $nodes = $xpath->query((//ul[#class='pagination'])[1]/li/a/#href)";
//insert_error($error_log);
echo $error_log;
}
// get total pages
foreach ($nodes as $link) {
$stackHref[]='https://www.everfest.com'.$link->nodeValue;
}
//loop through each pages in order to scrape
$j=0;
for($i=0;$i<count($stackHref);$i++){
$sitecontent=file_get_contents($stackHref[$i]);
if($sitecontent === FALSE) {
$error_log .= 'Error on $sitecontent = file_get_contents(https://www.everfest.com/music/edm-festivals) ';
//insert_error($error_log);
}
$dom= new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTML($sitecontent);
libxml_use_internal_errors(FALSE);
$innerXpath= new DOMXPath($dom);
//get page link
$pageLinks= $innerXpath->query('//div[#class="festival-card grow"]/a[1]/#href');
for ($a=0;$a <$pageLinks->length;$a++ ){
//get img src
$eventDetail[$j]['pagelink']='https://www.everfest.com'.$pageLinks[$a]->nodeValue;
$images= $innerXpath->query("//div[contains(#class,'columns medium-6 large-4')]/div[contains(#class,'grow')]/a/img/#src");
$eventDetail[$j]['img']=$images[$a]->nodeValue;
//get title
$titles= $innerXpath->query("//div[contains(#class,'clearfix')]/a[1]/text()");
$eventDetail[$j]['title']=$titles[$a]->nodeValue;
// go inside of each pages in order to get description, date, venue
$sitecontent=file_get_contents($eventDetail[$j]['pagelink']);
$dom= new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTML($sitecontent);
libxml_use_internal_errors(FALSE);
$deepxpath= new DOMXPath($dom);
$descriptions= $deepxpath->query('//div[#class="columns"]/div[contains(#class,"card-white")]/p[contains(#class,"")]/span[1]/following-sibling::text()[1]');
$eventDetail[$j]['description']=$descriptions[$a]->nodeValue;
//get date
$dates= $deepxpath->query('//div[#id="signup"]/div[#class="row"]/div[contains(#class,"columns")][1]/p/text()[1]');
$eventDetail[$j]['Date']=$dates[$a]->nodeValue;
//get venue
$venues= $deepxpath->query('//div[#id="signup"]/div[#class="row"]/div[contains(#class,"columns")][1]/p/text()[2]');
$eventDetail[$j++]['venue']=$venues[$a]->nodeValue;
}
}
?>
Related
I have the following:
$node = $doc->getElementsByTagName('img');
if ($node->item(0) == null || $node->item(0) == '') {
// do stuff
} elseif ($node->item(0)->hasAttribute('src')) {
// do other stuff
} else {
// do more other stuff
}
What I want is to only return images from the body tag.
I have tried:
$body = $doc->getElementsByTagName('body');
foreach ($body as $body_node) {
$node = $body_node->getElementsByTagName('img');
}
however if there is an image in header it still seems to get returned by
$node->item(0)->hasAttribute('src')
Personally there should never be an img in the header but I find some url's add them in a noscript tag in the the header.
So how do I return only images from he body tag excluding any found in the head tag?
Do it using DOMXPath:
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//body//img');
$nodes is now a DOMNodeList that you can iterate over.
If you only want img nodes that have a src attribute:
$nodes = $xpath->query('//body//img[#src]');
Edit: Here is a fully working example:
<?php
$contents = file_get_contents('http://stackoverflow.com/');
$doc = new DOMDocument();
$doc->loadHTML($contents);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//body//img');
foreach ($nodes as $node) {
echo $node->getAttribute('src') . "\n";
}
Excuse my English.
everybody,
I get a white page when I try to query the content in the DIV container of the URL.
$html = file_get_contents('https://www.imdb.com/search/title?title_type=feature,tv_movie&release_date=,2018'); //get the html returned from the following url
$doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$xpath = new DOMXPath($doc);
//get all the h2's with an id
$row = $xpath->query("//div[contains(#class, 'lister-item-image') and contains(#class, 'float-left')]/a");
if($row->length > 0){
foreach($row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
The content can be found within this DIVĀ“s .
<div class="lister-item-image float-left">
<a href="/title/tt1502407/?ref_=adv_li_i"
> <img alt="Halloween"
class="loadlate"
loadlate="https://m.media-amazon.com/images/M/MV5BMmMzNjJhYjUtNzFkZi00MWQ4LWJiMDEtYWM0NTAzNGZjMTI3XkEyXkFqcGdeQXVyOTE2OTMwNDk#._V1_UX67_CR0,0,67,98_AL_.jpg"
data-tconst="tt1502407"
height="98"
src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png"
width="67" />
</a> </div>
I mainly want to query the name, link, genre and length. And a maximum of 50 should be displayed and a link "Next" the next 50 should be queried.
I thank you in advance for possible help.
Working version:
Thanks to Mohammad.
$html = file_get_contents('https://www.imdb.com/search/title?title_type=feature,tv_movie&release_date=,2018'); //get the html returned from the following url
$doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$xpath = new DOMXPath($doc);
//get all the h2's with an id
$row = $xpath->query("//div[contains(#class, 'lister-item-image') and contains(#class, 'float-left')]");
if($row->length > 0){
foreach($row as $row){
echo $doc->saveHtml($row) . "<br/>";
}
}
}
I have this script to extract data from multiple pages of the same website. There are some 120 pages.
Here is the code I'm using to get for a single page.
$html = file_get_contents('https://www.example.com/product?page=1');
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('div');
foreach ($links as $link){
file_put_contents('products.txt', $link->getAttribute('data-product-name') .PHP_EOL, FILE_APPEND);
}
How can I do it for multiple pages? The links for that specific pages are incremental like the next page will be https://www.example.com/product?page=2 and so on. How can I do it without creating different files for each link?
What about this :
function extractContent($page)
{
$html = file_get_contents('https://www.example.com/product?page='.$page);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('div');
foreach ($links as $link) {
// skip empty attributes
if (empty($link->getAttribute('data-product-name'))) {
continue;
}
file_put_contents('products.txt', $link->getAttribute('data-product-name') .PHP_EOL, FILE_APPEND);
}
}
for ($i=1; $i<=120; $i++) {
extractContent($i);
}
<?php
$i=1;
while ($i<=5) {
# code...
$url = 'http://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_nav_0#'.$i;
echo $url;
$html= file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$classname="zg_title";
$elements = $xPath->query("//*[contains(#class, '$classname')]");
foreach ($elements as $e)
{
$lnk = $e->getAttribute('href');
$e->setAttribute("href", "http://www.amazon.in".$lnk);
$newdoc = new DOMDocument;
$e = $newdoc->importNode($e, true);
$newdoc->appendChild($e);
$html = $newdoc->saveHTML();
echo $html;
}
$i++;
}
?>
I am trying to crawl through the Amazon bestsellers page which has a list of top 100 bestseller items which have 20 items in each page. In every loop the $i value is changed and appended to URL. But only the first 20 items are being displayed 5 times, I think this has something to do with the ajax pagination, but i am not able to figure out what it is.
Try this:
<?php
$i=1;
while ($i<=5) {
# code...
$url = 'http://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_electronics_pg_'.$i.'?ie=UTF8&pg='.$i;
echo $url;
$html= file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$classname="zg_title";
$elements = $xPath->query("//*[contains(#class, '$classname')]");
foreach ($elements as $e)
{
$lnk = $e->getAttribute('href');
$e->setAttribute("href", "http://www.amazon.in".$lnk);
$newdoc = new DOMDocument;
$e = $newdoc->importNode($e, true);
$newdoc->appendChild($e);
$html = $newdoc->saveHTML();
echo $html;
}
$i++;
}
?>
Change your $url
I'm working with a DOM parser and I'm having issues. I'm basically trying to grab the href within the tag that only contain the class ID of 'thumbnail '. I've been trying to print the links on the screen and still get no results. Any help is appreciated. I also turned on error_reporting(E_ALL); and still nothing.
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$classId = "thumbnail ";
$div = $html->find('a#'.$classId);
echo $div;
I also tried this but still had the same result of NOTHING:
include('simple_html_dom.php');
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$ret = $html->find('a[class=thumbnail]');
echo $ret;
You were almost there:
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a[contains(concat(' ',normalize-space(#class),' '),' thumbnail ')]");
var_dump($hrefs);
Gives:
class DOMNodeList#28 (1) {
public $length =>
int(25)
}
25 matches, I'd call it success.
This code would probably work:
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hyperlinks = $xpath->query('//a[#class="thumbnail"]');
foreach($hyperlinks as $hyperlink) {
echo $hyperlink->getAttribute('href'), '<br>;'
}
if you're using simple_html_dom, why are you doing all these superfluous things? It already wraps the resource in everything you need -- http://simplehtmldom.sourceforge.net/manual.htm
include('simple_html_dom.php');
// set up:
$html = new simple_html_dom();
// load from URL:
$html->load_file('http://www.reddit.com/r/funny');
// find those <a> elements:
$links = $html->find('a[class=thumbnail]');
// done.
echo $links;
Tested it and made some changes - this works perfect too.
<?php
// load the url and set up an array for the links
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$links = array();
// loop thru all the A elements found
foreach($dom->getElementsByTagName('a') as $link) {
$url = $link->getAttribute('href');
$class = $link->getAttribute('class');
// Check if the URL is not empty and if the class contains thumbnail
if(!empty($url) && strpos($class,'thumbnail') !== false) {
array_push($links, $url);
}
}
// Print results
print_r($links);
?>