Crawling through Amazon Bestsellers page

Crawling through Amazon Bestsellers page - php

<?php
$i=1;
while ($i<=5) {
# code...
$url = 'http://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_nav_0#'.$i;
echo $url;
$html= file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$classname="zg_title";
$elements = $xPath->query("//*[contains(#class, '$classname')]");
foreach ($elements as $e)
{
$lnk = $e->getAttribute('href');
$e->setAttribute("href", "http://www.amazon.in".$lnk);
$newdoc = new DOMDocument;
$e = $newdoc->importNode($e, true);
$newdoc->appendChild($e);
$html = $newdoc->saveHTML();
echo $html;
}
$i++;
}
?>
I am trying to crawl through the Amazon bestsellers page which has a list of top 100 bestseller items which have 20 items in each page. In every loop the $i value is changed and appended to URL. But only the first 20 items are being displayed 5 times, I think this has something to do with the ajax pagination, but i am not able to figure out what it is.

Try this:
<?php
$i=1;
while ($i<=5) {
# code...
$url = 'http://www.amazon.in/gp/bestsellers/electronics/ref=zg_bs_electronics_pg_'.$i.'?ie=UTF8&pg='.$i;
echo $url;
$html= file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$classname="zg_title";
$elements = $xPath->query("//*[contains(#class, '$classname')]");
foreach ($elements as $e)
{
$lnk = $e->getAttribute('href');
$e->setAttribute("href", "http://www.amazon.in".$lnk);
$newdoc = new DOMDocument;
$e = $newdoc->importNode($e, true);
$newdoc->appendChild($e);
$html = $newdoc->saveHTML();
echo $html;
}
$i++;
}
?>
Change your $url

Related

How to parse body class with Xpath?

I'm trying to parse a page with Xpath, but I don't manage to get the body class.
Here is what I'm trying :
<?php
$url = 'http://figurinepop.com/mickey-paintbrush-disney-funko';
$html = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//link[#rel="canonical"]/#href');
foreach($nodes as $node) {
$canonical = $node->nodeValue;
}
$nodes = $xpath->query('//html/body/#class');
foreach($nodes as $node) {
$bodyclass = $node->nodeValue;
}
$output['canonical'] = $canonical;
$output['bodyclass'] = $bodyclass;
echo '<pre>'; print_r ($output); echo '</pre>';
?>
Here is what I get :
Array
(
[canonical] => http://figurinepop.com/mickey-paintbrush-disney-funko
[bodyclass] =>
)
It's working with many elements (title, canonical, div...) but the body class.
I've tested the Xpath query with a chrome extension and it seems well written.
What is wrong ?

Parsing HTML to extract array of DIV content by class

$html = file_get_contents("https://www.wireclub.com/chat/room/music");
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = array();
foreach($xpath->evaluate('//div[#class="message clearfix"]/node()') as $childNode) {
$result[] = $dom->saveHtml($childNode);
}
echo '<pre>'; var_dump($result);
I would like the content of each individual DIV in an array to be processed individually.
This code is clumping every DIV together.

You could retrieve all the div and get the nodeValue
$dom = new DOMDocument();
$dom->loadHTML($html);
$myDivs = $dom->getElementsByTagName('div');
foreach($myDivs as $key => $value) {
$result[] = $value->nodeValue;
}
var_dump($result);
for class you should
you could use you code
$xpath = new DOMXPath($dom);
$myElem = $xpath->query("//*[contains(#class, '$classname')]");
foreach($myElem as $key => $value) {
$result[] = $value->nodeValue;
}

how can I use xpath with for instead of foreach

I want to scrape a website that has 3 levels of scraping. firs I get all pages, then in each page, I get image, title, url that redirects me to a unique page which contains more info like description , date ,... . so if I use foreach it gives me the false results and if I use for instead of foreach it returns just one object. how can I handle if(use for instead of foreach);
<?php
$stackHref=array();
$eventDetail=array();
$sitecontent = file_get_contents('https://www.everfest.com/music/edm-festivals');
if($sitecontent === FALSE) {
$error_log .= 'Error on $sitecontent = file_get_contents(https://www.everfest.com/music/edm-festivals) ';
//insert_error($error_log);
}
// echo $sitecontent;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($sitecontent);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("(//ul[#class='pagination'])[1]/li/a/#href");
// $all_area_set= ' ';
//echo $sitecontent;
if(!isset($nodes))
{
$error_log .= "Error on $nodes = $xpath->query((//ul[#class='pagination'])[1]/li/a/#href)";
//insert_error($error_log);
echo $error_log;
}
// get total pages
foreach ($nodes as $link) {
$stackHref[]='https://www.everfest.com'.$link->nodeValue;
}
//loop through each pages in order to scrape
$j=0;
for($i=0;$i<count($stackHref);$i++){
$sitecontent=file_get_contents($stackHref[$i]);
if($sitecontent === FALSE) {
$error_log .= 'Error on $sitecontent = file_get_contents(https://www.everfest.com/music/edm-festivals) ';
//insert_error($error_log);
}
$dom= new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTML($sitecontent);
libxml_use_internal_errors(FALSE);
$innerXpath= new DOMXPath($dom);
//get page link
$pageLinks= $innerXpath->query('//div[#class="festival-card grow"]/a[1]/#href');
for ($a=0;$a <$pageLinks->length;$a++ ){
//get img src
$eventDetail[$j]['pagelink']='https://www.everfest.com'.$pageLinks[$a]->nodeValue;
$images= $innerXpath->query("//div[contains(#class,'columns medium-6 large-4')]/div[contains(#class,'grow')]/a/img/#src");
$eventDetail[$j]['img']=$images[$a]->nodeValue;
//get title
$titles= $innerXpath->query("//div[contains(#class,'clearfix')]/a[1]/text()");
$eventDetail[$j]['title']=$titles[$a]->nodeValue;
// go inside of each pages in order to get description, date, venue
$sitecontent=file_get_contents($eventDetail[$j]['pagelink']);
$dom= new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHTML($sitecontent);
libxml_use_internal_errors(FALSE);
$deepxpath= new DOMXPath($dom);
$descriptions= $deepxpath->query('//div[#class="columns"]/div[contains(#class,"card-white")]/p[contains(#class,"")]/span[1]/following-sibling::text()[1]');
$eventDetail[$j]['description']=$descriptions[$a]->nodeValue;
//get date
$dates= $deepxpath->query('//div[#id="signup"]/div[#class="row"]/div[contains(#class,"columns")][1]/p/text()[1]');
$eventDetail[$j]['Date']=$dates[$a]->nodeValue;
//get venue
$venues= $deepxpath->query('//div[#id="signup"]/div[#class="row"]/div[contains(#class,"columns")][1]/p/text()[2]');
$eventDetail[$j++]['venue']=$venues[$a]->nodeValue;
}
}
?>

DOM Parser grabbing href of <a> tag by class="Decision"

I'm working with a DOM parser and I'm having issues. I'm basically trying to grab the href within the tag that only contain the class ID of 'thumbnail '. I've been trying to print the links on the screen and still get no results. Any help is appreciated. I also turned on error_reporting(E_ALL); and still nothing.
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$classId = "thumbnail ";
$div = $html->find('a#'.$classId);
echo $div;
I also tried this but still had the same result of NOTHING:
include('simple_html_dom.php');
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$ret = $html->find('a[class=thumbnail]');
echo $ret;

You were almost there:
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a[contains(concat(' ',normalize-space(#class),' '),' thumbnail ')]");
var_dump($hrefs);
Gives:
class DOMNodeList#28 (1) {
public $length =>
int(25)
}
25 matches, I'd call it success.

This code would probably work:
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hyperlinks = $xpath->query('//a[#class="thumbnail"]');
foreach($hyperlinks as $hyperlink) {
echo $hyperlink->getAttribute('href'), '<br>;'
}

if you're using simple_html_dom, why are you doing all these superfluous things? It already wraps the resource in everything you need -- http://simplehtmldom.sourceforge.net/manual.htm
include('simple_html_dom.php');
// set up:
$html = new simple_html_dom();
// load from URL:
$html->load_file('http://www.reddit.com/r/funny');
// find those <a> elements:
$links = $html->find('a[class=thumbnail]');
// done.
echo $links;

Tested it and made some changes - this works perfect too.
<?php
// load the url and set up an array for the links
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$links = array();
// loop thru all the A elements found
foreach($dom->getElementsByTagName('a') as $link) {
$url = $link->getAttribute('href');
$class = $link->getAttribute('class');
// Check if the URL is not empty and if the class contains thumbnail
if(!empty($url) && strpos($class,'thumbnail') !== false) {
array_push($links, $url);
}
}
// Print results
print_r($links);
?>

Xpath for extracting links

I create an scraper for an automoto site and first I want to get all manufactures and after that all links of models for each manufactures but with the code below I get only the first model on the list. Why?
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.auto-types.com');
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//li[#class='clearfix_center']/a/#href");
$output = array();
foreach($entries as $e) {
$dom2 = new DOMDocument();
#$dom2->loadHTMLFile('http://www.auto-types.com' . $e->textContent);
$xpath2 = new DOMXPath($dom2);
$data = array();
$data['newLinks'] = trim($xpath2->query("//div[#class='modelImage']/a/#href")->item(0)->textContent);
$output[] = $data;
}
echo '<pre>' . print_r($output, true) . '</pre>';
?>
SO I need to get: mercedes/100, mercedes/200, mercedes/300 but now with my script i get only the first link so mercedes/100...
please help

You need to iterate through the results instead of just taking the first item:
$items = $xpath2->query("//div[#class='modelImage']/a/#href");
$links = array();
foreach($items as $item) {
$links[] = $item->textContent;
}
$data['newLinks'] = implode(', ', $links);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Crawling through Amazon Bestsellers page - php

Related

How to parse body class with Xpath?

Parsing HTML to extract array of DIV content by class

how can I use xpath with for instead of foreach

DOM Parser grabbing href of <a> tag by class="Decision"

Xpath for extracting links

Categories

Resources