Php auto go to the next page and scrape - php

I'm new to Php and Im trynna code a tool that scrape Amazon product title
Right now, I can scrape the first page but I need the tool to go to the next page until there is no page left and do the same task like the 1st page which is scraping.
Here is the code:
<?php
$file_string = file_get_contents('http://www.amazon.com/s/ref=lp_3737671_pg_1?rh=n%3A1055398%2Cn%3A%211063498%2Cn%3A3206324011%2Cn%3A3737671&page=1&ie=UTF8&qid=1361609819');
preg_match_all('/<span class="lrg bold">(.*)<\/span>/i', $file_string, $links);
for($i = 0; $i < count($links[1]); $i++) {
echo $links[1][$i] . '<br>';
}
?>
Any help is appreciate...

To get all pages HTML as one var this would do the trick
<?php
$html = '';
$file_string = file_get_contents('http://www.amazon.com/s/ref=lp_3737671_pg_1?rh=n%3A1055398%2Cn%3A%211063498%2Cn%3A3206324011%2Cn%3A3737671&page=1&ie=UTF8&qid=1361609819');
preg_match_all('/<span class="lrg bold">(.*)<\/span>/i', $file_string, $links);
for($i = 0; $i < count($links[1]); $i++) {
$html .= file_get_contents($links[1][$i]);
}
echo "all pages combined:\n".$html;
?>
However, more than likely your server will time out, run out of memory or something else will go wrong. To scrape HTML content you'd be better off creating a URL list first, then scraping it one at a time. You could do this via a HTML page that calls the scraper via AJAX.

Related

:not CSS Selector Implementation Issue

I am crawling links from a website (this one), but the structure of the website creates unwanted additional output. Basically, the <a> tags have the name of an article and additional information (images and sources of those images) inside them. I would like to get ride of the additional information. I found the :not Selector to do that, but I guess I am implementing it wrong, because every combination I have tried gives me no output at all.
Here is the output.
Here is the code I need to alter:
$posts = $html->find('ul[class=river] a[data-omni-click=inherit] :not[figure]');
(I have also tried figure:not and a couple of other combinations)
Does anyone know where I went wrong, and what I have to do to exclude the <figure> tag?
Here is my full code, not sure if that helps:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river] a[data-omni-click=inherit] :not[figure]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->href = 'http://www.theatlantic.com'.$post->href;
echo strip_tags($post, '<p><a>'); //echo ($post);
}
?>
</div>
</div>

Simple HTML Dom Crawler returns more than contained in attributes

I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.
I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/
In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.
How can I limit the output to only include the data contained within the h2 tag?
Here is the code I am using:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->find('h2[class=hed]',0)->outertext = "";
echo strip_tags($post, '<p><a>');
}
?>
</div>
Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.
You are not outputting the h2 contents, but the ul contents in the echo:
echo strip_tags($post, '<p><a>');
Note that the statement before the echo does not modify $post:
$post->find('h2[class=hed]',0)->outertext = "";
Change code to this:
$hed = $post->find('h2[class=hed]',0);
echo strip_tags($hed, '<p><a>');
However, that will only do something with the first found h2. So you need another loop. Here is a rewrite of the code after load_file:
$posts = $html->find('ul[class=river]');
foreach($posts as $postNum => $post) {
if ($postNum >= 10) break; // limit reached
$heds = $post->find('h2[class=hed]');
foreach($heds as $hed) {
echo strip_tags($hed, '<p><a>');
}
}
If you still need to clear outertext, you can do it with $hed:
$hed->outertext = "";
You really only need one loop. Consider this:
foreach($html->find('ul.river > h2.hed') as $postNum => $h2) {
if ($postNum >= 10) break;
echo strip_tags($h2, '<p><a>') . "\n"; // the text
echo $h2->parent->href . "\n"; // the href
}

How to display image url from website sub pages using php code

I am using below mentioned php code to display images from webpages.Below mentioned code is able to display image url from main page but unable to display image urls from sub pages.
enter code here
<?php
include_once('simple_html_dom.php');
$target_url = "http://fffmovieposters.com/";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('img') as $img)
{
echo $img->src."<br />";
echo $img."<br/>";
}
?>
If by sub-page you mean a page that http://fffmovieposters.com is linking to, then of course that script won't show any of those since you're not loading those pages.
You basically have to write a spider that not only finds images, but also anchor tags and then repeats the process for those links. Just remember to add some filters so that you don't process pages more than once or start processing the entire internet by following external links.
Pseudo'ish code
$todo = ['http://fffmovieposters.com'];
$done = [];
$images = [];
while( ! empty($todo))
$link = array_shift($todo);
$done[] = $link;
$html = get html;
$images += find <img> tags
$newLinks = find <a> tags
remove all external links and all links already in $done from $newLinks
$todo += $newLinks;
Or something like that...

How to crawl a site with server-generated content?

I am writing a simple php crawler that gets data from a website and inserts it into my database. I start with a predefined url. Then I go through the the contents of the page (from php's file_get_contents) and eventually use file_get_contents on links of that page. The url's I am getting from the links are fine when I echo them and then open them from my browser on their own. However, when I use file_get_contents and then echo the result, the page does not appear correctly because of errors related to dynamically created server-side data from the site. The echo'd page contents do not include the listed data from the server that I need, because it cannot find necessary resources for the site.
It appears relative paths in the echo'd webpage are not allowing the desired content to be generated.
Can anyone point me in the right direction here?
Any help is appreciated!
Here is some of my code so far:
function crawl_all($url)
{
$main_page = file_get_contents($url);
while(strpos($main_page, '"fl"') > 0)
{
$subj_start = strpos($main_page, '"fl"'); // get start of subject row
$main_page = substr($main_page, $subj_start); // cut off everything before subject row
$link_start = strpos($main_page, 'href') + 6; // get the start of the subject link
$main_page = substr($main_page, $link_start); // cut off everything before subject link
$link_end = strpos($main_page, '">') - 1; // get the end of the subject link
$link_length = $link_end + 1;
$link = substr($main_page, 0, $link_length); // get the subject link
crawl_courses('https://whatever.com' . $link);
}
}
/* Crawls all the courses for a subject. */
function crawl_courses($url)
{
$subj_page = file_get_contents($url);
echo $url; // website looks fine when in opened in browser
echo $subj_page; // when echo'd, the page does not contain most of the server-side generated data i need
while(strpos($subj_page, '<td><a href') > 0)
{
$course_start = strpos($subj_page, '<td><a href');
$subj_page = substr($subj_page, $course_start);
$link_start = strpos($subj_page, 'href') + 6;
$subj_page = substr($subj_page, $link_start);
$link_end = strpos($subj_page, '">') - 1;
$link_length = $link_end + 1;
$link = substr($subj_page, 0, $link_length);
//crawl_professors('https://whatever.com' . $link);
}
}
Try advance html dom parser. It is here....
http://sourceforge.net/projects/advancedhtmldom/

Trying to scrape the entire content of a div

I have this project i'm working on and id like to add a really small list of nearby places using facebooks places in an iframe featured from touch.facebook.com I can easily just use touch.facebook.com/#/places_friends.php but then that loads the headers the and the other navigation bars for like messges, events ect bars and i just want the content.
I'm pretty sure from looking at the touch.facebook.com/#/places_friends.php source, all i need to load is the div "content" Anyway, i'm extremely new to php and im pretty sure what i think i'm trying to do is called web scraping.
For the sake of figuring things out on stackoverflow and not needing to worry about authentication or anything yet i want to load the login page to see if i can at least get the scraper to work. Once I have a working scraping code i'm pretty sure i can handle the rest. It has load everything inside the div. I've seen this done before so i know it is possible. and it will look exactly like what you see when you try to login at touch.facebook.com but without the blue facebook logo up top and thats what im trying to accomplish right here.
So here's the login page, im trying to load the div which contains the text boxes to login the actual login button. If it's done correctly we should just see those with no blur Facebook header bar above it.
I've tried
<?php
$page = file_get_contents('http://touch.facebook.com/login.php');
$doc = new DOMDocument();
$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div) {
if ($div->getAttribute('id') === 'login_form') {
echo $div->nodeValue;
}
}
?>
all that does is load a blank page.
I've also tried using http://simplehtmldom.sourceforge.net/
and i modified the example basic selector to
<?php
include('../simple_html_dom.php');
$html = file_get_html('http://touch.facebook.com/login.php');
foreach($html->find('div#login_form') as $e)
echo $e->nodeValue;
?>
I've also tried
<?php
$stream = "http://touch.facebook.com/login.php";
$cnt = simplexml_load_file($stream);
$result = $cnt->xpath("/html/body/div[#id=login_form]");
for($i = 0; $i < $i < count($result); $i++){
echo $result[$i];
}
?>
that did not work either
$stream = "http://touch.facebook.com";
$cnt = simplexml_load_file($stream);
$result = $nct->xpath("/html/body/div[#id=content]");
for ($i = 0; $i < count($result); $i++){
echo $result[$i];
}
there was a syntax error in this line i removed it now just copy and paste and run this code
Im assuming that you can't use the facebook API, if you can, then I strongly suggest you use it, because you will save yourself from the whole scraping deal.
To scrape text the best tech is using xpath, if the html returned by touch.facebook.com is xhtml transitional, which it sould, the you should use xpath, a sample should look like this:
$stream = "http://touch.facebook.com";
$cnt = simplexml_load_file($stream);
$result = $nct->xpath("/html/body/div[#id=content]");
for ($i = 0; $i < $i < count($result); $i++){
echo $result[$i];
}
You need to learn about your comparison operators
=== is for comparing strictly, you should be using ==
if ($div->getAttribute('id') == 'login_form')
{
}
Scraping isn't always the best idea for capturing data else where. I would suggest using Facebook's API to retrieve the values your needing. Scraping will break any time Facebook decides to change their markup.
http://developers.facebook.com/docs/api
http://github.com/facebook/php-sdk/

Categories