So, I want to crawl a webpage? [duplicate] - php

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
How to write a crawler?
Best methods to parse HTML
I've always wondered how to do something like this. I am not the owner/admin/webmaster of the site (http://poolga.com/) however the information I wish to obtain is publicly available. This page here (http://poolga.com/artists) is a directory of all of the artist that have contributed to the site. However the links on this page go to another page which contains this anchor tag which contains the link to the artist actual website.
<a id="author-url" class="helv" target="_blank" href="http://aaaghr.com/">http://aaaghr.com/</a>
I hate having to command + click the links in the directory and then click the link to the artists website. I would love a way to have a batch of 10 of the artist website links appear as tabs in the browse just for temporary viewing. However just getting these href's into some-sort of array would be a feat itself. Any idea or direction / google searches within any programming language is great! Would this even be referred to as "crawling"? Thanks for reading!
UPDATE
I used Simple HTML DOM on my local php MAMP server with this script, took a little while!
$artistPages = array();
foreach(file_get_html('http://poolga.com/artists')->find('div#artists ol li a') as $element){
array_push($artistPages,$element->href);
}
for ($counter = 0; $counter <= sizeof($artistPages)-1; $counter += 1) {
foreach(file_get_html($artistPages[$counter])->find('a#author-url') as $element){
echo $element->href . '<br>';
}
}

My favourite php library for navigating through the dom is Simple HTML DOM.
set_time_limit(0);
$poolga = file_get_html('http://poolga.com/artists');
$inRefs = $poolga->find('div#artists ol li a');
$links = array();
foreach ($inRefs as $ref) {
$site = file_get_html($ref->href);
$links[] = $site->find('a#author-url', 0)->href;
}
print_r($links);
Code, I think, is pretty self-explanatory.
Edit: Had a spelling mistake. It would take the script a really, really long time to finish, seeing as how there are so many links; that's why I used set_time_limit(). Go do other stuff and let the script run.

Use some function to loop through the artist subpages (using jQuery as an example):
$("#artists li").each();
(each entry is under a <li> inside the <div id="artists">)
Then you will have to read each page search for the element <div id="artistSites"> or the <h2> id="author">
$("#author a").href();
The implementation details will depend on how different each page is. I only looked at two, so it may be a little more complicated than this.

Related

web scrape php with clickable links

I'm trying to do a fun little project where I basically take headlines for ex from a news site, scrape it/mirror it onto an additional site using php, and then have that data that is displayed on the new site actually be clickable links to the original site. if that's a bit confusing, let me show an example.
http://www.wilsonschlamme.com/test.php
Right there, I'm using php to scrape all data from the antrimreview (local michigan news site) contained in a < span=class >.
I chose span class, because that's where their headlines are located. I'm just using antrim for testing purposes, I have no affiliation with them.
*What I'm wondering is, and what I don't know how to do, is actually make these headlines that are re displaying on my test site, as clickable links. In other words, retain the < a href > of these headlines that contain clickable links to the full articles. Put differently, on the antrim website, those headlines are clickable links to full pages. When mirrored on my test website presently, there's clearly no links, because there's nothing grabbing the data.
Does anyone know how this could be done? or any thoughts? Would really appreciate it, this is a fun project, just lacking the knowledge on how to complete it.
Oh and i know the pokemon references are lolsy down below. It's because I'm working with code originally from a tutorial somewhere lol:
<?php
$html = file_get_contents('http://www.antrimreview.net/'); //get the html
returned from the following url
$pokemon_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$pokemon_xpath = new DOMXPath($pokemon_doc);
//get all the h2's with an id
$pokemon_row = $pokemon_xpath->query('//span[#class]');
if($pokemon_row->length > 0){
foreach($pokemon_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
?>
I actually found it simple to just use a CNN rss feed for ex, using surfing-waves to generate the code. thx for the suggestions anyway.

simple html php dom parser a way to gather information from all pages?

I wanted to know if its possible to make this code search every page from that website so it pulls every image src from all pages. Currently it will only pull the image src from that one page. I tried using a while loop but it only repeats the same results from the main page over and over. Any help would be great.
<?php
include_once('simple_html_dom.php');
//show errors
ini_set('display_errors', true);
error_reporting(E_ALL);
$html = file_get_html('http://betatv.net/');
$result=($html);
while($html = ($result)) {
// find the show img and echo it out
foreach($html->find('.entry-content') as $cover_img)
foreach($cover_img->find('img') as $cover_img_link)
//echo the images src
echo $cover_img_link->src .'<br>';
echo '<br>';
}
// clean up memory
$html->clear();
unset($html);
?>
Proof that i own betatv.net i added a link to this question on the front page.
Here is a nice example of a page crawler:
How do I make a simple crawler in PHP?
You just need to use your piece of code for each link it finds.
Also if you own this page I bet there is a better way to find all images instead of crawling it from frontend.

Using php include to add second/third/forth file from a folder?

This is a 2 part question from a NOVICE so if the answer could be explained carefully that would be appreciated.
Currently I'm using this code to add the latest News Article to my front page from the News folder, each article is a separate html page.
<?php $files = glob('news/*.html');
sort($files);
$newest = array_pop($files);
include($newest); ?>
But how would I go about adding the second, the third, the forth and so on file from said folder without adding all of them.
Now the second question, How do i create an "echo" function the same way to link to these articles. currently I use this simple method Grass Lands but i have to manualy do it every time a new article comes. I thought of using this. (NOTE all the news html pages are named "20130207 Grass Lands.html" , "20130206 Demons vs Fairyland" and so on.)
<a href="# <?php $files = glob('news/*.html');
sort($files);
$newest = array_pop($files);
echo $newest; ?> "> <?php $files = glob('news/*.html');
sort($files);
$newest = array_pop($files);
echo $newest; ?>
</a>
but the button ends up reading "news/20130207 Grass Lands.html" how do I cut out the "news/20130207" and the ".html" part of the button and just leave the "Grass Lands".
Ok, so you want to generate static pages and still have the minimum functionality of a CMS;
The first thing you have to do is create a rule for naming the urls/files; this has to be unique;
like: number-varchar1-varchar2-varchar3.html
the number needs to be incremented each time, not a random number!
now, every time you need to list the articles/pages you can do it in 2 ways:
a. load all the articles from a statis page that you created/refreshed with new data each time you added new news
b. load the files from that folder using a scan method
now, sort the links, using explode() method, by using the - as key; sort them by the number, descending, because you want the new news to be on top
but i have some questions:
how will you edit the news? will you edit the files manually?
you need htaccess skills to use seo friendly urls? do you know how htaccess rules work?
why dont you use wordpress of yii framework ?
Yii does miracles, I could teach you;
I have solved the second part of my own dilemma and want to leave the code here for future armatures who want to know how do do this.
<?php
$files = glob('news/*news.php');
rsort($files);
$before = '<a href="#';
$after = '</div></a>';
foreach ($files as $f) {
$f = substr($f, 14, -9);
$link = $before . $f . '"><div>' . $f . $after;
echo $link;
} ?>
all this creates a button that looks like this
<div>Demons vs Fairyland, another TD game Not as bad as it sounds</div>
Now on the side of my articles I added this code to the top which would create and anchor of the files name
<a id="<?php echo substr(basename(__FILE__, ".php"), 9); ?>" name="<?php echo substr(basename(__FILE__, ".php"), 9); ?>"></a>
this creates a anchor that will look liek this
<a id="Demons vs Fairyland, another TD game Not as bad as it sounds" name="Demons vs Fairyland, another TD game Not as bad as it sounds"></a>
It took me days to do this I hope the next guy will just use my code and save him self the torture.

JS, PHP Dynamic Content and Google Crawlers

I have a series of about 25 static sites I created that share the same info and was having to change inane bits of copy here and there so i wrote this javascript so all the sites pulled the content from one location. (shortened to one example)
var dataLoc = "<?=$resourceLocation?>";
$("#listOne").load(dataLoc+"resources.html #listTypes");
When the page loads, it will find the div id listOne then replace it with the contents of the div in the file resources.html and only the contents of the div labeled listTypes there.
My Question: Google is not crawling this dynamic content at all, I am told Google will crawl dynamically imported information so what i'm curious to find out is what it is that i am currently doing that needs to be improved?
I assumed js just was skipped by the google spider so i used PHP to access the same HTML file used before and it is working slightly, but it's not working how i need it. This will return the text, but i need the markup as well, the <li>, <p><img> tags, and so on. Perhaps i could tweak this? (i am not a developer so I have just tried a few dozen things i read in the PHP online help and this is as close as i got)
function parseContents($divID)
{
$page = file_get_contents('content/resources.html');
$doc = new DOMDocument();
#$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div)
{
if ($div->getAttribute('id') === $divID)
{
echo $div->nodeValue;
}
}
}
parseContents('listOfStuff');
Thanks for your help in understanding this a little better, let me know if I need to explain it any better :)
See Making AJAX Applications Crawlable published by Google.

Scrape website to retrieve certain li elements

I'm running a lottery syndicate and want to automate our system to check for the lottery numbers (UK National Lottery)
The url I am getting is: https://www.national-lottery.co.uk/player/p/results/lotto.ftl
and I am using
<?php
$html = file_get_contents("https://www.national-lottery.co.uk/player/p/results/lotto.ftl");
?>
I would like to be able to grab this area of the page, namely the numbers:
The problem is, there is a lot of content on that page and I don't know the first step I would take to break it all down.
Does anyone know a way to do this in PHP or jQuery?
Thanks
what about an existing rss feed http://www.alllotto.co.uk/rss/latest.rss
I would take a look at the PHP Simple HTML DOM Parser. It simplifies scraping and does what you're asking.
Using this, finding LI elements is as easy as this:
foreach($html->find('li') as $element) {
echo $element . '<br>';
}

Categories