How to create a sitemap with page relationships

How to create a sitemap with page relationships - php

I'm currently trying to figure out a way to write a script (preferrably PHP) that would crawl through a site and create a sitemap. In addition to the traditional standard listing of pages, I'd like the script to keep track of which pages link to other pages.
Example pages
A
B
C
D
I'd like the output to give me something like the following.
Page Name: A
Pages linking to Page A:
B
C
D
Page Name: B
Pages linking to Page B:
A
C
etc...
I've come across multiple standard sitemap scripts, but nothing that really accomplishes what I am looking for.
EDIT
Seems I didn't give enough info. Sorry about my lack of clarity there. Here is the code I currently have. I've used simple_html_dom.php to take care of the tasks of parsing and searching through the html for me.
<?php
include("simple_html_dom.php");
url = 'page_url';
$html = new simple_html_dom();
$html->load_file($url);
$linkmap = array();
foreach($html->find('a') as $link):
if(contains("cms/education",$link)):
if(!in_array($link, $linkmap)):
$linkmap[$link->href] = array();
endif;
endif;
endforeach;
?>
Note: My little foreach loop just filters based on a specific substring in the url.
So, I have the necessary first level pages. Where I am stuck is in creating a loop that will not run indefinitely, while keeping track of the pages you have already visited.

Basically, you need two arrays to control the flow here. The first will keep track of the pages you need to look at and the second will track the pages you have already looked at. Then you just run your existing code on each page until there are none left:
<?php
include("simple_html_dom.php");
$urlsToCheck = array();
$urlsToCheck[] = 'page_url';
$urlsChecked = array();
while(count($urlsToCheck) > 0)
{
$url = array_pop($urlsToCheck);
if (!in_array($url, $urlsChecked)
{
$urlsChecked[] = $url;
$html = new simple_html_dom();
$html->load_file($url);
$linkmap = array();
foreach($html->find('a') as $link):
if(contains("cms/education",$link)):
if((!in_array($link, $urlsToCheck)) && (!in_array($link,$urlsChecked)))
$urlsToCheck[] = $link;
if(!in_array($link, $linkmap)):
$linkmap[$link->href] = array();
endif;
endif;
endforeach;
}
}
?>

Related

Grabbing content of external site CSS class. (steam store)

I have been playing around with this code for a while but cant get it to work properly.
My goal is to display or maybe even create a table with ID's of grabbed data from the steam store for my own website and game library. the class is 'game_area_description'
This is a study project of mine.
So i tried to get the table using the following code.
#section('selectedGame');
<?php
$url = 'https://store.steampowered.com/app/'.$game->appID."/";
header("Access-Control-Allow-Origin: ${url}");
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom);
$elements = $xpath->query('//div[#class="game_area_description"]/a');
$link = $dom->saveHTML($elements->item(0));
echo $link;
?>
#endsection;
I am using Laravel by the way.
In some other cases i can get another piece of the website.
$url = 'https://store.steampowered.com/app/'.$game->appID."/";
$content = file_get_contents($url);
$first_step = explode( '<div class="game_description_snippet">' , $content );
$second_step = explode("</div>" , $first_step[1] );
echo "<p>${second_step[0]}</p>";
Here it just takes the excerpt of the webpage which works in some cases.
Here is the biggest issue, other than not beeing able to get all the information where i get an error $first_step[1]is not valid.
Is some CORE issue.
See the webpage loads an age check in some cases like "Batman Arkham knight". the user needs to either log in or verify their age first.
Keeping me from using the second block of code.
But the first gives me all kinds of errors as the screenshot shows.
Anyone know of a way to grab this part of the page?
Where the description of the game is?

The answer to my question was in the comments.
apparently steam has some undocumented API's .
here is the code ( with bootstrap CSS).
That i used and going ti implement in my migration tables and seeder
#section('selectedGame');
<div class="container border">
<!-- Content here -->
<?php
$url = "http://store.steampowered.com/api/appdetails?appids=".$game->appID;
$jsondata = file_get_contents($url);
$parsed = json_decode($jsondata,true);
$gameID = $game->appID;
$gameDescr = $parsed[$gameID]['data']['about_the_game'];
echo $gameDescr;
?>
</div>
#endsection;

Nesting simple-html-dom file_get_html($url)

I am attempting unsuccessfully to nest the use of file_get_html($url) from the simple-html-dom script.
Essentially I am requesting a page that has articles on, these articles are being looped through successfully, I can display the contents of these articles fine, however the image is only visible on the individual article page (once clicked through to that specific article).
Is this possible, why is this methodology not working? Maybe need to specify NEW file_get_html():
<?php
$simplehtmldom = get_template_directory() . "/simplehtmldom/simple_html_dom.php";
include_once($simplehtmldom);
// INIT FILE_GET_HTML
$articleshtml = file_get_html($articles_url);
$articlesdata = "";
// FIND ARTICLES
foreach($articleshtml->find('.articles') as $articlesdata) :
$articlecount = 1;
// FIND INDIVIDUAL ARTICLE
foreach($articlesdata->find('.article') as $article) :
// FIND LINK TO PAGE WITH IMAGE FOR ARTICLE
foreach($article->find('a') as $articleimagelink) :
// LINK TO HREF
$articleimage_url = $articleimagelink->href;
// NESTED FILE_GET_HTML TO GO TOWARDS IMAGE PAGE
$articleimagehtml = file_get_html($articleimage_url);
$articleimagedata = "";
foreach($articleimagehtml->find('.media img') as $articleimagedata) :
// MAKE USE OF IMAGE I WORKED EXTRA HARD TO FIND
endforeach;
endforeach;
endforeach;
endforeach; ?>
My question is regarding the possibility of making nested requests of the file_get_html() script so I can search for the image for a specific article on a separate page, then return to the previous file_get_html loop and move on to the next article?
I believe if it was at all possible would need me to set up something like:
something = new simplehtmldom;
or
something = new file_get_html(url);
What can I try next?

How to display image url from website sub pages using php code

I am using below mentioned php code to display images from webpages.Below mentioned code is able to display image url from main page but unable to display image urls from sub pages.
enter code here
<?php
include_once('simple_html_dom.php');
$target_url = "http://fffmovieposters.com/";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('img') as $img)
{
echo $img->src."<br />";
echo $img."<br/>";
}
?>

If by sub-page you mean a page that http://fffmovieposters.com is linking to, then of course that script won't show any of those since you're not loading those pages.
You basically have to write a spider that not only finds images, but also anchor tags and then repeats the process for those links. Just remember to add some filters so that you don't process pages more than once or start processing the entire internet by following external links.
Pseudo'ish code
$todo = ['http://fffmovieposters.com'];
$done = [];
$images = [];
while( ! empty($todo))
$link = array_shift($todo);
$done[] = $link;
$html = get html;
$images += find <img> tags
$newLinks = find <a> tags
remove all external links and all links already in $done from $newLinks
$todo += $newLinks;
Or something like that...

simple-html-dom loop not ending

I am trying to cature all the links and then go to the next page until the end of the pages.
I just keep getting a loop. I think I am just glazed over and was hoping that once again I can get some help today.
getLinks('http://www.homedepot.com/h_d1/N-5yc1vZaqns/h_d2/Navigation?catalogId=10053&langId=-1&storeId=10051&catStyle=ShowProducts#/?c=1&style=List');
function getLinks($URL) {
$html = file_get_contents($URL);
$dom = new simple_html_dom();
$dom -> load($html);
foreach ($dom->find('a[class=item_description]') as $href){
$url = $href->href;
echo $url;
}
if ($nextPage = $dom->find("a[class='paginationNumberStyle page_arrows']", 0)){
$nextPageURL = 'http://www.homedepot.com'.$nextPage->getAttribute('data-url');
$dom -> clear();
unset($dom);
getLinks($nextPageURL);
} else {
echo "\nEND";
$dom -> clear();
unset($dom);
}
}

In your code, you never keep track of where you've been.
Let's say you start on page A:
The first link on page A links to page B.
You open up page B and start crawling the links.
The first link on page B links to page A.
You open up page A and start crawling the links ....
This process will repeat indefinitely, because you'll end up crawling the same pages over and over. You need to keep a list of pages you've crawled and skip out if you've already crawled that page.
Also note that it may not be a simple loop like that.
A links to B
B links to C
C links to D
....
S links to T
T links to A
Not overly familiar with PHP, but something like:
$arr[$url] = true; // Tell it that we know the url
if (array_key_exists($url, $arr)) {
// check if the url exists in the hash
}

The problem is you're following previous arrows as well as next arrows. Your css selector needs to be adjusted to account for this.

Trying to scrape the entire content of a div

I have this project i'm working on and id like to add a really small list of nearby places using facebooks places in an iframe featured from touch.facebook.com I can easily just use touch.facebook.com/#/places_friends.php but then that loads the headers the and the other navigation bars for like messges, events ect bars and i just want the content.
I'm pretty sure from looking at the touch.facebook.com/#/places_friends.php source, all i need to load is the div "content" Anyway, i'm extremely new to php and im pretty sure what i think i'm trying to do is called web scraping.
For the sake of figuring things out on stackoverflow and not needing to worry about authentication or anything yet i want to load the login page to see if i can at least get the scraper to work. Once I have a working scraping code i'm pretty sure i can handle the rest. It has load everything inside the div. I've seen this done before so i know it is possible. and it will look exactly like what you see when you try to login at touch.facebook.com but without the blue facebook logo up top and thats what im trying to accomplish right here.
So here's the login page, im trying to load the div which contains the text boxes to login the actual login button. If it's done correctly we should just see those with no blur Facebook header bar above it.
I've tried
<?php
$page = file_get_contents('http://touch.facebook.com/login.php');
$doc = new DOMDocument();
$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div) {
if ($div->getAttribute('id') === 'login_form') {
echo $div->nodeValue;
}
}
?>
all that does is load a blank page.
I've also tried using http://simplehtmldom.sourceforge.net/
and i modified the example basic selector to
<?php
include('../simple_html_dom.php');
$html = file_get_html('http://touch.facebook.com/login.php');
foreach($html->find('div#login_form') as $e)
echo $e->nodeValue;
?>
I've also tried
<?php
$stream = "http://touch.facebook.com/login.php";
$cnt = simplexml_load_file($stream);
$result = $cnt->xpath("/html/body/div[#id=login_form]");
for($i = 0; $i < $i < count($result); $i++){
echo $result[$i];
}
?>
that did not work either

$stream = "http://touch.facebook.com";
$cnt = simplexml_load_file($stream);
$result = $nct->xpath("/html/body/div[#id=content]");
for ($i = 0; $i < count($result); $i++){
echo $result[$i];
}
there was a syntax error in this line i removed it now just copy and paste and run this code

Im assuming that you can't use the facebook API, if you can, then I strongly suggest you use it, because you will save yourself from the whole scraping deal.
To scrape text the best tech is using xpath, if the html returned by touch.facebook.com is xhtml transitional, which it sould, the you should use xpath, a sample should look like this:
$stream = "http://touch.facebook.com";
$cnt = simplexml_load_file($stream);
$result = $nct->xpath("/html/body/div[#id=content]");
for ($i = 0; $i < $i < count($result); $i++){
echo $result[$i];
}

You need to learn about your comparison operators
=== is for comparing strictly, you should be using ==
if ($div->getAttribute('id') == 'login_form')
{
}

Scraping isn't always the best idea for capturing data else where. I would suggest using Facebook's API to retrieve the values your needing. Scraping will break any time Facebook decides to change their markup.
http://developers.facebook.com/docs/api
http://github.com/facebook/php-sdk/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to create a sitemap with page relationships - php

Related

Grabbing content of external site CSS class. (steam store)

Nesting simple-html-dom file_get_html($url)

How to display image url from website sub pages using php code

simple-html-dom loop not ending

Trying to scrape the entire content of a div

Categories

Resources