simple-html-dom loop not ending - php

I am trying to cature all the links and then go to the next page until the end of the pages.
I just keep getting a loop. I think I am just glazed over and was hoping that once again I can get some help today.
getLinks('http://www.homedepot.com/h_d1/N-5yc1vZaqns/h_d2/Navigation?catalogId=10053&langId=-1&storeId=10051&catStyle=ShowProducts#/?c=1&style=List');
function getLinks($URL) {
$html = file_get_contents($URL);
$dom = new simple_html_dom();
$dom -> load($html);
foreach ($dom->find('a[class=item_description]') as $href){
$url = $href->href;
echo $url;
}
if ($nextPage = $dom->find("a[class='paginationNumberStyle page_arrows']", 0)){
$nextPageURL = 'http://www.homedepot.com'.$nextPage->getAttribute('data-url');
$dom -> clear();
unset($dom);
getLinks($nextPageURL);
} else {
echo "\nEND";
$dom -> clear();
unset($dom);
}
}

In your code, you never keep track of where you've been.
Let's say you start on page A:
The first link on page A links to page B.
You open up page B and start crawling the links.
The first link on page B links to page A.
You open up page A and start crawling the links ....
This process will repeat indefinitely, because you'll end up crawling the same pages over and over. You need to keep a list of pages you've crawled and skip out if you've already crawled that page.
Also note that it may not be a simple loop like that.
A links to B
B links to C
C links to D
....
S links to T
T links to A
Not overly familiar with PHP, but something like:
$arr[$url] = true; // Tell it that we know the url
if (array_key_exists($url, $arr)) {
// check if the url exists in the hash
}

The problem is you're following previous arrows as well as next arrows. Your css selector needs to be adjusted to account for this.

Related

Nesting simple-html-dom file_get_html($url)

I am attempting unsuccessfully to nest the use of file_get_html($url) from the simple-html-dom script.
Essentially I am requesting a page that has articles on, these articles are being looped through successfully, I can display the contents of these articles fine, however the image is only visible on the individual article page (once clicked through to that specific article).
Is this possible, why is this methodology not working? Maybe need to specify NEW file_get_html():
<?php
$simplehtmldom = get_template_directory() . "/simplehtmldom/simple_html_dom.php";
include_once($simplehtmldom);
// INIT FILE_GET_HTML
$articleshtml = file_get_html($articles_url);
$articlesdata = "";
// FIND ARTICLES
foreach($articleshtml->find('.articles') as $articlesdata) :
$articlecount = 1;
// FIND INDIVIDUAL ARTICLE
foreach($articlesdata->find('.article') as $article) :
// FIND LINK TO PAGE WITH IMAGE FOR ARTICLE
foreach($article->find('a') as $articleimagelink) :
// LINK TO HREF
$articleimage_url = $articleimagelink->href;
// NESTED FILE_GET_HTML TO GO TOWARDS IMAGE PAGE
$articleimagehtml = file_get_html($articleimage_url);
$articleimagedata = "";
foreach($articleimagehtml->find('.media img') as $articleimagedata) :
// MAKE USE OF IMAGE I WORKED EXTRA HARD TO FIND
endforeach;
endforeach;
endforeach;
endforeach; ?>
My question is regarding the possibility of making nested requests of the file_get_html() script so I can search for the image for a specific article on a separate page, then return to the previous file_get_html loop and move on to the next article?
I believe if it was at all possible would need me to set up something like:
something = new simplehtmldom;
or
something = new file_get_html(url);
What can I try next?

Simple html dom always loading the default first page and not the specified url

I want to scrape few web pages. I am using php and simple html dom parser.
For instance trying to scrape this site: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=5
I use this load the url.
$html = new simple_html_dom();
$html->load_file($url);
This loads the correct page. Then I find the next page link, here it will be:
https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=6
Just the page value is changed from 5 to 6. The code snippet to get the next link is:
function getNextLink($_htmlTemp)
{
//Getting the next page links
$aNext = $_htmlTemp->find('a.next', 0);
$nextLink = $aNext->href;
return $nextLink;
}
The above method returns the correct link with page value being 6.
Now when I try to load this next link, it fetches the first default page with page query absent from the url.
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>'; //$nxtLink has correct value
$originalHtml->load_file($nxtLink); //This line fetches default page
}
The whole flow is something like this:
$html->load_file($url);
//Whole thing in a do-while loop
$originalHtml = $html;
$shouldLoop = true;
//Main Array
$value = array();
do{
$listings = $originalHtml->find('div.searchResult');
foreach($listings as $item)
{
//Some logic here
}
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>';
$originalHtml->load_file($nxtLink);
}
else
{
//No next link -- stop the loop as we have covered all the pages
$shouldLoop = false;
}
} while($shouldLoop);
I have tried encoding the whole url, only the query parameters but the same result. I also tried creating new instances of simple_html_dom and then loading the file, no luck. Please help.
You need to html_entity_decode those links, I can see that they are getting mangled by simple-html-dom.
$url = 'https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes';
$html = str_get_html(file_get_contents($url));
while($a = $html->find('a.next', 0)){
$url = html_entity_decode($a->href);
echo $url . "\n";
$html = str_get_html(file_get_contents($url));
}

scraping images from url using php

i am trying to make a page that allows me to grab and save images from another link , so here's what i want to add on my page:
text box (to enter url that i want to get images from).
save dialog box to specify the path to save images.
but what i am trying to do here i want to save images only from that url and from inside specific element.
for example on my code i say go to example.com and from inside of element class="images" grab all images.
notes: not all images from the page, just from inside the element
whether element has 3 images in it or 50 or 100 i don't care.
here's what i tried and worked using php
<?php
$html = file_get_contents('http://www.tgo-tv.net');
preg_match_all( '|<img.*?src=[\'"](.*?)[\'"].*?>|i',$html, $matches );
echo $matches[ 1 ][ 0 ];
?>
this gets image name and path but what i am trying to make is a save dialog box and the code must save image directly into that path instead of echo it out
hope you understand
Edit 2
it's ok of Not having save dialog box. i must specify save path from the code
If you want something generic, you can use:
<?php
$the_site = "http://somesite.com";
$the_tag = "div"; #
$the_class = "images";
$html = file_get_contents($the_site);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//'.$the_tag.'[contains(#class,"'.$the_class.'")]/img') as $item) {
$img_src = $item->getAttribute('src');
print $img_src."\n";
}
Usage:
Change the site, tag, which can be a div, span, a, etc. also change the class name.
For example, change the values to:
$the_site = "https://stackoverflow.com/questions/23674744/what-is-the-equivalent-of-python-any-and-all-functions-in-javascript";
$the_tag = "div"; #
$the_class = "gravatar-wrapper-32";
Output:
https://www.gravatar.com/avatar/67d8ca039ee1ffd5c6db0d29aeb4b168?s=32&d=identicon&r=PG
https://www.gravatar.com/avatar/24da669dda96b6f17a802bdb7f6d429f?s=32&d=identicon&r=PG
https://www.gravatar.com/avatar/24780fb6df85a943c7aea0402c843737?s=32&d=identicon&r=PG
Maybe you should try HTML DOM Parser for PHP. I've found this tool recently and to be honest it works pretty well. It was JQuery-like selectors as you can see on the site. I suggest you to take a look and try something like:
<?php
require_once("./simple_html_dom.php");
foreach ($html->find("<tag>") as $<tag>) //Start from the root (<html></html>) find the the parent tag you want to search in instead of <tag> (e.g "div" if you want to search in all divs)
{
foreach ($<tag>->find("img") as $img) //Start searching for img tag in all (divs) you found
{
echo $img->src . "<br>"; //Output the information from the img's src attribute (if the found tag is <img src="www.example.com/cat.png"> you will get www.example.com/cat.png as result)
}
}
?>
I hope i helped you less or more.

PHP redirect but in parent frame

I have a small time url shortner at http://thetpg.tk using a simple php script and MySQL.
What it does is to get the id and matches it in the SQL Database and redirects it to the specified link found in the Database using header().
But if I have a frameset with source as something like http://thetpg.tk redirected link is loaded inside the frame instead of the parent window.
For e.g. look at the page source of
http://thetpgmusic.tk which has the frame source as
http://thetpg.tk/b which further redirects to
http://thepirategamer.tk/music.php .
I want (1) to load (3) as the parent, but just by making changes in the functions in (2) .
So is there a function like
header(Location:http://thepirategamer.tk/music.php, '_parent');
in php, or is there any other way to implement it?
NOTE: I can't change anything in (2).
Thanks in advance ! :)
There are tree solutions that can help you do this:
First solution:
This solution may involve php if you're using echo to generate your html code, when you need to output an a tag, you should make sure to add the atribute target='_parent'
<?php
echo ' Click here ';
?>
problem :
The problem with this solution, is that it doesn't work if you need to redirect in the parent window from a page that you don't own (inside the iframe). The second solution solves this problem
Second solution:
This second solution is totally client-side, wich means you need to use some javascript. you should define a javascript function that addes the target='_parent' in every a tag
function init ()
{
TagNames = document.getElementById('iframe').contentWindow.document.getElementsByTagName('a');
for( var x=0; x < TagNames.length; x++ )
TagNames[x].onclick = function()
{
this.setAttribute('target','_parent');
}
};
Now all you need to do is to call this function when the body is loaded like this
<body onload="init();"> ... </body>
problem:
The problem with this solution, is that if you have a link that contains an anchor like this href="#" it will change the parent window to the child window To solve this problem, you have to use the third solution
Third solution:
This solution is also client-side and you have to use javascript. It is like the second solution except that you have to test if the link is a url to an external page or to an anchor before you redirect. so you need to define a function that returns true if it's a link to an external page and false if it's a simple anchor, and then you'll have to use this function like this
function init ()
{
TagNames = document.getElementById('iframe').contentWindow.document.getElementsByTagName('a');
for( var x=0; x < TagNames.length; x++ )
TagNames[x].onclick = function()
{
if ( is_external_url( this.href ) )
document.location = this.href;
}
};
and you also need to call this function when the body is loaded
<body onload="init();"> ... </body>
don't forget to define is_external_url()
update :
Here is the solution to get the url of the last child, it's just a simple function that looks from frames and iframes inside the paages and get the urls
function get_last_url($url)
{
$code = file_get_contents($url);
$start = strpos($code, '<frameset');
$end = strpos($code, '</frameset>');
if($start===false||$end===false)
{
$start = strpos($code, '<iframe');
$end = strpos($code, '</iframe>');
if($start===false||$end===false)
return $url;
}
$sub = substr($code, $start,$end-$start);
$sub = substr($sub, strpos($sub,'src="')+5);
$url = explode('"', $sub)[0];
return get_last_child($url);
}
$url = get_last_url("http://thetpgmusic.tk/");
header('Location: ' . $url);
exit();

How to create a sitemap with page relationships

I'm currently trying to figure out a way to write a script (preferrably PHP) that would crawl through a site and create a sitemap. In addition to the traditional standard listing of pages, I'd like the script to keep track of which pages link to other pages.
Example pages
A
B
C
D
I'd like the output to give me something like the following.
Page Name: A
Pages linking to Page A:
B
C
D
Page Name: B
Pages linking to Page B:
A
C
etc...
I've come across multiple standard sitemap scripts, but nothing that really accomplishes what I am looking for.
EDIT
Seems I didn't give enough info. Sorry about my lack of clarity there. Here is the code I currently have. I've used simple_html_dom.php to take care of the tasks of parsing and searching through the html for me.
<?php
include("simple_html_dom.php");
url = 'page_url';
$html = new simple_html_dom();
$html->load_file($url);
$linkmap = array();
foreach($html->find('a') as $link):
if(contains("cms/education",$link)):
if(!in_array($link, $linkmap)):
$linkmap[$link->href] = array();
endif;
endif;
endforeach;
?>
Note: My little foreach loop just filters based on a specific substring in the url.
So, I have the necessary first level pages. Where I am stuck is in creating a loop that will not run indefinitely, while keeping track of the pages you have already visited.
Basically, you need two arrays to control the flow here. The first will keep track of the pages you need to look at and the second will track the pages you have already looked at. Then you just run your existing code on each page until there are none left:
<?php
include("simple_html_dom.php");
$urlsToCheck = array();
$urlsToCheck[] = 'page_url';
$urlsChecked = array();
while(count($urlsToCheck) > 0)
{
$url = array_pop($urlsToCheck);
if (!in_array($url, $urlsChecked)
{
$urlsChecked[] = $url;
$html = new simple_html_dom();
$html->load_file($url);
$linkmap = array();
foreach($html->find('a') as $link):
if(contains("cms/education",$link)):
if((!in_array($link, $urlsToCheck)) && (!in_array($link,$urlsChecked)))
$urlsToCheck[] = $link;
if(!in_array($link, $linkmap)):
$linkmap[$link->href] = array();
endif;
endif;
endforeach;
}
}
?>

Categories