Nesting simple-html-dom file_get_html($url) - php

I am attempting unsuccessfully to nest the use of file_get_html($url) from the simple-html-dom script.
Essentially I am requesting a page that has articles on, these articles are being looped through successfully, I can display the contents of these articles fine, however the image is only visible on the individual article page (once clicked through to that specific article).
Is this possible, why is this methodology not working? Maybe need to specify NEW file_get_html():
<?php
$simplehtmldom = get_template_directory() . "/simplehtmldom/simple_html_dom.php";
include_once($simplehtmldom);
// INIT FILE_GET_HTML
$articleshtml = file_get_html($articles_url);
$articlesdata = "";
// FIND ARTICLES
foreach($articleshtml->find('.articles') as $articlesdata) :
$articlecount = 1;
// FIND INDIVIDUAL ARTICLE
foreach($articlesdata->find('.article') as $article) :
// FIND LINK TO PAGE WITH IMAGE FOR ARTICLE
foreach($article->find('a') as $articleimagelink) :
// LINK TO HREF
$articleimage_url = $articleimagelink->href;
// NESTED FILE_GET_HTML TO GO TOWARDS IMAGE PAGE
$articleimagehtml = file_get_html($articleimage_url);
$articleimagedata = "";
foreach($articleimagehtml->find('.media img') as $articleimagedata) :
// MAKE USE OF IMAGE I WORKED EXTRA HARD TO FIND
endforeach;
endforeach;
endforeach;
endforeach; ?>
My question is regarding the possibility of making nested requests of the file_get_html() script so I can search for the image for a specific article on a separate page, then return to the previous file_get_html loop and move on to the next article?
I believe if it was at all possible would need me to set up something like:
something = new simplehtmldom;
or
something = new file_get_html(url);
What can I try next?

Related

Simple html dom always loading the default first page and not the specified url

I want to scrape few web pages. I am using php and simple html dom parser.
For instance trying to scrape this site: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=5
I use this load the url.
$html = new simple_html_dom();
$html->load_file($url);
This loads the correct page. Then I find the next page link, here it will be:
https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=6
Just the page value is changed from 5 to 6. The code snippet to get the next link is:
function getNextLink($_htmlTemp)
{
//Getting the next page links
$aNext = $_htmlTemp->find('a.next', 0);
$nextLink = $aNext->href;
return $nextLink;
}
The above method returns the correct link with page value being 6.
Now when I try to load this next link, it fetches the first default page with page query absent from the url.
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>'; //$nxtLink has correct value
$originalHtml->load_file($nxtLink); //This line fetches default page
}
The whole flow is something like this:
$html->load_file($url);
//Whole thing in a do-while loop
$originalHtml = $html;
$shouldLoop = true;
//Main Array
$value = array();
do{
$listings = $originalHtml->find('div.searchResult');
foreach($listings as $item)
{
//Some logic here
}
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>';
$originalHtml->load_file($nxtLink);
}
else
{
//No next link -- stop the loop as we have covered all the pages
$shouldLoop = false;
}
} while($shouldLoop);
I have tried encoding the whole url, only the query parameters but the same result. I also tried creating new instances of simple_html_dom and then loading the file, no luck. Please help.
You need to html_entity_decode those links, I can see that they are getting mangled by simple-html-dom.
$url = 'https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes';
$html = str_get_html(file_get_contents($url));
while($a = $html->find('a.next', 0)){
$url = html_entity_decode($a->href);
echo $url . "\n";
$html = str_get_html(file_get_contents($url));
}

scraping images from url using php

i am trying to make a page that allows me to grab and save images from another link , so here's what i want to add on my page:
text box (to enter url that i want to get images from).
save dialog box to specify the path to save images.
but what i am trying to do here i want to save images only from that url and from inside specific element.
for example on my code i say go to example.com and from inside of element class="images" grab all images.
notes: not all images from the page, just from inside the element
whether element has 3 images in it or 50 or 100 i don't care.
here's what i tried and worked using php
<?php
$html = file_get_contents('http://www.tgo-tv.net');
preg_match_all( '|<img.*?src=[\'"](.*?)[\'"].*?>|i',$html, $matches );
echo $matches[ 1 ][ 0 ];
?>
this gets image name and path but what i am trying to make is a save dialog box and the code must save image directly into that path instead of echo it out
hope you understand
Edit 2
it's ok of Not having save dialog box. i must specify save path from the code
If you want something generic, you can use:
<?php
$the_site = "http://somesite.com";
$the_tag = "div"; #
$the_class = "images";
$html = file_get_contents($the_site);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//'.$the_tag.'[contains(#class,"'.$the_class.'")]/img') as $item) {
$img_src = $item->getAttribute('src');
print $img_src."\n";
}
Usage:
Change the site, tag, which can be a div, span, a, etc. also change the class name.
For example, change the values to:
$the_site = "https://stackoverflow.com/questions/23674744/what-is-the-equivalent-of-python-any-and-all-functions-in-javascript";
$the_tag = "div"; #
$the_class = "gravatar-wrapper-32";
Output:
https://www.gravatar.com/avatar/67d8ca039ee1ffd5c6db0d29aeb4b168?s=32&d=identicon&r=PG
https://www.gravatar.com/avatar/24da669dda96b6f17a802bdb7f6d429f?s=32&d=identicon&r=PG
https://www.gravatar.com/avatar/24780fb6df85a943c7aea0402c843737?s=32&d=identicon&r=PG
Maybe you should try HTML DOM Parser for PHP. I've found this tool recently and to be honest it works pretty well. It was JQuery-like selectors as you can see on the site. I suggest you to take a look and try something like:
<?php
require_once("./simple_html_dom.php");
foreach ($html->find("<tag>") as $<tag>) //Start from the root (<html></html>) find the the parent tag you want to search in instead of <tag> (e.g "div" if you want to search in all divs)
{
foreach ($<tag>->find("img") as $img) //Start searching for img tag in all (divs) you found
{
echo $img->src . "<br>"; //Output the information from the img's src attribute (if the found tag is <img src="www.example.com/cat.png"> you will get www.example.com/cat.png as result)
}
}
?>
I hope i helped you less or more.

How to display image url from website sub pages using php code

I am using below mentioned php code to display images from webpages.Below mentioned code is able to display image url from main page but unable to display image urls from sub pages.
enter code here
<?php
include_once('simple_html_dom.php');
$target_url = "http://fffmovieposters.com/";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('img') as $img)
{
echo $img->src."<br />";
echo $img."<br/>";
}
?>
If by sub-page you mean a page that http://fffmovieposters.com is linking to, then of course that script won't show any of those since you're not loading those pages.
You basically have to write a spider that not only finds images, but also anchor tags and then repeats the process for those links. Just remember to add some filters so that you don't process pages more than once or start processing the entire internet by following external links.
Pseudo'ish code
$todo = ['http://fffmovieposters.com'];
$done = [];
$images = [];
while( ! empty($todo))
$link = array_shift($todo);
$done[] = $link;
$html = get html;
$images += find <img> tags
$newLinks = find <a> tags
remove all external links and all links already in $done from $newLinks
$todo += $newLinks;
Or something like that...

simple-html-dom loop not ending

I am trying to cature all the links and then go to the next page until the end of the pages.
I just keep getting a loop. I think I am just glazed over and was hoping that once again I can get some help today.
getLinks('http://www.homedepot.com/h_d1/N-5yc1vZaqns/h_d2/Navigation?catalogId=10053&langId=-1&storeId=10051&catStyle=ShowProducts#/?c=1&style=List');
function getLinks($URL) {
$html = file_get_contents($URL);
$dom = new simple_html_dom();
$dom -> load($html);
foreach ($dom->find('a[class=item_description]') as $href){
$url = $href->href;
echo $url;
}
if ($nextPage = $dom->find("a[class='paginationNumberStyle page_arrows']", 0)){
$nextPageURL = 'http://www.homedepot.com'.$nextPage->getAttribute('data-url');
$dom -> clear();
unset($dom);
getLinks($nextPageURL);
} else {
echo "\nEND";
$dom -> clear();
unset($dom);
}
}
In your code, you never keep track of where you've been.
Let's say you start on page A:
The first link on page A links to page B.
You open up page B and start crawling the links.
The first link on page B links to page A.
You open up page A and start crawling the links ....
This process will repeat indefinitely, because you'll end up crawling the same pages over and over. You need to keep a list of pages you've crawled and skip out if you've already crawled that page.
Also note that it may not be a simple loop like that.
A links to B
B links to C
C links to D
....
S links to T
T links to A
Not overly familiar with PHP, but something like:
$arr[$url] = true; // Tell it that we know the url
if (array_key_exists($url, $arr)) {
// check if the url exists in the hash
}
The problem is you're following previous arrows as well as next arrows. Your css selector needs to be adjusted to account for this.

How to create a sitemap with page relationships

I'm currently trying to figure out a way to write a script (preferrably PHP) that would crawl through a site and create a sitemap. In addition to the traditional standard listing of pages, I'd like the script to keep track of which pages link to other pages.
Example pages
A
B
C
D
I'd like the output to give me something like the following.
Page Name: A
Pages linking to Page A:
B
C
D
Page Name: B
Pages linking to Page B:
A
C
etc...
I've come across multiple standard sitemap scripts, but nothing that really accomplishes what I am looking for.
EDIT
Seems I didn't give enough info. Sorry about my lack of clarity there. Here is the code I currently have. I've used simple_html_dom.php to take care of the tasks of parsing and searching through the html for me.
<?php
include("simple_html_dom.php");
url = 'page_url';
$html = new simple_html_dom();
$html->load_file($url);
$linkmap = array();
foreach($html->find('a') as $link):
if(contains("cms/education",$link)):
if(!in_array($link, $linkmap)):
$linkmap[$link->href] = array();
endif;
endif;
endforeach;
?>
Note: My little foreach loop just filters based on a specific substring in the url.
So, I have the necessary first level pages. Where I am stuck is in creating a loop that will not run indefinitely, while keeping track of the pages you have already visited.
Basically, you need two arrays to control the flow here. The first will keep track of the pages you need to look at and the second will track the pages you have already looked at. Then you just run your existing code on each page until there are none left:
<?php
include("simple_html_dom.php");
$urlsToCheck = array();
$urlsToCheck[] = 'page_url';
$urlsChecked = array();
while(count($urlsToCheck) > 0)
{
$url = array_pop($urlsToCheck);
if (!in_array($url, $urlsChecked)
{
$urlsChecked[] = $url;
$html = new simple_html_dom();
$html->load_file($url);
$linkmap = array();
foreach($html->find('a') as $link):
if(contains("cms/education",$link)):
if((!in_array($link, $urlsToCheck)) && (!in_array($link,$urlsChecked)))
$urlsToCheck[] = $link;
if(!in_array($link, $linkmap)):
$linkmap[$link->href] = array();
endif;
endif;
endforeach;
}
}
?>

Categories