Web Scrape - not Working with a link, no result - php

i got a problem. I did that script with php to obtain the price of a web.
<?php
require('simple_html_dom.php'); //Library
$url = "https://www.exito.com/products/MP00550000000204/Televisor+LG+Led+43+Pulgadas+Full+HD+Smart+TV+43LJ550T"; //Link
$html = new simple_html_dom();
$html->load_file($url);
$post = $html->find('p[class=price offer]', 0)->plaintext;
$resultado = str_replace ( ".0", '', $post);
echo $resultado;
?>
So, if i test with the link that is on the code it works and shows me the price of the article 1186900
But when i change the link for one of the same page (this)
https://www.exito.com/products/0000225183192526/Carne+Res+Molida+En+Bandeja
I test the script and it does not shows me anything.
I donĀ“t understand because it is the same page and the price is in the same <p></p> that all the articles.
What am i doing wrong?
NOTE: Remember that if you want to test the script you need to download the simple_html_dom.php here
I appreciate your help.
Thanks

Related

Grabbing content of external site CSS class. (steam store)

I have been playing around with this code for a while but cant get it to work properly.
My goal is to display or maybe even create a table with ID's of grabbed data from the steam store for my own website and game library. the class is 'game_area_description'
This is a study project of mine.
So i tried to get the table using the following code.
#section('selectedGame');
<?php
$url = 'https://store.steampowered.com/app/'.$game->appID."/";
header("Access-Control-Allow-Origin: ${url}");
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom);
$elements = $xpath->query('//div[#class="game_area_description"]/a');
$link = $dom->saveHTML($elements->item(0));
echo $link;
?>
#endsection;
I am using Laravel by the way.
In some other cases i can get another piece of the website.
$url = 'https://store.steampowered.com/app/'.$game->appID."/";
$content = file_get_contents($url);
$first_step = explode( '<div class="game_description_snippet">' , $content );
$second_step = explode("</div>" , $first_step[1] );
echo "<p>${second_step[0]}</p>";
Here it just takes the excerpt of the webpage which works in some cases.
Here is the biggest issue, other than not beeing able to get all the information where i get an error $first_step[1]is not valid.
Is some CORE issue.
See the webpage loads an age check in some cases like "Batman Arkham knight". the user needs to either log in or verify their age first.
Keeping me from using the second block of code.
But the first gives me all kinds of errors as the screenshot shows.
Anyone know of a way to grab this part of the page?
Where the description of the game is?
The answer to my question was in the comments.
apparently steam has some undocumented API's .
here is the code ( with bootstrap CSS).
That i used and going ti implement in my migration tables and seeder
#section('selectedGame');
<div class="container border">
<!-- Content here -->
<?php
$url = "http://store.steampowered.com/api/appdetails?appids=".$game->appID;
$jsondata = file_get_contents($url);
$parsed = json_decode($jsondata,true);
$gameID = $game->appID;
$gameDescr = $parsed[$gameID]['data']['about_the_game'];
echo $gameDescr;
?>
</div>
#endsection;

Nesting simple-html-dom file_get_html($url)

I am attempting unsuccessfully to nest the use of file_get_html($url) from the simple-html-dom script.
Essentially I am requesting a page that has articles on, these articles are being looped through successfully, I can display the contents of these articles fine, however the image is only visible on the individual article page (once clicked through to that specific article).
Is this possible, why is this methodology not working? Maybe need to specify NEW file_get_html():
<?php
$simplehtmldom = get_template_directory() . "/simplehtmldom/simple_html_dom.php";
include_once($simplehtmldom);
// INIT FILE_GET_HTML
$articleshtml = file_get_html($articles_url);
$articlesdata = "";
// FIND ARTICLES
foreach($articleshtml->find('.articles') as $articlesdata) :
$articlecount = 1;
// FIND INDIVIDUAL ARTICLE
foreach($articlesdata->find('.article') as $article) :
// FIND LINK TO PAGE WITH IMAGE FOR ARTICLE
foreach($article->find('a') as $articleimagelink) :
// LINK TO HREF
$articleimage_url = $articleimagelink->href;
// NESTED FILE_GET_HTML TO GO TOWARDS IMAGE PAGE
$articleimagehtml = file_get_html($articleimage_url);
$articleimagedata = "";
foreach($articleimagehtml->find('.media img') as $articleimagedata) :
// MAKE USE OF IMAGE I WORKED EXTRA HARD TO FIND
endforeach;
endforeach;
endforeach;
endforeach; ?>
My question is regarding the possibility of making nested requests of the file_get_html() script so I can search for the image for a specific article on a separate page, then return to the previous file_get_html loop and move on to the next article?
I believe if it was at all possible would need me to set up something like:
something = new simplehtmldom;
or
something = new file_get_html(url);
What can I try next?

Simple html dom always loading the default first page and not the specified url

I want to scrape few web pages. I am using php and simple html dom parser.
For instance trying to scrape this site: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=5
I use this load the url.
$html = new simple_html_dom();
$html->load_file($url);
This loads the correct page. Then I find the next page link, here it will be:
https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=6
Just the page value is changed from 5 to 6. The code snippet to get the next link is:
function getNextLink($_htmlTemp)
{
//Getting the next page links
$aNext = $_htmlTemp->find('a.next', 0);
$nextLink = $aNext->href;
return $nextLink;
}
The above method returns the correct link with page value being 6.
Now when I try to load this next link, it fetches the first default page with page query absent from the url.
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>'; //$nxtLink has correct value
$originalHtml->load_file($nxtLink); //This line fetches default page
}
The whole flow is something like this:
$html->load_file($url);
//Whole thing in a do-while loop
$originalHtml = $html;
$shouldLoop = true;
//Main Array
$value = array();
do{
$listings = $originalHtml->find('div.searchResult');
foreach($listings as $item)
{
//Some logic here
}
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>';
$originalHtml->load_file($nxtLink);
}
else
{
//No next link -- stop the loop as we have covered all the pages
$shouldLoop = false;
}
} while($shouldLoop);
I have tried encoding the whole url, only the query parameters but the same result. I also tried creating new instances of simple_html_dom and then loading the file, no luck. Please help.
You need to html_entity_decode those links, I can see that they are getting mangled by simple-html-dom.
$url = 'https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes';
$html = str_get_html(file_get_contents($url));
while($a = $html->find('a.next', 0)){
$url = html_entity_decode($a->href);
echo $url . "\n";
$html = str_get_html(file_get_contents($url));
}

How to display image url from website sub pages using php code

I am using below mentioned php code to display images from webpages.Below mentioned code is able to display image url from main page but unable to display image urls from sub pages.
enter code here
<?php
include_once('simple_html_dom.php');
$target_url = "http://fffmovieposters.com/";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('img') as $img)
{
echo $img->src."<br />";
echo $img."<br/>";
}
?>
If by sub-page you mean a page that http://fffmovieposters.com is linking to, then of course that script won't show any of those since you're not loading those pages.
You basically have to write a spider that not only finds images, but also anchor tags and then repeats the process for those links. Just remember to add some filters so that you don't process pages more than once or start processing the entire internet by following external links.
Pseudo'ish code
$todo = ['http://fffmovieposters.com'];
$done = [];
$images = [];
while( ! empty($todo))
$link = array_shift($todo);
$done[] = $link;
$html = get html;
$images += find <img> tags
$newLinks = find <a> tags
remove all external links and all links already in $done from $newLinks
$todo += $newLinks;
Or something like that...

Only show certain ID with PHP web scrape?

I'm working on a personal project where it gets the content of my local weather station's school/business closing and it displays the results on my personal site. Since the site doesn't use an RSS feed (sadly), I was thinking of using a PHP scrape to get the contents of the page, but I only want to show a certain ID element. Is this possible?
My PHP code is,
<?php
$url = 'http://website.com';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
?>
I was thinking of using preg_match, but I'm not sure of the syntax or if that's even the right command. The ID element I want to show is #LeftColumnContent_closings_dg.
Here's an example using DOMDocument. It pulls the text from the first <h1> element with the id="test" ...
$html = '
<html>
<body>
<h1 id="test">test element text</h1>
<h1>test two</h1>
</body>
</html>
';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$res = $xpath->query('//h1[#id="test"]');
if ($res->item(0) !== NULL) {
$test = $res->item(0)->nodeValue;
}
A library I've used with great success for this sort of things is PHPQuery: http://code.google.com/p/phpquery/ .
You basically get your website into a string (like you have above), then do:
phpQuery::newDocument($output);
$titleElement = pq('title');
$title = $titleElement->html();
For instance - that would get the contents of the title element. The benefit is that all the methods are named after the jQuery ones, making it pretty easy to learn if you already know jQuery.

Categories