simple_html_dom not finding my element - php

I'm trying prices off of Amazon for an exercise.
<?php
require('simple_html_dom.php');
$get = $_GET['id'];
$get_url = 'http://www.amazon.co.uk/s/field-keywords='.$get;
echo $get_url;
// Retrieve the DOM from a given URL
$html = file_get_html($get_url);
foreach($html->find('li[class=newp]') as $e)
echo $e->plaintext . '<br>';
I tried a few differents:
li[class=newp]
.price
ul[class=rsltL]
but it doesn't return anything, what am I doing wrong?
I tried returning the titles as well:
.lrg.bold
Tried Xpath, nothing.
Thanks

Your code is fine. It is very likely that your PHP settings are the culprits.
put
error_reporting(E_ALL);
ini_set('display_errors', '1');
at the begining of your php script and see if it prints out any useful error.
Also, note that simple_html_dom uses the file_get_contents() function internally to grab page content. So, you may want to run file_get_contents($get_url) to see what happens.
If that function does not work then it is definitely your PHP setting. In such case, I recommend starting another thread with that issue in the title.
This might help though:
PHP file_get_contents does not work on localhost

Related

Simple HTML DOM cannot get file

I have no clue what the solution might be.
I simply cannot get the html file of this Charizard, I don't get any response even though the link is correct. Bulbasaur is working fine, but I want this lovely Charizard...
include("simple_html_dom.php");
$html = file_get_html('https://bulbapedia.bulbagarden.net/wiki/Charizard_(Pok%C3%A9mon)');
$html2 = file_get_html('https://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon)');
echo $html;
echo $html2;
Does this page have any protection or is Charizard only harder to catch?
I'd appreciate if you are able to help me with this.
Jonas :)
There are two problems here:
Length of the content fetched from this URL exceeds MAX_FILE_SIZE (defined in simple_html_dom.php)
The bug that was pointed out in the comments (https://github.com/sunra/php-simple-html-dom-parser/issues/37). This bug seems to be resolved in the forked repository that is maintained on github but it still exists in original version (which does not seem to be maintained anymore).
To solve the first problem, edit simple_html_dom.php and change define('MAX_FILE_SIZE', 600000); to use a bigger number.
As a workaround for the second problem, pass correct parameters to file_get_html, and by that I mean to pass 0 for $offset:
$html = file_get_html('https://bulbapedia.bulbagarden.net/wiki/Charizard_(Pok%C3%A9mon)',
false,
null,
0); // this last one is the offset
var_dump($html);
Alternatively you can use the forked version of the library.
I'm going to suggest an alternative library because II don't think you will get this with simple_html_dom:
include 'advanced_html_dom.php';
$html = file_get_html('https://bulbapedia.bulbagarden.net/wiki/Charizard_(Pok%C3%A9mon)');
echo $html->find('h1', 0)->text() . PHP_EOL;
echo $html->find('big a[title*="Pokédex number"]', 0)->text() . PHP_EOL;
This gives:
Charizard (Pokémon)
#006
Since i haven't found the file_get_html() in the php docs, maybe you prefer using file_get_contents(url) instead.

file_get_html & str_get_html with cURL are getting part of a page

This is a really weird situation that I can't explain.
I use simple HTML DOM and am trying to get the full code of this page:
http://ronilocks.com/
The thing is, I'm getting only part of what's actually on the page.
For instance: look at the page source code and see all the script tags that are in the plugins folder. There are quite a few.
When I check the same with the string I get back from simple HTML DOM none of them are there. Only wp-rocket.
(I used a clean file_get_html() and a file_get_contents() too and got the same result)
Any thoughts?
Thanks!
Edit: Is it possible that wp-rocket (installed on the page being scrapped) knows that the page is being scrapped and shows something different?
include 'simple_html_dom.php';
$html = file_get_html('http://ronilocks.com/');
echo count($html->find('a'));
// 425
I get 425. This looks right to me.

PHP - file_get_html not returning anything

I am trying to scrape data from this site, using "inspect" I am checking the class of the div, but when I try to get it, it doesn't display anything:
Trying to get the "Diamond" below "Supremacy".
What I am using:
<?php
include('simple_html_dom.php');
$memberName = $_GET['memberName'];
$html = file_get_html('https://destinytracker.com/d2/profile/pc/'.$memberName.'');
preg_match("/<div id=\"dtr-rating\".*span>/", $html, $data);
var_dump($data);
?>
FYI, simple_html_dom is a package available on SourceForge at http://simplehtmldom.sourceforge.net/. See the documentation.
file_get_html(), from simple_html_dom, does not return a string; it returns an object that has methods you can call to traverse the HTML document. To get a string from the object, do:
$url = https://destinytracker.com/d2/profile/pc/'.$memberName;
$html_str = file_get_html($url)->plaintext;
But if you are going to do that, you might as well just do:
$html_str = file_get_contents($url);
and then run your regex on $html_str.
BUT ... if you want to use the power of simple_html_dom ...
$html_obj = file_get_html($url);
$the_div = $html_obj->find('div[id=dtr-rating]', 0);
$inner_str = $the_div->innertext;
I'm not sure how to do exactly what you want, because when I look at the source of the web link you provided, I cannot find a <div> with id="dtr-rating".
My other answer is about using simple_html_dom. After looking at the HTML doc in more detail, I see the problem is different than I first thought (I'll leave it there for pointers on better use of simple_html_dom).
I see that the web page you are scraping is a VueJS application. That means the HTML sent by the web server causes Javascript to run and build the dynamic contents of the web page that you see displayed. That means, the <div> your are looking for with regex DOES NOT EXIST in the HTML sent by the server. Your regex cannot find anything but its not there.
In Chrome, do Ctl+U to see what the web server sent (no "Supremacy"). Do Ctl+Shift+I and look under the "Elements" tab to see the HTML after the Javascript has done is magic (this does have "Supremacy").
This means you won't be able to get the initial HTML of the web page and scrape it to get the data you want.

file_get_html() not working with airbnb

I have a problem with file_get_html(), i don't understand why it doesn't work can you help me? my code
$html = file_get_html('https://www.airbnb.fr/');
if ($html) {
echo "good";
}
Have a good day!
I think, server just blocks your request, you will not be able to fetch data from it, using simple HTTP requests.
You can try using curl, proxies, or both (there are ready to use solutions for this, like: AngryCurl, or RollingCurl)
It doesnt work because you have to include the simple_dom_html class to make it work. You can find the code on their official page:
http://simplehtmldom.sourceforge.net/
Then you can simply get the HTML and output it like this:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->outertext;
or if you want to save the result in a variable
// Dump contents (without tags) from HTML
$html = file_get_html('http://www.google.com/')->outertext;
More info: http://simplehtmldom.sourceforge.net/

file_get_contents api link two connections with session

I have two api pages, api1.php and api2.php. On 1 there is set an session and on 2 this session needs to be returned.
Ofcourse there will be additional functions but my goal of this is to link those two api connections to one and eachother by using a session.
api1.php:
session_start();
$api_key = 'dfdsakdsfjdskfjdskfdsjfdfewfifjjsd';
$_SESSION['api_key'] = $api_key;
setcookie('api_key', $api_key);
api2.php:
session_start();
echo $_SESSION['api_key'];
echo $_COOKIE['api_key'];
test.php:
$url = 'http://example.com/api1.php';
$content1 = file_get_contents($url);
$url2 = 'http://example.com/api2.php';
$content2 = file_get_contents($url2);
echo $content2;
As you may have noticed, i'm visiting the page test.php to obtain a result.
But no result is being returned.
Can somebody tell me why this is not working and what may be an additional way of making all of this happen?
(Notice: the example.com are both the same site (mine))
You're code "links" correctly. The problem is actually in test.php! Instead of executing the code contained in both files, it retrieves the entire file. If you view source you will note the PHP tags and your code. A better solution to check if this is working is to go to api1.php and api2.php separately. With some code adjustments you could also just use the include() or require() functions. Which would look like this:
api2.php
echo $_SESSION['api_key'] . "\n<br/>\n";
echo $_COOKIE['api_key'];
test.php
include('api1.php');
include('api2.php');
It's worth noting the using the include and require functions executes the code in api1.php and api2.php as if that code were a part of test.php.

Categories