I have a problem with file_get_html(), i don't understand why it doesn't work can you help me? my code
$html = file_get_html('https://www.airbnb.fr/');
if ($html) {
echo "good";
}
Have a good day!
I think, server just blocks your request, you will not be able to fetch data from it, using simple HTTP requests.
You can try using curl, proxies, or both (there are ready to use solutions for this, like: AngryCurl, or RollingCurl)
It doesnt work because you have to include the simple_dom_html class to make it work. You can find the code on their official page:
http://simplehtmldom.sourceforge.net/
Then you can simply get the HTML and output it like this:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->outertext;
or if you want to save the result in a variable
// Dump contents (without tags) from HTML
$html = file_get_html('http://www.google.com/')->outertext;
More info: http://simplehtmldom.sourceforge.net/
Related
This is a really weird situation that I can't explain.
I use simple HTML DOM and am trying to get the full code of this page:
http://ronilocks.com/
The thing is, I'm getting only part of what's actually on the page.
For instance: look at the page source code and see all the script tags that are in the plugins folder. There are quite a few.
When I check the same with the string I get back from simple HTML DOM none of them are there. Only wp-rocket.
(I used a clean file_get_html() and a file_get_contents() too and got the same result)
Any thoughts?
Thanks!
Edit: Is it possible that wp-rocket (installed on the page being scrapped) knows that the page is being scrapped and shows something different?
include 'simple_html_dom.php';
$html = file_get_html('http://ronilocks.com/');
echo count($html->find('a'));
// 425
I get 425. This looks right to me.
I am trying to scrape data from this site, using "inspect" I am checking the class of the div, but when I try to get it, it doesn't display anything:
Trying to get the "Diamond" below "Supremacy".
What I am using:
<?php
include('simple_html_dom.php');
$memberName = $_GET['memberName'];
$html = file_get_html('https://destinytracker.com/d2/profile/pc/'.$memberName.'');
preg_match("/<div id=\"dtr-rating\".*span>/", $html, $data);
var_dump($data);
?>
FYI, simple_html_dom is a package available on SourceForge at http://simplehtmldom.sourceforge.net/. See the documentation.
file_get_html(), from simple_html_dom, does not return a string; it returns an object that has methods you can call to traverse the HTML document. To get a string from the object, do:
$url = https://destinytracker.com/d2/profile/pc/'.$memberName;
$html_str = file_get_html($url)->plaintext;
But if you are going to do that, you might as well just do:
$html_str = file_get_contents($url);
and then run your regex on $html_str.
BUT ... if you want to use the power of simple_html_dom ...
$html_obj = file_get_html($url);
$the_div = $html_obj->find('div[id=dtr-rating]', 0);
$inner_str = $the_div->innertext;
I'm not sure how to do exactly what you want, because when I look at the source of the web link you provided, I cannot find a <div> with id="dtr-rating".
My other answer is about using simple_html_dom. After looking at the HTML doc in more detail, I see the problem is different than I first thought (I'll leave it there for pointers on better use of simple_html_dom).
I see that the web page you are scraping is a VueJS application. That means the HTML sent by the web server causes Javascript to run and build the dynamic contents of the web page that you see displayed. That means, the <div> your are looking for with regex DOES NOT EXIST in the HTML sent by the server. Your regex cannot find anything but its not there.
In Chrome, do Ctl+U to see what the web server sent (no "Supremacy"). Do Ctl+Shift+I and look under the "Elements" tab to see the HTML after the Javascript has done is magic (this does have "Supremacy").
This means you won't be able to get the initial HTML of the web page and scrape it to get the data you want.
Why does the code below code not print html content?
$url = 'http://clashofclans.com';
echo file_get_contents($url);
It works in all websites except for $url. I get this:
‹í}}{ÛƱïÿùÛ[É-á…IÛrŽÍØqzœØO¤º·'ÍÕ ˆ˜$ÔK÷;¿™¾åö9µËÅîìÌì¼íìxúå×o»çÿx÷Rd£á³/žâ¢ ƒñÕi#7P½g_hÚÓQ”Z8¦i”6fYßh7NæwY61¢_fñõiãÿ{nt“Ñ$ÈâËaÔÐÂdœEcêöíËÓ¨w•;Ž
Because the response content is gzipped.
Try gzdecode:
gzdecode(file_get_contents($url));
Consider using cURL instead, which does the decompression for you and should be more robust, as described in this SO answer.
I try to download sourcecode of a twitter webpage with a php code:
$continut_pp = file_get_contents('https://twitter.com/');
echo $continut_pp;
The problem is that result is null. I think the problem comes from the https, well how I can extract an https source coude in PHP code?
Try using the file_get_contents function. Just give it the full web address and it should return the HTML source. I hope this helps.
As the first function I suggested did not work, you could try this one: var markup = document.documentElement.innerHTML;. However, it is in Javascript and not PHP.
I had a big PHP script written out to scrape images from this site: "http://www.mcso.us/paid/", but when it didn't work I butchered my code to simply echo the whole page.
I found that the table with the image links I want doesn't show up. I believe it's because the remote site uses ASP to generate the table. Is there a way around this? Am I wrong? Please help.
<?php
include("simple_html_dom.php");
set_time_limit(0);
$baseURL = "http://www.mcso.us/paid/";
$html = file_get_html($baseURL);
echo $html;
?>
There's no obvious reason why them using ASP would cause this, have you tried navigating the page with JavaScript turned off? It's a more likely scenario that the tables are generated through JS.
Do note that the search results are retrieved through ajax ( page http://www.mcso.us/paid/default.aspx ) by making a POST request, you can use cURL http://php.net/manual/en/book.curl.php , use chrome right-click-->inspect element---> network and make a search you will see all the info there (post variables etc ...)