I'm beginner of php, I'm making simple program, and that use some crawling web site (not private information). The result that I expected is HTML CODE, like a
<html><head><title>blabla blabla</title></head>...................
But I checked the result, the screen shown up. not a raw code, for example,
include "Snoopy.class.php";
$snoopy = new Snoopy;
$snoopy->fetch("http://stackoverflow.com/");
echo $snoopy->results;
How to I get information to HTML Code? And Do you have another good parsing library in PHP? (like a beautifulsoup on Python, and Jsoup on Java)
** The result of above code : not a html code, but screen **
To see the source code using your browser instead of it rendering the HTML your last line should be:
echo htmlspecialchars($snoopy->results);
It's very simple
// Add snoopy class and initiate it
require "snoopy/Snoopy.class.php";
$snoopy = new Snoopy;
// THis fetches the html
$snoopy->fetch("http://www.php.net/");
$text = $snoopy->results;
// This fetches the text with html tags stripped
$snoopy->fetchtext("http://www.php.net/");
$text = $snoopy->results;
// This fetches all the links
$snoopy->fetchlinks('http://www.php.net/');
$linksarray = $snoopy->results;
Snoopy works great for me. So hope that helps
If you want to fetch the html from the URL you can do this simple do this by file_get_contents function of php.
$url = 'http://stackoverflow.com/';
$html = file_get_contents($url);
// echo $url -> wrong
echo $html;
Related
I am trying to scrape data from this site, using "inspect" I am checking the class of the div, but when I try to get it, it doesn't display anything:
Trying to get the "Diamond" below "Supremacy".
What I am using:
<?php
include('simple_html_dom.php');
$memberName = $_GET['memberName'];
$html = file_get_html('https://destinytracker.com/d2/profile/pc/'.$memberName.'');
preg_match("/<div id=\"dtr-rating\".*span>/", $html, $data);
var_dump($data);
?>
FYI, simple_html_dom is a package available on SourceForge at http://simplehtmldom.sourceforge.net/. See the documentation.
file_get_html(), from simple_html_dom, does not return a string; it returns an object that has methods you can call to traverse the HTML document. To get a string from the object, do:
$url = https://destinytracker.com/d2/profile/pc/'.$memberName;
$html_str = file_get_html($url)->plaintext;
But if you are going to do that, you might as well just do:
$html_str = file_get_contents($url);
and then run your regex on $html_str.
BUT ... if you want to use the power of simple_html_dom ...
$html_obj = file_get_html($url);
$the_div = $html_obj->find('div[id=dtr-rating]', 0);
$inner_str = $the_div->innertext;
I'm not sure how to do exactly what you want, because when I look at the source of the web link you provided, I cannot find a <div> with id="dtr-rating".
My other answer is about using simple_html_dom. After looking at the HTML doc in more detail, I see the problem is different than I first thought (I'll leave it there for pointers on better use of simple_html_dom).
I see that the web page you are scraping is a VueJS application. That means the HTML sent by the web server causes Javascript to run and build the dynamic contents of the web page that you see displayed. That means, the <div> your are looking for with regex DOES NOT EXIST in the HTML sent by the server. Your regex cannot find anything but its not there.
In Chrome, do Ctl+U to see what the web server sent (no "Supremacy"). Do Ctl+Shift+I and look under the "Elements" tab to see the HTML after the Javascript has done is magic (this does have "Supremacy").
This means you won't be able to get the initial HTML of the web page and scrape it to get the data you want.
I have a problem with file_get_html(), i don't understand why it doesn't work can you help me? my code
$html = file_get_html('https://www.airbnb.fr/');
if ($html) {
echo "good";
}
Have a good day!
I think, server just blocks your request, you will not be able to fetch data from it, using simple HTTP requests.
You can try using curl, proxies, or both (there are ready to use solutions for this, like: AngryCurl, or RollingCurl)
It doesnt work because you have to include the simple_dom_html class to make it work. You can find the code on their official page:
http://simplehtmldom.sourceforge.net/
Then you can simply get the HTML and output it like this:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->outertext;
or if you want to save the result in a variable
// Dump contents (without tags) from HTML
$html = file_get_html('http://www.google.com/')->outertext;
More info: http://simplehtmldom.sourceforge.net/
I'm attempting to use the Zoopla API (http://developer.zoopla.com/docs/read/Property_listings) to output specific data.
I have tested the API using a simple echo after the "file_get_contents() method, which shows the data. Example code shown below (API Key Removed)
$url = "http://api.zoopla.co.uk/api/v1/property_listings.xml?postcode=CF11&api_key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
$zoopla = file_get_contents($url);
echo $zoopla;
What Im trying to code is a loop that will allow me to add html tags so that I can style them. I've done similar for a RSS feed but can't figure out a way for this XML.
I have also tried an alternative approach using simplexml_load_file()
$xml = simplexml_load_file($url);
$agent_address = $xml->agent_address->agent_address[1]->agent_address;
echo $agent_address;
Any help would be greatly appreciated!
I found the answer to my own question!
Basically the $URL is a string and not a file "simplexml_load_file()"
So first, we need to get the xml file as a string and then parse the file. Code as followed! Works like a treat!
$zoopla = file_get_contents('http://api.zoopla.co.uk/api/v1/property_listings.xml?postcode=CF64&api_key=xxxxxxxxxxxxxxxxx');
$properties = simplexml_load_string($zoopla);
echo $properties->listing[2]->agent_phone;
Why does the code below code not print html content?
$url = 'http://clashofclans.com';
echo file_get_contents($url);
It works in all websites except for $url. I get this:
‹í}}{ÛƱïÿùÛ[É-á…IÛrŽÍØqzœØO¤º·'ÍÕ ˆ˜$ÔK÷;¿™¾åö9µËÅîìÌì¼íìxúå×o»çÿx÷Rd£á³/žâ¢ ƒñÕi#7P½g_hÚÓQ”Z8¦i”6fYßh7NæwY61¢_fñõiãÿ{nt“Ñ$ÈâËaÔÐÂdœEcêöíËÓ¨w•;Ž
Because the response content is gzipped.
Try gzdecode:
gzdecode(file_get_contents($url));
Consider using cURL instead, which does the decompression for you and should be more robust, as described in this SO answer.
Hello Stackoverflow I'm trying to use the following library simplehtmldom.sourceforge.net
The purpose of this little script is to grab the StackOverflow Logo and echo it. But for some strange reason it grabs every DOM element instead. Any idea what i'm doing wrong here?
<?php
include('simple_html_dom.php');
$request_url = 'http://stackoverflow.com/';
$html = file_get_html($request_url);
$element = $html->find('div[id=hlogo]');
echo $html->save($element);
Thank you in advance for taking your time to read this!
$html->find returns an array in the form that you're using it, so you need to access the first element of the array to get the results:
include('simple_html_dom.php');
$html = file_get_html('http://stackoverflow.com');
$logo = $html->find('#hlogo'); // find the id hlogo
echo $logo[0];
# prints out
# <div id="hlogo"> Stack Overflow </div>
You're also using the save function wrong; from the docs:
// Dumps the internal DOM tree back into string
$str = $html->save();
// Dumps the internal DOM tree back into a file
$html->save('result.htm');
You're getting the whole page because $html contains the whole DOM!