I need to scrape the very little piece of text which Google returns to any enquiry as part of the "Knowledge Graph" result (the one generally on the right-hand side) which it gets from Wikipedia. This way I can then convert the plain-text to Voice Answer. Using Simple HTML Dom I have no problems scraping such info from Bing or Ask, but the very DIV (and SPAN) within which this result is nested on Google, I just can't get it. Simple function below:
$question = str_replace(' ','+',$_GET['question']);
$address = 'http://www.google.co.uk/search?q='.$question;
$ret = scraping_Google($address);
function scraping_Google($url) {
// create HTML DOM
$html = file_get_html($url);
// get title
$ret = $html->find('div.kno-rdesc', 0)->plaintext;
// clean up memory
$html->clear();
unset($html);
return $ret;
}
echo $ret;
The very div.kno-rdesc is where the content is nested (this I easily retrieve using Code Inspector on Chrome). Yet, no success to parse this tiny piece of information. Anybody able to help out? Cheers!
You don't need to scrape it. Google has an API for that. Tap into the power of Google's Knowledge Graph with Freebase data
Related
I am trying to scrape data from this site, using "inspect" I am checking the class of the div, but when I try to get it, it doesn't display anything:
Trying to get the "Diamond" below "Supremacy".
What I am using:
<?php
include('simple_html_dom.php');
$memberName = $_GET['memberName'];
$html = file_get_html('https://destinytracker.com/d2/profile/pc/'.$memberName.'');
preg_match("/<div id=\"dtr-rating\".*span>/", $html, $data);
var_dump($data);
?>
FYI, simple_html_dom is a package available on SourceForge at http://simplehtmldom.sourceforge.net/. See the documentation.
file_get_html(), from simple_html_dom, does not return a string; it returns an object that has methods you can call to traverse the HTML document. To get a string from the object, do:
$url = https://destinytracker.com/d2/profile/pc/'.$memberName;
$html_str = file_get_html($url)->plaintext;
But if you are going to do that, you might as well just do:
$html_str = file_get_contents($url);
and then run your regex on $html_str.
BUT ... if you want to use the power of simple_html_dom ...
$html_obj = file_get_html($url);
$the_div = $html_obj->find('div[id=dtr-rating]', 0);
$inner_str = $the_div->innertext;
I'm not sure how to do exactly what you want, because when I look at the source of the web link you provided, I cannot find a <div> with id="dtr-rating".
My other answer is about using simple_html_dom. After looking at the HTML doc in more detail, I see the problem is different than I first thought (I'll leave it there for pointers on better use of simple_html_dom).
I see that the web page you are scraping is a VueJS application. That means the HTML sent by the web server causes Javascript to run and build the dynamic contents of the web page that you see displayed. That means, the <div> your are looking for with regex DOES NOT EXIST in the HTML sent by the server. Your regex cannot find anything but its not there.
In Chrome, do Ctl+U to see what the web server sent (no "Supremacy"). Do Ctl+Shift+I and look under the "Elements" tab to see the HTML after the Javascript has done is magic (this does have "Supremacy").
This means you won't be able to get the initial HTML of the web page and scrape it to get the data you want.
I am using Goutte Laravel library in project to get page content and to crawl it.
I can find any element of DOM structure, except in one of the site i have found the important content placed in <script> tag.
The data is placed in javascript variable and i wants to crawl it without heavy string operations. Typical example of such a case
$html="var article_content = "Details article string";
var article_twtag = "#Madrid #Barcelona";
var article_twtitle = "Article title";
var article_images = new Array (
"http://img.sireasas.com/?i=reuters%2f2017-03-08%2f2017-03-
08t200344z_132005024_mt1aci14762686_rtrmadp_3_soccer-champions-fcb-
psg_reuters.jpg","",
"0000000000115043","",
"");";
Is there any way to crawl the javascript using selector or DOM methods ?
What I would do, was getting the content that existed inside the script tag and then extract whatever I wanted through regular expressions.
$doc = new DOMDocument();
$doc->loadHTML($yoursiteHTML);
foreach($doc->getElementsByTagName('script') as $content) {
// extract data
}
Goutte only receives the HTML response and does not run Javascript code, to get dynamic data, as a browser does.
Use PHP Simple HTML DOM Parser
$html = file_get_html('http://www.your-link-here.com/');
// Find all scripts
foreach($html->find('script') as $element)
echo $element->outertext . '<br>';
UPDATE: The source code is very much different from what Developer Tools shows.
Check out the source: view-source:http://www.machinerytrader.com/list/list.aspx?ETID=1&catid=1002
Is that javascript that needs to be rendered by a browser into html? If so, how can I have php do that process so that I have Html to parse? It's weird that you can use Xpath Checker to return the items I'm looking for (see below), but you cannot access the full html!
(Xpath: //table[contains(#id, 'ctl00_ContentPlaceHolder1') and (contains(#id,"tblContent") or contains(#id,"tblListingHeader"))])
END UPDATE
I need to scrape some information off of this site for work on a regular basis. I am attempting to write some PHP code to scrape this data. I think I have some namespace issues here, having read a number of other posts on SO. I have never encountered namespace problems before and used the approach shown on another SO post (to no avail :().
It appears the xpath query is just not happening for whatever reason. If you have any guesses or solutions as to how to handle this issue, I am open for suggestions.
Also here is the output from my code:
object(DOMXPath)#2 (0) {
}
Debug 1
array(0) {
}
array(0) {
}
I left out the bottom of the code where I var_dump testarray and create and var_dump otherarray. Their output is included above. Obviously the two arrays will be empty if the DOMXPath element has length 0 as well.
$string = 'http://www.machinerytrader.com/list/list.aspx?ETID=1&catid=1002';
$machine_trader = file_get_contents($string);
$xml = new DOMDocument();
$xml->loadHTML($machine_trader);
$xpath = new DOMXPath($xml);
$rootNamespace = $xml->lookupNamespaceUri($xml->namespaceURI);
$xpath->registerNamespace('x', $rootNamespace);
$tableRows = $xpath->query("//x:table[contains(#id, 'ctl00_ContentPlaceHolder1') and (contains(#id,'tblContent') or contains(#id,'tblListingHeader'))]");
var_dump($xpath);
$testarray = array();
$otherarray = array();
foreach ( $tableRows as $row )
{
echo "Debug 1"."\n";
$testarray[] = $row->nodeValue;
}
This is not an XPath issue insofar that the actual content is found from a form post, which you didn't reach yet. JS Source code here does nothing more than authenticate a proper 'user' for the information request, and then send the request via form submission.
At each request, the salt / encryption 'key' is randomized and changes, preventing simple scrapes.
You could rewrite that JavaScript to PHP and then issue two requests, battling the authentication process along the way.
Or, rather than diddle with reverse-engineering this, you could switch your scraping to NodeJS and use something like PhantomJS since it can evaluate javascript but give you programmatic access. Given the complexity of this task, it'd be much simpler to use the right tool.
I wanted to use PHP Simple HTML DOM Parser to grab the Google Apps Status table so I can create my own dashboard that will only include Google Mail and Google Talk service status, as well as change the presentation (html,css).
As a test, I wanted to find/output the table element but it's not displaying any results.
$html = file_get_html('http://www.google.com/appsstatus');
$e = $html->find("table", 0);
echo $e->outertext;
Although, if I find/output the div elements it will display results.
$html = file_get_html('http://www.google.com/appsstatus');
$e = $html->find("div", 0);
echo $e->outertext;
Any help would be much appreciated.
It's way easier than that. All of this data is tied up in a JSON feed.
http://www.google.com/appsstatus/json/en
On a simple level, you could do a file_get_contents() of that, knock that bit off the front that says dashboard.jsonp, and then to a json_decode() (doc), and you will have yourself a nice array with all the information you'd ever want to know about Google's service status. Dump it with print_r() to see where everything is.
Finding these types of things is super easy with Fiddler. I highly recommend it.
I would like to create a crawler using php that would give me a list of all the pages on a specific domain (starting from the homepage: www.example.com).
How can I do this in php?
I don't know how to recursively find all the pages on a website starting from a specific page and excluding external links.
For the general approach, check out the answers to these questions:
How to write a crawler?
How to best develop web crawlers
Is there a way to use PHP to crawl links?
In PHP, you should be able to simply fetch a remote URL with file_get_contents(). You could perform a naive parse of the HTML by using a regular expression with preg_match() to find <a href=""> tags and parse the URL out of them (See this question for some typical approaches).
Once you've extract the raw href attribute, you could use parse_url() to break into it components and figure out if its a URL you want to fetch - remember also the URLs may be relative to the page you've fetched.
Though fast, a regex isn't the best way of parsing HTML though - you could also try the DOM classes to parse the HTML you fetch, for example:
$dom = new DOMDocument();
$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
if ( count($anchors->length) > 0 ) {
foreach ( $anchors as $anchor ) {
if ( $anchor->hasAttribute('href') ) {
$url = $anchor->getAttribute('href');
//now figure out whether to processs this
//URL and add it to a list of URLs to be fetched
}
}
}
Finally, rather than write it yourself, see also this question for other resources you could use.
is there a good web crawler library available for PHP or Ruby?
Overview
Here are some notes on the basics of the crawler.
It is a console app - It doesn't need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.
Getting the Text from an Html Page
The first crucial piece of building a crawler is the mechanism for going out and fetching the html off of the web (or your local machine, if you have the site running locally.). Like so much else, .NET has classes for doing this very thing built into the framework.
private static string GetWebText(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "A .NET Web Crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText = reader.ReadToEnd();
return htmlText;
}
The HttpWebRequest class can be used to request any page from the internet. The response (retrieved through a call to GetResponse()) holds the data you want. Get the response stream, throw it in a StreamReader, and read the text to get your html.
for Reference: http://www.juicer.headrun.com