DOM structure, get element by attribute name/value - php

I see a lot of answers on SO that pertain to the question but either there are slight differences that I couldn't overcome or maybe i just couldn't repeat the processes shown.
What I am trying to accomplish is to use CURL to get the HTML from a Google+ business page, iterate over the HTML and for each review of the business scrape the reviews HTML for display on that businesses non google+ webpage.
Every review shares this parent div structure:
<div class="ZWa nAa" guidedhelpid="userreviews"> .....
Thus i am trying to do a foreach loop based on finding and grabbing the div and innerhtml for each div with attribute: guidehelpid="userreviews"
I am succesfully getting the HTML back via CURL and can parse it when targeting a standard TAG name like "a" or if it had an ID, but iterating over the HTML using the PHP default parser when looking for a attribute name is problematic:
How can I take this successful code below and make it work like intended as shown in the second code which of course is wrong?
WORKING CODE (Finds,gets, echo's all "a" tags in $output)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->getAttribute('href');
echo "<br />";}
THEORETICALLY NEEDED CODE: (Find every review by custom attribute in HTML and echo them)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('div[guidehelpid=userreviews]') as $review) {
echo $review;
echo "<br />"; }
Any help i correcting this would be appreciated. I would prefer not to use "simple_html_dom" if I can accomplish this without it.

I suggest and you could use an DOMXpath in this case too. Example:
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($output);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$review = $xpath->query('//div[#guidedhelpid="userreviews"]');
if($review->length > 0) { // if it exists
echo $review->item(0)->nodeValue;
// echoes
// John DeRemer reviewed 3 months ago Last fall, we had a major issue with mold which required major ... and so on
}

Related

Parsing RSS news not working

I would like to parse news titles and links from the following RSS page:
http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE
I have tried using this code (but it's not working):
<?php
$xml=("http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
echo $title;
echo $link;
}
?>
However the same code is working to get RSS titles and links from other RSS pages.. for example:
<?php
$xml=("https://feeds.finance.yahoo.com/rss/2.0/headline?s=bcm.v&region=US&lang=en-US");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
echo $title;
echo $link;
}
?>
Do you have any idea on how to make it work?
Thanks in advance!
Downloading Remote Documents
The problem is that you are trying to download remote document with DOMDocument::load. The method is capable of downloading remote files, but it doesn't set the User-Agent HTTP header, if it is not specified via user_agent INI setting. Some hosts are configured to reject HTTP requests, if the User-Agent header is absent. And the URL you pasted into the question returns 403 Forbidden, if the header is missing.
So you should either set user agent via INI settings:
ini_set('user_agent', 'MyCrawler/1.0');
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$doc = new DOMDocument();
$doc->load($url);
or download the document manually with User-Agent header set, e.g.:
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'MyCrawler/1.0');
$xml = curl_exec($ch);
$doc = new DOMDocument();
$doc->loadXML($xml);
Traversing the DOM
The next problem with your code is that you are fully relying on specific DOM structure:
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
The are many possible cases where you code will not work as expected: less than 5 items, missing elements, empty document, etc. Besides, the code is not very readable. You should always check if the node exists before going deeper into its structure, e.g.:
$channels = $doc->getElementsByTagName('channel');
foreach ($channels as $channel) {
// Print channel properties
foreach ($channel->childNodes as $child) {
if ($child->nodeType !== XML_ELEMENT_NODE) {
continue;
}
switch ($child->nodeName) {
case 'title':
echo "Title: ", $child->nodeValue, PHP_EOL;
break;
case 'description':
echo "Description: ", $child->nodeValue, PHP_EOL;
break;
}
}
}
You can parse the item elements in similar manner:
$items = $channel->getElementsByTagName('item');
foreach ($items as $item) {
// ...
}
They have security in place when no user agent is set so you'll have to use curl and fake an user agent to get the xml content eg:
$url = "http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$xml = curl_exec($ch);

PhP curl simple_dom_document request to get snow data from snowbird.com

Im using php, curl, and simple_dom_document to get snow data from snowbird.com. The problem is I cant seem to actually find the data I need. I am able to find the parent div and its name but I cant find the actually snow info div. Here is my code. Below my code ill past a small part of the output.
<?php
require('simple_html_dom.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.snowbird.com/mountain-report/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$content = curl_exec($ch);
curl_close($ch);
$html = new simple_html_dom();
$html->load($content);
$ret = $html->find('.horizSnowChartText');
$ret = serialize($ret);
$ret3 = new simple_html_dom();
$ret3->load($ret);
$es = $ret3->find('text');
$ret2 = $ret3->find('.total-inches');
print_r($ret2);
//print_r($es);
?>
And here is a picture of the output. You can see it skips the actual snow data and goes right to the inches mark ".
Do note that the html markup you're getting has multiple instances of .total-inches (multiple nodes with this class). If you want to explicitly get one, you can point to it directly using the second argument of ->find().
Example:
$ret2 = $html->find('.total-inches', 3);
// ^
If you want to check them all out, a simple foreach should suffice:
foreach($html->find('.current-conditions .snowfall-total .total-inches') as $in) {
echo $in , "\n";
}

get information from html table using Curl

i need to get some information about some plants and put it into mysql table.
My knowledge on Curl and DOM is quite null, but i've come to this:
set_time_limit(0);
include('simple_html_dom.php');
$ch = curl_init ("http://davesgarden.com/guides/pf/go/1501/");
curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept-Language: es-es,en"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec ($ch);
curl_close ($ch);
$html= str_get_html($data);
$e = $html->find("table", 8);
echo $e->innertext;
now, i'm really lost about how to move in from this point, can you please guide me?
Thanks!
This is a mess.
But at least it's a (somewhat) consistent mess.
If this is a one time extraction and not a rolling project, personally I'd use quick and dirty regex on this instead of simple_html_dom. You'll be there all day twiddling with the tags otherwise.
For example, this regex pulls out the majority of title/data pairs:
$pattern = "/<b>(.*?)</b>\s*<br>(.*?)</?(td|p)>/si";
You'll need to do some pre and post cleaning before it will get them all though.
I don't envy you having this task...
Your best bet will be to wrape this in php ;)
Yes, this is a ugly hack for a ugly html code.
<?php
ob_start();
system("
/usr/bin/env links -dump 'http://davesgarden.com/guides/pf/go/1501/' |
/usr/bin/env perl -lne 'm/((Family|Genus|Species):\s+\w+\s+\([\w-]+\))/ && \
print $1'
");
$out = ob_get_contents();
ob_end_clean();
print $out;
?>
Use Simple Html Dom and you would be able to access any element/element's content you wish. Their api is very straightforward.
you can try somthing like this.
<?php
$ch = curl_init ("http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$data = array();
// get all table rows and rows which are not headers
$table_rows = $xpath->query('//table[#id="tbl-all-product-view"]/tr[#class!="rowH"]');
foreach($table_rows as $row => $tr) {
foreach($tr->childNodes as $td) {
$data[$row][] = preg_replace('~[\r\n]+~', '', trim($td->nodeValue));
}
$data[$row] = array_values(array_filter($data[$row]));
}
echo '<pre>';
print_r($data);
?>

how to get a list of links in a webpage in PHP? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Parse Website for URLs
How do I get all the links in a webpage using PHP?
I need to get a list of the links :-
Google
I want to fetch the href (http://www.google.com) and the text (Google)
-------------------situation is:-
I'm building a crawler and i want it to get all the links that exist in a database table.
There are a couple of ways to do this, but the way I would approach this is something like the following,
Use cURL to fetch the page, ie:
// $target_url has the url to be fetched, ie: "http://www.website.com"
// $userAgent should be set to a friendly agent, sneaky but hey...
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
If all goes well, page content is now all in $html.
Let's move on and load the page in a DOM Object:
$dom = new DOMDocument();
#$dom->loadHTML($html);
So far so good, XPath to the rescue to scrape the links out of the DOM object:
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
Loop through the result and get the links:
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$link = $href->getAttribute('href');
$text = $href->nodeValue
// Do what you want with the link, print it out:
echo $text , ' -> ' , $link;
// Or save this in an array for later processing..
$links[$i]['href'] = $link;
$links[$i]['text'] = $text;
}
$hrefs is an object of type DOMNodeList and item() returns a DOMNode object for the specified index. So basically we’ve got a loop that retrieves each link as a DOMNode object.
This should pretty much do it for you.
The only part I am not 100% sure of is if the link is an image or an anchor, what would happen in those conditions, I have no idea so you would need to test and filter those out.
Hope this gives you an idea of how to scrape links, happy coding.

Why am I not getting back any images here?

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$html = #file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xml = #simplexml_import_dom($doc);
$images = $xml->xpath('//img');
var_dump($images);
die();
Output is:
array(0) { }
However, in the page source I see this:
<img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />
Edit: It appears $html's contents stop at the <body> tag for this page. Any idea why?
It appears $html's contents stop at the tag for this page. Any idea why?
Yes, you must provide this page with a valid user agent.
$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_exec($ch);
outputs everything to the ending </html> including your requested <img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />
When a simple wget or curl without the user agent returns only up to the <body> tag.
$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$images = $xml->xpath('//img');
var_dump($images);
die();
EDIT: My first post stated that there was still an issue with xpath... I was just not doing my due diligence and the updated code above works great. I forgot to force curl to output to a string rather then print to the screen(as it does by default).
Why bring simplexml into the mix? You're already loading the HTML from w3fools into the DOM class, which has a perfectly good XPath query engine in it already.
[...snip...]
$doc->loadHTML($html);
$xpath = new DOMXPath($doc)
$images = $xpath->xpath('//img');
[...snip...]
The IMG tag is generated by javascript.
If you'd downloaded this page via wget, you'd realize there is no IMG tag in the HTML.
Update #1
I believe it is because of user agent string.
If I supply "Mozilla/5.0 (X11; Linux i686 on x86_64; rv:2.0) Gecko/20100101 Firefox/4.0" as user agent id, I get the page in whole.

Categories