Why am I not getting back any images here? - php

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$html = #file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xml = #simplexml_import_dom($doc);
$images = $xml->xpath('//img');
var_dump($images);
die();
Output is:
array(0) { }
However, in the page source I see this:
<img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />
Edit: It appears $html's contents stop at the <body> tag for this page. Any idea why?

It appears $html's contents stop at the tag for this page. Any idea why?
Yes, you must provide this page with a valid user agent.
$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_exec($ch);
outputs everything to the ending </html> including your requested <img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />
When a simple wget or curl without the user agent returns only up to the <body> tag.
$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$images = $xml->xpath('//img');
var_dump($images);
die();
EDIT: My first post stated that there was still an issue with xpath... I was just not doing my due diligence and the updated code above works great. I forgot to force curl to output to a string rather then print to the screen(as it does by default).

Why bring simplexml into the mix? You're already loading the HTML from w3fools into the DOM class, which has a perfectly good XPath query engine in it already.
[...snip...]
$doc->loadHTML($html);
$xpath = new DOMXPath($doc)
$images = $xpath->xpath('//img');
[...snip...]

The IMG tag is generated by javascript.
If you'd downloaded this page via wget, you'd realize there is no IMG tag in the HTML.
Update #1
I believe it is because of user agent string.
If I supply "Mozilla/5.0 (X11; Linux i686 on x86_64; rv:2.0) Gecko/20100101 Firefox/4.0" as user agent id, I get the page in whole.

Related

XPath: No elements found in grabbed website and html seems incomplete

Using Xpath in a PHP (v8.1) environment, I am trying to fetch all IMG tags from a dummy website:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.someurl.com');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmlString);
$xpath = new DOMXPath($doc);
$images = $xpath->evaluate("//img");
echo serialize($images); //gives me: O:11:"DOMNodeList":0:{}
echo $doc->saveHTML(); //outputs entire website in series with wiped <HTML>,<HEAD>,<BODY> tags
I don't understand, why I don't get any results for whatever tags I am trying to adress with Xpath (in this case all img tags but I've tried a bunch of variations!).
The second issue I am having is, when looking at the output of the second echo instruction (outputting the entire grabbed html), I realize that the HTML page is not complete. What I am getting is everything except the <HTML></HTML>, <HEAD></HEAD> and <BODY></BODY> tags (but the actual contents still exist!), as if everything was appended in series. Is that supposed to be this way?

php curl to get img src value on a page

I want to get img src value on a page if it is
https://www.google.com
then result will be like
https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png
https://www.google.com/ff.png
https://www.google.com/idk.jpg
i want something like this!
Thanks
<?php
# Use the Curl extension to query Google and get back a page of results
$url = "https://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('img') as $link) {
# Show the <a href>
echo $link->getAttribute('src');
echo "<br />";
}
?>
Here it is

DOM structure, get element by attribute name/value

I see a lot of answers on SO that pertain to the question but either there are slight differences that I couldn't overcome or maybe i just couldn't repeat the processes shown.
What I am trying to accomplish is to use CURL to get the HTML from a Google+ business page, iterate over the HTML and for each review of the business scrape the reviews HTML for display on that businesses non google+ webpage.
Every review shares this parent div structure:
<div class="ZWa nAa" guidedhelpid="userreviews"> .....
Thus i am trying to do a foreach loop based on finding and grabbing the div and innerhtml for each div with attribute: guidehelpid="userreviews"
I am succesfully getting the HTML back via CURL and can parse it when targeting a standard TAG name like "a" or if it had an ID, but iterating over the HTML using the PHP default parser when looking for a attribute name is problematic:
How can I take this successful code below and make it work like intended as shown in the second code which of course is wrong?
WORKING CODE (Finds,gets, echo's all "a" tags in $output)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->getAttribute('href');
echo "<br />";}
THEORETICALLY NEEDED CODE: (Find every review by custom attribute in HTML and echo them)
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
#$DOM->loadHTML($output);
foreach($DOM->getElementsByTagName('div[guidehelpid=userreviews]') as $review) {
echo $review;
echo "<br />"; }
Any help i correcting this would be appreciated. I would prefer not to use "simple_html_dom" if I can accomplish this without it.
I suggest and you could use an DOMXpath in this case too. Example:
$url = "https://plus.google.com/+Mcgowansac/about";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($output);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$review = $xpath->query('//div[#guidedhelpid="userreviews"]');
if($review->length > 0) { // if it exists
echo $review->item(0)->nodeValue;
// echoes
// John DeRemer reviewed 3 months ago Last fall, we had a major issue with mold which required major ... and so on
}

Load xml from another page

I'm trying to load this page
https://developers.facebook.com/blog/feed
in my site with no luck. I'm using this code
<?php
$xml = simplexml_load_file('https://developers.facebook.com/blog/feed/');
print_r($xml);
?>
but i get many line of error like this
Warning: simplexml_load_file() [function.simplexml-load-file]:
https://developers.facebook.com/blog/feed/:10: parser error :
xmlParseEntityRef: no name in /fb_feed/fb_feed.php on line 2
Thanks to all who help me
I think this is a problem with the XML feed itself.
See this article.
Load the string with file_get_contents, and do a str_replace on the amperand to
&
So leaving you with
$xml = simplexml_load_string(str_replace('&','&',file_get_contents('https://developers.facebook.com/blog/feed/')));
EDIT:
Just seen in the comments, this has been tackled before and the str_replace can be improved from my original to
$xml = simplexml_load_string(str_replace(array("&", "&"), array("&", "&"),file_get_contents('https://developers.facebook.com/blog/feed/')));
This avoids converting already correctly encoded ampersands.
EDIT 2 :
Facebook redirects requests from file_get_contents to a browser select page. So we need to 'trick' it into thinking we're using a regular browser.
$url='https://developers.facebook.com/blog/feed/';
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($crl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$ret = curl_exec($crl);
curl_close($crl);
$xml = simplexml_load_string(str_replace(array("&", "&"),array("&", "&"),$ret));
var_dump($xml);
The first answer should work in most cases, but edit 2 is for the Facebook Dev blog, or any other that redirects based on the user-agent header.
it could be the case that you need to encode the url as this page suggests
simplexml_load_file(rawurlencode('https://developers.facebook.com/blog/feed/'))
if that doesnt work you can try to load the file by file_get_contents and pass the returning value to the xml parser:
simplexml_load_string( file_get_contents('https://developers.facebook.com/blog/feed/') );
<?php
$url = "https://developers.facebook.com/blog/feed/";
$xml = str_replace('&','&', file_get_contents($url));
$xml = simplexml_load_string($xml);
print_r($xml);
?>

get information from html table using Curl

i need to get some information about some plants and put it into mysql table.
My knowledge on Curl and DOM is quite null, but i've come to this:
set_time_limit(0);
include('simple_html_dom.php');
$ch = curl_init ("http://davesgarden.com/guides/pf/go/1501/");
curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept-Language: es-es,en"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec ($ch);
curl_close ($ch);
$html= str_get_html($data);
$e = $html->find("table", 8);
echo $e->innertext;
now, i'm really lost about how to move in from this point, can you please guide me?
Thanks!
This is a mess.
But at least it's a (somewhat) consistent mess.
If this is a one time extraction and not a rolling project, personally I'd use quick and dirty regex on this instead of simple_html_dom. You'll be there all day twiddling with the tags otherwise.
For example, this regex pulls out the majority of title/data pairs:
$pattern = "/<b>(.*?)</b>\s*<br>(.*?)</?(td|p)>/si";
You'll need to do some pre and post cleaning before it will get them all though.
I don't envy you having this task...
Your best bet will be to wrape this in php ;)
Yes, this is a ugly hack for a ugly html code.
<?php
ob_start();
system("
/usr/bin/env links -dump 'http://davesgarden.com/guides/pf/go/1501/' |
/usr/bin/env perl -lne 'm/((Family|Genus|Species):\s+\w+\s+\([\w-]+\))/ && \
print $1'
");
$out = ob_get_contents();
ob_end_clean();
print $out;
?>
Use Simple Html Dom and you would be able to access any element/element's content you wish. Their api is very straightforward.
you can try somthing like this.
<?php
$ch = curl_init ("http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$data = array();
// get all table rows and rows which are not headers
$table_rows = $xpath->query('//table[#id="tbl-all-product-view"]/tr[#class!="rowH"]');
foreach($table_rows as $row => $tr) {
foreach($tr->childNodes as $td) {
$data[$row][] = preg_replace('~[\r\n]+~', '', trim($td->nodeValue));
}
$data[$row] = array_values(array_filter($data[$row]));
}
echo '<pre>';
print_r($data);
?>

Categories