simple xpath query not working - php

This snippet of code is not working:
Notice: Trying to get property of non-object in test.php on line 13
but the xpath query seems obviously correct... and the url provided obviously have a tag .
I tried to replace the query even with '//html' but no luck.
I always use xpath and this is a strange behaviour.
<?php
$_url = 'http://www.portaleaste.com/it/Aste/Detail/876989';
$ch2 = curl_init();
curl_setopt($ch2, CURLOPT_URL, $_url);
curl_setopt($ch2, CURLOPT_RETURNTRANSFER, true);
$result2 = curl_exec($ch2);
curl_close($ch2);
$doc2 = new DOMDocument();
#$doc2->load($result2);
$xpath2 = new DOMXpath($doc2);
$txt = $xpath2->query('//p[#id="descrizione"]')->item(0)->nodeValue;
echo $txt;
?>

There is nothing wrong with your xpath query as it is correct syntax and the node does exist. The problematic line is this:
#$doc2->load($result2);
// DOMDocument::load — Load XML from a file
You are not loading the result page that you got from your curl request properly. To load the response,
Use this instead:
#$doc2->loadHTML($result2);
// DOMDocument::loadHTML — Load HTML from a string
Here's a sample output you'd expect

Related

PHP - simplexml_load_file() - I/O warning : failed to load external entity [duplicate]

I'm trying to create a small application that will simply read an RSS feed and then layout the info on the page.
All the instructions I find make this seem simplistic but for some reason it just isn't working. I have the following
include_once(ABSPATH.WPINC.'/rss.php');
$feed = file_get_contents('http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int');
$items = simplexml_load_file($feed);
That's it, it then breaks on the third line with the following error
Error: [2] simplexml_load_file() [function.simplexml-load-file]: I/O warning : failed to load external entity "<?xml version="1.0" encoding="UTF-8"?> <?xm
The rest of the XML file is shown.
I have turned on allow_url_fopen and allow_url_include in my settings but still nothing.
I've tried multiple feeds that all end up with the same result?
I'm going mad here
simplexml_load_file() interprets an XML file (either a file on your disk or a URL) into an object. What you have in $feed is a string.
You have two options:
Use file_get_contents() to get the XML feed as a string, and use e simplexml_load_string():
$feed = file_get_contents('...');
$items = simplexml_load_string($feed);
Load the XML feed directly using simplexml_load_file():
$items = simplexml_load_file('...');
You can also load the content with cURL, if file_get_contents insn't enabled on your server.
Example:
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,"http://feeds.bbci.co.uk/sport/0/football/rss.xml?edition=int");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
$output = curl_exec($ch);
curl_close($ch);
$items = simplexml_load_string($output);
this also works:
$url = "http://www.some-url";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xmlresponse = curl_exec($ch);
$xml=simplexml_load_string($xmlresponse);
then I just run a forloop to grab the stuff from the nodes.
like this:`
for($i = 0; $i < 20; $i++) {
$title = $xml->channel->item[$i]->title;
$link = $xml->channel->item[$i]->link;
$desc = $xml->channel->item[$i]->description;
$html .="<div><h3>$title</h3>$link<br />$desc</div><hr>";
}
echo $html;
***note that your node names will differ, obviously..and your HTML might be structured differently...also your loop might be set to higher or lower amount of results.
$url = 'http://legis.senado.leg.br/dadosabertos/materia/tramitando';
$xml = file_get_contents("xml->{$url}");
$xml = simplexml_load_file($url);

PHP output keeps saying 'DOMDocument::loadHTML(): Empty string supplied as input in'

I have this code that will retrieve every link in the $curl_scrapped_page:
require_once ('simple_html_dom.php');
$des_array = array();
$url = 'http://citeseerx.ist.psu.edu/search?q=mean&t=doc&sort=rlv';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
$html = new simple_html_dom();
$html->load($curl_scraped_page);
Then I want to get abstract for each of link (on the page of that link) I scrapped. (I also get other things like title, description and so on, but the problem only lies on this abstract):
foreach ($html->find('div.result h3 a') as $des) {
$des2 = 'http://citeseerx.ist.psu.edu' . $des->href;
$ch = curl_init($des2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page2 = curl_exec($ch);
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHtml($curl_scraped_page2);//line 72
libxml_use_internal_errors(false);
$xpath2 = new DomXPath($dom);
$thing = $xpath2->query('//p[preceding::h3[preceding::div]]')->item(1)->textContent; //line 75
array_push($des_array, $thing);
}
curl_close ($ch);
This is the display code:
for ($i = 0; $i < 10; $i++) {
echo $des_array[$i];
}
When I checked it on my browser, it gave me this, thrice:
Warning: DOMDocument::loadHTML(): Empty string supplied as input in C:\xampp\htdocs\MSP\Citeseerx.php on line 72
Notice: Trying to get property of non-object in C:\xampp\htdocs\MSP\Citeseerx.php on line 75
I realised I pushed an empty string to the $des_array. So I tried this:
if (empty($thing)){
array_push($des_array,'');
}
else{
array_push($des_array, $thing);
}
And this: if ($thing!=''){..}.
It still gave me that error.
What should I do?
Thanks..
curl_exec() may return false. In that case check with curl_error() what's the error. For example if the href attribute does not begin with / you will pass invalid url to the curl_init function. Also you may use curl_info() to get more information about the server response
Actually the $curl_scraped_page should be an handle for an open file not a variable since you are returning the transfer as a. Binary it should be read to file you can't pass to a varible since it is not a string

Dealing with "PHP HTML DOM Parser", including the same library into two different files

I have two PHP files (same folders) that access the library simple_html_dom.php
The first one, caridefine.phphas this:
include('simple_html_dom.php');
$url = 'http://www.statistics.com/index.php?page=glossary&term_id=209';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHtml($curl_scraped_page);
$xpath = new DomXPath($dom);
print $xpath->evaluate('string(//p[preceding::b]/text())');
The second one, caridefine2.php has this:
include('simple_html_dom.php');
$url = 'http://www.statsoft.com//textbook/statistics-glossary/z/?button=0#Z Distribution (Standard Normal)';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
$html = new simple_html_dom();
$html->load($curl_scraped_page);
foreach ($html->find('/p/a [size="4"]') as $font) {
$link = $font->parent();
$paragraph = $link->parent();
$text = str_replace($link->plaintext, '', $paragraph->plaintext);
echo $text;
}
Separately, each of them worked fine, I ran the caridefine.php, it worked well, so did the caridefine2.php.
But when I tried to load these two files in other PHP files:
<div class="examples">
<?php
$this->load->view('definer/caridefine.php');
?>
</div>
<div class="examples">
<?php
$this->load->view('definer/caridefine2.php');
?>
</div>
None of them worked. Just gave me a blank page, when I pressed CTRL+U, it said: Cannot redeclare file_get_html() (previously declared in C:\xampp\htdocs\MSPN\APPLICATION\views\Definer\simple_html_dom.php:70) in C:\xampp\htdocs\MSPN\APPLICATION\views\Definer\simple_html_dom.php on line 85
I googled for this problem, I found that "if you load many objects without clearing the previous ones, it can be a problem."
I've tried doing $html->clear() and unset($dom). It gave me nothing.
What is it that makes me like in the end of the line?
Thanks..
I have tried analyse my own problem:
Here is the correction:
Change include('simple_html_dom.php'); in each file into: require_once('simple_html_dom.php');
What happened was the file called the file simple_html_dom.php twice. So it won't work.
That should do it.

PHP Not parsing rss using cURL properly

i just want to get the name of 'channel' tag i.e. CHANNEL...the script works fine when i use it to parse the rss from Google..............but when i use it for some other provider it gives an output '#text' instead of giving 'channel' which is the intended output.......the following is my script plz help me out.
$url = 'http://ibnlive.in.com/ibnrss/rss/sports/cricket.xml';
$get = perform_curl($url);
$xml = new DOMDocument();
$xml -> loadXML($get['remote_content']);
$fetch = $xml -> documentElement;
$gettitle = $fetch -> firstChild -> nodeName;
echo $gettitle;
function perform_curl($rss_feed_provider_url){
$url = $rss_feed_provider_url;
$curl_handle = curl_init();
// Do we have a cURL session?
if ($curl_handle) {
// Set the required CURL options that we need.
// Set the URL option.
curl_setopt($curl_handle, CURLOPT_URL, $url);
// Set the HEADER option. We don't want the HTTP headers in the output.
curl_setopt($curl_handle, CURLOPT_HEADER, false);
// Set the FOLLOWLOCATION option. We will follow if location header is present.
curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, true);
// Instead of using WRITEFUNCTION callbacks, we are going to receive the remote contents as a return value for the curl_exec function.
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, true);
// Try to fetch the remote URL contents.
// This function will block until the contents are received.
$remote_contents = curl_exec($curl_handle);
// Do the cleanup of CURL.
curl_close($curl_handle);
$remote_contents = utf8_encode($remote_contents);
$handle = #simplexml_load_string($remote_contents);
$return_result = array();
if(is_object($handle)){
$return_result['handle'] = true;
$return_result['remote_content'] = $remote_contents;
return $return_result;
}
else{
$return_result['handle'] = false;
$return_result['content_error'] = 'INVALID RSS SOURCE, PLEASE CHECK IF THE SOURCE IS A VALID XML DOCUMENT.';
return $return_result;
}
} // End of if ($curl_handle)
else{
$return_result['curl_error'] = 'CURL INITIALIZATION FAILED.';
return false;
}
}
php
it gives an output '#text' instead of giving 'channel' which is the intended output it happens because the $fetch -> firstChild -> nodeType is 3, which is a TEXT_NODE or just some text. You could select channel by
echo $fetch->getElementsByTagName('channel')->item(0)->nodeName;
and
$gettitle = $fetch -> firstChild -> nodeValue;
var_dump($gettitle);
gives you
string(5) "
"
or spaces and a new line symbol which happens to appear between the xml tags due to formatting.
ps: and RSS feed by your link fails validation at http://validator.w3.org/feed/
Take a look at the XML - it's been pretty printed with whitespace so it is being parsed correctly. The first child of the root node is a text node. I'd suggest using SimpleXML if you want an easier time of it, or use XPath queries on your DomDocument to obtain the tags of interest.
Here's how you'd use SimpleXML
$xml = new SimpleXMLElement($get['remote_content']);
print $xml->channel[0]->title;

php proDOM parsing error

I am using the following code for parsing dom document but at the end I get the error
"google.ac" is null or not an object
line 402
char 1
What I guess, line 402 contains tag and a lot of ";",
How can I fix this?
<?php
//$ch = curl_init("http://images.google.com/images?q=books&tbm=isch/");
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
$dom->loadHTML($data);
//#$dom->saveHTMLFile('newfolder/abc.html')
$dom->loadHTML('$data');
// find all ul
$list = $dom->getElementsByTagName('ul');
// get few list items
$rows = $list->item(30)->getElementsByTagName('li');
// get anchors from the table
$links = $list->item(30)->getElementsByTagName('a');
foreach ($links as $link) {
echo "<fieldset>";
$links = $link->getElementsByAttribute('imgurl');
$dom->saveXML($links);
}
?>
There are a few issues with the code:
You should add the CURL option - CURLOPT_RETURNTRANSFER - in order to capture the output. By default the output is displayed on the browser. Like this: curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);. In the code above, $data will always be TRUE or FALSE (http://www.php.net/manual/en/function.curl-exec.php)
$dom->loadHTML('$data'); is not correct and not required
The method of reading 'li' and 'a' tags might not be correct because $list->item(30) will always point to the 30th element
Anyways, coming to the fixes. I'm not sure if you checked the HTML returned by the CURL request but it seems different from what we discussed in the original post. In other words, the HTML returned by CURL does not contain the required <ul> and <li> elements. It instead contains <td> and <a> elements.
Add-on: I'm not very sure why do HTML for the same page is different when it is seen from the browser and when read from PHP. But here is a reasoning that I think might fit. The page uses JavaScript code that renders some HTML code dynamically on page load. This dynamic HTML can be seen when viewed from the browser but not from PHP. Hence, I assume the <ul> and <li> tags are dynamically generated. Anyways, that isn't of our concern for now.
Therefore, you should modify your code to parse the <a> elements and then read the image URLs. This code snippet might help:
<?php
$ch = curl_init(); // create a new cURL resource
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$data = curl_exec($ch); // grab URL and pass it to the browser
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data); // avoid warnings
$listA = $dom->getElementsByTagName('a'); // read all <a> elements
foreach ($listA as $itemA) { // loop through each <a> element
if ($itemA->hasAttribute('href')) { // check if it has an 'href' attribute
$href = $itemA->getAttribute('href'); // read the value of 'href'
if (preg_match('/^\/imgres\?/', $href)) { // check that 'href' should begin with "/imgres?"
$qryString = substr($href, strpos($href, '?') + 1);
parse_str($qryString, $arrHref); // read the query parameters from 'href' URI
echo '<br>' . $arrHref['imgurl'] . '<br>';
}
}
}
I hope above makes sense. But please note that the above parsing might fail if Google modifies their HTML.

Categories