PHP Not parsing rss using cURL properly

PHP Not parsing rss using cURL properly - php

i just want to get the name of 'channel' tag i.e. CHANNEL...the script works fine when i use it to parse the rss from Google..............but when i use it for some other provider it gives an output '#text' instead of giving 'channel' which is the intended output.......the following is my script plz help me out.
$url = 'http://ibnlive.in.com/ibnrss/rss/sports/cricket.xml';
$get = perform_curl($url);
$xml = new DOMDocument();
$xml -> loadXML($get['remote_content']);
$fetch = $xml -> documentElement;
$gettitle = $fetch -> firstChild -> nodeName;
echo $gettitle;
function perform_curl($rss_feed_provider_url){
$url = $rss_feed_provider_url;
$curl_handle = curl_init();
// Do we have a cURL session?
if ($curl_handle) {
// Set the required CURL options that we need.
// Set the URL option.
curl_setopt($curl_handle, CURLOPT_URL, $url);
// Set the HEADER option. We don't want the HTTP headers in the output.
curl_setopt($curl_handle, CURLOPT_HEADER, false);
// Set the FOLLOWLOCATION option. We will follow if location header is present.
curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, true);
// Instead of using WRITEFUNCTION callbacks, we are going to receive the remote contents as a return value for the curl_exec function.
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, true);
// Try to fetch the remote URL contents.
// This function will block until the contents are received.
$remote_contents = curl_exec($curl_handle);
// Do the cleanup of CURL.
curl_close($curl_handle);
$remote_contents = utf8_encode($remote_contents);
$handle = #simplexml_load_string($remote_contents);
$return_result = array();
if(is_object($handle)){
$return_result['handle'] = true;
$return_result['remote_content'] = $remote_contents;
return $return_result;
}
else{
$return_result['handle'] = false;
$return_result['content_error'] = 'INVALID RSS SOURCE, PLEASE CHECK IF THE SOURCE IS A VALID XML DOCUMENT.';
return $return_result;
}
} // End of if ($curl_handle)
else{
$return_result['curl_error'] = 'CURL INITIALIZATION FAILED.';
return false;
}
}
php

it gives an output '#text' instead of giving 'channel' which is the intended output it happens because the $fetch -> firstChild -> nodeType is 3, which is a TEXT_NODE or just some text. You could select channel by
echo $fetch->getElementsByTagName('channel')->item(0)->nodeName;
and
$gettitle = $fetch -> firstChild -> nodeValue;
var_dump($gettitle);
gives you
string(5) "
"
or spaces and a new line symbol which happens to appear between the xml tags due to formatting.
ps: and RSS feed by your link fails validation at http://validator.w3.org/feed/

Take a look at the XML - it's been pretty printed with whitespace so it is being parsed correctly. The first child of the root node is a text node. I'd suggest using SimpleXML if you want an easier time of it, or use XPath queries on your DomDocument to obtain the tags of interest.
Here's how you'd use SimpleXML
$xml = new SimpleXMLElement($get['remote_content']);
print $xml->channel[0]->title;

Related

Getting whole HTML element with PHP

I want to get the whole element <article> which represents 1 listing but it doesn't work. Can someone help me please?
containing the image + title + it's link + description
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$first_step = explode( '<article>' , $content );
$second_step = explode("</article>" , $first_step[3] );
echo $second_step[0];
?>

You should definitely be using curl for this type of requests.
function curl_download($url){
// is cURL installed?
if (!function_exists('curl_init')){
die('cURL is not installed!');
}
$ch = curl_init();
// URL to download
curl_setopt($ch, CURLOPT_URL, $url);
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "Set your user agent here...");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = retu rn, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
for best results for your question. Combine it with HTML Dom Parser
use it like:
// Find all images
foreach($output->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($output->find('a') as $element)
echo $element->href . '<br>';
Good Luck!

I'm not sure I get you right, But I guess you need a PHP DOM Parser. I suggest this one (This is a great PHP library to parser HTML codes)
Also you can get whole HTML code like this:
$url = 'http://www.polkmugshot.com/';
$html = file_get_html($url);
echo $html;

Probably a better way would be to parse the document and run some xpath queries over it afterwards, like so:
$url = 'http://www.polkmugshot.com/';
$xml = simplexml_load_file($url);
$articles = $xml->xpath("//articles");
foreach ($articles as $article) {
// do sth. useful here
}
Read about SimpleXML here.

extract the articles with DOMDocument. working example:
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$domd=#DOMDocument::loadHTML($content);
foreach($domd->getElementsByTagName("article") as $article){
var_dump($domd->saveHTML($article));
}
and as pointed out by #Guns , you'd better use curl, for several reasons:
1: file_get_contents will fail if allow_url_fopen is not set to true in php.ini
2: until php 5.5.0 (somewhere around there), file_get_contents kept reading from the connection until the connection was actually closed, which for many servers can be many seconds after all content is sent, while curl will only read until it reaches content-length HTTP header, which makes for much faster transfers (luckily this was fixed)
3: curl supports gzip and deflate compressed transfers, which again, makes for much faster transfer (when content is compressible, such as html), while file_get_contents will always transfer plain

simple xpath query not working

This snippet of code is not working:
Notice: Trying to get property of non-object in test.php on line 13
but the xpath query seems obviously correct... and the url provided obviously have a tag .
I tried to replace the query even with '//html' but no luck.
I always use xpath and this is a strange behaviour.
<?php
$_url = 'http://www.portaleaste.com/it/Aste/Detail/876989';
$ch2 = curl_init();
curl_setopt($ch2, CURLOPT_URL, $_url);
curl_setopt($ch2, CURLOPT_RETURNTRANSFER, true);
$result2 = curl_exec($ch2);
curl_close($ch2);
$doc2 = new DOMDocument();
#$doc2->load($result2);
$xpath2 = new DOMXpath($doc2);
$txt = $xpath2->query('//p[#id="descrizione"]')->item(0)->nodeValue;
echo $txt;
?>

There is nothing wrong with your xpath query as it is correct syntax and the node does exist. The problematic line is this:
#$doc2->load($result2);
// DOMDocument::load — Load XML from a file
You are not loading the result page that you got from your curl request properly. To load the response,
Use this instead:
#$doc2->loadHTML($result2);
// DOMDocument::loadHTML — Load HTML from a string
Here's a sample output you'd expect

simplexml_load_file($feedURL) returns bool(false) even that the RSS works

I am trying to make a simple widget that loads a youtube rss feed and shows the few first videos.
The problem is that even the RSS adress is correct it allways dumps false
$feedURL = 'http://gdata.youtube.com/feeds/api/users/ninpetit/uploads?alt=rss&v=2';
$sxml = simplexml_load_file($feedURL);
var_dump($sxml); /* output: bool(false) */
What am I doing wrong? Is there any alternative to simplexml_load_file?
PS: This code is being executed in a shared server
EDIT
I successfully getting the data vía curl, but the simplexml_load_file will return false if I pass the $data
$feedURL = 'http://gdata.youtube.com/feeds/api/users/ninpetit/uploads?alt=rss&v=2';
$ch = curl_init($feedURL);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
echo $data.'<br>'; /* shows data!! */
sxml = simplexml_load_file($data); /*Also false*/

if you have xml data in your $data string, you can easily parse it using simplexml_load_string() function.

php parse exchange rate feed XML

I am trying to use the currentcy exchange rate feeds of the European Central Bank (ECB)
http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml
They have provided documentation on how to parse the xml but none of the options works for me: I checked that allow_url_fopen=On is set.
http://www.ecb.int/stats/exchange/eurofxref/html/index.en.html
For instance, I used but it doesn't echo anything and it seems the $XML object is always empty.
<?php
//This is aPHP(5)script example on how eurofxref-daily.xml can be parsed
//Read eurofxref-daily.xml file in memory
//For the next command you will need the config option allow_url_fopen=On (default)
$XML=simplexml_load_file("http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml");
//the file is updated daily between 2.15 p.m. and 3.00 p.m. CET
foreach($XML->Cube->Cube->Cube as $rate){
//Output the value of 1EUR for a currency code
echo '1€='.$rate["rate"].' '.$rate["currency"].'<br/>';
//--------------------------------------------------
//Here you can add your code for inserting
//$rate["rate"] and $rate["currency"] into your database
//--------------------------------------------------
}
?>
Update:
As I am behind proxy at my test environment, I tried this but still I don't get to read the XML:
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_close ($ch);
return curl_exec($ch); }
$address = urlencode($address);
$data = curl("http://www.ecb.int/stats/eurofxref/eurofxref-daily.xml");
$XML = simplexml_load_file($data);
var_dump($XML); -> returns boolean false
Please help me. Thanks!

I didn't find any relevant settings in php.ini. Check with phpinfo() if you have SimpleXML support and cURLsupport enabled. (You should have them both and especially SimpleXML since you're using it and it returns false, it doesn't complain about missing function.)
Proxy might be an issue here. See this and this answer. Using cURL could be an answer to your problem.
Here's one alternative foud here.
$url = file_get_contents('http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml');
$xml = new SimpleXMLElement($url) ;
//file put contents - same as fopen, wrote and close
//need to output "asXML" - simple xml returns an object based upon the raw xml
file_put_contents(dirname(__FILE__)."/loc.xml", $xml->asXML());
foreach($xml->Cube->Cube->Cube as $rate){
echo '1€='.$rate["rate"].' '.$rate["currency"].'<br/>';
}

This solution works for me:
$data = [];
$url = "http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist-90d.xml";
$xmlRaw = file_get_contents($url);
$doc = new DOMDocument();
$doc->preserveWhiteSpace = FALSE;
$doc->loadXML($xmlRaw);
$node1 = $doc->getElementsByTagName('Cube')->item(0);
foreach ($node1->childNodes as $node2) {
$value = [];
foreach ($node2->childNodes as $node3) {
$value['date'] = $node2->getAttribute('time');
$value['currency'] = $node3->getAttribute('currency');
$value['rate'] = $node3->getAttribute('rate');
$data[] = $value;
unset($value);
}
}
echo "<pre"> . print_r($data) . "</pre>";

php proDOM parsing error

I am using the following code for parsing dom document but at the end I get the error
"google.ac" is null or not an object
line 402
char 1
What I guess, line 402 contains tag and a lot of ";",
How can I fix this?
<?php
//$ch = curl_init("http://images.google.com/images?q=books&tbm=isch/");
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
$dom->loadHTML($data);
//#$dom->saveHTMLFile('newfolder/abc.html')
$dom->loadHTML('$data');
// find all ul
$list = $dom->getElementsByTagName('ul');
// get few list items
$rows = $list->item(30)->getElementsByTagName('li');
// get anchors from the table
$links = $list->item(30)->getElementsByTagName('a');
foreach ($links as $link) {
echo "<fieldset>";
$links = $link->getElementsByAttribute('imgurl');
$dom->saveXML($links);
}
?>

There are a few issues with the code:
You should add the CURL option - CURLOPT_RETURNTRANSFER - in order to capture the output. By default the output is displayed on the browser. Like this: curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);. In the code above, $data will always be TRUE or FALSE (http://www.php.net/manual/en/function.curl-exec.php)
$dom->loadHTML('$data'); is not correct and not required
The method of reading 'li' and 'a' tags might not be correct because $list->item(30) will always point to the 30th element
Anyways, coming to the fixes. I'm not sure if you checked the HTML returned by the CURL request but it seems different from what we discussed in the original post. In other words, the HTML returned by CURL does not contain the required <ul> and <li> elements. It instead contains <td> and <a> elements.
Add-on: I'm not very sure why do HTML for the same page is different when it is seen from the browser and when read from PHP. But here is a reasoning that I think might fit. The page uses JavaScript code that renders some HTML code dynamically on page load. This dynamic HTML can be seen when viewed from the browser but not from PHP. Hence, I assume the <ul> and <li> tags are dynamically generated. Anyways, that isn't of our concern for now.
Therefore, you should modify your code to parse the <a> elements and then read the image URLs. This code snippet might help:
<?php
$ch = curl_init(); // create a new cURL resource
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$data = curl_exec($ch); // grab URL and pass it to the browser
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data); // avoid warnings
$listA = $dom->getElementsByTagName('a'); // read all <a> elements
foreach ($listA as $itemA) { // loop through each <a> element
if ($itemA->hasAttribute('href')) { // check if it has an 'href' attribute
$href = $itemA->getAttribute('href'); // read the value of 'href'
if (preg_match('/^\/imgres\?/', $href)) { // check that 'href' should begin with "/imgres?"
$qryString = substr($href, strpos($href, '?') + 1);
parse_str($qryString, $arrHref); // read the query parameters from 'href' URI
echo '<br>' . $arrHref['imgurl'] . '<br>';
}
}
}
I hope above makes sense. But please note that the above parsing might fail if Google modifies their HTML.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Not parsing rss using cURL properly - php

Related

Getting whole HTML element with PHP

simple xpath query not working

simplexml_load_file($feedURL) returns bool(false) even that the RSS works

php parse exchange rate feed XML

php proDOM parsing error

Categories

Resources