I have used several different scripts that people have suggested for trying to parse RSS including Magpie and the SimpleXML feature in PHP. But none seem to handle RSS 2.0 well because they will not give me back the full content chunk. Does anyone have a suggestion for reading a feed like the one found at http://chacha102.com/feed/, and getting the full content instead of only the description?
Without reading any documentation of the rss "content" namespace and how it is to be used, here is a working SimpleXML script. The trick is using the namespace when retreiving the content.
/* the namespace of rss "content" */
$content_ns = "http://purl.org/rss/1.0/modules/content/";
/* load the file */
$rss = file_get_contents("http://chacha102.com/feed/");
/* create SimpleXML object */
$xml = new SimpleXMLElement($rss);
$root=$xml->channel; /* our root element */
foreach($root->item as $item) { /* loop over every item in the channel */
print "Description: <br>".$item->description."<br><br>";
print "Full content: <div>";
foreach($item->children($content_ns) as $content_node) {
/* loop over all children in the "content" namespace */
print $content_node."\n";
}
print "</div>";
}
What do you have that's not working right now? Parsing RSS should be a trivial process. Try stepping back from excessive libraries and just use a few simple XPath queries or accessing the DOMDocument object in PHP.
see: PHP DOMDocument
Related
I'm trying to use SimplePie to pull a list of links via RSS feeds and then scrape those feeds using Simple HTML DOM to pull out images. I'm able to get SimplePie working to pull the links and store them in an array. I can also also use the Simple HTML DOM parser to get the image link that I'm looking for. The problem is that when I try to use SimplePie and Simple HTML DOM at the same time, I get a 500 error. Here's the code:
set_time_limit(0);
error_reporting(0);
$rss = new SimplePie();
$rss->set_feed_url('http://contently.com/strategist/feed/');
$rss->init();
foreach($rss->get_items() as $item)
$urls[] = $item->get_permalink();
unset($rss);
/*
$urls = array(
'https://contently.com/strategist/2016/01/22/whats-in-a-spotify-name-and-5-other-stories-you-should-read/',
'https://contently.com/strategist/2016/01/22/how-to-make-content-marketing-work-inside-a-financial-services-company/',
'https://contently.com/strategist/2016/01/22/glenn-greenwald-talks-buzzfeed-freelancing-the-future-journalism/',
...
'https://contently.com/strategist/2016/01/19/update-a-simpler-unified-workflow/');
*/
foreach($urls as $url) {
$html = new simple_html_dom();
$html->load_file($url);
$images = $html->find('img[class=wp-post-image]',0);
echo $images;
$html->clear();
unset($html);
}
I commented out the urls array, but it is identical to the array created by the SimplePie loop (I created it manually from the results). It fails on the find command the first time through the loop. If I comment out the $rss->init() line and use the static url array, the code all runs with no errors, but doesn't give me the result I want - of course. Any help is greatly appreciated!
There's a strange incompatibility between simple_html_dom and SimplePie. Loading html, the simple_html_dom->root is not loaded, causing error for any other operation.
Curiously, passing to function-mode instead of object-mode, for me it works fine:
$html = file_get_html( $url );
instead of:
$html = new simple_html_dom();
$html->load_file($url);
Anyway, simple_html_dom is is known for causing problems, above all about memory usage.
Edited:
OK, I have found the bug.
It reside on simple_html_dom->load_file(), that call standard function file_get_contents() and then check the result through error_get_last() and - if error was found - unset this own data. But if an error has occurred before (in my test SimplePie output a warning ./cache is not writeable) this previously error is interpreted by simple_html_dom as file_get_contents() fail.
If you have PHP 7 installed, you can call error_clear_last() after unset($rss), and your code should be work. Otherwise, you can use my code above or pre-load html data to a variable and then call simple_html_dom->load() instead of simple_html_dom->load_file()
Very stumped by this one. In PHP, I'm fetching a YouTube user's vids feed and trying to access the nodes, like so:
$url = 'http://gdata.youtube.com/feeds/api/users/HCAFCOfficial/uploads';
$xml = simplexml_load_file($url);
So far, so fine. Really basic stuff. I can see the data comes back by running:
echo '<p>Found '.count($xml->xpath('*')).' nodes.</p>'; //41
echo '<textarea>';print_r($xml);echo '</textarea>';
Both print what I would expect, and the print_r replicates the XML structure.
However, I have no idea why this is returning zero:
echo '<p>Found '.count($xml->xpath('entry')).'"entry" nodes.</p>';
There blatantly are entry nodes in the XML. This is confirmed by running:
foreach($xml->xpath('*') as $node) echo '<p>['.$node->getName().']</p>';
...which duly outputs "[entry]" 25 times. So perhaps this is a bug in SimpleXML? This is part of a wider feed caching system and I'm not having any trouble with other, non-YT feeds, only YT ones.
[UPDATE]
This question shows that it works if you do
count($xml->entry)
But I'm curious as to why count($xml->xpath('entry')) doesn't also work...
[Update 2]
I can happily traverse YT's anternate feed format just fine:
http://gdata.youtube.com/feeds/base/users/{user id}/uploads?alt=rss&v=2
This is happening because the feed is an Atom document with a defined default namespace.
<feed xmlns="http://www.w3.org/2005/Atom" ...
Since a namespace is defined, you have to define it for your xpath call too. Doing something like this works:
$url = 'http://gdata.youtube.com/feeds/api/users/HCAFCOfficial/uploads';
$xml = simplexml_load_file($url);
$xml->registerXPathNamespace('ns', 'http://www.w3.org/2005/Atom');
$results = $xml->xpath('ns:entry');
echo count($results);
The main thing to know here is that SimpleXML respects any and all defined namespaces and you need to handle them accordingly, including the default namespace. You'll notice that the second feed you listed does not define a default namespace and so the xpath call works fine as is.
I need to traverse a dbpedia's xml resource file to get the abstract and some other basic information like formation year and budget.
An example for this would be the US EPA.(the bottom of the page has links to different data formats of the same file)
I only need the first rdf:Description namespace of the xml file. A snippet of the code
$xml_result = file_get_contents($xml_url);
$xml_data = simplexml_load_string($xml_result);
$namespaces = $xml_data->getNamespaces(true);
//print_r($namespaces);
$current = $xml_data->children($namespaces['rdf']);
This only gets me the rdf elements inside the first rdf:Description. how do i get access to other elements like the dbpedia-owl namespace elements inside the Description element ?
You can use multiple namespaces, see https://stackoverflow.com/a/13350242/865201
Without testing it, I think you can use something like
$xml_data->children($namespaces['rdf'])->Description->children($namespaces['dbpedia-owl'])->anotherElement;
I'm currently trying to parse some data from a forum. Here is the code:
$xml = simplexml_load_file('https://forums.eveonline.com');
$names = $xml->xpath("html/body/div/div/form/div/div/div/div/div[*]/div/div/table//tr/td[#class='topicViews']");
foreach($names as $name)
{
echo $name . "<br/>";
}
Anyway, the problem is that I'm using google xpath extension to help me get the path, and I'm guessing that google is changing the html enough to make it not come up when i use my website to do this search. Is there some type of way I can make the host look at the site through google chrome so that it gets the right code? What would you suggest?
Thanks!
My suggestion is to always use DOMDocument as opposed to SimpleXML, since it's a much nicer interface to work with and makes tasks a lot more intuitive.
The following example shows you how to load the HTML into the DOMDocument object and query the DOM using XPath. All you really need to do is find all td elements with a class name of topicViews and this will output each of the nodeValue members found in the DOMNodeList returned by this XPath query.
/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
$dom->loadHTMLFile("https://forums.eveonline.com");
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query("//td[#class='topicViews']");
/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $i => $node) {
echo "Node($i): ", $node->nodeValue, "\n";
}
A double '/' will make xpath search. So if you would use the xpath '//table' you would get all tables.
You can also use this deeper in your xpath structure like 'html/body/div/div/form//table' to get all tables under xpath 'html/body/div/div/form'.
This way you can make your code a bit more resilient against changes in the html source.
I do suggest learning a little about xpath if you want to use it. Copy paste only gets you so far.
A simple explanation about the syntax can be found at w3schools.com/xml/xpath_syntax.asp
I have a functionality like below and getting an error String could not be parsed as XML
$category_feed_url = "http://www.news4u.com/blogs/category/articles/feed/";
$file = file_get_contents($category_feed_url);
$xml = new SimpleXMLElement($file);
foreach($xml->channel->item as $feed)
{
echo $feed->link;
echo $feed->title;
...
why this error has occurred.
The URL points to an HTML document.
It is possible for a document to be both HTML and XML, but this one isn't.
It fails because you are trying to parse not-XML as if it was XML.
See How to parse and process HTML with PHP? for guidance in parsing HTML using PHP.
You seem to be expecting an RSS feed though, and that document doesn't resemble one or reference one. The site looks rather spammy, possibly that URI used to point to an RSS feed but the domain has now fallen to a link farm spammer. If so, you should find an alternative source for the information you were collecting.
"String could not be parsed as XML", your link is an html page.