Best way to parse RSS/Atom feeds with PHP [closed]

Best way to parse RSS/Atom feeds with PHP [closed] - php

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm currently using Magpie RSS but it sometimes falls over when the RSS or Atom feed isn't well formed. Are there any other options for parsing RSS and Atom feeds with PHP?

I've always used the SimpleXML functions built in to PHP to parse XML documents. It's one of the few generic parsers out there that has an intuitive structure to it, which makes it extremely easy to build a meaningful class for something specific like an RSS feed. Additionally, it will detect XML warnings and errors, and upon finding any you could simply run the source through something like HTML Tidy (as ceejayoz mentioned) to clean it up and attempt it again.
Consider this very rough, simple class using SimpleXML:
class BlogPost
{
var $date;
var $ts;
var $link;
var $title;
var $text;
}
class BlogFeed
{
var $posts = array();
function __construct($file_or_url)
{
$file_or_url = $this->resolveFile($file_or_url);
if (!($x = simplexml_load_file($file_or_url)))
return;
foreach ($x->channel->item as $item)
{
$post = new BlogPost();
$post->date = (string) $item->pubDate;
$post->ts = strtotime($item->pubDate);
$post->link = (string) $item->link;
$post->title = (string) $item->title;
$post->text = (string) $item->description;
// Create summary as a shortened body and remove images,
// extraneous line breaks, etc.
$post->summary = $this->summarizeText($post->text);
$this->posts[] = $post;
}
}
private function resolveFile($file_or_url) {
if (!preg_match('|^https?:|', $file_or_url))
$feed_uri = $_SERVER['DOCUMENT_ROOT'] .'/shared/xml/'. $file_or_url;
else
$feed_uri = $file_or_url;
return $feed_uri;
}
private function summarizeText($summary) {
$summary = strip_tags($summary);
// Truncate summary line to 100 characters
$max_len = 100;
if (strlen($summary) > $max_len)
$summary = substr($summary, 0, $max_len) . '...';
return $summary;
}
}

With 4 lines, I import a rss to an array.
$feed = implode(file('http://yourdomains.com/feed.rss'));
$xml = simplexml_load_string($feed);
$json = json_encode($xml);
$array = json_decode($json,TRUE);
For a more complex solution
$feed = new DOMDocument();
$feed->load('file.rss');
$json = array();
$json['title'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$json['description'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
$json['link'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('link')->item(0)->firstChild->nodeValue;
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
$json['item'] = array();
$i = 0;
foreach($items as $key => $item) {
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$description = $item->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
$pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
$guid = $item->getElementsByTagName('guid')->item(0)->firstChild->nodeValue;
$json['item'][$key]['title'] = $title;
$json['item'][$key]['description'] = $description;
$json['item'][$key]['pubdate'] = $pubDate;
$json['item'][$key]['guid'] = $guid;
}
echo json_encode($json);

Your other options include:
SimplePie
Last RSS
PHP Universal Feed Parser

I would like introduce simple script to parse RSS:
$i = 0; // counter
$url = "http://www.banki.ru/xml/news.rss"; // url to parse
$rss = simplexml_load_file($url); // XML parser
// RSS items loop
print '<h2><img style="vertical-align: middle;" src="'.$rss->channel->image->url.'" /> '.$rss->channel->title.'</h2>'; // channel title + img with src
foreach($rss->channel->item as $item) {
if ($i < 10) { // parse only 10 items
print ''.$item->title.'<br />';
}
$i++;
}

If feed isn't well-formed XML, you're supposed to reject it, no exceptions. You're entitled to call feed creator a bozo.
Otherwise you're paving way to mess that HTML ended up in.

The HTML Tidy library is able to fix some malformed XML files. Running your feeds through that before passing them on to the parser may help.

I use SimplePie to parse a Google Reader feed and it works pretty well and has a decent feature set.
Of course, I haven't tested it with non-well-formed RSS / Atom feeds so I don't know how it copes with those, I'm assuming Google's are fairly standards compliant! :)

Personally I use BNC Advanced Feed Parser- i like the template system that is very easy to use

The PHP RSS reader - http://www.scriptol.com/rss/rss-reader.php - is a complete but simple parser used by thousand of users...

Another great free parser - http://bncscripts.com/free-php-rss-parser/
It's very light ( only 3kb ) and simple to use!

Related

How to parse malformed RSS feed from third party sites using php?

I'm trying to parse RSS feeds from some medias. My script works for most of them. The problem is that I need to agregate all of them, eventhough they are malformed.
I don't manage to get the description of these two feeds. How could I proceed anyway ?
Here is my script :
<?php
function RSS_items ($url) {
$i = 0;
$doc = new DOMDocument();
$doc->load($url);
$channels = $doc->getElementsByTagName('channel');
foreach($channels as $channel) {
$items = $channel->getElementsByTagName('item');
foreach($items as $item) {
$i++;
$y[$i]['title'] = $item->getElementsByTagName('title')->item(0)->firstChild->textContent;
$y[$i]['link'] = $item->getElementsByTagName('link')->item(0)->firstChild->textContent;
$y[$i]['updated'] = $item->getElementsByTagName('pubDate')->item(0)->firstChild->textContent;
$y[$i]['description'] = $item->getElementsByTagName('description')->item(0)->firstChild->textContent;
}
}
echo '<pre>';
print_r ($y);
echo '</pre>';
}
// the two malformed feeds
RSS_items ('http://www.lefigaro.fr/rss/figaro_actualites-a-la-une.xml');
RSS_items ('https://francais.rt.com/rss');
?>

Problem of your code is in useing firstChild property that select first child of element. But in target XML, description tag hasn't any childs that you want to select first of them. Remove it from code. The result should be like this
$item->getElementsByTagName('description')->item(0)->textContent;

How to create array in foreach loop?

I am using PHP HTML DOM Parser to get data from another site. First i get URLs of my trades on this site and than i send another request on each trade url to get comments .I want to make an array of comments so i can sort them later. Why i cant create array ?
It looks like this
include_once('simple_html_dom.php');
$result = array();
$html = file_get_html('http://csgolounge.com/profile?id='.$steamid);
foreach($html->find('div.tradepoll') as $trade)
{
$tradeid = $trade->find('.tradeheader')[0]->find('a')[0]->href;
$html = file_get_html('http://csgolounge.com/'.$tradeid);
foreach($html->find('div.message') as $message)
{
if($message->find('p',0)){}
else
{
$left = $message->find('.msgleft')[0];
$right = $message->find('.msgright')[0];
//information about comments
$time = trim(strip_tags_content($left->innertext));
$text = $left->find('.msgtxt')[0];
$result[$time]['time'] = $time;
$result[$time]['text'] = $text;
}
}
}
echo json_encode($result);
If i echo $time or $text i always get data successfully.

I found what was the problem.
The Simple HTML DOM Parser does not clean up memory in the DOM each time file_get_html or str_get_html is called so it needs to be done explicity each time you have finished with the current DOM.
So I added $html->clear(); at the end of the loop.
Credits: electrictoolbox.com

Extract content of meta element in php? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am totally new to PHP development and I would like to extract the contents of a meta tag.
I have this code that allows me to extract the contents of the element # squad.
// Pull in PHP Simple HTML DOM Parser
include("simplehtmldom/simple_html_dom.php");
// Settings on top
$sitesToCheck = array(
// id is the page ID for selector
array("url" => "http://www.arsenal.com/first-team/players", "selector" => "#squad"),
array("url" => "http://www.liverpoolfc.tv/news", "selector" => "ul[style='height:400px;']")
);
$savePath = "cachedPages/";
$emailContent = "";
// For every page to check...
foreach($sitesToCheck as $site) {
$url = $site["url"];
// Calculate the cachedPage name, set oldContent = "";
$fileName = md5($url);
$oldContent = "";
// Get the URL's current page content
$html = file_get_html($url);
// Find content by querying with a selector, just like a selector engine!
foreach($html->find($site["selector"]) as $element) {
$currentContent = $element->plaintext;;
}
// If a cached file exists
if(file_exists($savePath.$fileName)) {
// Retrieve the old content
$oldContent = file_get_contents($savePath.$fileName);
}
// If different, notify!
if($oldContent && $currentContent != $oldContent) {
// Build simple email content
$emailContent = "Hey, the following page has changed!\n\n".$url."\n\n";
}
// Save new content
file_put_contents($savePath.$fileName,$currentContent);
}
// Send the email if there's content!
if($emailContent) {
// Sendmail!
mail("me#myself.name","Sites Have Changed!",$emailContent,"From: alerts#myself.name","\r\n");
// Debug
echo $emailContent;
}
But I want to change this code to get the number of comments in income.
Here is the meta tag where i would just extract the number of comments :
<meta item="desc" content="Comments:645">
Am I clear enough, do you understand me?
If I am not explicit enough, ask me?
Thanks for help

There's two ways to do this. You could either use the native PHP function: get_meta_tags() like so:
$tags = get_meta_tags('http://yoursite.com');
$comments = $tags['desc'];
Or you could use RegEx, but the above would be much more practical.

What you are looking for might be screen scraping.
This is the process where a programming-language like php, python or ruby loads a website in memory and uses various selectors to grab content from it.
Screen scraping is mostly used on websites that feature a lot of interesting data but have no json or xml API's
having googled around for it I stumbled on this post:
PHP equivalent of PyQuery or Nokogiri?
This article explains more about screen-scraping for web:
http://en.wikipedia.org/wiki/Web_scraping

Look for use domDocument
$dom = new domDocument;
$dom->loadHTML($htmlPage);
$metas = $dom->documentElement->getElementsByTagName('meta');
$ar = array();
foreach ($metas as $meta) {
$name = $meta->getAttribute('name');
$value = $meta->getAttribute('content');
$ar[$name] = $value;
}
print_r($ar); // print array meta-values

Loading Multiple XML feeds - PHP

I have a php file setup to pull through ONE XML data feed, What I would like to do is load up to 4 feeds into it and if possible make it select a random item too. Then parse that into an jQuery News Ticker.
My current PHP is as follows...
<?php
$feed = new DOMDocument();
$feed->load('/feed');
$json = array();
$json['title'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$json['description'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
$json['link'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('link')->item(0)->firstChild->nodeValue;
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
$json['item'] = array();
$i = 0;
foreach($items as $item) {
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$description = $item->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
$pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
$guid = $item->getElementsByTagName('guid')->item(0)->firstChild->nodeValue;
$json['item'][$i++]['title'] = $title;
$json['item'][$i++]['description'] = $description;
$json['item'][$i++]['pubdate'] = $pubDate;
$json['item'][$i++]['guid'] = $guid;
echo '<li class="news-item">'.$title.'</li>';
}
//echo json_encode($json);
?>
How can I modify this to load more than one feed into the file?
Thanks in advance

The simplest approach to doing this is wrapping another loop around the code you have. It's not the cleanest way but will probably suffice for the purpose.
In general, IMO, it's always beneficial to learn the basics of the language first. E.g. PHP manual on foreach
This is roughly what the loop needs to look like:
$my_feeds = array("http://.....", "http://.....", "http://.....");
foreach ($my_feeds as $my_feed)
{
// This is where your code starts
$feed = new DOMDocument();
$feed->load($my_feed); <--------------- notice the variable
$json = array();
... and the rest of the code
}
this will walk through all the URLs in $my_feeds, open the RSS source, fetch all the items from it, and output them.

If I'm reading your question right, what you may want to do is turn your code into a function, which you would then run inside a foreach loop for each url (which you could store in an array or other data structure).
Edit: If you don't know much about functions, this tutorial section might help you. http://devzone.zend.com/9/php-101-part-6-functionally-yours/

Retrieve XML from a third party page in PHP

I need read in and parse data from a third party website which sends XML data. All of this needs to be done server side.
What is the best way to do this using PHP?

You can obtain the remote XML data with, e.g.
$xmldata = file_get_contents("http://www.example.com/xmldata");
or with curl. Then use SimpleXML, DOM, whatever.

A good way of parsing XML is often to use XPP (XML Pull Parsing) librairy, PHP has an implementation of it, it's called XMLReader.
http://php.net/manual/en/class.xmlreader.php

I would suggest you to use DOMDocument (PHP inline built class)
A simple example of its power could be the following code:
/***********************************************************************************************
Takes the RSS news feeds found at $url and prints them as HTML code.
Each news is rendered in a <div class="rss"> block in the order: date + title + description.
***********************************************************************************************/
function Render($url, $max_feeds = 1000)
{
$doc = new DOMDocument();
if(#$doc->load($url, LIBXML_NOCDATA|LIBXML_NOBLANKS))
{
$feed_count = 0;
$items = $doc->getElementsByTagName("item");
//echo $items->length; //DEBUG
foreach($items as $item)
{
if($feed_count > $max_feeds)
break;
//Unfortunately inside <item> node elements are not always in same order, therefor we have to call many times getElementsByTagName
//WARNING: using iconv function instead of utf8_decode because this last one did not convert properly some characters like apostrophe 0x19 from techsport.it feeds.
$title = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("title")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
$description = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("description")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
$link = iconv('UTF-8', 'CP1252', $item->getElementsByTagName("link")->item(0)->firstChild->textContent); //can use "CP1252//TRANSLIT"
//pubDate tag is not mandatory in RSS [RSS2 spec: http://cyber.law.harvard.edu/rss/rss.html]
$pub_date = $item->getElementsByTagName("pubDate"); $date_html = "";
//play with date here if you want
echo "<div class='rss'>\n<p class='title'><a href='" . $link . "'>" . $title . "</a></p>\n<p class='description'>" . $description . "</p>\n</div>\n\n";
$feed_count++;
}
}
else
echo "<div class='rss'>Service not available.</div>";
}

I have been using simpleXML for a while.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Best way to parse RSS/Atom feeds with PHP [closed] - php

Your other options include: SimplePie Last RSS PHP Universal Feed Parser

If feed isn't well-formed XML, you're supposed to reject it, no exceptions. You're entitled to call feed creator a bozo. Otherwise you're paving way to mess that HTML ended up in.

The HTML Tidy library is able to fix some malformed XML files. Running your feeds through that before passing them on to the parser may help.

I use SimplePie to parse a Google Reader feed and it works pretty well and has a decent feature set. Of course, I haven't tested it with non-well-formed RSS / Atom feeds so I don't know how it copes with those, I'm assuming Google's are fairly standards compliant! :)

Personally I use BNC Advanced Feed Parser- i like the template system that is very easy to use

The PHP RSS reader - http://www.scriptol.com/rss/rss-reader.php - is a complete but simple parser used by thousand of users...

Another great free parser - http://bncscripts.com/free-php-rss-parser/ It's very light ( only 3kb ) and simple to use!

Related

How to parse malformed RSS feed from third party sites using php?

How to create array in foreach loop?

Extract content of meta element in php? [closed]

Loading Multiple XML feeds - PHP

Retrieve XML from a third party page in PHP

Categories

Resources