Get image url from RSS feed description using PHP - php

I want to get the url of an image (where it's stored) from the description of a RSS feed, so then I can put that url inside a variable.
I know that for getting the link of the RSS feed post I have to use $feed->channel->item->link; where $feed is $feed=simplexml_load_file("link_of_the_feed";.
But what if I have get the image url of the post, do I have to use something like $feed->channel->item->image;?
I really don't know, maybe a RSS parser like MagPie RSS which I tried without results?
Thanks in advance.

If the image node is at the top level of the item node, then yes. If it's deeper than the item node, you'll have to traverse it accordingly. It would be helpful if you posted your XML.
EDIT: you can also check out my answer here on how to parse through an XML file with PHP.

You're on the right track! But it all depends on the format that the RSS Feed is set up in.
The item node actually contains a whole bunch of different fields, of which link is only one. Take a look here for information on the other fields that the item node contains.
Now, if the RSS feed points directly to the image file, then you can just use item->link. More likely, however, the link points to a blog post or something that has the image embedded in it. In this case, you can undertake some processing on $feed->channel->item->description to find what you need. The description node contains an escaped HTML summary of the post, and then from there, you can just use a regular expression to find the source of the image. Also remember: before you start using the regular expressions, you might need to decode the description using htmlspecialchars_decode() before you start processing it with the regular expressions - in my own experience, descriptions often come formatted with special characters escaped.
I know that's a lot of information, but once you get started it's really not as hard as it sounds. Good luck!

Related

Regex PHP to get data from a website

I want to scrap following data( pink color part in image ) from http://www.kitco.com/market/
I was able to scrap data from The World Spot Price - Asia/Europe/NY markets HTML Table below that table using following.. but not able to get the London Fix data.. what changes should i do in the regular expression below as i tried many combinations but it doesnt work
My code looks like the following
$html= get_url_contents("http://www.kitco.com/market/");
//echo $html;
preg_match_all('!Gold\s+([0-9.]+)\s+([0-9.]+)!i',$html,$matches);
$patt = "/<td[^>]*width=['\"]68['\"][^>]*>([0-9\.]+)<\/td>\s*<td[^>]*width=['\"]68['\"][^>]*>([0-9\.]+)<\/td>/i";
Please do not parse HTML with regular expressions (you can see why in this mandatory post).
That being said, you can use an HTML parser, such as the Simple HTML DOM Parser to process the table. Take a look at this previous SO post to get started in the right direction.
EDIT: As per your comment, you could try to do something like so: <td bgcolor=".+?">\s*<p>\s*(.+?)\s*</p>\s*</td>. I do however, advise against this approach.
This will match and put the values into regex groups, which you can then, later access.
NOTE: Also as per your comment, the regex you propose is also susceptible style changes, so if they change the width of the columns, your regex will most likely fail.

Removing unwanted info from an RSS feed from facebook

Lets say for example we go to this page:
https://www.facebook.com/feeds/page.php?format=rss20&id=133869316660964
How can I strip out programmatically data from that.. or pull peices of data into a variable / array for example you want the 2 latest posts (part in the tag is it possible to acccess that?
Sorry for such an open question but I havent found a solution in some time of searching.
Thanks.
Parse it with PHP’s DOM extension: http://php.net/manual/en/book.dom.php

Get feed titles from any RSS feed with php

I'm attempting to build an RSS reader at the moment I'm using php's simplexml using
For example
$xml->item->title
But however this is dependent on the structure of the rss feed itself if the structure is different it won't work so I was wondering if theres a more broader and less specific way to grab all the titles from a RSS feed.
Thanks a lot
There is a RSS Specification document out there. You can find it at http://cyber.law.harvard.edu/rss/rss.html
Therefore, a RSS file always looks the same, but be carefull. There is always something like Atom.
You could use xPATH for searching within the RSS: http://nl.php.net/manual/en/simplexmlelement.xpath.php
maybe it´s an option for you using a contribution like this instead of invent the wheel again: http://www.phpclasses.org/package/3724-PHP-Parse-and-display-items-of-an-RSS-feed.html
You can use some Regex to filter the RSS files and split them into titles, etc. Whatever you want. As you will define which tags to grab data from.
Using something like: $regex = '/<(w+)[^>]*>(.*?)</\1>/s';
preg_match_all($reg_exp, $text, $match);
There are lot of feed formats. Writing code to comply all, is a bit difficult task. so..
I recommend using simple pie. http://simplepie.org/
or you can use google feed api also. here is a example using simple pie.
http://simplepie.org/wiki/setup/sample_page

How to filter data after using get contents

I want to know how to find a number on a remote website and make it a variable.
For example, if I want to find the stock quote for "AMZN", I would use curl or get contents on the page "http://stock-quotes.com/AMZN" to make it a variable string called $contents
Now that I have $contents, how would I find that AMZN quote? I was thinking of using a regular expression to narrow down the line, like finding "AMZN=35 points", and then perform another function to delete the "AMZN=" and " points" at the start and end of the string so that "35" is all that's left.
Is that how people do it?
1.) DOM Element
2.) Simple XML
3.) preg_match
4.) strpos
What I've always done (say in spidering, etc.) is to use the simple_html_dom library in PHP, then inspect the markup for the site.
The downside, as mentioned before, is that if the markup changes, you'll need to modify your code, but usually it's fairly easy, and if you use a source that has informative markup (consistent class names on the elements you need, etc.), then it's even easier.
Library link: http://simplehtmldom.sourceforge.net/

Intelligently grab first paragraph/starting text

I'd like to have a script where I can input a URL and it will intelligently grab the first paragraph of the article... I'm not sure where to begin other than just pulling text from within <p> tags. Do you know of any tips/tutorials on how to do this kind of thing?
update
For further clarification, I'm building a section of my site where users can submit links like on Facebook, it'll grab an image from their site as well as text to go with the link. I'm using PHP and trying to determine the best method of doing this.
I say "intelligently" because I'd like to try to get content on that page that's important, not just the first paragraph, but the first paragraph of the most important content.
If the page you want to grab is foreign or even if it is local but that you don't know its structure in advance, I'd say the best to achieve this would be by using the php DOM functions.
function get_first_paragraph($url)
{
$page = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($page);
/* Gets all the paragraphs */
$p = $doc->getElementsByTagName('p');
/* extracts the first one */
$p = $p->items(0);
/* returns the paragraph's content */
return $p->textContent;
}
Short answer: you can't.
In order to have a PHP script "intelligently" fetch the "most important" content from a page, the script would have to understand the content on the page. PHP is no natural language processor, nor is this a trivial area of study. There might be some NLP toolkits for PHP, but I still doubt it would be easy then.
A solution that can be achieved with reasonable effort would be fetch those entire page with an HTML parser and then look out for elements with certain class names or ids commonly found in blog engines. You could also parse for hAtom Microformats. Or you could look out for Meta tags within the document and more clearly defined information.
I wrote a Python script a while ago to extract a web page's main article content. It uses a heuristic to scan all text nodes in a document and group together nodes at similar depths, and then assume the largest grouping is the main article.
Of course, this method has its limitations, and no method will work on 100% of web pages. This is just one approach, and there are many other ways you might accomplish it. You may also want to look at similar past questions on this subject.

Categories