Trying to Parse Only the Images from an RSS Feed

Trying to Parse Only the Images from an RSS Feed - php

First, I am a php newbie. I have looked at the question and solution here. For my needs however, the parsing does not go deep enough into the various articles.
A small sampling of my rss feed reads like this:
<channel>
<atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
<title>My Web Site</title>
<description>My Feed</description>
<link>http://mywebsite.com/</link>
<image>
<url>http://mywebsite.com/views/images/banner.jpg</url>
<title>My Title</title>
<link>http://mywebsite.com/</link>
<description>Visit My Site</description>
</image>
<item>
<title>Article One</title>
<guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
<link>http://mywebsite.com/geturl/e8c5106</link>
<comments>http://mywebsite.com/details/e8c5106#comments</comments>
<pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate>
<category>Category 1</category>
<description>
<![CDATA[<div>
<img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0" />
<ul><li>Poster: someone's name;</li>
<li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
<li>Rating: 5</li>
<li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
</description>
</item>
<item>..
The image links that I want to parse out are the ones way inside each Item > Description
The code in my php file reads:
<?php
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$imgs = $xml->xpath('/item/description/img');
foreach($imgs as $image) {
echo $image->src;
}
?>
Can someone please help me figure out how to configure the php code above?
Also a very newbie question... once I get the resulting image urls, how can I display the images in a row on my html?
Many thanks!!!
Hernando

The <img> tags inside that RSS feed are not actually elements of the XML document, contrary to the syntax highlighting on this site - they are just text inside the <description> element which happen to contain the characters < and >.
The string <![CDATA[ tells the XML parser that everything from there until it encounters ]]> is to be treated as a raw string, regardless of what it contains. This is useful for embedding HTML inside XML, since the HTML tags wouldn't necessarily be valid XML. It is equivalent to escaping the whole HTML (e.g. with htmlspecialchars) so that the <img> tags would look like <img>. (I went into more technical details on another answer.)
So to extract the images from the RSS requires two steps: first, get the text of each <description>, and second, find all the <img> tags in that text.
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$descriptions = $xml->xpath('//item/description');
foreach ( $descriptions as $description_node ) {
// The description may not be valid XML, so use a more forgiving HTML parser mode
$description_dom = new DOMDocument();
$description_dom->loadHTML( (string)$description_node );
// Switch back to SimpleXML for readability
$description_sxml = simplexml_import_dom( $description_dom );
// Find all images, and extract their 'src' param
$imgs = $description_sxml->xpath('//img');
foreach($imgs as $image) {
echo (string)$image['src'];
}
}

I don't have much experience with xPath, but you could try the following:
$imgs = $xml->xpath('item//img');
This will select all img-elements which are inside item-elements, regardless if there are other elements inbetween. Removing the leading slash will search for item anywhere in the documet, not just from the root. Otherwise, you'd need something like /rss/channel/item....
As for displaying the images: Just output <img>-tags followed by line-breaks, like so:
foreach($imgs as $image) {
echo '<img src="' . $image->src . '" /><br />';
}
The preferred way would be to use CSS instead of <br>-tags, but I think they are simpler for a start.

Related

Xpath query for HTML table within XML in PHP DOMDocument

I have an XML file with following tree structure.
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>Videos</title>
<link>https://www.example.com/r/videos/</link>
<description>A long description of the video.</description>
<image>...</image>
<atom:link rel="self" href="http://www.example.com/videos/.xml" type="application/rss+xml"/>
<item>
<title>The most used Jazz lick in history.</title>
<link>
http://www.example.com/
</link>
<guid isPermaLink="true">
http://www.example.com/
</guid>
<pubDate>Mon, 07 Sep 2015 14:43:34 +0000</pubDate>
<description>
<table>
<tr>
<td>
<a href="http://www.example.com/">
<img src="http://www.example.com/.jpg" alt="The most used Jazz lick in history." title="The most used Jazz lick in history." />
</a>
</td>
<td> submitted by
jcepiano
<br/>
[link]
<a href="http://www.example.com/">
[508 comments]
</a>
</td>
</tr>
</table>
</description>
<media:title>The most used Jazz lick in history.</media:title>
<media:thumbnail url="http://example.jpg"/>
</item>
</channel>
</rss>
Here, the html table element is embedded inside XML and that's confusing me.
Now I want to pick the text node values for //channel/item/title and href value for //channel/item/description/table/tr/td[1]/a[1] (with a text node value = "[link]")
Above in 2nd case, I am looking for the value of 2nd a (with a text node value = "[link]"), inside 2nd td inside tr, table, description, item, channel.
I am using PHP DOMDocument();
I have been looking for a perfect solution for this for 2 days now, can you please let me know how would this happen?
Also I need to count the total number of items in the feed, right now I am doing like this:
...
$queryResult = $xpathvar->query('//item/title');
$total = 1;
foreach($queryResult as $result){
$total++;
}
echo $title;
And I also need a reference link for XPath query selectors' rules.
Thanks in advance! :)

You wrote that you wanted the length of the result set of the following query:
$queryResult = $xpathvar->query('//item/title');
I assume that $xpathvar here is of type DOMXPath. If so, it has a length property as described here. Instead of using foreach, simply use:
$length = $xpathvar->query('//item/title')->length;
Now I want to pick the text node values for //channel/item/title
Which you can get with the expression //channel/item/title/text().
and href value for //channel/item/description/table/tr/td[1]/a[1] (with a text node value = "[link]")
Your expression here selects any tr, the first td under that, then the first a. But the first a does not have a value of "[link]" in your source. If you want that, though, you can use:
//channel/item/description/table/tr/td[1]/a[1]/#href
but it looks like you rather want:
//channel/item/description/table/tr/td/a[. = "[link]"][1]/#href
which finds the first a element in the tree that has the value (text node) that is "[link]".
Above in 2nd case, I am looking for the value of 2nd a (with a text node value = "[link]"), inside 2nd td inside tr, table, description, item, channel.
Not sure if this was a separate question or meant to explain the previous one. Regardless, the answer the same as in the previous one, unless you explicitly want to search for 2nd a etc (i.e., search by position), in which case you can use numeric predicates.
Note: you start most of your expressions with //expr, which essentially means: search the whole tree at any depth for the expression expr. This is potentially expensive and if all you need is a (relative) root node for which you know the starting point or expression, it is better, and far more performant, to use a direct path. In your case, you can replace //channel for /*/channel (because it is the first under the root element).

I finally could make it work with the code below
$url = "https://www.example.com/r/videos/.xml";
$feed_dom = new domDocument;
$feed_dom->load($url);
$feed_dom->preserveWhiteSpace = false;
$items = $feed_dom->getElementsByTagName('item');
foreach($items as $item){
$title = $item->getElementsByTagName('title')->item(0)->nodeValue;
$desc_table = $item->getElementsByTagName('description')->item(0)->nodeValue;
echo $title . "<br>";
$table_dom = new domDocument;
$table_dom->loadHTML($desc_table);
$xpath = new DOMXpath($table_dom);
$table_dom->preserveWhiteSpace = false;
$yt_link_node = $xpath->query("//table/tr/td[2]/a[2]");
foreach($yt_link_node as $yt_link){
$yt = $yt_link->getAttribute('href');
echo $yt . "<br>";
echo "<br>";
}
}
I thank Abel, your help was greatly useful to achieve the tasks. :)

PHP - Parse XML with HTML elements inside

I'm trying to read XML which has HTML inside an element. It is NOT enclosed in CDATA tags, which is the problem because any XML parser I use tries to parse it as XML.
The point in the XML where it dies:
<item>
<title>Title text <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="🔥" aria-label="Emoji: Fire"></title>
</item>
Error message:
Warning: XMLReader::readOuterXml(): (xml file here) parser error : Opening and ending tag mismatch: img line 1 and title in (php file here)
I know how to get HTML out of an XML element but the parser doesn't like the fact that it's an open tag and it can't find the closing tag so it dies and I can't get any further.
Now, I don't actually need the <title> element so if there is a way to ignore it, that would work as the information I need is in only two child nodes of the <item> parent.
If anyone can see a workaround to this, that would be great.
Update
Using Christian Gollhardt's suggestions, I've managed to load the XML into an object but I get the same problem I did before where I have issues getting the CDATA from the <description> element.
This is the CDATA I should get:
<description>
<![CDATA[<a href="https://twitter.com/menomatters" >#menomatters</a> <a href="https://twitter.com/physicool1" >#physicool1</a> will chill my own "personal summer". <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="🔥" aria-label="Emoji: Fire"><img src="https://abs.twimg.com/emoji/v1/72x72/2600.png" draggable="false" alt="☀️" aria-label="Emoji: Black sun with rays">]]>
</description>
This is what I end up with:
["description"]=>
string(54) "#menomatters will chill my own "personal summer". ]]>"
Looks like an issue with closing tags again?

Take a look at DOMDocument. You can either work direct with it, or you can write a function, witch give you a cleaned document.
Clean Methods:
function tidyXml($xml) {
$doc = new DOMDocument();
if (#$doc->loadHTML($xml)) {
$output = '';
//Dom Document creates <html><body><myxml></body></html>, so we need to remove it
foreach ($doc->getElementsByTagName('body')->item(0)->childNodes as $child) {
$output .= $doc->saveXML($child);
}
return $output;
} else {
throw new Exception('Document can not be cleaned');
}
}
function getSimpleXml($xml) {
return new SimpleXMLElement(tidyXml($xml));
}
Implementation
$xml= '<item><title>Title text <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="�" aria-label="Emoji: Fire"></title></item>';
$myxml = getSimpleXml($xml);
$titleNodeCollection =$myxml->xpath('/item/title');
foreach ($titleNodeCollection as $titleNode) {
$titleText = (string)$titleNode;
$imageUrl = (string)$titleNode->img['src'];
$innerContent = str_replace(['<title>', '</title>'], '', $titleNode->asXML());
var_dump($titleText, $imageUrl, $innerContent);
}
Enjoy!

Trying to Parse Images and Text from an RSS Feed

This is a continuation of the thread here: Trying to Parse Only the Images from an RSS Feed
This time I want to parse both Images and Certain Items from an RSS feed. A Sampling of the RSS feed looks like this:
<channel>
<atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
<item>
<title>Article One</title>
<guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
<link>http://mywebsite.com/geturl/e8c5106</link>
<comments>http://mywebsite.com/details/e8c5106#comments</comments>
<pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate>
<category>Category 1</category>
<description>
<![CDATA[<div>
<img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0" />
<ul><li>Poster: someone's name;</li>
<li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
<li>Rating: 5</li>
<li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
</description>
</item>
<item>..
I have the following code below where I try to parse image and text:
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1');
$descriptions = $xml->xpath('//item/description');
$mytitle= $xml->xpath('//item/title');
foreach ( $descriptions as $description_node ) {
// The description may not be valid XML, so use a more forgiving HTML parser mode
$description_dom = new DOMDocument();
$description_dom->loadHTML( (string)$description_node );
// Switch back to SimpleXML for readability
$description_sxml = simplexml_import_dom( $description_dom );
// Find all images, and extract their 'src' param
$imgs = $description_sxml->xpath('//img');
foreach($imgs as $image) {
echo "<img id=poster class=poster src={$image['src']}> {$mytitle}";
}
}
The above code extracts the images beautifully.... However, it does not extract the $mytitle (which would be "Article One") tag as I try on the last line of my code. This is supposed to extract from all items in the RSS feed.
Can anyone help me figure this one out please.
Many thanks,
Hernando

xpath() always returns an array (see http://www.php.net/manual/en/simplexmlelement.xpath.php), even if just one element is the result. If you know you will expect one element, you can simply use $mytitle[0].
You will have to iterate over each <item/> element, as otherwise you can't know which description and which title belong together. So the following should work:
$xml = simplexml_load_file('test.xml');
$items = $xml->xpath('//item');
foreach ( $items as $item) {
$descriptions = $item->description;
$mytitle = $item->title;
foreach ( $descriptions as $description_node ) {
// The description may not be valid XML, so use a more forgiving HTML parser mode
$description_dom = new DOMDocument();
$description_dom->loadHTML( (string)$description_node );
// Switch back to SimpleXML for readability
$description_sxml = simplexml_import_dom( $description_dom );
// Find all images, and extract their 'src' param
$imgs = $description_sxml->xpath('//img');
foreach($imgs as $image) {
echo "<img id=\"poster\" class=\"poster\" src=\"{$image['src']}\"> {$mytitle}";
}
}
}
By the way, I also added "" to you your <img/> element. I guess you want that, as this look very much like XML/HTML.

Parsing meta data in RSS feed PHP

I am trying to extact the IMG SRC value out of the below RSS feed (only partial feed below).
I am currently using XML parser to get the rest of the items - which works fine (e.g.):
foreach($xml['RSS']['CHANNEL']['ITEM'] as $item)
{
...
$title = $item['TITLE'];
$description = $item['DESCRIPTION'];
$link = $item['LINK'];
$desc_imgsrc = <how do i get this for below RSS feed??>;
...
}
However - how do i get the IMG SRC value from below RSS feed into a PHP variable? Specifically i am trying to extact "http://thumbnails.---.com/VCPS/sm.jpg" string into $desc_imgsrc variable above? How can i adapt above code to do that? Thanks.
<item>
<title>Electric Cars - all about them</title>
<metadata:title xmlns:metadata="http://search.--.com/rss/2.0/Metadata">This is the title metadata</metadata:title>
<description>This is the description</description>
<metadata:description xmlns:metadata="http://search.---.com/rss/2.0/>
<![CDATA[<div class="rss_image" style="float:left;padding-right:10px;"><img border="0" vspace="0" hspace="0" width="10" src="http://thumbnails.---.com/VCPS/sm.jpg"></div><div class="rss_abstract" style="font:Arial 12px;width:100%;float:left;clear:both">This is the description</div>]]></metadata:description>
<pubDate>Fri, 25 Nov 2011 07:00 GMT</pubDate>

This is HTML (XML) inside an XML CDATA element. CDATA (character data) is not parsed by the XML parser. You need to extract the value the same way you did with the other elements. Then you can parse the element value, either by using a regular expression or even better use an XML parser again (if the HTML data is valid XML).

$doc = new DomDocument;
#$doc->loadHTML(...); // html string
// use # to supress the warning due to mixture of xml and html
$items = $doc->getElementsByTagName('img');
foreach ($items as $item)
{
$src = $item->getAttribute('src');
}

Help parsing XML with DOMDocument

I am trying to parse a youtube playlist field.
The URL is: http://gdata.youtube.com/feeds/api/playlists/664AA68C6E6BA19B?v=2
I need: Title, Video ID, and Default thumbnail.
I can easily get the title but I'm a little lost when it comes to the nested elements
$data = new DOMDocument();
if($data->load("http://gdata.youtube.com/feeds/api/playlists/664AA68C6E6BA19B?v=2"))
{
foreach ($data->getElementsByTagName('entry') as $video)
{
$title = $video->getElementsByTagName('title')->item(0)->nodeValue;
$id = ??
$thumb = ??
}
}
Here is the XML (I have stripped out the elements that are irrelevant for this example)
<entry gd:etag="W/"AkYGSXc9cSp7ImA9Wx9VGEk."">
<title>A GoPro Weekend On The Ice</title>
<media:group>
<media:thumbnail url="http://i.ytimg.com/vi/yk6wkfVNFQE/default.jpg" height="90" width="120" time="00:02:07" yt:name="default" />
<yt:videoid>yk6wkfVNFQE</yt:videoid>
</media:group>
</entry>
I need the "videoid" and the "url" from thumbnail-default
Thank you!

Similar to the getElementsByTagName() that you're already using, to access namespaced elements (recognisable by namespace:element-name) you can use the getElementsByTagNameNS() method.
The documenation (linked above) should give you the technical lowdown on how to use it, suffice to say it will be similar to the following (also using getAttribute()).
$yt = 'http://gdata.youtube.com/schemas/2007';
$media = 'http://search.yahoo.com/mrss/';
// Inside your loop
$id = $video->getElementsByTagNameNS($yt, 'videoid')->item(0)->nodeValue;
$thumb = $video->getElementsByTagNameNS($media, 'thumbnail')->item(0)->getAttribute('url');
Hopefully that should give you a spring-board to leap into accessing namespaced items within your XML documents.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Trying to Parse Only the Images from an RSS Feed - php

Related

Xpath query for HTML table within XML in PHP DOMDocument

PHP - Parse XML with HTML elements inside

Trying to Parse Images and Text from an RSS Feed

Parsing meta data in RSS feed PHP

Help parsing XML with DOMDocument

Categories

Resources