Parsing meta data in RSS feed PHP - php

I am trying to extact the IMG SRC value out of the below RSS feed (only partial feed below).
I am currently using XML parser to get the rest of the items - which works fine (e.g.):
foreach($xml['RSS']['CHANNEL']['ITEM'] as $item)
{
...
$title = $item['TITLE'];
$description = $item['DESCRIPTION'];
$link = $item['LINK'];
$desc_imgsrc = <how do i get this for below RSS feed??>;
...
}
However - how do i get the IMG SRC value from below RSS feed into a PHP variable? Specifically i am trying to extact "http://thumbnails.---.com/VCPS/sm.jpg" string into $desc_imgsrc variable above? How can i adapt above code to do that? Thanks.
<item>
<title>Electric Cars - all about them</title>
<metadata:title xmlns:metadata="http://search.--.com/rss/2.0/Metadata">This is the title metadata</metadata:title>
<description>This is the description</description>
<metadata:description xmlns:metadata="http://search.---.com/rss/2.0/>
<![CDATA[<div class="rss_image" style="float:left;padding-right:10px;"><img border="0" vspace="0" hspace="0" width="10" src="http://thumbnails.---.com/VCPS/sm.jpg"></div><div class="rss_abstract" style="font:Arial 12px;width:100%;float:left;clear:both">This is the description</div>]]></metadata:description>
<pubDate>Fri, 25 Nov 2011 07:00 GMT</pubDate>

This is HTML (XML) inside an XML CDATA element. CDATA (character data) is not parsed by the XML parser. You need to extract the value the same way you did with the other elements. Then you can parse the element value, either by using a regular expression or even better use an XML parser again (if the HTML data is valid XML).

$doc = new DomDocument;
#$doc->loadHTML(...); // html string
// use # to supress the warning due to mixture of xml and html
$items = $doc->getElementsByTagName('img');
foreach ($items as $item)
{
$src = $item->getAttribute('src');
}

Related

PHP not returning full XML file contents [duplicate]

I'm trying to read an RSS feed from Flickr but it has some nodes which are not readable by Simple XML (media:thumbnail, flickr:profile, and so on).
How do I get round this? My head hurts when I look at the documentation for the DOM. So I'd like to avoid it as I don't want to learn.
I'm trying to get the thumbnail by the way.
The solution is explained in this nice article. You need the children() method for accessing XML elements which contain a namespace. This code snippet is quoted from the article:
$feed = simplexml_load_file('http://www.sitepoint.com/recent.rdf');
foreach ($feed->item as $item) {
$ns_dc = $item->children('http://purl.org/dc/elements/1.1/');
echo $ns_dc->date;
}
With the latest version, you can now reference colon nodes with curly brackets.
$item->{'itunes:duration'}
You're dealing with a namespace? I think you need to use the ->children method.
$ns_dc = $item->children('http://namespace.org/');
Can you provide a snippet with the xml declaration?
An even simpler method using PHP of accessing namespaced XML nodes without declaring a namespace is....
In order to get the value of <su:authorEmail> from the following source
<item>
<title>My important article</title>
<pubDate>Mon, 29 Feb 2017 00:00:00 +0000</pubDate>
<link>https://myxmlsource.com/32984</link>
<guid>https://myxmlsource.com/32984</guid>
<author>Blogs, Jo</author>
<su:departments>
<su:department>Human Affairs</su:department>
</su:departments>
<su:authorHash>4f329b923419b3cb2c654d615e22588c</su:authorHash>
<su:authorEmail>hIwW14tLc+4l/oo7agmRrcjwe531u+mO/3IG3xe5jMg=</su:authorEmail>
<dc:identifier>/32984/Download/0032984-11042.docx</dc:identifier>
<dc:format>Journal article</dc:format>
<dc:creator>Blogs, Jo</dc:creator>
<slash:comments>0</slash:comments>
</item>
Use the following code:
$rss = new DOMDocument();
$rss->load('https://myxmlsource.com/rss/xml');
$nodes = $rss->getElementsByTagName('item');
foreach ($nodes as $node) {
$title = $node->getElementsByTagName('title')->item(0)->nodeValue;
$author = $node->getElementsByTagName('author')->item(0)->nodeValue;
$authorHash = $node->getElementsByTagName('authorHash')->item(0)->nodeValue;
$department = $node->getElementsByTagName('department')->item(0)->nodeValue;
$email = decryptEmail($node->getElementsByTagName('authorEmail')->item(0)->nodeValue);
}

PHP SimpleXmlElement get XML tag with colon in it? [duplicate]

I'm trying to read an RSS feed from Flickr but it has some nodes which are not readable by Simple XML (media:thumbnail, flickr:profile, and so on).
How do I get round this? My head hurts when I look at the documentation for the DOM. So I'd like to avoid it as I don't want to learn.
I'm trying to get the thumbnail by the way.
The solution is explained in this nice article. You need the children() method for accessing XML elements which contain a namespace. This code snippet is quoted from the article:
$feed = simplexml_load_file('http://www.sitepoint.com/recent.rdf');
foreach ($feed->item as $item) {
$ns_dc = $item->children('http://purl.org/dc/elements/1.1/');
echo $ns_dc->date;
}
With the latest version, you can now reference colon nodes with curly brackets.
$item->{'itunes:duration'}
You're dealing with a namespace? I think you need to use the ->children method.
$ns_dc = $item->children('http://namespace.org/');
Can you provide a snippet with the xml declaration?
An even simpler method using PHP of accessing namespaced XML nodes without declaring a namespace is....
In order to get the value of <su:authorEmail> from the following source
<item>
<title>My important article</title>
<pubDate>Mon, 29 Feb 2017 00:00:00 +0000</pubDate>
<link>https://myxmlsource.com/32984</link>
<guid>https://myxmlsource.com/32984</guid>
<author>Blogs, Jo</author>
<su:departments>
<su:department>Human Affairs</su:department>
</su:departments>
<su:authorHash>4f329b923419b3cb2c654d615e22588c</su:authorHash>
<su:authorEmail>hIwW14tLc+4l/oo7agmRrcjwe531u+mO/3IG3xe5jMg=</su:authorEmail>
<dc:identifier>/32984/Download/0032984-11042.docx</dc:identifier>
<dc:format>Journal article</dc:format>
<dc:creator>Blogs, Jo</dc:creator>
<slash:comments>0</slash:comments>
</item>
Use the following code:
$rss = new DOMDocument();
$rss->load('https://myxmlsource.com/rss/xml');
$nodes = $rss->getElementsByTagName('item');
foreach ($nodes as $node) {
$title = $node->getElementsByTagName('title')->item(0)->nodeValue;
$author = $node->getElementsByTagName('author')->item(0)->nodeValue;
$authorHash = $node->getElementsByTagName('authorHash')->item(0)->nodeValue;
$department = $node->getElementsByTagName('department')->item(0)->nodeValue;
$email = decryptEmail($node->getElementsByTagName('authorEmail')->item(0)->nodeValue);
}

PHP - Parse XML with HTML elements inside

I'm trying to read XML which has HTML inside an element. It is NOT enclosed in CDATA tags, which is the problem because any XML parser I use tries to parse it as XML.
The point in the XML where it dies:
<item>
<title>Title text <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="🔥" aria-label="Emoji: Fire"></title>
</item>
Error message:
Warning: XMLReader::readOuterXml(): (xml file here) parser error : Opening and ending tag mismatch: img line 1 and title in (php file here)
I know how to get HTML out of an XML element but the parser doesn't like the fact that it's an open tag and it can't find the closing tag so it dies and I can't get any further.
Now, I don't actually need the <title> element so if there is a way to ignore it, that would work as the information I need is in only two child nodes of the <item> parent.
If anyone can see a workaround to this, that would be great.
Update
Using Christian Gollhardt's suggestions, I've managed to load the XML into an object but I get the same problem I did before where I have issues getting the CDATA from the <description> element.
This is the CDATA I should get:
<description>
<![CDATA[<a href="https://twitter.com/menomatters" >#menomatters</a> <a href="https://twitter.com/physicool1" >#physicool1</a> will chill my own "personal summer". <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="🔥" aria-label="Emoji: Fire"><img src="https://abs.twimg.com/emoji/v1/72x72/2600.png" draggable="false" alt="☀️" aria-label="Emoji: Black sun with rays">]]>
</description>
This is what I end up with:
["description"]=>
string(54) "#menomatters will chill my own "personal summer". ]]>"
Looks like an issue with closing tags again?
Take a look at DOMDocument. You can either work direct with it, or you can write a function, witch give you a cleaned document.
Clean Methods:
function tidyXml($xml) {
$doc = new DOMDocument();
if (#$doc->loadHTML($xml)) {
$output = '';
//Dom Document creates <html><body><myxml></body></html>, so we need to remove it
foreach ($doc->getElementsByTagName('body')->item(0)->childNodes as $child) {
$output .= $doc->saveXML($child);
}
return $output;
} else {
throw new Exception('Document can not be cleaned');
}
}
function getSimpleXml($xml) {
return new SimpleXMLElement(tidyXml($xml));
}
Implementation
$xml= '<item><title>Title text <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="�" aria-label="Emoji: Fire"></title></item>';
$myxml = getSimpleXml($xml);
$titleNodeCollection =$myxml->xpath('/item/title');
foreach ($titleNodeCollection as $titleNode) {
$titleText = (string)$titleNode;
$imageUrl = (string)$titleNode->img['src'];
$innerContent = str_replace(['<title>', '</title>'], '', $titleNode->asXML());
var_dump($titleText, $imageUrl, $innerContent);
}
Enjoy!

Trying to Parse Images and Text from an RSS Feed

This is a continuation of the thread here: Trying to Parse Only the Images from an RSS Feed
This time I want to parse both Images and Certain Items from an RSS feed. A Sampling of the RSS feed looks like this:
<channel>
<atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
<item>
<title>Article One</title>
<guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
<link>http://mywebsite.com/geturl/e8c5106</link>
<comments>http://mywebsite.com/details/e8c5106#comments</comments>
<pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate>
<category>Category 1</category>
<description>
<![CDATA[<div>
<img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0" />
<ul><li>Poster: someone's name;</li>
<li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
<li>Rating: 5</li>
<li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
</description>
</item>
<item>..
I have the following code below where I try to parse image and text:
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1');
$descriptions = $xml->xpath('//item/description');
$mytitle= $xml->xpath('//item/title');
foreach ( $descriptions as $description_node ) {
// The description may not be valid XML, so use a more forgiving HTML parser mode
$description_dom = new DOMDocument();
$description_dom->loadHTML( (string)$description_node );
// Switch back to SimpleXML for readability
$description_sxml = simplexml_import_dom( $description_dom );
// Find all images, and extract their 'src' param
$imgs = $description_sxml->xpath('//img');
foreach($imgs as $image) {
echo "<img id=poster class=poster src={$image['src']}> {$mytitle}";
}
}
The above code extracts the images beautifully.... However, it does not extract the $mytitle (which would be "Article One") tag as I try on the last line of my code. This is supposed to extract from all items in the RSS feed.
Can anyone help me figure this one out please.
Many thanks,
Hernando
xpath() always returns an array (see http://www.php.net/manual/en/simplexmlelement.xpath.php), even if just one element is the result. If you know you will expect one element, you can simply use $mytitle[0].
You will have to iterate over each <item/> element, as otherwise you can't know which description and which title belong together. So the following should work:
$xml = simplexml_load_file('test.xml');
$items = $xml->xpath('//item');
foreach ( $items as $item) {
$descriptions = $item->description;
$mytitle = $item->title;
foreach ( $descriptions as $description_node ) {
// The description may not be valid XML, so use a more forgiving HTML parser mode
$description_dom = new DOMDocument();
$description_dom->loadHTML( (string)$description_node );
// Switch back to SimpleXML for readability
$description_sxml = simplexml_import_dom( $description_dom );
// Find all images, and extract their 'src' param
$imgs = $description_sxml->xpath('//img');
foreach($imgs as $image) {
echo "<img id=\"poster\" class=\"poster\" src=\"{$image['src']}\"> {$mytitle}";
}
}
}
By the way, I also added "" to you your <img/> element. I guess you want that, as this look very much like XML/HTML.

Trying to Parse Only the Images from an RSS Feed

First, I am a php newbie. I have looked at the question and solution here. For my needs however, the parsing does not go deep enough into the various articles.
A small sampling of my rss feed reads like this:
<channel>
<atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
<title>My Web Site</title>
<description>My Feed</description>
<link>http://mywebsite.com/</link>
<image>
<url>http://mywebsite.com/views/images/banner.jpg</url>
<title>My Title</title>
<link>http://mywebsite.com/</link>
<description>Visit My Site</description>
</image>
<item>
<title>Article One</title>
<guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
<link>http://mywebsite.com/geturl/e8c5106</link>
<comments>http://mywebsite.com/details/e8c5106#comments</comments>
<pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate>
<category>Category 1</category>
<description>
<![CDATA[<div>
<img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0" />
<ul><li>Poster: someone's name;</li>
<li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
<li>Rating: 5</li>
<li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
</description>
</item>
<item>..
The image links that I want to parse out are the ones way inside each Item > Description
The code in my php file reads:
<?php
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$imgs = $xml->xpath('/item/description/img');
foreach($imgs as $image) {
echo $image->src;
}
?>
Can someone please help me figure out how to configure the php code above?
Also a very newbie question... once I get the resulting image urls, how can I display the images in a row on my html?
Many thanks!!!
Hernando
The <img> tags inside that RSS feed are not actually elements of the XML document, contrary to the syntax highlighting on this site - they are just text inside the <description> element which happen to contain the characters < and >.
The string <![CDATA[ tells the XML parser that everything from there until it encounters ]]> is to be treated as a raw string, regardless of what it contains. This is useful for embedding HTML inside XML, since the HTML tags wouldn't necessarily be valid XML. It is equivalent to escaping the whole HTML (e.g. with htmlspecialchars) so that the <img> tags would look like <img>. (I went into more technical details on another answer.)
So to extract the images from the RSS requires two steps: first, get the text of each <description>, and second, find all the <img> tags in that text.
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$descriptions = $xml->xpath('//item/description');
foreach ( $descriptions as $description_node ) {
// The description may not be valid XML, so use a more forgiving HTML parser mode
$description_dom = new DOMDocument();
$description_dom->loadHTML( (string)$description_node );
// Switch back to SimpleXML for readability
$description_sxml = simplexml_import_dom( $description_dom );
// Find all images, and extract their 'src' param
$imgs = $description_sxml->xpath('//img');
foreach($imgs as $image) {
echo (string)$image['src'];
}
}
I don't have much experience with xPath, but you could try the following:
$imgs = $xml->xpath('item//img');
This will select all img-elements which are inside item-elements, regardless if there are other elements inbetween. Removing the leading slash will search for item anywhere in the documet, not just from the root. Otherwise, you'd need something like /rss/channel/item....
As for displaying the images: Just output <img>-tags followed by line-breaks, like so:
foreach($imgs as $image) {
echo '<img src="' . $image->src . '" /><br />';
}
The preferred way would be to use CSS instead of <br>-tags, but I think they are simpler for a start.

Categories