Help parsing XML with DOMDocument - php

I am trying to parse a youtube playlist field.
The URL is: http://gdata.youtube.com/feeds/api/playlists/664AA68C6E6BA19B?v=2
I need: Title, Video ID, and Default thumbnail.
I can easily get the title but I'm a little lost when it comes to the nested elements
$data = new DOMDocument();
if($data->load("http://gdata.youtube.com/feeds/api/playlists/664AA68C6E6BA19B?v=2"))
{
foreach ($data->getElementsByTagName('entry') as $video)
{
$title = $video->getElementsByTagName('title')->item(0)->nodeValue;
$id = ??
$thumb = ??
}
}
Here is the XML (I have stripped out the elements that are irrelevant for this example)
<entry gd:etag="W/"AkYGSXc9cSp7ImA9Wx9VGEk."">
<title>A GoPro Weekend On The Ice</title>
<media:group>
<media:thumbnail url="http://i.ytimg.com/vi/yk6wkfVNFQE/default.jpg" height="90" width="120" time="00:02:07" yt:name="default" />
<yt:videoid>yk6wkfVNFQE</yt:videoid>
</media:group>
</entry>
I need the "videoid" and the "url" from thumbnail-default
Thank you!

Similar to the getElementsByTagName() that you're already using, to access namespaced elements (recognisable by namespace:element-name) you can use the getElementsByTagNameNS() method.
The documenation (linked above) should give you the technical lowdown on how to use it, suffice to say it will be similar to the following (also using getAttribute()).
$yt = 'http://gdata.youtube.com/schemas/2007';
$media = 'http://search.yahoo.com/mrss/';
// Inside your loop
$id = $video->getElementsByTagNameNS($yt, 'videoid')->item(0)->nodeValue;
$thumb = $video->getElementsByTagNameNS($media, 'thumbnail')->item(0)->getAttribute('url');
Hopefully that should give you a spring-board to leap into accessing namespaced items within your XML documents.

Related

PHP not returning full XML file contents [duplicate]

I'm trying to read an RSS feed from Flickr but it has some nodes which are not readable by Simple XML (media:thumbnail, flickr:profile, and so on).
How do I get round this? My head hurts when I look at the documentation for the DOM. So I'd like to avoid it as I don't want to learn.
I'm trying to get the thumbnail by the way.
The solution is explained in this nice article. You need the children() method for accessing XML elements which contain a namespace. This code snippet is quoted from the article:
$feed = simplexml_load_file('http://www.sitepoint.com/recent.rdf');
foreach ($feed->item as $item) {
$ns_dc = $item->children('http://purl.org/dc/elements/1.1/');
echo $ns_dc->date;
}
With the latest version, you can now reference colon nodes with curly brackets.
$item->{'itunes:duration'}
You're dealing with a namespace? I think you need to use the ->children method.
$ns_dc = $item->children('http://namespace.org/');
Can you provide a snippet with the xml declaration?
An even simpler method using PHP of accessing namespaced XML nodes without declaring a namespace is....
In order to get the value of <su:authorEmail> from the following source
<item>
<title>My important article</title>
<pubDate>Mon, 29 Feb 2017 00:00:00 +0000</pubDate>
<link>https://myxmlsource.com/32984</link>
<guid>https://myxmlsource.com/32984</guid>
<author>Blogs, Jo</author>
<su:departments>
<su:department>Human Affairs</su:department>
</su:departments>
<su:authorHash>4f329b923419b3cb2c654d615e22588c</su:authorHash>
<su:authorEmail>hIwW14tLc+4l/oo7agmRrcjwe531u+mO/3IG3xe5jMg=</su:authorEmail>
<dc:identifier>/32984/Download/0032984-11042.docx</dc:identifier>
<dc:format>Journal article</dc:format>
<dc:creator>Blogs, Jo</dc:creator>
<slash:comments>0</slash:comments>
</item>
Use the following code:
$rss = new DOMDocument();
$rss->load('https://myxmlsource.com/rss/xml');
$nodes = $rss->getElementsByTagName('item');
foreach ($nodes as $node) {
$title = $node->getElementsByTagName('title')->item(0)->nodeValue;
$author = $node->getElementsByTagName('author')->item(0)->nodeValue;
$authorHash = $node->getElementsByTagName('authorHash')->item(0)->nodeValue;
$department = $node->getElementsByTagName('department')->item(0)->nodeValue;
$email = decryptEmail($node->getElementsByTagName('authorEmail')->item(0)->nodeValue);
}

PHP SimpleXmlElement get XML tag with colon in it? [duplicate]

I'm trying to read an RSS feed from Flickr but it has some nodes which are not readable by Simple XML (media:thumbnail, flickr:profile, and so on).
How do I get round this? My head hurts when I look at the documentation for the DOM. So I'd like to avoid it as I don't want to learn.
I'm trying to get the thumbnail by the way.
The solution is explained in this nice article. You need the children() method for accessing XML elements which contain a namespace. This code snippet is quoted from the article:
$feed = simplexml_load_file('http://www.sitepoint.com/recent.rdf');
foreach ($feed->item as $item) {
$ns_dc = $item->children('http://purl.org/dc/elements/1.1/');
echo $ns_dc->date;
}
With the latest version, you can now reference colon nodes with curly brackets.
$item->{'itunes:duration'}
You're dealing with a namespace? I think you need to use the ->children method.
$ns_dc = $item->children('http://namespace.org/');
Can you provide a snippet with the xml declaration?
An even simpler method using PHP of accessing namespaced XML nodes without declaring a namespace is....
In order to get the value of <su:authorEmail> from the following source
<item>
<title>My important article</title>
<pubDate>Mon, 29 Feb 2017 00:00:00 +0000</pubDate>
<link>https://myxmlsource.com/32984</link>
<guid>https://myxmlsource.com/32984</guid>
<author>Blogs, Jo</author>
<su:departments>
<su:department>Human Affairs</su:department>
</su:departments>
<su:authorHash>4f329b923419b3cb2c654d615e22588c</su:authorHash>
<su:authorEmail>hIwW14tLc+4l/oo7agmRrcjwe531u+mO/3IG3xe5jMg=</su:authorEmail>
<dc:identifier>/32984/Download/0032984-11042.docx</dc:identifier>
<dc:format>Journal article</dc:format>
<dc:creator>Blogs, Jo</dc:creator>
<slash:comments>0</slash:comments>
</item>
Use the following code:
$rss = new DOMDocument();
$rss->load('https://myxmlsource.com/rss/xml');
$nodes = $rss->getElementsByTagName('item');
foreach ($nodes as $node) {
$title = $node->getElementsByTagName('title')->item(0)->nodeValue;
$author = $node->getElementsByTagName('author')->item(0)->nodeValue;
$authorHash = $node->getElementsByTagName('authorHash')->item(0)->nodeValue;
$department = $node->getElementsByTagName('department')->item(0)->nodeValue;
$email = decryptEmail($node->getElementsByTagName('authorEmail')->item(0)->nodeValue);
}

Trying to Parse Only the Images from an RSS Feed

First, I am a php newbie. I have looked at the question and solution here. For my needs however, the parsing does not go deep enough into the various articles.
A small sampling of my rss feed reads like this:
<channel>
<atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
<title>My Web Site</title>
<description>My Feed</description>
<link>http://mywebsite.com/</link>
<image>
<url>http://mywebsite.com/views/images/banner.jpg</url>
<title>My Title</title>
<link>http://mywebsite.com/</link>
<description>Visit My Site</description>
</image>
<item>
<title>Article One</title>
<guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
<link>http://mywebsite.com/geturl/e8c5106</link>
<comments>http://mywebsite.com/details/e8c5106#comments</comments>
<pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate>
<category>Category 1</category>
<description>
<![CDATA[<div>
<img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0" />
<ul><li>Poster: someone's name;</li>
<li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
<li>Rating: 5</li>
<li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
</description>
</item>
<item>..
The image links that I want to parse out are the ones way inside each Item > Description
The code in my php file reads:
<?php
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$imgs = $xml->xpath('/item/description/img');
foreach($imgs as $image) {
echo $image->src;
}
?>
Can someone please help me figure out how to configure the php code above?
Also a very newbie question... once I get the resulting image urls, how can I display the images in a row on my html?
Many thanks!!!
Hernando
The <img> tags inside that RSS feed are not actually elements of the XML document, contrary to the syntax highlighting on this site - they are just text inside the <description> element which happen to contain the characters < and >.
The string <![CDATA[ tells the XML parser that everything from there until it encounters ]]> is to be treated as a raw string, regardless of what it contains. This is useful for embedding HTML inside XML, since the HTML tags wouldn't necessarily be valid XML. It is equivalent to escaping the whole HTML (e.g. with htmlspecialchars) so that the <img> tags would look like <img>. (I went into more technical details on another answer.)
So to extract the images from the RSS requires two steps: first, get the text of each <description>, and second, find all the <img> tags in that text.
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$descriptions = $xml->xpath('//item/description');
foreach ( $descriptions as $description_node ) {
// The description may not be valid XML, so use a more forgiving HTML parser mode
$description_dom = new DOMDocument();
$description_dom->loadHTML( (string)$description_node );
// Switch back to SimpleXML for readability
$description_sxml = simplexml_import_dom( $description_dom );
// Find all images, and extract their 'src' param
$imgs = $description_sxml->xpath('//img');
foreach($imgs as $image) {
echo (string)$image['src'];
}
}
I don't have much experience with xPath, but you could try the following:
$imgs = $xml->xpath('item//img');
This will select all img-elements which are inside item-elements, regardless if there are other elements inbetween. Removing the leading slash will search for item anywhere in the documet, not just from the root. Otherwise, you'd need something like /rss/channel/item....
As for displaying the images: Just output <img>-tags followed by line-breaks, like so:
foreach($imgs as $image) {
echo '<img src="' . $image->src . '" /><br />';
}
The preferred way would be to use CSS instead of <br>-tags, but I think they are simpler for a start.

Selecting xml elements with the same name with simple_load_xml in PHP

I have a XML (simplified) like this:
<article>
<title>My Article</title>
<image src="someurl.jpg" />
<image src="someotherurl.jpg" />
</article>
How do I select the <image> elements? They have the same name. To select the <title> i simply do this:
$xml = simplexml_load_file( "theurltomyxml.xml" );
$article = $xml->article;
$title = $article->title;
But how do I get the images? They have the same name! Just writing $article->image won't work.
I know this is an older question/answer but I had a similar issue and solved it by using the second solution by ajreal with a few adjustments of my own. I had a series of top level nodes (the xml was not formatted properly and didn't split the elements into parent nodes - out of my control). So I used a for loop that counts the elements then used ajreal's solution to echo back the contents I wanted with the iteration of $i.
My use was a bit different than above so I've tried to change it to make it more relevant to your images issue. Anyone please let me know if I made a mistake.
$campaigns = $xml->children();
for($i=0;$i<=$campaigns->count();$i++){
echo $campaigns[$i]->article->title . $campaigns[$i]->article->image[0];
}
You can do this :-
foreach ($xml->xpath("/article/image") as $img)
{
...
}
Or (is list of image node, so normal way of access array is workable)
$xml->image[0];
$xml->image[1];

PHP -- SimpleXMLElement -- parsing im:image

How do I extract im:image elements. For instance, I can do this:
$feed=file_get_contents($url);
$xml = new SimpleXMLElement($feed);
$title = $xml->entry[0]->title;
$html = $xml->entry[0]->content;
But I can't get this:
$img = $xml->entry[0]->im;
How do I target those? I'm willing to use DOMDocument() as well.
EDIT:
<entry>
<im:image height="55">
http://foo.com/foo.jpg
</im:image>
</entry>
The im is just a namespace. You want 'image' element, not im – Dmitri Snytkine 20 hours ago
Yes, that's true, and the clue I needed.

Categories