Xpath query for HTML table within XML in PHP DOMDocument - php

I have an XML file with following tree structure.
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>Videos</title>
<link>https://www.example.com/r/videos/</link>
<description>A long description of the video.</description>
<image>...</image>
<atom:link rel="self" href="http://www.example.com/videos/.xml" type="application/rss+xml"/>
<item>
<title>The most used Jazz lick in history.</title>
<link>
http://www.example.com/
</link>
<guid isPermaLink="true">
http://www.example.com/
</guid>
<pubDate>Mon, 07 Sep 2015 14:43:34 +0000</pubDate>
<description>
<table>
<tr>
<td>
<a href="http://www.example.com/">
<img src="http://www.example.com/.jpg" alt="The most used Jazz lick in history." title="The most used Jazz lick in history." />
</a>
</td>
<td> submitted by
jcepiano
<br/>
[link]
<a href="http://www.example.com/">
[508 comments]
</a>
</td>
</tr>
</table>
</description>
<media:title>The most used Jazz lick in history.</media:title>
<media:thumbnail url="http://example.jpg"/>
</item>
</channel>
</rss>
Here, the html table element is embedded inside XML and that's confusing me.
Now I want to pick the text node values for //channel/item/title and href value for //channel/item/description/table/tr/td[1]/a[1] (with a text node value = "[link]")
Above in 2nd case, I am looking for the value of 2nd a (with a text node value = "[link]"), inside 2nd td inside tr, table, description, item, channel.
I am using PHP DOMDocument();
I have been looking for a perfect solution for this for 2 days now, can you please let me know how would this happen?
Also I need to count the total number of items in the feed, right now I am doing like this:
...
$queryResult = $xpathvar->query('//item/title');
$total = 1;
foreach($queryResult as $result){
$total++;
}
echo $title;
And I also need a reference link for XPath query selectors' rules.
Thanks in advance! :)

You wrote that you wanted the length of the result set of the following query:
$queryResult = $xpathvar->query('//item/title');
I assume that $xpathvar here is of type DOMXPath. If so, it has a length property as described here. Instead of using foreach, simply use:
$length = $xpathvar->query('//item/title')->length;
Now I want to pick the text node values for //channel/item/title
Which you can get with the expression //channel/item/title/text().
and href value for //channel/item/description/table/tr/td[1]/a[1] (with a text node value = "[link]")
Your expression here selects any tr, the first td under that, then the first a. But the first a does not have a value of "[link]" in your source. If you want that, though, you can use:
//channel/item/description/table/tr/td[1]/a[1]/#href
but it looks like you rather want:
//channel/item/description/table/tr/td/a[. = "[link]"][1]/#href
which finds the first a element in the tree that has the value (text node) that is "[link]".
Above in 2nd case, I am looking for the value of 2nd a (with a text node value = "[link]"), inside 2nd td inside tr, table, description, item, channel.
Not sure if this was a separate question or meant to explain the previous one. Regardless, the answer the same as in the previous one, unless you explicitly want to search for 2nd a etc (i.e., search by position), in which case you can use numeric predicates.
Note: you start most of your expressions with //expr, which essentially means: search the whole tree at any depth for the expression expr. This is potentially expensive and if all you need is a (relative) root node for which you know the starting point or expression, it is better, and far more performant, to use a direct path. In your case, you can replace //channel for /*/channel (because it is the first under the root element).

I finally could make it work with the code below
$url = "https://www.example.com/r/videos/.xml";
$feed_dom = new domDocument;
$feed_dom->load($url);
$feed_dom->preserveWhiteSpace = false;
$items = $feed_dom->getElementsByTagName('item');
foreach($items as $item){
$title = $item->getElementsByTagName('title')->item(0)->nodeValue;
$desc_table = $item->getElementsByTagName('description')->item(0)->nodeValue;
echo $title . "<br>";
$table_dom = new domDocument;
$table_dom->loadHTML($desc_table);
$xpath = new DOMXpath($table_dom);
$table_dom->preserveWhiteSpace = false;
$yt_link_node = $xpath->query("//table/tr/td[2]/a[2]");
foreach($yt_link_node as $yt_link){
$yt = $yt_link->getAttribute('href');
echo $yt . "<br>";
echo "<br>";
}
}
I thank Abel, your help was greatly useful to achieve the tasks. :)

Related

php xpath query to get parent node based on value in repeating child nodes

I have an XML file structured as follows:
<pictures>
<picture>
<title></title>
<description></description>
<facts>
<date></date>
<place>Unites States</place>
</facts>
<people>
<person>John</person>
<person>Sue</person>
</people>
</picture>
<picture>
<title></title>
<description></description>
<facts>
<date></date>
<place>Canada</place>
</facts>
<people>
<person>Sue</person>
<person>Jane</person>
</people>
</picture>
<picture>
<title></title>
<description></description>
<facts>
<date></date>
<place>Canada</place>
</facts>
<people>
<person>John</person>
<person>Joe</person>
<person>Harry</person>
</people>
</picture>
<pictures>
In one case, I need to search for pictures where place="Canada". I have an XPath that does this fine, as such:
$place = "Canada";
$pics = ($pictures->xpath("//*[place='$place']"));
This pulls the entire "picture" node, so I am able to display title, description, etc.
I have another need to find all pictures where person = $person. I use the same type query as above:
$person = "John";
$pics = ($pictures->xpath("//*[person='$person']"));
In this case, the query apparently knows there are 2 pictures with John, but I don't get any of the values for the other nodes. I'm guessing it has something to do with the repeating child node, but can't figure out how to restructure the XPath to pull all of the picture node for each where I have a match on person. I tried using attributes instead of values (and modified the query accordingly), but got the same result.
Can anyone advise what I'm missing here?
Let's replace the variables first. That takes PHP out of the picture. The problem is just the proper XPath expression.
//*[place='Canada']
matches any element node that has a child element node place with the text content Canada.
This is the facts element node - not the picture.
Getting the pictures node is slightly different:
//picture[facts/place='Canada']
This would select ANY picture node at ANY DEPTH that matches the condition.
picture[facts/place='Canada']
Would return the same result with the provided XML, but is more specific and matches only picture element nodes that are children of the document element.
Now validating the people node is about the same:
picture[people/person="John"]
You can even combine the two conditions:
picture[facts/place="Canada" and people/person="John"]
Here is a small demo:
$element = new SimpleXMLElement($xml);
$expressions = [
'//*[place="Canada"]',
'//picture[facts/place="Canada"]',
'picture[facts/place="Canada"]',
'picture[people/person="John"]',
'picture[facts/place="Canada" and people/person="John"]',
];
foreach ($expressions as $expression) {
echo $expression, "\n", str_repeat('-', 60), "\n";
foreach ($element->xpath($expression) as $index => $found) {
echo '#', $index, "\n", $found->asXml(), "\n";
}
echo "\n";
}
HINT: Your using dyamic values in you XPath expressions. String literals in XPath 1.0 do not support any kind of escaping. A quote in the variable can break you expression. See this answer.

Trying to Parse Only the Images from an RSS Feed

First, I am a php newbie. I have looked at the question and solution here. For my needs however, the parsing does not go deep enough into the various articles.
A small sampling of my rss feed reads like this:
<channel>
<atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
<title>My Web Site</title>
<description>My Feed</description>
<link>http://mywebsite.com/</link>
<image>
<url>http://mywebsite.com/views/images/banner.jpg</url>
<title>My Title</title>
<link>http://mywebsite.com/</link>
<description>Visit My Site</description>
</image>
<item>
<title>Article One</title>
<guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
<link>http://mywebsite.com/geturl/e8c5106</link>
<comments>http://mywebsite.com/details/e8c5106#comments</comments>
<pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate>
<category>Category 1</category>
<description>
<![CDATA[<div>
<img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0" />
<ul><li>Poster: someone's name;</li>
<li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
<li>Rating: 5</li>
<li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
</description>
</item>
<item>..
The image links that I want to parse out are the ones way inside each Item > Description
The code in my php file reads:
<?php
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$imgs = $xml->xpath('/item/description/img');
foreach($imgs as $image) {
echo $image->src;
}
?>
Can someone please help me figure out how to configure the php code above?
Also a very newbie question... once I get the resulting image urls, how can I display the images in a row on my html?
Many thanks!!!
Hernando
The <img> tags inside that RSS feed are not actually elements of the XML document, contrary to the syntax highlighting on this site - they are just text inside the <description> element which happen to contain the characters < and >.
The string <![CDATA[ tells the XML parser that everything from there until it encounters ]]> is to be treated as a raw string, regardless of what it contains. This is useful for embedding HTML inside XML, since the HTML tags wouldn't necessarily be valid XML. It is equivalent to escaping the whole HTML (e.g. with htmlspecialchars) so that the <img> tags would look like <img>. (I went into more technical details on another answer.)
So to extract the images from the RSS requires two steps: first, get the text of each <description>, and second, find all the <img> tags in that text.
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$descriptions = $xml->xpath('//item/description');
foreach ( $descriptions as $description_node ) {
// The description may not be valid XML, so use a more forgiving HTML parser mode
$description_dom = new DOMDocument();
$description_dom->loadHTML( (string)$description_node );
// Switch back to SimpleXML for readability
$description_sxml = simplexml_import_dom( $description_dom );
// Find all images, and extract their 'src' param
$imgs = $description_sxml->xpath('//img');
foreach($imgs as $image) {
echo (string)$image['src'];
}
}
I don't have much experience with xPath, but you could try the following:
$imgs = $xml->xpath('item//img');
This will select all img-elements which are inside item-elements, regardless if there are other elements inbetween. Removing the leading slash will search for item anywhere in the documet, not just from the root. Otherwise, you'd need something like /rss/channel/item....
As for displaying the images: Just output <img>-tags followed by line-breaks, like so:
foreach($imgs as $image) {
echo '<img src="' . $image->src . '" /><br />';
}
The preferred way would be to use CSS instead of <br>-tags, but I think they are simpler for a start.

XPath multidimensional arrays in PHP

I'm scraping a website that's mostly table based. I have <tr> tags that each represent a category and <td> tags inside these that represent properties of the category.
Using Xpath I get the <tr> fine but with all the <td> info inside it bunched as one string:
$html_string = file_get_contents('testpage.html');
$dom = new DOMDocument();
$dom->loadHTML($html_string);
$xpath = new DOMXpath($dom);
$context_nodes = $xpath->query('//table[#id="category"]/tr[not(starts-with(#id, "category"))]');
And can each get <td> fine but with no retrospective reference to the category with:
$context_nodes = $xpath->query('//table[#id="category"]/tr[not(starts-with(#id, "category"))]/td');
What I would like to do later is be able to reference the properties of each category. I presumed I could do so with $context_nodes[2] etc., thinking that the array it created was a multidimensional string array. This doesn't seem to be the case.
How would I go about creating an array from the xpath info where I can grab a property of a category based on identifying what category I specifically want. E.g. train[1][2]?
Your second attempt is on the right lines. PHP (or, rather, libxml) retains a reference to the context the nodes you selected were returned from, allowing you to do precisely what you need in your case.
XML
<root>
<cat name="category 1">
<prop>prop 1.1</prop>
<prop>prop 1.2</prop>
</cat>
<cat name="category 2">
<prop>prop 2.1</prop>
<prop>prop 2.2</prop>
</cat>
</root>
PHP
$xml = new SimpleXMLElement($xml);
$props = $xml->xpath('cat/prop');
foreach($props as $prop) {
//let's go back up...
$parent_cat = $prop->xpath('parent::*/#name');
echo '<p>'.$prop.' (property of '.$parent_cat[0].')</p>';
}
Notice how we navigate back up the tree, from the point of the prop node, to reference the parent category. Not sure if this is what you meant but hope it helps.

Selecting xml elements with the same name with simple_load_xml in PHP

I have a XML (simplified) like this:
<article>
<title>My Article</title>
<image src="someurl.jpg" />
<image src="someotherurl.jpg" />
</article>
How do I select the <image> elements? They have the same name. To select the <title> i simply do this:
$xml = simplexml_load_file( "theurltomyxml.xml" );
$article = $xml->article;
$title = $article->title;
But how do I get the images? They have the same name! Just writing $article->image won't work.
I know this is an older question/answer but I had a similar issue and solved it by using the second solution by ajreal with a few adjustments of my own. I had a series of top level nodes (the xml was not formatted properly and didn't split the elements into parent nodes - out of my control). So I used a for loop that counts the elements then used ajreal's solution to echo back the contents I wanted with the iteration of $i.
My use was a bit different than above so I've tried to change it to make it more relevant to your images issue. Anyone please let me know if I made a mistake.
$campaigns = $xml->children();
for($i=0;$i<=$campaigns->count();$i++){
echo $campaigns[$i]->article->title . $campaigns[$i]->article->image[0];
}
You can do this :-
foreach ($xml->xpath("/article/image") as $img)
{
...
}
Or (is list of image node, so normal way of access array is workable)
$xml->image[0];
$xml->image[1];

How to get the tag "<yweather:condition>" from Yahoo Weather RSS in PHP?

<?php
$doc = new DOMDocument();
$doc->load('http://weather.yahooapis.com/forecastrss?p=VEXX0024&u=c');
$channel = $doc->getElementsByTagName("channel");
foreach($channel as $chnl)
{
$item = $chnl->getElementsByTagName("item");
foreach($item as $itemgotten)
{
$describe = $itemgotten->getElementsByTagName("description");
$description = $describe->item(0)->nodeValue;
echo $description;
}
}
?>
And as you can see is a simple script who return the content of the tag from the above url. The thing is that i dont need that content, i need the one who is inside the tag . I need the attributes code, temp, text. How do i do that simple with my actual code? Thanks
Ex of the tag content:
<yweather:condition text="Partly Cloudy" code="30" temp="30" date="Fri, 16 Jul 2010 8:30 am AST" />
Something like:
echo $itemgotten->getElementsByTagNameNS(
"http://xml.weather.yahoo.com/ns/rss/1.0","condition")->item(0)->
getAttribute("temp");
The key is that you have to use getElementsByTagNameNS instead of getElementsByTagName and specify "http://xml.weather.yahoo.com/ns/rss/1.0" as the namespace.
You know yweather is a shortcut for http://xml.weather.yahoo.com/ns/rss/1.0 because the XML file includes a xmls attribute:
<rss version="2.0" xmlns:yweather="http://xml.weather.yahoo.com/ns/rss/1.0"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
How about:
DOMElement::getAttribute — Returns value of attribute
And if you need to get anything with namespaces, there is corresponding methods ending in NS.

Categories