How to get iTunes-specific child nodes of RSS feeds?

How to get iTunes-specific child nodes of RSS feeds? - php

I'm trying to process an RSS feed using PHP and there are some tags such as 'itunes:image' which I need to process. The code I'm using is below and for some reason these elements are not returning any value. The output is length is 0.
How can I read these tags and get their attributes?
$f = $_REQUEST['feed'];
$feed = new DOMDocument();
$feed->load($f);
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
foreach($items as $key => $item)
{
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
$description = $item->getElementsByTagName('description')->item(0)->textContent; // textContent
$arrt = $item->getElementsByTagName('itunes:image');
print_r($arrt);
}

getElementsByTagName is specified by DOM, and PHP is just following that. It doesn't consider namespaces. Instead, use getElementsByTagNameNS, which requires the full namespace URI (not the prefix). This appears to be http://www.itunes.com/dtds/podcast-1.0.dtd*. So:
$img = $item->getElementsByTagNameNS('http://www.itunes.com/dtds/podcast-1.0.dtd', 'image');
// Set preemptive fallback, then set value if check passes
urlImage = '';
if ($img) {
$urlImage = $img->getAttribute('href');
}
Or put the namespace in a constant.
You might be able to get away with simply removing the prefix and getting all image tags of any namespace with getElementsByTagName.
Make sure to check whether a given item has an itunes:image element at all (example now given); in the example podcast, some don't, and I suspect that was also giving you trouble. (If there's no href attribute, getAttribute will return either null or an empty string per the DOM spec without erroring out.)
*In case you're wondering, there is no actual DTD file hosted at that location, and there hasn't been for about ten years.

<?php
$rss_feed = simplexml_load_file("url link");
if(!empty($rss_feed)) {
$i=0;
foreach ($rss_feed->channel->item as $feed_item) {
?>
<?php echo $rss_feed->children('itunes', true)->image->attributes()->href;?>
<?php
}
?>

Related

Extract pattern from xml file using PHP?

I have a remote XML file. I need to read, find some values an save them in an array.
I've got load the file with (no problem with this):
$xml_external_path = 'http://example.com/my-file.xml';
$xml = file_get_contents($xml_external_path);
In this file there are many instances of:
<unico>4241</unico>
<unico>234</unico>
<unico>534534</unico>
<unico>2345334</unico>
I need to extract just the number of these strings and save them in a array. I guess I need to use a pattern like:
$pattern = '/<unico>(.*?)<\/unico>/';
But I'm not sure what to do next. Keep in mind that it is an .xml file.
Result should be a populated array like this:
$my_array = array (4241, 234, 534534,2345334);

You can better use XPath to read through an XML file. XPath is a variant of DOMDocument focused on reading and editing XML files. You can query an XPath variable using patterns, which is based on the simple Unix path syntax. So // means anywhere and ./ means relative to selected node. XPath->query() will return a DOMNodelist with all the nodes according to the pattern. The following code will do what you want:
$xmlFile = "
<unico>4241</unico>
<unico>234</unico>
<unico>534534</unico>
<unico>2345334</unico>";
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xmlFile);
$xpath = new DOMXPath($xmlDoc);
// This code returns a DOMNodeList of all nodes with the unico tags in the file.
$unicos = $xpath->query("//unico");
//This returns an integer of how many nodes were found that matched the pattern
echo $unicos->length;
You can find more info on XPath and its syntax here: XPath on Wikipedia#syntax
DOMNodeList implements Traversable, so you can use foreach() to traverse it. If you really want a flat array you can simply convert is using simple code like in question #15807314:
$unicosArr = array();
foreach($unicos as $node){
$unicosArr[] = $node->nodeValue;
}

Using preg_match_all:
<?php
$xml = '<unico>4241</unico>
<unico>234</unico>
<unico>534534</unico>
<unico>2345334</unico>';
$pattern = '/<unico>(.*?)<\/unico>/';
preg_match_all($pattern,$xml,$result);
print_r($result[0]);

You could try this, it basically just loops through each line of the file and finds whatever is between the XML <unico> tags.
<?php
$file = "./your.xml";
$pattern = '/<unico>(.*?)<\/unico>/';
$allVars = array();
$currentFile = fopen($file, "r");
if ($currentFile) {
// Read through file
while (!feof($currentFile)) {
$m_sLine = fgets($currentFile);
// Check for sitename validity
if (preg_match($pattern, $m_sLine) == true) {
$curVar = explode("<unico>", $m_sLine);
$curVar = explode("</unico>", $curVar[1]);
$allVars[] = $curVar[0];
}
}
}
fclose($currentFile);
print_r($allVars);
Is this sort of what you want? :)

PHP return value after XML exploration

I got a PHP array with a lot of XML users-file URL :
$tab_users[0]=john.xml
$tab_users[1]=chris.xml
$tab_users[n...]=phil.xml
For each user a <zoom> tag is filled or not, depending if user filled it up or not:
john.xml = <zoom>Some content here</zoom>
chris.xml = <zoom/>
phil.xml = <zoom/>
I'm trying to explore the users datas and display the first filled <zoom> tag, but randomized: each time you reload the page the <div id="zoom"> content is different.
$rand=rand(0,$n); // $n is the number of users
$datas_zoom=zoom($n,$rand);
My PHP function
function zoom($n,$rand) {
global $tab_users;
$datas_user=new SimpleXMLElement($tab_users[$rand],null,true);
$tag=$datas_user->xpath('/user');
//if zoom found
if($tag[0]->zoom !='') {
$txt_zoom=$tag[0]->zoom;
}
... some other taff here
// no "zoom" value found
if ($txt_zoom =='') {
echo 'RAND='.$rand.' XML='.$tab_users[$rand].'<br />';
$datas_zoom=zoom($r,$n,$rand); } // random zoom fct again and again till...
}
else {
echo 'ZOOM='.$txt_zoom.'<br />';
return $txt_zoom; // we got it!
}
}
echo '<br />Return='.$datas_zoom;
The prob is: when by chance the first XML explored contains a "zoom" information the function returns it, but if not nothing returns... An exemple of results when the first one is by chance the good one:
// for RAND=0, XML=john.xml
ZOOM=Anything here
Return=Some content here // we're lucky
Unlucky:
RAND=1 XML=chris.xml
RAND=2 XML=phil.xml
// the for RAND=0 and XML=john.xml
ZOOM=Anything here
// content founded but Return is empty
Return=
What's wrong?

I suggest importing the values into a database table, generating a single local file or something like that. So that you don't have to open and parse all the XML files for each request.
Reading multiple files is a lot slower then reading a single file. And using a database even the random logic can be moved to SQL.
You're are currently using SimpleXML, but fetching a single value from an XML document is actually easier with DOM. SimpleXMLElement::xpath() only supports Xpath expression that return a node list, but DOMXpath::evaluate() can return the scalar value directly:
$document = new DOMDocument();
$document->load($xmlFile);
$xpath = new DOMXpath($document);
$zoomValue = $xpath->evaluate('string(//zoom[1])');
//zoom[1] will fetch the first zoom element node in a node list. Casting the list into a string will return the text content of the first node or an empty string if the list was empty (no node found).
For the sake of this example assume that you generated an XML like this
<zooms>
<zoom user="u1">z1</zoom>
<zoom user="u2">z2</zoom>
</zooms>
In this case you can use Xpath to fetch all zoom nodes and get a random node from the list.
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$zooms = $xpath->evaluate('//zoom');
$zoom = $zooms->item(mt_rand(0, $zooms->length - 1));
var_dump(
[
'user' => $zoom->getAttribute('user'),
'zoom' => $zoom->textContent
]
);

Your main issue is that you are not returning any value when there is no zoom found.
$datas_zoom=zoom($r,$n,$rand); // no return keyword here!
When you're using recursion, you usually want to "chain" return values on and on, till you find the one you need. $datas_zoom is not a global variable and it will not "leak out" outside of your function. Please read the php's variable scope documentation for more info.
Then again, you're calling zoom function with three arguments ($r,$n,$rand) while the function can only handle two ($n and $rand). Also the $r is undiefined, $n is not used at all and you are most likely trying to use the same $rand value again and again, which obviously cannot work.
Also note that there are too many closing braces in your code.
I think the best approach for your problem will be to shuffle the array and then to use it like FIFO without recursion (which should be slightly faster):
function zoom($tab_users) {
// shuffle an array once
shuffle($tab_users);
// init variable
$txt_zoom = null;
// repeat until zoom is found or there
// are no more elements in array
do {
$rand = array_pop($tab_users);
$datas_user = new SimpleXMLElement($rand, null, true);
$tag=$datas_user->xpath('/user');
//if zoom found
if($tag[0]->zoom !='') {
$txt_zoom=$tag[0]->zoom;
}
} while(!$txt_zoom && !empty($tab_users));
return $txt_zoom;
}
$datas_zoom = zoom($tab_users); // your zoom is here!
Please read more about php scopes, php functions and recursion.

There's no reason for recursion. A simple loop would do.
$datas_user=new SimpleXMLElement($tab_users[$rand],null,true);
$tag=$datas_user->xpath('/user');
$max = $tag->length;
while(true) {
$test_index = rand(0, $max);
if ($tag[$test_index]->zoom != "") {
break;
}
}
Of course, you might want to add a bit more logic to handle the case where NO zooms have text set, in which case the above would be an infinite loop.

Remove tags with Simple HTML DOM parser [duplicate]

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?

There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';

when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.

I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;

I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.

The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}

This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}

Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.

Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing

Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}

This works now:
$element->remove();
You can see the documentation for the method here.

Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

Simple HTML Dom: How to remove elements?

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?

There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';

when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.

I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;

I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.

The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}

This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}

Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.

Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing

Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}

This works now:
$element->remove();
You can see the documentation for the method here.

Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

Using PHP to get DOM Element

I'm struggling big time understanding how to use the DOMElement object in PHP. I found this code, but I'm not really sure it's applicable to me:
$dom = new DOMDocument();
$dom->loadHTML("index.php");
$div = $dom->getElementsByTagName('div');
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
Basically what I need is to search the DOM for an element with a particular id, after which point I need to extract a non-standard attribute (i.e. one that I made up and put on with JS) so I can see the value of that. The reason is I need one piece from the $_GET and one piece that is in the HTML based from a redirect. If someone could just explain how I use DOMDocument for this purpose, that would be helpful. I'm really struggling understanding what's going on and how to properly implement it, because I clearly am not doing it right.
EDIT (Where I'm at based on comment):
This is my code lines 4-26 for reference:
<div id="column_profile">
<?php
require_once($_SERVER["DOCUMENT_ROOT"] . "/peripheral/profile.php");
$searchResults = isset($_GET["s"]) ? performSearch($_GET["s"]) : "";
$dom = new DOMDocument();
$dom->load("index.php");
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
$div = $dom->getElementById('currentLocation');
$attr = $div->getAttribute('srckey');
echo "<h1>{$attr}</a>";
?>
</div>
<div id="column_main">
Here is the error message I'm getting:
Warning: DOMDocument::load() [domdocument.load]: Extra content at the end of the document in ../public_html/index.php, line: 26 in ../public_html/index.php on line 10
Fatal error: Call to a member function getAttribute() on a non-object in ../public_html/index.php on line 21

getElementsByTagName returns you a list of elements, so first you need to loop through the elements, then through their attributes.
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
In your case, you said you needed a specific ID. Those are supposed to be unique, so to do that, you can use (note getElementById might not work unless you call $dom->validate() first):
$div = $dom->getElementById('divID');
Then to get your attribute:
$attr = $div->getAttribute('customAttr');
EDIT: $dom->loadHTML just reads the contents of the file, it doesn't execute them. index.php won't be ran this way. You might have to do something like:
$dom->loadHTML(file_get_contents('http://localhost/index.php'))

You won't have access to the HTML if the redirect is from an external server. Let me put it this way: the DOM does not exist at the point you are trying to parse it. What you can do is pass the text to a DOM parser and then manipulate the elements that way. Or the better way would be to add it as another GET variable.
EDIT: Are you also aware that the client can change the HTML and have it pass whatever they want? (Using a tool like Firebug)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to get iTunes-specific child nodes of RSS feeds? - php

<?php $rss_feed = simplexml_load_file("url link"); if(!empty($rss_feed)) { $i=0; foreach ($rss_feed->channel->item as $feed_item) { ?> <?php echo $rss_feed->children('itunes', true)->image->attributes()->href;?> <?php } ?>

Related

Extract pattern from xml file using PHP?

PHP return value after XML exploration

Remove tags with Simple HTML DOM parser [duplicate]

Simple HTML Dom: How to remove elements?

Using PHP to get DOM Element

Categories

Resources