Reading itunes XML file with PHP DOM method - php

I'm having some trouble in getting information from my itunes XML feed which you can peek at here: http://c3carlingford.org.au/podcast/C3CiTunesFeed.xml
I need to get the information from each of the inner <item> tags. An example of one of these is as follows:
<item>
<title>What to do when a viper bites you</title>
<itunes:subtitle/>
<itunes:summary/>
<!-- 4000 Characters Max ******** -->
<itunes:author>Ps. Phil Buechler</itunes:author>
<itunes:image href="http://www.c3carlingford.org.au/podcast/itunes_cover_art.jpg"/>
<enclosure url="http://www.ccccarlingford.org.au/podcast/C3C-20120722PM.mp3" length="14158931" type="audio/mpeg"/>
<guid isPermaLink="false">61bc701c-b374-40ea-bc36-6c1cdaae8042</guid>
<pubDate>Sun, 22 Jul 2012 19:30:00 +1100</pubDate>
<itunes:duration>40:01</itunes:duration>
<itunes:keywords>
Worship, Reach, Build, Holy Spirit, Worship, C3 Carlingford
</itunes:keywords>
</item>
Now i have had some success!
I have been able to get the title out of it all:
<?php
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->load('http://c3carlingford.org.au/podcast/C3CiTunesFeed.xml');
$items = $dom->getElementsByTagName('item');
foreach($items as $item){
$title = $item->getElementsByTagName('title')->item(0)->nodeValue;
echo $title . '<br />';
};
?>
But I can't seem to get anything else out... I'm new to all this!
So What I need to get out includes:
The <itunes:author> value.
The url attribute value from the <enclosure> tag
Would someone help me getting these two values out?

You can use DOMXPath to do this and make your life a lot easier:
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->loadXML( $xml); // $xml = file_get_contents( "http://www.c3carlingford.org.au/podcast/C3CiTunesFeed.xml")
// Initialize XPath
$xpath = new DOMXpath( $doc);
// Register the itunes namespace
$xpath->registerNamespace( 'itunes', 'http://www.itunes.com/dtds/podcast-1.0.dtd');
$items = $doc->getElementsByTagName('item');
foreach( $items as $item) {
$title = $xpath->query( 'title', $item)->item(0)->nodeValue;
$author = $xpath->query( 'itunes:author', $item)->item(0)->nodeValue;
$enclosure = $xpath->query( 'enclosure', $item)->item(0);
$url = $enclosure->attributes->getNamedItem('url')->value;
echo "$title - $author - $url\n";
}
You can see from the demo that this will output:
What to do when a viper bites you - Ps. Phil Buechler - http://www.ccccarlingford.org.au/podcast/C3C-20120722PM.mp3

Yes, You can do it with simplexml.
Here is the sample code:
<?php
$x = simplexml_load_file("http://c3carlingford.org.au/podcast/C3CiTunesFeed.xml");
foreach ($x->channel->item as $item) {
$otherNode = $item->children('http://www.itunes.com/dtds/podcast-1.0.dtd');
echo $item->title .'---'.$otherNode->author;
echo "\n";
}
?>
OutPut:
What to do when a viper bites you---Ps. Phil Buechler
Living Water, Let the River Flow!---Ps. Phil Buechler
The Call of God to Forgive One Another AM & PM---Ps. Richard Botta
The Call of God to Evangelise AM & PM---Rob Waugh
The Call of God to Love One Another AM & PM---Rob Waugh
Hope this help!

you can use simpleXML children
$item->children('itunes', TRUE);
So you have an array with all the tag itunes:duration, itunes:subtitle ....
<?php
$x = simplexml_load_file("http://c3carlingford.org.au/podcast/C3CiTunesFeed.xml");
foreach ($x->channel->item as $item) {
$otherNode = $item->children('itunes', TRUE);
echo $otherNode->duration;
echo "\n";
echo $otherNode->author;
echo "\n";
echo $otherNode->subtitle;
echo "\n";
}
?>

Related

How to retrieve an attribute from a namespaced element

I am trying to get the url attributes from the <media:content> elements in this RSS feed:
https://news.google.com/rss/search?q=test&as_qdr=w1&scoring=n&num=100&hl=en-CA&gl=CA&ceid=CA:en
Here's what I have so far:
$feed_url = "https://news.google.com/rss/search?q=test&as_qdr=w1&scoring=n&num=100&hl=en-CA&gl=CA&ceid=CA:en";
$rss = file_get_contents($feed_url);
$rss = new SimpleXMLElement($rss);
$items = $rss->channel->item;
foreach ($items as $item) {
print_r($item);
echo "<hr>";
}
This code works for all elements except the ones with a semicolon in the name, like <media:content> or <dc:contributor>. If I open the XML feed in my browser I can see the tag I am looking for:
<media:content url="https://lh6.googleusercontent.com/proxy/nxX8kqpFKSDvYg_bf_QrdsS0PYNMFPGspYmTlZlIo0IzyyhYhURxQc5nrpnzfrNBZkWQywioGXdPclazSIEwiz5wklsBePHOCft9qdHl2EmqIES_SMl5orim2xM2eHYalvIgFFeGYvp7cQaCQpKAObhPGQ--diqZg4Io3MSW8f6PXlRAbUcPvpDxB-KRqBj53bbROhoUYuqxkA=-w150-h150-c" medium="image" width="150" height="150"/>
</item>
I tried various solutions from other threads but it didn't work for me. Example:
$xml_object = $rss->channel->item[0];
$ns_media = $xml_object->children('http://search.yahoo.com/mrss/');
I don't know what I'm doing wrong so I would appreciate some help
You're missing a call to the attributes() method of your SimpleXMLElement instance:
foreach ($items as $item) {
$media_content_url = $item->children('http://search.yahoo.com/mrss/')->attributes()->url;
// ...
}
I'm not familiar with SimpleXML, but this is simple enough to do with DomDocument:
$feed_url = "https://news.google.com/rss/search?q=test&as_qdr=w1&scoring=n&num=100&hl=en-CA&gl=CA&ceid=CA:en";
$rss = file_get_contents($feed_url);
$dom = new DomDocument;
$dom->loadXML($rss);
$nodes = $dom->getElementsByTagNameNS("http://search.yahoo.com/mrss/", "content");
foreach ($nodes as $node) {
printf("%s\n", $node->getAttribute("url"));
}
Output:
https://lh6.googleusercontent.com/proxy/m7yanlDdWIjGc1XmsY6AHB5DqqJcgSe1Z7vs9DUC5NbD-FfQqJzEY8uIadNckLJFu7O6rcuh4W-CsXRg2vjr_KLOWhwNG5shhfdetcUkY5dMHa0uN1GBC5iY0svkP-Wxcm7JJ_kJMh6sctcvJ5Hfbb2Vor8KPlnYXUk_Y3jxYeCgmDBTqeRKwQ1pTMtWtJ_7fK5P5PSdKQKjUNnfVODZjHg_c4PwFWw3Cw=-w150-h150-c
https://lh3.googleusercontent.com/proxy/j7vDbXvscxGVLF8xo2DGkEgmgyQ9-u5vE0RWJjmAp84xOuy4v-Ff6cHADsLiC2Zd2KE7s04sCgtT_WNx4K5vxjDw_jbFRwQhlBgpL-YdXMgvDgakxzx8xWDO5bdpHaVssEGXgkxCnXnHXBRgb67vXeY6XnbgeEp7Fe5ohK1fpyk_hE3IYGyHdJnTxiH_=-w150-h150-c
https://lh6.googleusercontent.com/proxy/nxX8kqpFKSDvYg_bf_QrdsS0PYNMFPGspYmTlZlIo0IzyyhYhURxQc5nrpnzfrNBZkWQywioGXdPclazSIEwiz5wklsBePHOCft9qdHl2EmqIES_SMl5orim2xM2eHYalvIgFFeGYvp7cQaCQpKAObhPGQ--diqZg4Io3MSW8f6PXlRAbUcPvpDxB-KRqBj53bbROhoUYuqxkA=-w150-h150-c
https://lh3.googleusercontent.com/proxy/PbDyKTNQAyxkLNnyQFm00dHkNyKoASc3zKJjw7tjRtfmebHfbP_Ov_5RfcsG1RL8gyFaMSvVltd7IQJns6x_N_thPQTWz7E3ER0RlqLhZBYjmM-cp9xUkCdICiFyfkY0XGx-xGSh6zq5C_SpuzAxCVdhoOkqW4Lz_kyw-KN1fUJB6b8VgDGFvssIfmurSm3qCdJYeFJAx7x6lh_NQS1GNeNdbVBf0RoE2jiZfK6SYgFCX2s9KifQ7Sld_0plNrvTyW6VSR9D0AEwlBClWXNfoMmB_NYl4j03ELoUIjB0fRpUxV7YAqiIC1nSxqn92Q=-w150-h150-c
https://lh6.googleusercontent.com/proxy/DcJhP_BX_r7IbiwFgYt7MOL8RKQMizjCEWAn4YBcWdDy-PYOncpb_PrDZ0H5cMlxSFk9X8SQz5WEAX1xJRV4RWBiPwSH6uDJr7bt1Dh4H7MyYAaB_66BnASNA-fw0pszLPYEgfkwZsRyEZNT045MRYXA3Q=-w150-h150-c
https://lh5.googleusercontent.com/proxy/ECDRkzc4AO3OP1V0PNEVw1OhBYwuDRJVdDzF9lFNW34D8aNO8s54aWJfuR_LhDz-wKCRRvS64ggZnsg2UkCE5EYJghnBkQlpmwktgFYcKW0OnXP_-Ynh37EHR9nG9lyRoM1-3ebj=-w150-h150-c
https://lh6.googleusercontent.com/proxy/9xL1YVQyHg2mzitNgeiHRkjq_vJBxmOb0iAb1bBfJcqFLlWOJWkzRmLYrc9-hmK4nGcLnFfLMEb-bTnZmlWRM72_9ibysFUU_z-77ZK5PhX-f8vfIoWp=-w150-h150-c
https://lh5.googleusercontent.com/proxy/FcDG9_8xHTVUP2uHZ6cMAnIAdDxd-Kg29IksHUEDJCX8_mjTd2voG8BITnqpPEXKtFEImDogPfJfHNlqJr7X2I0VHlkesJvRnE2D-aLRxiNJfc2Lh6WIb0PrRy2nluPe6IJOhUulh0AzZ8JXJVBPgYnfeifItdhBsCTz7QtKGN4DbLZzDAVRL28mHNzaFBlCjCMGyhrbR3jmmlLWqj5K6lbfdoS0jxbLDKkNY8ywVq61rua1YHe-J6ZOY40ESL_0hf27KTgIJFSNq99yGX2sMw=-w150-h150-c
https://lh3.googleusercontent.com/proxy/ciJtTGr7RBlyJD2JGS-Ps4OUOcHxph6Csa1RCJOhcIjjqnjMHnqDzBU2MwEBoieNz37zmC69cPcoUi9696CfpMYh_cry5O6xmTT-1BnlyJhGMTeDR2mIbf3-3VEJ5YsNHmyozvClGvbR6_ZOaMgH0w3AWwf_bZppmqXp7Bul8rXDOkIDMeHrmKCpQcJff-lAbV7hnud3h0JH0--7zw5wCCOj57dNwIA=-w150-h150-c
https://lh5.googleusercontent.com/proxy/nyoQUYC-IPq8Td_GPYc60euyS5cgKwUk7ta1ISJl9wafgrGt1HkhVtPSpoO36KZl_8em4B9bBuP_OXmR0RZlGk1yLwcfAK42NknrGy5H0bLwJqouJ0sE0a21EwardDsVGe1XhGXETO92NfSG2Tikpl9pUFLiJEE-ySdL2d5CK9LA82P4DG6FM5eW=-w150-h150-c
https://lh3.googleusercontent.com/proxy/F2QF_T-xeBkwMyMljtdxwaXMQmNvyG9YCv2QmdBBSmNCe6okG7AIuElWuXXI45IjTA8fuyEZbeGEHBJdWIeyxcjiwapXjzAIxm1lxrmMdzLgJyD4C867KZtcTS1NTWyJebHY4u3gBQ2Z=-w150-h150-c
https://lh4.googleusercontent.com/proxy/rbbBxl7QWG4BBIlsvJUIfL3fr8j4f7L_LoRc3NvfWcNOGT7oX7U1_CpJ71oE04TD0Ax75WmDJrlBQNYPGcsQnOid874Z6P3nwNpdNtZlytX_6FlXqefr58IQ3fB-sivfI20EmQVnRfaBXUozjpHbjW225jeI1hxWc1U15MC_rMuJryLEC_CV4OLg=-w150-h150-c
https://lh5.googleusercontent.com/proxy/7TRaIMikpkiHsWKNenXItiKUUSnRhEl92XSOmDHl828VWobr_M8crMxMLvfG9BDbVc4SioxINBmDyDLhxyHiLlk_c_-6ocsm6ATjrUbWHuc-FTba=-w150-h150-c
https://lh5.googleusercontent.com/proxy/VDXcSNdIqLdhl6IFXyQlwOlzDlhPGaPTOWN0XMuX_DHozxXowzuQMWGAnFNIizgXavLZ5rVcw9rGl7NWTHMyRboJVqjzRRtoOs1GJCb0dylyUsHSt5qUSOZULdyBDPkW6HXVDHHyR40EeR4CS9nOX0M=-w150-h150-c
https://lh5.googleusercontent.com/proxy/YlbnCDloJxYZBVFhx-k2JZzYzFhcBA6DMAim5QTgNoyB4-Q_8DjgR2-JV73ARnUmpROHYfdZiKwfEUCdPB7tUSJ-uJuSRFgfQ_t8CV6rQ8zAXiIKOuoNO20AMArh5NXKr99nP_FoiOLf6mIEJw--URXUP0Tg4-i7bCXXdIPIvVpWNaDxKesNa1MRzIWkzHnYoGuy5QGL7byi32Y0ld8mHP2KFbXT2aga0f3S5rl-ikFkRxpaSxk0coE=-w150-h150-c
https://lh6.googleusercontent.com/proxy/XgDFopqdfCYmrzYi5NKRDYZYZgmJ2fyAn-9eN9QnXgmBezTviTAYV4ct07peVeZMerMND3ZwgZ-lK8Uv7B6FivV6LXqAJN4E7OVvtKYToSriuCRK4QOTuf6oFXG0KTetG-QJCoHZT77mWJtCGb0jG9tch59MJ0aWWZ1NA5X7wF4aMtoLSHkvAK2cuspI85CpPMj2ayu6wiG-0GT8fcAwRsVW64773g=-w150-h150-c
https://lh6.googleusercontent.com/proxy/WTq4Z72Acy7ykLcNmv9b_B2IQerfE4M7V00dxJ1o65IBz4OPyDzKkDBrVLAvqKdSjOHuTsHwTw5_UBINqKU3tFIOeEjCdsOs3yLkqG52YI-NGYvtw0EUohgu6Ps=-w150-h150-c
https://lh6.googleusercontent.com/proxy/uyiP15MOcfSPSHn6FU_LMK08w-0WiaDKQwC0e5rC67ZmdEqYqDYQDjHCkkH0UB0vxLRNpUw0jyz_YCvZ713hPuAeZM4xk7ZItIzNnkXRv_H-n8196YojFvVHZjbRYNZguxHtN7uAT1hxNvrRlRL-pdG9eiUs58rx_w=-w150-h150-c
https://lh4.googleusercontent.com/proxy/MBgkZaPJYX173801aVCotrRKb5pTCO3orWgu0cgsCuglj0bAQW7Za7HhaQUVk2NIUxg8MQz0qiizFGBTSZXdcUwpsmTPpz2rXYuZpqxgSAsxjZH4RUZW9P74EuM-_KzJQ7fKZx7sVK26kLxXr206i6DWH8LCCFa9utvSWevAS2OlYSYc17a09Z6Af5NJ7FoIJ6jxh1wnsuZhfwebq_4zQGDg4AF9XVrKayVrYExcBtx-kYqgjAqTiswm7YoO1yJGBUNkXh1aXC76C50WEOiYemdoXeygD1KIsRpExccCQZZ_NDjIgRMjBvPAGy-1Ue1xwCiXvgFu1P7AtZ2g0XA=-w150-h150-c
https://lh6.googleusercontent.com/proxy/9TiiPwVKm7hFE4kzBEeQ2nrwR8h3hjJ1KLMw6G5s-3_FzO0QorKfKYkubFvF-YDgMYyujUloeApsM00xuBWgQYJE4vRrcdXFoAik032DyRykwAYl7e87Qjqb2wnNMUpfmX1PTZo_OK6VC69sGQY9ISWevaI09tKI6w=-w150-h150-c
https://lh4.googleusercontent.com/proxy/bePdkudsqFAyihVJPg94KH4SKhVTwB0BSFiXEsCdQEZubli_o9RtbEctWtw1X5CC_x9JqM9vBPhWMyP29eBmkrCzi04osGEwiTOoaeLI3WhPl49w0UKeNvOONgkK3ZbPhJ4RFutptw71nhLoiWX2uAUDChy_zgxYmg=-w150-h150-c
https://lh5.googleusercontent.com/proxy/GgkEfaSJZU_ZrtKUpU3aqKBS-u1fUkz3wDJH7m3iPlFduw_C0yfprGaOGubYqxBdILL69inogeniN3zM83hh2EaBGS5wNJ8PfmSe1bUOy605NxMR19SPSgeLu0hGlrc_d16v-V1OtwxDt2SVDM382UQ=-w150-h150-c
https://lh3.googleusercontent.com/proxy/C04R8mKirMGoXC7SvbOwbMAh2BpjAUcqRhyFEoZhIg2bl55t5jgF6zLwXvAAe4O95ETW1fp1sIQSGgzCoxlFBp51LCEzyfvDiKkxt0LpYzHmNTeIxmGTlmBkRv4wRNGquW0ZBp1AWnjoqaGosgMWv0kQI6QTkgFTEI5tuhrLppr_0Xcfy4JNSqo8bSVxa_fb27Iih5Evf1RUSS6Umnc6wW3cHip0icT7QmfdebQs3LUvSUqHaaeDikc60NOQdZRX8tUcODic84RoOn41vM8NvBrZZuQgImC8GTPavaMTUIsEOK4QxSNUdxKduRcPpa0p=-w150-h150-c
https://lh4.googleusercontent.com/proxy/qpUSiYScbVAjAv63NKloqPadlLXD7xWo0eocfLMerlUozukyVTS4QWZYcJBPmkHuxJZCh85Zh1mEepVEeZH3JSMxQRrcE-4Apawmnw=-w150-h150-c
https://lh6.googleusercontent.com/proxy/CNvMvcbMHHckCXbyFXkRZnnuPr17TEzvspLGwIobu15dDsrlHt-3QKzi7kcHKvTpJZlCr2l-HhWOPBfJJzPRrd2gn734awWdE5jXRcqrnfly4bwnIPokO68_luWur73lg-k=-w150-h150-c
https://lh4.googleusercontent.com/proxy/eDks_caYi82LNyG2L2AMQENYCuL7LfaH5rhL-qT6QiVnb5r142EFLTe4u61mZf-1xE2UkJB9GfcUy4x5IfNOU-JCd78FMn--f1CldoEY9y7ouU5cdZ8=-w150-h150-c
https://lh6.googleusercontent.com/proxy/WsWb8Woo-ogEIuDkvaBpsxrizSDUwM_k6h9w1ma-d_i4f3c9Bpefe6llemcMlZNODP5hx_raBrZ6dlclfDXJpirHGgVuTFp3W_mCdrGWO1LCsQf6Nz3iyjgJIbFFv12K3rC9sy2sfV3kgpQRURxi50MwLLG4lUcDx8LIiHWk5bG-VR9IBpygAMPtL5LJoRN8fkg9Vh7RA-J8kkbDm8-xirGXhkYheENaly7yH3qpIo_3aBYrHzS1GsCULOfpjdEuw2OISw=-w150-h150-c

Get xml node full path in Php / SimpleXml

I need the full path of an xml node.
I saw the answer in this question but I wasn't able to use it.
Below the code I used on a php web tester with no success:
$xml = <<<EOF
<root>
<First>
<Martha>Text01</Martha>
<Lucy>Text02</Lucy>
<Bob>
<Jhon>Text03</Jhon>
</Bob>
<Frank>One</Frank>
<Jessy>Two</Jessy>
</First>
<Second>
<Mary>
<Jhon>Text04</Jhon>
<Frank>Text05</Frank>
<Jessy>Text06</Jessy>
</Mary>
</Second>
</root>
EOF;
$MyXml = new SimpleXMLElement($xml);
$Jhons = $MyXml->xpath('//Jhon');
foreach ($Jhons as $Jhon){
echo (string) $Jhon;
//No one of the following works
echo (string) $Jhon->xpath('./node()/path()');
echo (string) $Jhon->xpath('./path()');
echo (string) $Jhon->xpath('.path()');
echo (string) $Jhon->path();
echo '<br/> ';
}
I need: "/root/First/Bob/Jhon" and "/root/Second/Mary/Jhon"
You can use the much more powerful DOM (DOMDocument based in PHP) api to do this...
$MyXml = new SimpleXMLElement($xml);
$Jhons = $MyXml->xpath('//Jhon');
foreach ($Jhons as $Jhon){
$dom = dom_import_simplexml($Jhon);
echo $dom->getNodePath().PHP_EOL;
}
The dom_import_simplexml($Jhon) converts the node and then getNodePath() displays the path...
This gives ( for the example)
/root/First/Bob/Jhon
/root/Second/Mary/Jhon
Or if you just want to stick to SimpleXML, you can use the XPath axes ancestor-or-self to list the current node and each parent node...
$MyXml = new SimpleXMLElement($xml);
$Jhons = $MyXml->xpath('//Jhon');
foreach ($Jhons as $Jhon){
$parent = $Jhon->xpath("ancestor-or-self::*");
foreach ( $parent as $p ) {
echo "/".$p->getName();
}
echo PHP_EOL;
}

How can I get the Plain text AND the HTML of a DOM element created from XML?

We have thousands of Closed Caption XML files that we have to import to a database as plain text, as well as preserve the HTML markup for conversion to another CC format. I have been able to extract the plain text quite easily, but can't seem to find the correct way of extracting the raw HTML as well.
Is there a way to accomplish something like "->htmlContent" in the same way that ->textContent works below?
$ctx = stream_context_create(array('http' => array('timeout' => 60)));
$xml = #file_get_contents('http://blah-blah-blah/16TH.xml', 0, $ctx);
$dom = new DOMDocument;
$dom->loadXML($xml);
$ptags = $dom->getElementsByTagName( "p" );
foreach( $ptags as $p ) {
$text = $p->textContent;
}
Typical <p> being processed:
<p begin="00:00:14.83" end="00:00:18.83" tts:textAlign="left">
<metadata ccrow="12" cccol="8"/>
(male narrator)<br></br> THE 16TH AND 17TH CENTURIES<br></br> WERE THE FORMATIVE 200 YEARS
</p>
Successful ->textContent Result
(male narrator) THE 16TH AND 17TH CENTURIES WERE THE FORMATIVE 200 YEARS
Desired HTML Result
(male narrator)<br></br> THE 16TH AND 17TH CENTURIES<br></br> WERE THE FORMATIVE 200 YEARS
In other word you would like to save specific nodes - br elements and text nodes. You can do this with DOM+Xpath:
$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->loadXml($html);
$xpath = new DOMXpath($document);
foreach ($xpath->evaluate('//p') as $p) {
$content = '';
foreach ($xpath->evaluate('.//br|.//text()', $p) as $node) {
$content .= $document->saveHtml($node);
}
var_dump($content);
}
Output:
string(86) "
(male narrator)<br> THE 16TH AND 17TH CENTURIES<br> WERE THE FORMATIVE 200 YEARS
"
The Xpath Expression
Any descendant br: .//br
Any descendant text node: .//text()
Combined expression: .//br|.//text()
Namespaces
If you XML uses namespaces you have to register and use them.
$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->loadXml($html);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('tt', 'http://www.w3.org/2006/04/ttaf1');
foreach ($xpath->evaluate('//tt:p') as $p) {
$content = '';
foreach ($xpath->evaluate('.//tt:br|.//text()', $p) as $node) {
$content .= $document->saveHtml($node);
}
var_dump($content);
}
I couldn't see the forest for the trees...quite a simple solution after I realized that strip_tags() was failing because of the closing tags of the BR tag:
foreach( $ptags as $p ) {
$text = $p->textContent;
$html = $p->ownerDocument->saveXML($p); // Raw HTML
$html = str_ireplace('<br></br>','<br>',$html); // Cleanup the BR usage
$html = strip_tags($html,'<br>'); // Strip the tags I don't need
}
There's likely a more elegent solution with the DOM, or with regex, but this did get it done.

DOMDocument removing html elements

Here is my code:
$text = '<div class="cgus_post"><div class="imgbox"><img src="/cgmedia/default.gif"></div>
<h2 id="post-15055">
Willie Nelson Celebrates 80th Birthday Stoned and Auditioning for Gandalf</h2>
<p>This video pretty much sums up why Willie Nelson is fucking awesome. Willie decided to celebrate his 80th birthday by recording an ‘audition’ for Peter Jackson. Willie wants to take the reigns from Ian McKellan in The Hobbit 2, and decided to show off his acting skills and give some of his own wizardly advice. The result is hilarious. Watch …</p>
<br class="clear">
</div>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'cgus_post';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
foreach($nodes as $node){
echo $node->nodeValue;
}
The problem I am having is I am querying for the div that contains the class cgus_post and its returning just the text. How do I have it return the HTML elements also?
Here's my innerHTML function that I always use:
function innerHTML(DOMNode $node, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($node->childNodes as $inner_node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($inner_node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}
So then you do:
$dom = new DOMDocument();
$dom->loadHTML($html);
echo htmlentities(innerHTML($dom->documentElement->childNodes->item(0)->firstChild));

PHP XML file filter on match

I am having a heck of a time getting this working...
What I want to do is filter a xml file by a city (or market in this case).
This is the xml data.
<itemset>
<item>
<id>2171</id>
<market>Vancouver</market>
<url>http://</url></item>
<item>
<id>2172</id>
<market>Toronto</market>
<url>http://</url></item>
<item>
<id>2171</id>
<market>Vancouver</market>
<url>http://</url></item>
This is my code...
<?php
$source = 'get-xml-feed.php.xml';
$xml = new SimpleXMLElement($source);
$result = $xml->xpath('//item/[contains(market, \'Toronto\')]');
while(list( , $node) = each($result)) {
echo '//Item/[contains(Market, \'Toronto\')]',$node,"\n";
}
?>
If I can get this working I would like to access each element, item[0], item[1] base on filtered results.
Thanks
I think this implements what you are looking for using XPath:
<?php
$source = file_get_contents('get-xml-feed.php.xml');
$xml = new SimpleXMLElement($source);
foreach ($xml as $node)
{
$row = simplexml_load_string($node->asXML());
$result = $row->xpath("//item/market[.='Toronto']");
if ($result[0])
{
var_dump($row);
}
}
?>
As another answer mentioned, unless you are wed to the use of XPath it's probably more trouble than it's worth for this application: just load the XML and treat the result as an array.
I propose using simplexml_load_file. The learning curve is less step than using the specific XML objects + XPath. It returns an object in the format you descibe.
Try this and you'll see what I mean:
<?php
$source = 'get-xml-feed.php.xml';
$xml = simplexml_load_file($source);
var_dump($xml);
?>
There is also simplexml_load_string if you just have an XML snippet.
<?php
$source = 'get-xml-feed.php.xml';
//$xml = new SimpleXMLElement($source);
$dom = new DOMDocument();
#$dom->loadHTMLFile($source);
$xml = simplexml_import_dom($dom);
$result = $xml->xpath("//item/market[.='Toronto']/..");
while(list( , $node) = each($result)) {
print_r($node);
}
?>
This will get you the parent nodeset when it contains a node with "Toronto" in it. It returns $node as a simplexml element so you will have to deal with it accordingly (I just printed it as an array).

Categories