Parsing RSS with PHP - php

I'm trying to parse RSS: http://www.mlssoccer.com/rss/en.xml .
$feed = new DOMDocument();
$feed->load($url)
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
foreach($items as $key => $item)
{
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
$description = $item->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
// do some stuff
}
The thing is: I'm getting "$title" and "$pubDate" without a problem, but for some reason "$description" is always empty, there's nothing in it. What could be the reason for such behaviour and how to fix it?

The problem was with CDATA you need to use textContent instead of nodeValue to retreive value beetween
<?php
$feed = new DOMDocument();
$feed->load('http://www.mlssoccer.com/rss/en.xml');
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
foreach($items as $key => $item)
{
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
$description = $item->getElementsByTagName('description')->item(0)->textContent; // textContent
}

Here can be whitespaces between the opening <description> tag and the opening <![CDATA[. This is a text node.
So if you access the firstChild of description, you might fetch that whitespace text node.
In a generic way you can set the DOMdocument to ignore whitespace nodes:
$feed = new DOMDocument();
$feed->preserveWhiteSpace = FALSE;
$feed->load($url);
Additionally you should check out XPath, it makes reading a DOM much easier:
$xpath = new DOMXpath($feed);
foreach ($xpath->evaluate('//channel/item') as $item) {
$title = $xpath->evaluate('string(title)', $item);
$pubDate = $xpath->evaluate('string(pubDate)', $item);
$description = $xpath->evaluate('string(description)', $item);
// do some stuff
var_dump([$title, $pubData, $description]);
}

Related

How to retrieve an attribute from a namespaced element

I am trying to get the url attributes from the <media:content> elements in this RSS feed:
https://news.google.com/rss/search?q=test&as_qdr=w1&scoring=n&num=100&hl=en-CA&gl=CA&ceid=CA:en
Here's what I have so far:
$feed_url = "https://news.google.com/rss/search?q=test&as_qdr=w1&scoring=n&num=100&hl=en-CA&gl=CA&ceid=CA:en";
$rss = file_get_contents($feed_url);
$rss = new SimpleXMLElement($rss);
$items = $rss->channel->item;
foreach ($items as $item) {
print_r($item);
echo "<hr>";
}
This code works for all elements except the ones with a semicolon in the name, like <media:content> or <dc:contributor>. If I open the XML feed in my browser I can see the tag I am looking for:
<media:content url="https://lh6.googleusercontent.com/proxy/nxX8kqpFKSDvYg_bf_QrdsS0PYNMFPGspYmTlZlIo0IzyyhYhURxQc5nrpnzfrNBZkWQywioGXdPclazSIEwiz5wklsBePHOCft9qdHl2EmqIES_SMl5orim2xM2eHYalvIgFFeGYvp7cQaCQpKAObhPGQ--diqZg4Io3MSW8f6PXlRAbUcPvpDxB-KRqBj53bbROhoUYuqxkA=-w150-h150-c" medium="image" width="150" height="150"/>
</item>
I tried various solutions from other threads but it didn't work for me. Example:
$xml_object = $rss->channel->item[0];
$ns_media = $xml_object->children('http://search.yahoo.com/mrss/');
I don't know what I'm doing wrong so I would appreciate some help
You're missing a call to the attributes() method of your SimpleXMLElement instance:
foreach ($items as $item) {
$media_content_url = $item->children('http://search.yahoo.com/mrss/')->attributes()->url;
// ...
}
I'm not familiar with SimpleXML, but this is simple enough to do with DomDocument:
$feed_url = "https://news.google.com/rss/search?q=test&as_qdr=w1&scoring=n&num=100&hl=en-CA&gl=CA&ceid=CA:en";
$rss = file_get_contents($feed_url);
$dom = new DomDocument;
$dom->loadXML($rss);
$nodes = $dom->getElementsByTagNameNS("http://search.yahoo.com/mrss/", "content");
foreach ($nodes as $node) {
printf("%s\n", $node->getAttribute("url"));
}
Output:
https://lh6.googleusercontent.com/proxy/m7yanlDdWIjGc1XmsY6AHB5DqqJcgSe1Z7vs9DUC5NbD-FfQqJzEY8uIadNckLJFu7O6rcuh4W-CsXRg2vjr_KLOWhwNG5shhfdetcUkY5dMHa0uN1GBC5iY0svkP-Wxcm7JJ_kJMh6sctcvJ5Hfbb2Vor8KPlnYXUk_Y3jxYeCgmDBTqeRKwQ1pTMtWtJ_7fK5P5PSdKQKjUNnfVODZjHg_c4PwFWw3Cw=-w150-h150-c
https://lh3.googleusercontent.com/proxy/j7vDbXvscxGVLF8xo2DGkEgmgyQ9-u5vE0RWJjmAp84xOuy4v-Ff6cHADsLiC2Zd2KE7s04sCgtT_WNx4K5vxjDw_jbFRwQhlBgpL-YdXMgvDgakxzx8xWDO5bdpHaVssEGXgkxCnXnHXBRgb67vXeY6XnbgeEp7Fe5ohK1fpyk_hE3IYGyHdJnTxiH_=-w150-h150-c
https://lh6.googleusercontent.com/proxy/nxX8kqpFKSDvYg_bf_QrdsS0PYNMFPGspYmTlZlIo0IzyyhYhURxQc5nrpnzfrNBZkWQywioGXdPclazSIEwiz5wklsBePHOCft9qdHl2EmqIES_SMl5orim2xM2eHYalvIgFFeGYvp7cQaCQpKAObhPGQ--diqZg4Io3MSW8f6PXlRAbUcPvpDxB-KRqBj53bbROhoUYuqxkA=-w150-h150-c
https://lh3.googleusercontent.com/proxy/PbDyKTNQAyxkLNnyQFm00dHkNyKoASc3zKJjw7tjRtfmebHfbP_Ov_5RfcsG1RL8gyFaMSvVltd7IQJns6x_N_thPQTWz7E3ER0RlqLhZBYjmM-cp9xUkCdICiFyfkY0XGx-xGSh6zq5C_SpuzAxCVdhoOkqW4Lz_kyw-KN1fUJB6b8VgDGFvssIfmurSm3qCdJYeFJAx7x6lh_NQS1GNeNdbVBf0RoE2jiZfK6SYgFCX2s9KifQ7Sld_0plNrvTyW6VSR9D0AEwlBClWXNfoMmB_NYl4j03ELoUIjB0fRpUxV7YAqiIC1nSxqn92Q=-w150-h150-c
https://lh6.googleusercontent.com/proxy/DcJhP_BX_r7IbiwFgYt7MOL8RKQMizjCEWAn4YBcWdDy-PYOncpb_PrDZ0H5cMlxSFk9X8SQz5WEAX1xJRV4RWBiPwSH6uDJr7bt1Dh4H7MyYAaB_66BnASNA-fw0pszLPYEgfkwZsRyEZNT045MRYXA3Q=-w150-h150-c
https://lh5.googleusercontent.com/proxy/ECDRkzc4AO3OP1V0PNEVw1OhBYwuDRJVdDzF9lFNW34D8aNO8s54aWJfuR_LhDz-wKCRRvS64ggZnsg2UkCE5EYJghnBkQlpmwktgFYcKW0OnXP_-Ynh37EHR9nG9lyRoM1-3ebj=-w150-h150-c
https://lh6.googleusercontent.com/proxy/9xL1YVQyHg2mzitNgeiHRkjq_vJBxmOb0iAb1bBfJcqFLlWOJWkzRmLYrc9-hmK4nGcLnFfLMEb-bTnZmlWRM72_9ibysFUU_z-77ZK5PhX-f8vfIoWp=-w150-h150-c
https://lh5.googleusercontent.com/proxy/FcDG9_8xHTVUP2uHZ6cMAnIAdDxd-Kg29IksHUEDJCX8_mjTd2voG8BITnqpPEXKtFEImDogPfJfHNlqJr7X2I0VHlkesJvRnE2D-aLRxiNJfc2Lh6WIb0PrRy2nluPe6IJOhUulh0AzZ8JXJVBPgYnfeifItdhBsCTz7QtKGN4DbLZzDAVRL28mHNzaFBlCjCMGyhrbR3jmmlLWqj5K6lbfdoS0jxbLDKkNY8ywVq61rua1YHe-J6ZOY40ESL_0hf27KTgIJFSNq99yGX2sMw=-w150-h150-c
https://lh3.googleusercontent.com/proxy/ciJtTGr7RBlyJD2JGS-Ps4OUOcHxph6Csa1RCJOhcIjjqnjMHnqDzBU2MwEBoieNz37zmC69cPcoUi9696CfpMYh_cry5O6xmTT-1BnlyJhGMTeDR2mIbf3-3VEJ5YsNHmyozvClGvbR6_ZOaMgH0w3AWwf_bZppmqXp7Bul8rXDOkIDMeHrmKCpQcJff-lAbV7hnud3h0JH0--7zw5wCCOj57dNwIA=-w150-h150-c
https://lh5.googleusercontent.com/proxy/nyoQUYC-IPq8Td_GPYc60euyS5cgKwUk7ta1ISJl9wafgrGt1HkhVtPSpoO36KZl_8em4B9bBuP_OXmR0RZlGk1yLwcfAK42NknrGy5H0bLwJqouJ0sE0a21EwardDsVGe1XhGXETO92NfSG2Tikpl9pUFLiJEE-ySdL2d5CK9LA82P4DG6FM5eW=-w150-h150-c
https://lh3.googleusercontent.com/proxy/F2QF_T-xeBkwMyMljtdxwaXMQmNvyG9YCv2QmdBBSmNCe6okG7AIuElWuXXI45IjTA8fuyEZbeGEHBJdWIeyxcjiwapXjzAIxm1lxrmMdzLgJyD4C867KZtcTS1NTWyJebHY4u3gBQ2Z=-w150-h150-c
https://lh4.googleusercontent.com/proxy/rbbBxl7QWG4BBIlsvJUIfL3fr8j4f7L_LoRc3NvfWcNOGT7oX7U1_CpJ71oE04TD0Ax75WmDJrlBQNYPGcsQnOid874Z6P3nwNpdNtZlytX_6FlXqefr58IQ3fB-sivfI20EmQVnRfaBXUozjpHbjW225jeI1hxWc1U15MC_rMuJryLEC_CV4OLg=-w150-h150-c
https://lh5.googleusercontent.com/proxy/7TRaIMikpkiHsWKNenXItiKUUSnRhEl92XSOmDHl828VWobr_M8crMxMLvfG9BDbVc4SioxINBmDyDLhxyHiLlk_c_-6ocsm6ATjrUbWHuc-FTba=-w150-h150-c
https://lh5.googleusercontent.com/proxy/VDXcSNdIqLdhl6IFXyQlwOlzDlhPGaPTOWN0XMuX_DHozxXowzuQMWGAnFNIizgXavLZ5rVcw9rGl7NWTHMyRboJVqjzRRtoOs1GJCb0dylyUsHSt5qUSOZULdyBDPkW6HXVDHHyR40EeR4CS9nOX0M=-w150-h150-c
https://lh5.googleusercontent.com/proxy/YlbnCDloJxYZBVFhx-k2JZzYzFhcBA6DMAim5QTgNoyB4-Q_8DjgR2-JV73ARnUmpROHYfdZiKwfEUCdPB7tUSJ-uJuSRFgfQ_t8CV6rQ8zAXiIKOuoNO20AMArh5NXKr99nP_FoiOLf6mIEJw--URXUP0Tg4-i7bCXXdIPIvVpWNaDxKesNa1MRzIWkzHnYoGuy5QGL7byi32Y0ld8mHP2KFbXT2aga0f3S5rl-ikFkRxpaSxk0coE=-w150-h150-c
https://lh6.googleusercontent.com/proxy/XgDFopqdfCYmrzYi5NKRDYZYZgmJ2fyAn-9eN9QnXgmBezTviTAYV4ct07peVeZMerMND3ZwgZ-lK8Uv7B6FivV6LXqAJN4E7OVvtKYToSriuCRK4QOTuf6oFXG0KTetG-QJCoHZT77mWJtCGb0jG9tch59MJ0aWWZ1NA5X7wF4aMtoLSHkvAK2cuspI85CpPMj2ayu6wiG-0GT8fcAwRsVW64773g=-w150-h150-c
https://lh6.googleusercontent.com/proxy/WTq4Z72Acy7ykLcNmv9b_B2IQerfE4M7V00dxJ1o65IBz4OPyDzKkDBrVLAvqKdSjOHuTsHwTw5_UBINqKU3tFIOeEjCdsOs3yLkqG52YI-NGYvtw0EUohgu6Ps=-w150-h150-c
https://lh6.googleusercontent.com/proxy/uyiP15MOcfSPSHn6FU_LMK08w-0WiaDKQwC0e5rC67ZmdEqYqDYQDjHCkkH0UB0vxLRNpUw0jyz_YCvZ713hPuAeZM4xk7ZItIzNnkXRv_H-n8196YojFvVHZjbRYNZguxHtN7uAT1hxNvrRlRL-pdG9eiUs58rx_w=-w150-h150-c
https://lh4.googleusercontent.com/proxy/MBgkZaPJYX173801aVCotrRKb5pTCO3orWgu0cgsCuglj0bAQW7Za7HhaQUVk2NIUxg8MQz0qiizFGBTSZXdcUwpsmTPpz2rXYuZpqxgSAsxjZH4RUZW9P74EuM-_KzJQ7fKZx7sVK26kLxXr206i6DWH8LCCFa9utvSWevAS2OlYSYc17a09Z6Af5NJ7FoIJ6jxh1wnsuZhfwebq_4zQGDg4AF9XVrKayVrYExcBtx-kYqgjAqTiswm7YoO1yJGBUNkXh1aXC76C50WEOiYemdoXeygD1KIsRpExccCQZZ_NDjIgRMjBvPAGy-1Ue1xwCiXvgFu1P7AtZ2g0XA=-w150-h150-c
https://lh6.googleusercontent.com/proxy/9TiiPwVKm7hFE4kzBEeQ2nrwR8h3hjJ1KLMw6G5s-3_FzO0QorKfKYkubFvF-YDgMYyujUloeApsM00xuBWgQYJE4vRrcdXFoAik032DyRykwAYl7e87Qjqb2wnNMUpfmX1PTZo_OK6VC69sGQY9ISWevaI09tKI6w=-w150-h150-c
https://lh4.googleusercontent.com/proxy/bePdkudsqFAyihVJPg94KH4SKhVTwB0BSFiXEsCdQEZubli_o9RtbEctWtw1X5CC_x9JqM9vBPhWMyP29eBmkrCzi04osGEwiTOoaeLI3WhPl49w0UKeNvOONgkK3ZbPhJ4RFutptw71nhLoiWX2uAUDChy_zgxYmg=-w150-h150-c
https://lh5.googleusercontent.com/proxy/GgkEfaSJZU_ZrtKUpU3aqKBS-u1fUkz3wDJH7m3iPlFduw_C0yfprGaOGubYqxBdILL69inogeniN3zM83hh2EaBGS5wNJ8PfmSe1bUOy605NxMR19SPSgeLu0hGlrc_d16v-V1OtwxDt2SVDM382UQ=-w150-h150-c
https://lh3.googleusercontent.com/proxy/C04R8mKirMGoXC7SvbOwbMAh2BpjAUcqRhyFEoZhIg2bl55t5jgF6zLwXvAAe4O95ETW1fp1sIQSGgzCoxlFBp51LCEzyfvDiKkxt0LpYzHmNTeIxmGTlmBkRv4wRNGquW0ZBp1AWnjoqaGosgMWv0kQI6QTkgFTEI5tuhrLppr_0Xcfy4JNSqo8bSVxa_fb27Iih5Evf1RUSS6Umnc6wW3cHip0icT7QmfdebQs3LUvSUqHaaeDikc60NOQdZRX8tUcODic84RoOn41vM8NvBrZZuQgImC8GTPavaMTUIsEOK4QxSNUdxKduRcPpa0p=-w150-h150-c
https://lh4.googleusercontent.com/proxy/qpUSiYScbVAjAv63NKloqPadlLXD7xWo0eocfLMerlUozukyVTS4QWZYcJBPmkHuxJZCh85Zh1mEepVEeZH3JSMxQRrcE-4Apawmnw=-w150-h150-c
https://lh6.googleusercontent.com/proxy/CNvMvcbMHHckCXbyFXkRZnnuPr17TEzvspLGwIobu15dDsrlHt-3QKzi7kcHKvTpJZlCr2l-HhWOPBfJJzPRrd2gn734awWdE5jXRcqrnfly4bwnIPokO68_luWur73lg-k=-w150-h150-c
https://lh4.googleusercontent.com/proxy/eDks_caYi82LNyG2L2AMQENYCuL7LfaH5rhL-qT6QiVnb5r142EFLTe4u61mZf-1xE2UkJB9GfcUy4x5IfNOU-JCd78FMn--f1CldoEY9y7ouU5cdZ8=-w150-h150-c
https://lh6.googleusercontent.com/proxy/WsWb8Woo-ogEIuDkvaBpsxrizSDUwM_k6h9w1ma-d_i4f3c9Bpefe6llemcMlZNODP5hx_raBrZ6dlclfDXJpirHGgVuTFp3W_mCdrGWO1LCsQf6Nz3iyjgJIbFFv12K3rC9sy2sfV3kgpQRURxi50MwLLG4lUcDx8LIiHWk5bG-VR9IBpygAMPtL5LJoRN8fkg9Vh7RA-J8kkbDm8-xirGXhkYheENaly7yH3qpIo_3aBYrHzS1GsCULOfpjdEuw2OISw=-w150-h150-c

Getting data from HTML using DOMDocument

I'm trying to get data from HTML using DOM. I can get some data, but can't figure out how to get the rest. Here is an image highlighting the data I want.
http://i.imgur.com/Es51s5s.png
here is the code itself
http://pastebin.com/Re8qEivv
and here my PHP code
$html = file_get_contents('result.html');
$dom = new DOMDocument;
$dom->loadHTML($html);
$tr = $dom->getElementsByTagName('tr');
foreach ($tr as $row){
$td = $row->getElementsByTagName('td');
$td1 = $td->item(1);
$td2 = $td->item(2);
foreach ($td1->childNodes as $node){
$title = $node->textContent;
}
foreach ($td2->childNodes as $node){
$type = $node->textContent;
}
}
Figured it out
$html = file_get_contents('result.html');
$dom = new DOMDocument;
$dom->loadHTML($html);
$tr = $dom->getElementsByTagName('tr');
foreach ($tr as $row){
$td = $row->getElementsByTagName('td');
$td1 = $td->item(1);
$td2 = $td->item(2);
$title = $td1->childNodes->item(0)->textContent;
$firstURL = $td1->getElementsByTagName('a')->item(0)->getAttribute('href');
$type = $td2->childNodes->item(0)->textContent;
$imageURL = $td2->getElementsByTagName('img')->item(0)->getAttribute('src');
}
I have used following class.
http://sourceforge.net/projects/simplehtmldom/
This is very simple and easy to use class.
You can use
$html->find('#RosterReport > tbody', 0);
to find specific table
$html->find('tr')
$html->find('td')
to find table rows or columns
Note $html is variable have full html dom content.

PHP nodevalue stripping html tags

I have seem similar solutions else where but I haven't been able to convert to work with my own code.
I have a function that splits an html string between the paragraph tags and returns in an array. Code is as follows...
$dom = new DOMDocument();
$dom->loadHTML($string);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//p");
$result = array();
foreach ($entries as $entry) {
$result[] = '<' . $entry->tagName . '>' . $entry->nodeValue . '</' . $entry->tagName . '>';
}
return $result;
Can someone assist me to remove the nodeValue element from this so it returns the paragraph content with html tags complete?
The html I am testing against is this: http://adam-makes-websites.com/tests/htmltest/test.html
A full test of what im doing with the code (as it stands with the suggestion to use ownerDocument->saveHTML applied) is here: http://adam-makes-websites.com/tests/htmltest/runtest.txt
The output from the test can be seen here: http://adam-makes-websites.com/tests/htmltest/runtest.php
You need to call saveHTML on the ownerDocument property:
$result[] = $entry->ownerDocument->saveHTML($entry);
$dom = new DOMDocument();
$dom->loadHTML($string);
$entries = $dom->getElementsByTagName('p');
$new_dom = new DOMDocument();
foreach ($entries as $entry) {
$new_dom->appendChild($new_dom->importNode($entry, TRUE));
}
$result = $new_dom->saveHTML()

looking to loop for 2 element in the same time (php /xpath )

I'm trying to extract 2 elements using PHP Curl and Xpath!
So far have the element separated in foreach but I would like to have them in the same time:
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->evaluate("//p[#class='row']/a/#href");
//$elements = $xpath->query("//p[#class='row']/a");
foreach ($elements as $element) {
$url = $element->nodeValue;
//$title = $element->nodeValue;
}
When I echo each one out of the foreach I only get 1 element and when its echoed inside the foreach i get all of them.
My question is how can I get them both at the same time (url and title ) and whats the best way to add them into myqsl using pdo.
thank you
There is no need, in this case, to use XPath twice. You could do one query and navigate to the associated other node(s).
For example, find all of the hrefs that you are interested in and get their ownerElement's (the <a>) node value.
$hrefs = $xpath->query("//p[#class='row']/a/#href");
foreach ($hrefs as $href) {
$url = $href->value;
$title = $href->ownerElement->nodeValue;
// Insert into db here
}
Or, find all of the <a>s that you are interested in and get their href attributes.
$anchors = $xpath->query("//p[#class='row']/a[#href]");
foreach ($anchors as $anchor) {
$url = $anchor->getAttribute("href");
$title = $anchor->nodeValue;
// Insert into db here
}
You're overwriting $url on each iteration. Maybe use an array?
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->evaluate("//p[#class='row']/a/#href");
//$elements = $xpath->query("//p[#class='row']/a");
$urls = array();
foreach ($elements as $element){
array_push($urls, $element->nodeValue);
//$title = $element->nodeValue;
}

Trying to use PHP DOM to replace node text without changing child nodes

I am trying to use the dom object to simplify the implementation of a glossary tooltip. What I need to do is to replace a text element in a paragraph, but NOT in an anchor tag that may be embedded in the paragraph.
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementByTagName("p");
foreach ($nodes as $node) {
$node->nodeValue = str_replace("tag","element",$node->nodeValue);
}
echo $document->saveHTML();
I get:
'...<p>Replace this element not this element</p>...'
I want:
'...<p>Replace this element not this tag</p>...'
How do I implement this such that only the parent node text is changed and the child node (a tag) is not changed?
Try this:
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementsByTagName("p");
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
$node->nodeValue = str_replace("tag","element",$node->nodeValue);
}
echo $document->saveHTML();
Hope this helps.
UPDATE
To answer #paul's question in the comments below, you can create
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementsByTagName("p");
//create the element which should replace the text in the original string
$elem = $document->createElement( 'dfn', 'tag' );
$attr = $document->createAttribute('title');
$attr->value = 'element';
$elem->appendChild( $attr );
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
//dump the new string here, which replaces the source string
$node->nodeValue = str_replace("tag",$document->saveHTML($elem),$node->nodeValue);
}
echo $document->saveHTML();

Categories