PHP - RSS Parser XML - php

Question: How to Parse <media:content URL="IMG" /> from XML?
OK. This is like asking why 1+1 = 2. And 2+2=Not Available.
Orginal Link:
How to Parse XML With SimpleXML and PHP // By: John Morris.
https://www.youtube.com/watch?v=_1F1Iq1IIS8
Using his method I can easily reach items on RSS FEED New York Times
With Following Code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>How to Parse XML with SimpleXML and PHP</title>
</head>
<body>
<?php
$url = 'http://rss.nytimes.com/services/xml/rss/nyt/Sports.xml';
$xml = simplexml_load_file($url) or die("Can't connect to URL");
?><pre><?php //print_r($xml); ?></pre><?php
foreach ($xml->channel->item as $item) {
printf('<li>%s</li>', $item->link, $item->title);
}
?>
</body>
</html>
GIVES:
Sparky Lyle in Monument Park? Fans Say Yes, but He Disagrees
The Thickly Accented American Behind the N.B.A. in France
On Pro Basketball: ‘That Got Ugly in a Hurry’: More Playoff Pain Delivered by the Spurs
...
BUT
TO reach media:content you cannot use simplexml_load_file as it doesn't grab any media.content tags.
So... Yes.. I searched around on the Webb.
I found this example on StackOverflow:
get media:description and media:content url from xml
But using the Code:
<?php
function feeds()
{
$url = "http://rss.nytimes.com/services/xml/rss/nyt/Sports.xml"; // xmld.xml contains above data
$feeds = file_get_contents($url);
$rss = simplexml_load_string($feeds);
foreach($rss->channel->item as $entry) {
if($entry->children('media', true)->content->attributes()) {
$md = $entry->children('media', true)->content->attributes();
print_r("$md->url");
}
}
}
?>
Gave me no errors. But also a blank page.
And it seems most people (googling) has little to no idea how to really use media:content . So I have to turn for Stackoverflow and hope someone can provide an answer. Im even willing to not use SimpleXML.
What I want.. is .. to grab media:content url IMAGES and use them on a external site.
Also.. if possible.
I would like to put the XML parsed items into a SQL database.

I came up with this:
<?php
$url = "http://rss.nytimes.com/services/xml/rss/nyt/Sports.xml"; // xmld.xml contains above data
$feeds = file_get_contents($url);
$rss = simplexml_load_string($feeds);
$items = [];
foreach($rss->channel->item as $entry) {
$image = '';
$image = 'N/A';
$description = 'N/A';
foreach ($entry->children('media', true) as $k => $v) {
$attributes = $v->attributes();
if ($k == 'content') {
if (property_exists($attributes, 'url')) {
$image = $attributes->url;
}
}
if ($k == 'description') {
$description = $v;
}
}
$items[] = [
'link' => $entry->link,
'title' => $entry->title,
'image' => $image,
'description' => $description,
];
}
print_r($items);
?>
Giving:
Array
(
[0] => Array
(
[link] => SimpleXMLElement Object
(
[0] => https://www.nytimes.com/2017/04/17/sports/basketball/a-court-used-for-playing-hoops-since-1893-where-paris.html?partner=rss&emc=rss
)
[title] => SimpleXMLElement Object
(
[0] => A Court Used for Playing Hoops Since 1893. Where? Paris.
)
[image] => SimpleXMLElement Object
(
[0] => https://static01.nyt.com/images/2017/04/05/sports/basketball/05oldcourt10/05oldcourt10-moth-v13.jpg
)
[description] => SimpleXMLElement Object
(
[0] => The Y.M.C.A. in Paris says its basketball court, with its herringbone pattern and loose slats, is the oldest one in the world. It has been continuously functional since the building opened in 1893.
)
)
.....
And you can iterate over
foreach ($items as $item) {
printf('<img src="%s">', $item['image']);
printf('%s', $item['url'], $item['title']);
}
Hope this helps.

Related

How to parse html inside xml's tag

I need help in getting data from description tag where it consists of <a>, <img>and some text. The xml I am trying to parse is this
I managed to get all the data I need, except for description tag where I got <a> tag along with description text. What I need is img's src and the description text.
My code :
foreach ($rss->getElementsByTagName('item') as $node) {
/*$test = $node->getElementsByTagName('description');
$test = $test->item(0)->textContent;*/
var_dump($test);
exit;
$nodes = $node->getElementsByTagName('content');
if(!is_object($nodes) || $nodes === null || $nodes->length==0){
$linkthumbNode = $node->getElementsByTagName('image');
if(isset($linkthumbNode) && $linkthumbNode->length >0){
$linkthumb=$linkthumbNode->item(0)->nodeValue;
if(empty($linkthumb)||$linkthumb == " "){
$linkthumb = $linkthumbNode->item(0)->getAttribute('src');
}
}else{
$linkthumb = "NO IMAGE";
}
}else{
$linkthumb = $nodes->item(0)->getAttribute('url');
}
$title = $node->getElementsByTagName('title')->item(0)->nodeValue;
$desc = $node->getElementsByTagName('description')->item(0)->textContent;
$link = $node->getElementsByTagName('link')->item(0)->nodeValue;
$img = $linkthumb;
$date = $node->getElementsByTagName('pubDate');
if(isset($date) && $date->length >0){
$date = $date->item(0)->nodeValue;
}else{
$date = "no date provided";
}
$item = array (
'title' => $title,
'desc' => $desc,
'link' => $link,
'img' => $img,
'date' => $date,
);
array_push($feed, $item);
}
the xml description tag is :
<description>
<img border="0" hspace="10" align="left" style="margin-top:3px;margin-right:5px;" src="http://timesofindia.indiatimes.com/photo/20984744.cms" />Nine food combinations that will make staying healthy and looking fit easier
</description>
what I need: http://timesofindia.indiatimes.com/photo/20984744.cms as image and Nine food combinations that will make staying healthy and looking fit easier as my description.
Can someone help me? I'm not that great at PHP and parsing XML.
Maybe I am a little late to the party, but if an answer is still needed, check out my solution. I use PHP DOMDocument and regular expressions since I haven't found a simple way to get the needed data using only XML-extensions.
$rss = file_get_contents('https://timesofindia.indiatimes.com/rssfeeds/2886704.cms');
$feed = new DOMDocument();
$feed->loadXML($rss);
$items = array();
foreach($feed->getElementsByTagName('item') as $item) {
$arr = array();
foreach($item->childNodes as $child) {
if($child->nodeName === 'title' || $child->nodeName === 'link') $arr[$child->nodeName] = $child->nodeValue;
if($child->nodeName === 'pubDate') $arr['date'] = $child->nodeValue;
if($child->nodeName === 'description') {
preg_match('/(?<=src=[\'\"])(.+)(?=[\'\"])/i', $child->nodeValue, $matches);
$arr['img'] = $matches[0];
preg_match('/[^>]+$/i', $child->nodeValue, $matches);
$arr['desc'] = $matches[0];
}
}
array_push($items, $arr);
}
print_r($items);
The output is like this and seems to be what you needed:
Array ( [0] => Array ( [title] => 5 reasons you get sore after sex [img] => https://timesofindia.indiatimes.com/photo/61101815.cms [desc] => Sometimes, a super-filmy, almost-perfect sex leaves you all euphoric but only to end with soreness later. So, what is it that is going wrong? Can it be remedied? [link] => https://timesofindia.indiatimes.com/life-style/health-fitness/health-news/5-reasons-you-get-sore-after-sex/life-style/health-fitness/health-news/5-reasons-you-get-sore-after-sex/photostory/61101724.cms [date] => Mon, 16 Oct 2017 10:21:27 GMT )...

A regexp to retrieve either og:url meta or link rel="canonical"

i'm trying to write a script to scrape canonical URL from a remote URL.
I'm not a professional developper, so if something is ugly in my code, any explanation would (and will) be appreciated.
What I'm trying to do is either look for:
<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />
<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />`
... and extract the URL out of it.
My code so far :
$content = file_get_contents($url);
$content = strtolower($content);
$content = preg_replace("'<style[^>]*>.*</style>'siU",'',$content); // strip js
$content = preg_replace("'<script[^>]*>.*</script>'siU",'',$content); // strip css
$split = explode("\n",$content); // Separate each line
foreach ($split as $k => $v) // For each line
{
if (strpos(' '.$v,'<meta') || strpos(' '.$v,'<link')) // If contains a <meta or <link
{
// Check with regex and if found, return what I need (the URL)
}
}
return $split_content;
I've been fighting with regex for hours, trying to figure out how to do so, but it seems it's well above my knowledge.
would someone know how I need to define this rule ?
Plus, does my script seems okay to you, or is there room for improvement ?
Thanks a bunch !
Using DOMDocument this is how you can get the property and content
$html = '<meta property="og:url" content="http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
$attr = array();
foreach ($dom->getElementsByTagName('meta') as $meta) {
if ($meta->hasAttributes()) {
foreach ($meta->attributes as $attribute) {
$attr[$attribute->nodeName] = $attribute->nodeValue;
}
}
}
print_r($attr);
Output ::
Array
(
[property] => og:url
[content] => http://www.my-canonical-url.com/is-here-and-look-no-dynamic-parameters-186.html
)
The same you can get for the 2nd URL as
$html = '<link rel="canonical" href="http://www.another-canonical-url.com/is-here" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
$attr = array();
foreach ($dom->getElementsByTagName('link') as $link) {
if ($link->hasAttributes()) {
foreach ($link->attributes as $attribute) {
$attr[$attribute->nodeName] = $attribute->nodeValue;
}
}
}
print_r($attr);
Output ::
Array
(
[rel] => canonical
[href] => http://www.another-canonical-url.com/is-here
)
Consider using DOMDocument, simply load your HTML into the DOMDocument object and use getElementsByTagName and then loop the results until one of them has the right attributes. As if you were writing Javascript.

How do I obtain the canonical value using PHP DomDocument?

<link rel='canonical' href='http://test.com/asdfsdf/sdf/' />
I need to get the canonical href value using Dom. How do I do this?
There are multiple ways to do this.
Using XML:
<?php
$html = "<link rel='canonical' href='http://test.com/asdfsdf/sdf/' />";
$xml = simplexml_load_string($html);
$attr = $xml->attributes();
print_r($attr);
?>
which outputs:
SimpleXMLElement Object
(
[#attributes] => Array
(
[rel] => canonical
[href] => http://test.com/asdfsdf/sdf/
)
)
or, using Dom:
<?php
$html = "<link rel='canonical' href='http://test.com/asdfsdf/sdf/' />";
$dom = new DOMDocument;
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('link');
foreach ($nodes as $node)
{
if ($node->getAttribute('rel') === 'canonical')
{
echo($node->getAttribute('href'));
}
}
?>
which outputs:
http://test.com/asdfsdf/sdf/
In both examples, more code is required if you're parsing an entire HTML file, but they demonstrate most of the structure that you'll need.
Code modified from this answer and the documentation on Dom.

Accessing SimpleXMLElements within an array

I've had a look round similar articles such as this one and I can't quite get it to work, quite possible I'm just misunderstanding.
I've got a simple script which parses a bit of xml and prints out specific fields - what I'm having trouble doing is accessing the data of SimpleXMLElement Objects.
XML (simplified for clarity)
<channel>
<item>
<title><![CDATA[Title is in here ...]]></title>
<description>Our description is in here!</description>
</item>
</channel>
PHP
$url = "file.xml";
$xml = simplexml_load_file($url, 'SimpleXMLElement', LIBXML_NOCDATA);
foreach ($xml->channel->item as $item) {
$articles = array();
$articles['title'] = $item->title;
$articles['description'] = $item->description;
}
Up to this point, everything seems ok. I end up with an array of content which I can confirm with print_r, this is what I get back:
Array
(
[title] => SimpleXMLElement Object
(
[0] => Title is in here ...
)
[description] => SimpleXMLElement Object
(
[0] => Our description is in here!
)
)
The key question
How do I then access [title][0] or [description][0]?
I've tried a couple of variations with no success, most likely a rookie error somewhere!
foreach ($articles as $article) {
echo $article->title;
}
and
foreach ($articles as $article) {
echo $article['title'][0];
}
and
foreach ($articles as $article) {
echo $article['title'];
}
If you really don't want to simply pass the SimpleXMLelement but put the values in an array first....
<?php
// $xml = simplexml_load_file($url, 'SimpleXMLElement', LIBXML_NOCDATA);
$xlm = getData();
$articles = array();
foreach ($xml->channel->item as $item) {
// with (string)$item->title you get rid of the SimpleXMLElements and store plain strings
// but you could also keep the SimpleXMLElements here - the output is still the same.
$articles[] = array(
'title'=>(string)$item->title,
'description'=>(string)$item->description
);
}
// ....
foreach( $articles as $a ) {
echo $a['title'], ' - ', $a['description'], "\n";
}
function getData() {
return new SimpleXMLElement('<foo><channel>
<item>
<title><![CDATA[Title1 is in here ...]]></title>
<description>Our description1 is in here!</description>
</item>
<item>
<title><![CDATA[Title2 is in here ...]]></title>
<description>Our description2 is in here!</description>
</item>
</channel></foo>');
}
prints
Title1 is in here ... - Our description1 is in here!
Title2 is in here ... - Our description2 is in here!
I think you have an error when you assign value to to array:
foreach ($xml->channel->item as $item) {
$articles = array();
$articles['title'] = $item->title;
$articles['description'] = $item->description;
}
if you have foreach why are your creating on every step new array $articles = array();
$articles = array();
foreach ($xml->channel->item as $item) {
$articles['title'] = $item->title;
$articles['description'] = $item->description;
}

reading twitter's rss search feed with simple xml

Having some trouble selecting some nodes in the rss feed for twitter's search
the rss url is here
http://search.twitter.com/search.rss?q=twitfile
each item looks like this
<item>
<title>RT #TwittBoy: TwitFile - Comparte tus archivos en Twitter (hasta 200Mb) http://bit.ly/xYNsM</title>
<link>http://twitter.com/MarielaCelita/statuses/5990165590</link>
<description>RT <a href="http://twitter.com/TwittBoy">#TwittBoy</a>: <b>TwitFile</b> - Comparte tus archivos en Twitter (hasta 200Mb) <a href="http://bit.ly/xYNsM">http://bit.ly/xYNsM</a></description>
<pubDate>Mon, 23 Nov 2009 22:45:39 +0000</pubDate>
<guid>http://twitter.com/MarielaCelita/statuses/5990165590</guid>
<author>MarielaCelita#twitter.com (M.Celita Lijerón)</author>
<media:content type="image/jpg" width="48" height="48" url="http://a3.twimg.com/profile_images/537676869/orkut_normal.jpg"/>
<google:image_link>http://a3.twimg.com/profile_images/537676869/orkut_normal.jpg</google:image_link>
</item>
My php is below
foreach ($twitter_xml->channel->item as $key) {
$screenname = $key->{"author"};
$date = $key->{"pubDate"};
$profimg = $key->{"google:image_link"};
$link = $key->{"link"};
$title = $key->{"title"};
echo"
<li>
<h5><a href=$link>$author</a></h5>
<p class=info><a href=$link>$title</a></p>
</li>
";
Problem is nothing is being echoed, i mean from the rss feed, if there are 20 results, its looping 20 times, just no data
In the code, $screenname is assigned a value but you are echoing $author.
To get elements within namespaces like google:image_link ,you will have to do this:
$g = $key->children("http://base.google.com/ns/1.0");
$profimg = $g->{"image_link"};
If you are wondering where did I get "http://base.google.com/ns/1.0" from, the namespace is mentioned in the second line of the rss feed.
$url="http://search.twitter.com/search.rss?q=twitfile";
$twitter_xml = simplexml_load_file($url);
foreach ($twitter_xml->channel->item as $key) {
$author = $key->{"author"};
$date = $key->{"pubDate"};
$link = $key->{"link"};
$title = $key->{"title"};
$g = $key->children("http://base.google.com/ns/1.0");
$profimg = $g->{"image_link"};
echo"
<li>
<h5><a href=$link>$author</a></h5>
<p class=info><a href=$link>$title</a></p>
</li>
";
$xml = $twitter_xml;
}
This code works.
Set error_reporting(E_ALL); and you'll see that $author isn't defined.
You can't access <google:image_link/> this way, you'll have to use XPath or children()
$key->children("google", true)->image_link;
If you use SimpleDOM, there's a shortcut that returns the first element of an XPath result:
$key->firstOf("google:image_link");
if (!$xml = simplexml_load_file('http://search.twitter.com/search.atom?q='.urlencode ($terms)))
{
throw new RuntimeException('Unable to load or parse search results feed');
}
if (!count($entries = $xml->entry))
{
throw new RuntimeException('No entry found');
}
for($i=0;$i<count($entries);$i++)
{
$title[$i] = $entries[$i]->title;
//etc.. continue description,,,,,
}
I made this and it works :)) $sea_name is the keyword your looking for...
<?php
function twitter_feed( $sea_name ){
$endpoint = 'http://search.twitter.com/search.rss?q='.urlencode($sea_name); // URL to call
$resp = simplexml_load_file($endpoint);
// Check to see if the response was loaded, else print an error
if ($resp) {
$results = '';
$counter=0;
// If the response was loaded, parse it and build links
foreach($resp->channel->item as $item) {
//var_dump($item);
preg_match("/\((.*?)\)/", $item->author, $blah);
$content = $item->children("http://search.yahoo.com/mrss/" );
$imageUrl = getXmlAttribute( $content, "url" );
echo '
<div class="twitter-item">
<img src="'.$imageUrl.'" />
<span class="twit">'.$blah[1].'</span><br />
<span class="twit-content">'.$item->title.'</span>
<br style="clear:both; line-height:0;margin:0;padding:0;">
</div>';
$counter++;
}
}
// If there was no response, print an error
else {
$results = "Oops! Must not have gotten the response!";
}
echo $results;
}
function getXmlAttribute( SimpleXMLElement $xmlElement, $attribute ) {
foreach( $xmlElement->attributes() as $name => $value ) {
if( $name == $attribute ) {
return (string)$value;
}
}
}
?>
The object will contain somthing like:
<!-- SimpleXMLElement Object
(
[title] => Before I go to bed, I just want to say I've just seen Peter Kay's CIN cartoon video for the 1st time... one word... WOW.
[link] => http://twitter.com/Alex_Segal/statuses/5993710015
[description] => Before I go to bed, I just want to say I&apos;ve just seen <b>Peter</b> <b>Kay</b>&apos;s CIN cartoon video for the 1st time... one word... WOW.
[pubDate] => Tue, 24 Nov 2009 01:00:00 +0000
[guid] => http://twitter.com/Alex_Segal/statuses/5993710015
[author] => Alex_Segal#twitter.com (Alex Segal)
)
-->
You can use any of it inside the foreach look and echo them such as $item->author, $item->link, etc....any other attributes you can use the getattribute function...

Categories