I seem to be having trouble accessing a portion of an RSS feed. I have read this article: Accessing date as XML node in PHP and Problem getting author from a wordpress RSS feed using SimpleXmlElement article but I still cannot seem to get this to work.
<item>
<title>This is the title</title>
<dc:creator>!an (#myHandle)</dc:creator>
<description><![CDATA[<p class="TweetTextSize js-tweet-text tweet-text" lang="en">What is this tweeet</p>]]></description>
<pubDate>Tue, 03 Jan 2017 22:01:54 +0000</pubDate>
<guid>1234567</guid>
<link>1234567</link>
<twitter:source/>
<twitter:place/>
</item>
here is the portion of the rss I am getting. Here is the full RSS: https://twitrss.me/twitter_search_to_rss/?term=test+this+feed
I am trying to access this with PHP code like so:
$feed = new DOMDocument();
$feed->load('https://twitrss.me/twitter_search_to_rss/?term=test+this+feed');
foreach ($feed->getElementsByTagName('channel')->item(0) as $item) {
$ns_dc = $item->children('http://purl.org/dc/elements/1.1/');
echo $ns_dc->creator;
}
but I keep getting this error:
Invalid argument supplied for foreach()
I have tried changing it to:
$feed->getElementsByTagName('channel')->item(0)
but then nothing outputs.
I also tried:
$feed->getElementsByTagName('channel')
but then get the error:
Call to undefined method DOMElement::children()
Question
Please can someone tell me, using PHP how to access the dc:creator tag and assign that value to a variable?
Edit
$i = 1;
foreach ($feed->channel as $row) {
echo $i;
foreach ($row->children() as $key => $val) {
if ($key == "item") {
if ($i == 1) {
$ns_dc = $val->children('http://purl.org/dc/elements/1.1/');
echo $ns_dc->creator;
}
}
}
$i = $i + 1;
}
Since you are dealing with XML content, I found it easier to use SimpleXML instead of DOMDocument. Since each <item> is in the same block as channel you can do a foreach looking for a key of item.
$feed = simplexml_load_file('https://twitrss.me/twitter_search_to_rss/?term=test+this+feed');
foreach ($feed->channel as $row) {
foreach ($row->children() as $key => $val) {
if ($key == "item") {
$ns_dc = $val->children('http://purl.org/dc/elements/1.1/');
echo $ns_dc->creator;
}
}
}
This code outputs the "creator" string you expect.
Related
Here is the XML I am parsing:
https://seekingalpha.com/api/sa/combined/AAPL.xml
When I grab and parse the XML with simplexml_load_file($url) and then do a var_dump on that, it shows that the only children of every "item" are "title", "link", "guid", and "pubDate."
I am trying to access the node "sa:author_name." Why isn't it a child of "item"? Maybe I am misunderstanding something about how XML files are structured. Help me my children are missing lol
To get the data in sa:author_name you have to use the namespace https://seekingalpha.com/api/1.0.
You can for example use a foreach and loop the children using the namespace.
$url = "https://seekingalpha.com/api/sa/combined/AAPL.xml";
$xml = simplexml_load_file($url);
foreach ($xml->channel->item as $item) {
foreach ($item->children("https://seekingalpha.com/api/1.0") as $child) {
if ($child->getName() === "author_name") {
echo $child . "<br>";
}
}
}
Another way you could do it is using an xpath expression:
$authorNames = $xml->xpath('/rss/channel/item/sa:author_name');
foreach ($authorNames as $authorName) {
echo $authorName . "<br>";
}
Which will result in:
Yoel Minkoff
DoctoRx
SA Transcripts
Bill Maurer
etc..
I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz
The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!
If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .
I want to scrape title post of a blog and I wrote below code. I stuck in figuring out how to loop through every page.
$dom = file_get_html('http://demos.appthemes.com/clipper/');
scrape('http://demos.appthemes.com/clipper/');
function scrape($URL)
{
$dom = file_get_html($URL);
foreach ($dom->find('.item-frame h1 a') as $items) {
$item = array('courseTitle' => $items->text());
var_dump($item);
}
}
for($pages = 0; $pages < 3;$pages++) {
if($next = $dom->find('a[class=page]', $pages)) {
$URL = $next->href;
$dom->clear();
unset($dom);
scrape($URL);
}
}
Partial result did appear but stuck at an error Undefined variable: dom in on line 23
unset($dom); causes the $dom variable to be unset and on the second loop iteration ($pages == 1) call to $dom->find fails.
I did not get the logic, but try to remove $dom->clear(); unset($dom); lines.
Hope it helps.
I am parsing the following RSS feed (relevant part shown)
<item>
<title>xxx</title>
<link>xxx</link>
<guid>xxx</guid>
<description>xxx</description>
<prx:proxy>
<prx:ip>101.226.74.168</prx:ip>
<prx:port>8080</prx:port>
<prx:type>Anonymous</prx:type>
<prx:ssl>false</prx:ssl>
<prx:check_timestamp>1369199066</prx:check_timestamp>
<prx:country_code>CN</prx:country_code>
<prx:latency>20585</prx:latency>
<prx:reliability>9593</prx:reliability>
</prx:proxy>
<prx:proxy>...</prx:proxy>
<prx:proxy>...</prx:proxy>
<pubDate>xxx</pubDate>
</item>
<item>...</item>
<item>...</item>
<item>...</item>
Using the php code
$proxylist_rss = file_get_contents('http://www.xxx.com/xxx.xml');
$proxylist_xml = new SimpleXmlElement($proxylist_rss);
foreach($proxylist_xml->channel->item as $item) {
var_dump($item); // Ok, Everything marked with xxx
var_dump($item->title); // Ok, title
foreach($item->proxy() as $entry) {
var_dump($entry); //empty
}
}
While I can access everything marked with xxx, I cannot access anything inside prx:proxy - mainly because : cannot be present in valid php varnames.
The question is how to reach prx:ip, as example.
Thanks!
Take a look at SimpleXMLElement::children, you can access the namespaced elements with that.
For example: -
<?php
$xml = '<xml xmlns:prx="http://example.org/">
<item>
<title>xxx</title>
<link>xxx</link>
<guid>xxx</guid>
<description>xxx</description>
<prx:proxy>
<prx:ip>101.226.74.168</prx:ip>
<prx:port>8080</prx:port>
<prx:type>Anonymous</prx:type>
<prx:ssl>false</prx:ssl>
<prx:check_timestamp>1369199066</prx:check_timestamp>
<prx:country_code>CN</prx:country_code>
<prx:latency>20585</prx:latency>
<prx:reliability>9593</prx:reliability>
</prx:proxy>
</item>
</xml>';
$sxe = new SimpleXMLElement($xml);
foreach($sxe->item as $item)
{
$proxy = $item->children('prx', true)->proxy;
echo $proxy->ip; //101.226.74.169
}
Anthony.
I would just strip out the "prx:"...
$proxylist_rss = file_get_contents('http://www.xxx.com/xxx.xml');
$proxylist_rss = str_replace('prx:', '', $proxylist_rss);
$proxylist_xml = new SimpleXmlElement($proxylist_rss);
foreach($proxylist_xml->channel->item as $item) {
foreach($item->proxy as $entry) {
var_dump($entry);
}
}
http://phpfiddle.org/main/code/jsz-vga
Try it like this:
$proxylist_rss = file_get_contents('http://www.xxx.com/xxx.xml');
$feed = simplexml_load_string($proxylist_rss);
$ns=$feed->getNameSpaces(true);
foreach ($feed->channel->item as $item){
var_dump($item);
var_dump($item->title);
$proxy = $item->children($ns["prx"]);
$proxy = $proxy->proxy;
foreach ($proxy as $key => $value){
var_dump($value);
}
}
I am using simple html dom parser. When I am requesting a page like file_get_html() I am getting 2 values. One is title and the other one is an url. Within this I want to do file_get_html() again.
But I am getting similar data for the second traverse.
Like this script :
foreach($urls as $value) {
$html=file_get_html($value);
foreach($html->find('div[class=data] a') as $content) {
$url2='http://abc.com/'.$content->href;
$childHtml=file_get_html($url2);
echo $childHtml; //Proble is here i am getting the previous data-->html
}
}
What am I doing wrong here?
This is the main crawling code
$urls=GenerateURLS($currentmonth);
$tracker=0;
$urlHolderArray=array();
foreach ($urls as $value) {
$html=file_get_html($value); //Here I am requesting the html dom
foreach ($html->find('div[id=centrepanel] div[class=events_listing_container] div[class=events_info_container] div[class=events_image] a') as $content) {
$proxyURL="http://www.junkclub.com/".$content->href;
array_push($urlHolderArray,$proxyURL);
}
}
echo '<pre/>';
print_r($urlHolderArray);
echo '<pre/>';
foreach ($urlHolderArray as $link) {
$htmlCon=file_get_html($link);
}
echo $htmlCon;