I want to parse Google News rss with PHP. I managed to run this code:
<?
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
foreach($news->channel->item as $item) {
echo "<strong>" . $item->title . "</strong><br />";
echo strip_tags($item->description) ."<br /><br />";
}
?>
However, I'm unable to solve following problems. For example:
How can i get the hyperlink of the news title?
As each of the Google news has many related news links in footer, (and my code above includes them also). How can I remove those from the description?
How can i get the image of each news also? (Google displays a thumbnail image of each news)
Thanks.
There we go, just what you need for your particular situation:
<?php
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
$feeds = array();
$i = 0;
foreach ($news->channel->item as $item)
{
preg_match('#src="([^"]+)"#', $item->description, $match);
$parts = explode('<font size="-1">', $item->description);
$feeds[$i]['title'] = (string) $item->title;
$feeds[$i]['link'] = (string) $item->link;
$feeds[$i]['image'] = $match[1];
$feeds[$i]['site_title'] = strip_tags($parts[1]);
$feeds[$i]['story'] = strip_tags($parts[2]);
$i++;
}
echo '<pre>';
print_r($feeds);
echo '</pre>';
?>
And the output should look like this:
[2] => Array
(
[title] => Los Alamos Nuclear Lab Under Siege From Wildfire - ABC News
[link] => http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGxBe4YsZArH0kSwEjq_zDm_h-N4A&url=http://abcnews.go.com/Technology/wireStory?id%3D13951623
[image] => http://nt2.ggpht.com/news/tbn/OhH43xORRwiW1M/6.jpg
[site_title] => ABC News
[story] => A wildfire burning near the desert birthplace of the atomic bomb advanced on the Los Alamos laboratory and thousands of outdoor drums of plutonium-contaminated waste Tuesday as authorities stepped up ...
)
I'd recommend checking out SimplePie. I've used it for several different projects and it works great (and abstracts away all of the headache you're currently dealing with).
Now, if you're writing this code simply because you want to learn how to do it, you should probably ignore this answer. :)
To get the URL for a news item, use $item->link.
If there's a common delimiter for the related news links, you could use regex to cut off everything after it.
Google puts the thumbnail image HTML code inside the description field of the feed. You could regex out everything between the open and close brackets for the image declaration to get the HTML for it.
Related
Why is this code able to fetch data from the following first page and insert them into an array by numbering the array, while it fails to do the same for the following second page:
http://nimishprabhu.com
https://www.fiverr.com/search/gigs?utf8=%E2%9C%93&source=guest-homepage&locale=en&search_in=everywhere&query=php
The page shows arrays numbered like the following, which is not correct:
Array ( [0] => mailto:support#fiverr.com )
Array ( [0] => https://collector.fiverr.com/api/v1/collector/noScript.gif?appId=PXK3bezZfO
[1] => https://collector.fiverr.com/api/v1/collector/pxPixel.gif?appId=PXK3bezZfO )
Array ( [0] => One Small Step )
Code:
<?php
/*
2.
FINDING HTML ELEMENTS BASED ON THEIR TAG NAMES
Suppose you wanted to find each and every image on a webpage or say, each
and every hyperlink.
We will be using “find” function to extract this information from the
object. Doing it using Simple HTML DOM Parser :
*/
include('simple_html_dom.php');
$html = file_get_html('https://www.fiverr.com/search/gigs?utf8=%E2%9C%93&source=guest-homepage&locale=en&search_in=everywhere&query=php');
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
//to fetch all images from a webpage
$images = array();
foreach($html->find('img') as $img) {
$images[] = $img->src;
}
print_r($images);
echo "<br />";
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
print_r($headlines);
echo "<br />";
?>
Any suggestions and code samples welcome for my learning purpose.
I am a self study student.
The reason is that the page you are trying to download (fiverr.com) is JavaScript-based with dynamically loaded content. This will not work in PHP, because it only sees the HTML that was sent by the server, it can't parse and run JavaScript. Because this is for learning purposes, you can simply try a different website.
However, if you want a working solution, you should look into Selenium. It's basically a headless web browser which does everything like other browsers, including running JavaScript. Through its web driver you will be able to fully parse websites like fiverr.com.
I have created a script where every other word in a paragraph is green, which is correct. However there is a problem because the original paragraph which I used appears above the new paragraph, which I do not want.
This solution to this may be simple but I can't get my head around it.
Can anyone point me in the right direction?
Code:
<?php
$storyOfTheDay= "Once upon a time there was an old woman who loved baking gingerbread. She would bake gingerbread cookies, cakes, houses and gingerbread people, all decorated with chocolate and peppermint, caramel candies and colored frosting.
She lived with her husband on a farm at the edge of town. The sweet spicy smell of gingerbread brought children skipping and running to see what would be offered that day.
Unfortunately the children gobbled up the treats so fast that the old woman had a hard time keeping her supply of flour and spices to continue making the batches of gingerbread. Sometimes she suspected little hands of having reached through her kitchen window because gingerbread pieces and cookies would disappear.";
$storyOfTheDay = preg_split("/\s+/", $storyOfTheDay);
//Adding <span> to odd array index items
foreach (array_chunk($storyOfTheDay , 2) as $chunk) {
$storyOfTheDay[] = $chunk[0];
if(!empty( $chunk[1]))
{
$storyOfTheDay[] = $chunk[1]= "<span style='color:green'>". $chunk[1] ."</span>";
}
}
$storyOfTheDay = join(" ", $storyOfTheDay);
echo $storyOfTheDay;
Output:
Image of Output
You are continuously filling the same array ($storyOfTheDay). Make the new one:
$storyOfTheDay = preg_split("/\s+/", $storyOfTheDay);
$newStoryOfTheDay = [];
//Adding <span> to odd array index items
foreach (array_chunk($storyOfTheDay , 2) as $chunk) {
$newStoryOfTheDay[] = $chunk[0];
if( !empty($chunk[1]) ){
$newStoryOfTheDay[] = "<span style='color:green'>". $chunk[1] ."</span>";
}
}
$newStoryOfTheDay = join(" ", $newStoryOfTheDay);
echo $newStoryOfTheDay;
Kindly please help regarding Xpath...
Following scripts will scraping the main body of URL by using Xpath
<?php
//sentimen order
if (PHP_SAPI != 'cli') {
echo "<pre>";
}
require_once __DIR__ . '/../autoload.php';
$sentiment = new \PHPInsight\Sentiment();
require_once 'Xpath.php';
$startUrl = "http://news.sky.com/story/1445575/suspect-held-over-shooting-of-ferguson-police/";
$xpath = new XPATH($startUrl);
// We starts from the root element
$query = '/html/body/div[2]/div[3]/article/div/div[2]/div[2]/p[3]';
$strQuery = $xpath->query($query);
$strNode = $strQuery->item(0)->nodeValue;
$result = array($strNode);
foreach ($result as $string) {
// calculations:
$scores = $sentiment->score($string);
$class = $sentiment->categorise($string);
// output:
echo "Strings $string \n";
echo "Dominant: $class, scores: ";
print_r($scores);
echo "\n";
}
Above scripts run well except the array loop...Xpath does not scraping ALL content but ONLY the first line of main body..
I think the problem lies from array loop and foreach...
Anyone please help to fix this looping....
You only fetch one paragraph. Additionally you only put one string into the array.
You're perhaps looking for something more along this lines:
foreach ($xpath->query('
//header/h1
|//header/p
|//header//p[#class="last-updated__text"]
|//div[#class="story__content"]/p') as $p) {
echo string_normalize($p->textContent), "\n\n";
}
function string_normalize($string)
{
return preg_replace('~\s+~u', ' ', trim($string));
}
Output:
Shooting Of Ferguson Police: Suspect Charged
A prosecutor says the 20-year-old suspect claims he fired the shots in a dispute with other individuals and did not aim at police.
05:19, UK, Monday 16 March 2015
By Sky News US Team
A suspect has been charged in connection with the shooting and wounding last week of two police officers in Ferguson, Missouri.
St Louis County prosecutor Robert McCulloch told a news conference the accused was 20-year-old Jeffrey Williams.
He said the suspect, a local resident, was facing two counts of assault in the first degree.
Williams, who was arrested on Saturday night, is also charged with firing a handgun from a vehicle.
"He has acknowledged his participation in firing the shots," Mr McCulloch told reporters.
...
I want to parse Google News rss with PHP. I managed to run this code:
<?
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
foreach($news->channel->item as $item) {
echo "<strong>" . $item->title . "</strong><br />";
echo strip_tags($item->description) ."<br /><br />";
}
?>
However, I'm unable to solve following problems. For example:
How can i get the hyperlink of the news title?
As each of the Google news has many related news links in footer, (and my code above includes them also). How can I remove those from the description?
How can i get the image of each news also? (Google displays a thumbnail image of each news)
Thanks.
There we go, just what you need for your particular situation:
<?php
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
$feeds = array();
$i = 0;
foreach ($news->channel->item as $item)
{
preg_match('#src="([^"]+)"#', $item->description, $match);
$parts = explode('<font size="-1">', $item->description);
$feeds[$i]['title'] = (string) $item->title;
$feeds[$i]['link'] = (string) $item->link;
$feeds[$i]['image'] = $match[1];
$feeds[$i]['site_title'] = strip_tags($parts[1]);
$feeds[$i]['story'] = strip_tags($parts[2]);
$i++;
}
echo '<pre>';
print_r($feeds);
echo '</pre>';
?>
And the output should look like this:
[2] => Array
(
[title] => Los Alamos Nuclear Lab Under Siege From Wildfire - ABC News
[link] => http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGxBe4YsZArH0kSwEjq_zDm_h-N4A&url=http://abcnews.go.com/Technology/wireStory?id%3D13951623
[image] => http://nt2.ggpht.com/news/tbn/OhH43xORRwiW1M/6.jpg
[site_title] => ABC News
[story] => A wildfire burning near the desert birthplace of the atomic bomb advanced on the Los Alamos laboratory and thousands of outdoor drums of plutonium-contaminated waste Tuesday as authorities stepped up ...
)
I'd recommend checking out SimplePie. I've used it for several different projects and it works great (and abstracts away all of the headache you're currently dealing with).
Now, if you're writing this code simply because you want to learn how to do it, you should probably ignore this answer. :)
To get the URL for a news item, use $item->link.
If there's a common delimiter for the related news links, you could use regex to cut off everything after it.
Google puts the thumbnail image HTML code inside the description field of the feed. You could regex out everything between the open and close brackets for the image declaration to get the HTML for it.
I just started using PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/) and have some problems parsing XML.
I can perfectly parse all the links from HTML documents, but parsing links from RSS feeds (XML format) doesn't work. For example, I want to parse all the links from http://www.bing.com/search?q=ipod&count=50&first=0&format=rss so I use this code:
$content = file_get_html('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
foreach($content->find('item') as $entry)
{
$item['title'] = $entry->find('title', 0)->plaintext;
$item['description'] = $entry->find('description', 0)->plaintext;
$item['link'] = $entry->find('link', 0)->plaintext;
$parsed_results_array[] = $item;
}
print_r($parsed_results_array);
The script parses title and description but link element is empty. Any ideas? My guess is that "link" is reserved word or something, so how do I get the parser to work?
I suggest you use the right tool for this job. Use SimpleXML: Plus, its built-in :)
$xml = simplexml_load_file('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
$parsed_results_array = array();
foreach($xml as $entry) {
foreach($entry->item as $item) {
// $parsed_results_array[] = json_decode(json_encode($item), true);
$items['title'] = (string) $item->title;
$items['description'] = (string) $item->description;
$items['link'] = (string) $item->link;
$parsed_results_array[] = $items;
}
}
echo '<pre>';
print_r($parsed_results_array);
Should yield something like:
Array
(
[0] => Array
(
[title] => Apple - iPod
[description] => Learn about iPod, Apple TV, and more. Download iTunes for free and purchase iTunes Gift Cards. Check out the most popular TV shows, movies, and music.
[link] => http://www.apple.com/ipod/
)
[1] => Array
(
[title] => iPod - Wikipedia, the free encyclopedia
[description] => The iPod is a line of portable media players designed and marketed by Apple Inc. The first line was released on October 23, 2001, about 8½ months after ...
[link] => http://en.wikipedia.org/wiki/IPod
)
If you are used to use PHP Simple HTML DOM, you can keep using it!
Too many approaches would make confusions, and simplehtmldom is already easy and powerful.
Be sure you start like this:
require_once('lib/simple_html_dom.php');
$content = file_get_contents('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
$xml = new simple_html_dom();
$xml->load($content);
Then you can go with you queries!
edit simple_html_doom class
protected $self_closing_tags
delete key "link"
BEFORE:
protected $self_closing_tags = array('img'=>1, 'br'=>1,'link'=>1, 'input'=>1, 'meta'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);
AFTER:
protected $self_closing_tags = array('img'=>1, 'br'=>1, 'input'=>1, 'meta'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);