I want to parse Google News rss with PHP. I managed to run this code:
<?
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
foreach($news->channel->item as $item) {
echo "<strong>" . $item->title . "</strong><br />";
echo strip_tags($item->description) ."<br /><br />";
}
?>
However, I'm unable to solve following problems. For example:
How can i get the hyperlink of the news title?
As each of the Google news has many related news links in footer, (and my code above includes them also). How can I remove those from the description?
How can i get the image of each news also? (Google displays a thumbnail image of each news)
Thanks.
There we go, just what you need for your particular situation:
<?php
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
$feeds = array();
$i = 0;
foreach ($news->channel->item as $item)
{
preg_match('#src="([^"]+)"#', $item->description, $match);
$parts = explode('<font size="-1">', $item->description);
$feeds[$i]['title'] = (string) $item->title;
$feeds[$i]['link'] = (string) $item->link;
$feeds[$i]['image'] = $match[1];
$feeds[$i]['site_title'] = strip_tags($parts[1]);
$feeds[$i]['story'] = strip_tags($parts[2]);
$i++;
}
echo '<pre>';
print_r($feeds);
echo '</pre>';
?>
And the output should look like this:
[2] => Array
(
[title] => Los Alamos Nuclear Lab Under Siege From Wildfire - ABC News
[link] => http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGxBe4YsZArH0kSwEjq_zDm_h-N4A&url=http://abcnews.go.com/Technology/wireStory?id%3D13951623
[image] => http://nt2.ggpht.com/news/tbn/OhH43xORRwiW1M/6.jpg
[site_title] => ABC News
[story] => A wildfire burning near the desert birthplace of the atomic bomb advanced on the Los Alamos laboratory and thousands of outdoor drums of plutonium-contaminated waste Tuesday as authorities stepped up ...
)
I'd recommend checking out SimplePie. I've used it for several different projects and it works great (and abstracts away all of the headache you're currently dealing with).
Now, if you're writing this code simply because you want to learn how to do it, you should probably ignore this answer. :)
To get the URL for a news item, use $item->link.
If there's a common delimiter for the related news links, you could use regex to cut off everything after it.
Google puts the thumbnail image HTML code inside the description field of the feed. You could regex out everything between the open and close brackets for the image declaration to get the HTML for it.
Related
Why is this code able to fetch data from the following first page and insert them into an array by numbering the array, while it fails to do the same for the following second page:
http://nimishprabhu.com
https://www.fiverr.com/search/gigs?utf8=%E2%9C%93&source=guest-homepage&locale=en&search_in=everywhere&query=php
The page shows arrays numbered like the following, which is not correct:
Array ( [0] => mailto:support#fiverr.com )
Array ( [0] => https://collector.fiverr.com/api/v1/collector/noScript.gif?appId=PXK3bezZfO
[1] => https://collector.fiverr.com/api/v1/collector/pxPixel.gif?appId=PXK3bezZfO )
Array ( [0] => One Small Step )
Code:
<?php
/*
2.
FINDING HTML ELEMENTS BASED ON THEIR TAG NAMES
Suppose you wanted to find each and every image on a webpage or say, each
and every hyperlink.
We will be using “find” function to extract this information from the
object. Doing it using Simple HTML DOM Parser :
*/
include('simple_html_dom.php');
$html = file_get_html('https://www.fiverr.com/search/gigs?utf8=%E2%9C%93&source=guest-homepage&locale=en&search_in=everywhere&query=php');
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
//to fetch all images from a webpage
$images = array();
foreach($html->find('img') as $img) {
$images[] = $img->src;
}
print_r($images);
echo "<br />";
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
print_r($headlines);
echo "<br />";
?>
Any suggestions and code samples welcome for my learning purpose.
I am a self study student.
The reason is that the page you are trying to download (fiverr.com) is JavaScript-based with dynamically loaded content. This will not work in PHP, because it only sees the HTML that was sent by the server, it can't parse and run JavaScript. Because this is for learning purposes, you can simply try a different website.
However, if you want a working solution, you should look into Selenium. It's basically a headless web browser which does everything like other browsers, including running JavaScript. Through its web driver you will be able to fully parse websites like fiverr.com.
I have created a script where every other word in a paragraph is green, which is correct. However there is a problem because the original paragraph which I used appears above the new paragraph, which I do not want.
This solution to this may be simple but I can't get my head around it.
Can anyone point me in the right direction?
Code:
<?php
$storyOfTheDay= "Once upon a time there was an old woman who loved baking gingerbread. She would bake gingerbread cookies, cakes, houses and gingerbread people, all decorated with chocolate and peppermint, caramel candies and colored frosting.
She lived with her husband on a farm at the edge of town. The sweet spicy smell of gingerbread brought children skipping and running to see what would be offered that day.
Unfortunately the children gobbled up the treats so fast that the old woman had a hard time keeping her supply of flour and spices to continue making the batches of gingerbread. Sometimes she suspected little hands of having reached through her kitchen window because gingerbread pieces and cookies would disappear.";
$storyOfTheDay = preg_split("/\s+/", $storyOfTheDay);
//Adding <span> to odd array index items
foreach (array_chunk($storyOfTheDay , 2) as $chunk) {
$storyOfTheDay[] = $chunk[0];
if(!empty( $chunk[1]))
{
$storyOfTheDay[] = $chunk[1]= "<span style='color:green'>". $chunk[1] ."</span>";
}
}
$storyOfTheDay = join(" ", $storyOfTheDay);
echo $storyOfTheDay;
Output:
Image of Output
You are continuously filling the same array ($storyOfTheDay). Make the new one:
$storyOfTheDay = preg_split("/\s+/", $storyOfTheDay);
$newStoryOfTheDay = [];
//Adding <span> to odd array index items
foreach (array_chunk($storyOfTheDay , 2) as $chunk) {
$newStoryOfTheDay[] = $chunk[0];
if( !empty($chunk[1]) ){
$newStoryOfTheDay[] = "<span style='color:green'>". $chunk[1] ."</span>";
}
}
$newStoryOfTheDay = join(" ", $newStoryOfTheDay);
echo $newStoryOfTheDay;
I just started using PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/) and have some problems parsing XML.
I can perfectly parse all the links from HTML documents, but parsing links from RSS feeds (XML format) doesn't work. For example, I want to parse all the links from http://www.bing.com/search?q=ipod&count=50&first=0&format=rss so I use this code:
$content = file_get_html('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
foreach($content->find('item') as $entry)
{
$item['title'] = $entry->find('title', 0)->plaintext;
$item['description'] = $entry->find('description', 0)->plaintext;
$item['link'] = $entry->find('link', 0)->plaintext;
$parsed_results_array[] = $item;
}
print_r($parsed_results_array);
The script parses title and description but link element is empty. Any ideas? My guess is that "link" is reserved word or something, so how do I get the parser to work?
I suggest you use the right tool for this job. Use SimpleXML: Plus, its built-in :)
$xml = simplexml_load_file('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
$parsed_results_array = array();
foreach($xml as $entry) {
foreach($entry->item as $item) {
// $parsed_results_array[] = json_decode(json_encode($item), true);
$items['title'] = (string) $item->title;
$items['description'] = (string) $item->description;
$items['link'] = (string) $item->link;
$parsed_results_array[] = $items;
}
}
echo '<pre>';
print_r($parsed_results_array);
Should yield something like:
Array
(
[0] => Array
(
[title] => Apple - iPod
[description] => Learn about iPod, Apple TV, and more. Download iTunes for free and purchase iTunes Gift Cards. Check out the most popular TV shows, movies, and music.
[link] => http://www.apple.com/ipod/
)
[1] => Array
(
[title] => iPod - Wikipedia, the free encyclopedia
[description] => The iPod is a line of portable media players designed and marketed by Apple Inc. The first line was released on October 23, 2001, about 8½ months after ...
[link] => http://en.wikipedia.org/wiki/IPod
)
If you are used to use PHP Simple HTML DOM, you can keep using it!
Too many approaches would make confusions, and simplehtmldom is already easy and powerful.
Be sure you start like this:
require_once('lib/simple_html_dom.php');
$content = file_get_contents('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
$xml = new simple_html_dom();
$xml->load($content);
Then you can go with you queries!
edit simple_html_doom class
protected $self_closing_tags
delete key "link"
BEFORE:
protected $self_closing_tags = array('img'=>1, 'br'=>1,'link'=>1, 'input'=>1, 'meta'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);
AFTER:
protected $self_closing_tags = array('img'=>1, 'br'=>1, 'input'=>1, 'meta'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);
I'm not looking to scrape Google. This just a one-time thing to get about 300 urls a bit faster than manually doing it.
I can't seem to get a DOMDocument to be created though. It always ends up as an empty object.
search_list.txt contains my list of search terms. Right now I'm testing it with just 1 term, "legos".
The script correctly downloads the search results page. I viewed it in a web browser and it looked fine.
search_list.txt
legos
getresults.php
<?php
$search_list = 'search_list.txt'; // file containing search terms
$results = 'results.txt';
$handle = fopen($vendor_list,'r');
while($line = fgets($handle)) {
$fp = fopen($results,'w');
$ch = curl_init('http://www.google.com/'
. 'search?q=' . urlencode($line));
curl_setopt($ch,CURLOPT_FILE,$fp);
curl_setopt($ch,CURLOPT_HEADER,0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
unset($ch,$fp);
}
fclose($handle);
$dom = DOMDocument::loadHTML(file_get_contents($results));
echo print_r($dom,true); // EMPTY
$search_div = $dom->getElementById('search');
if(is_null($search_div)) { // ALWAYS NULL
echo 'Search_div is null';
} else {
echo print_r($search_div,true);
}
?>
I made some changes.
Instead of fopen - fgets - * , file .
Instead of curl, simple_html_dom::load_file
$search_list = 'search_list.txt'; // file containing search terms
$result_list = 'results.txt'; // file containing search terms
$searching_list = file($search_list);
foreach ($search_list as $key => $searching_word) {
$html->load_file('http://www.google.com/'.'search?q='.urlencode($searching_word));
$search_div = $html->find("div[id='search']");
echo $search_div[0]; // See content of the search div.
file_put_contents($result_list,$search_div[0]);
}
?>
You can see the results with echo $search_div[0];.
It shows you whole content of search div .
I searched for 'asd' =) ...
Based on my results , it is started with like
<div id="search"><div id="ires"><ol><li class="g"><h3 class="r"><b>Atrial septal defect</b> - Wikipedia, the free encyclopedia</h3><div class="s"><div class="kv" style="margin-bottom:2px"><cite>en.wikipedia.org/wiki/<b>Atrial_septal_defect</b></cite><span class="flc"> - Cached - Similar</span></div><span class="st"><b>Atrial septal defect</b> (<b>ASD</b>)
And ended like
</span><br></div></li><li class="g"><h3 class="r"><b>Achievement School District</b></h3><div class="s"><div class="kv" style="margin-bottom:2px"><cite>achievementschooldistrict.org/</cite><span class="flc"> - Cached</span></div><span class="st"><b>Achievement School District</b> · The <b>ASD</b> · Driving Results · Campuses · Join Our <br> Team · Enroll A Student · <b>ASD</b> News · Contact Us <b>...</b></span><br></div></li></ol></div></div>
UPDATE
This part is based on comment of Buttle Butk .
If there is no change the first 1st result of google search you can use this code to get the first result in the search.
<?php
$search_list = 'search_list.txt'; // file containing search terms
$result_list = 'results.txt'; // file containing search terms
$order_language = "en"
$searching_list = file($search_list);
foreach ($search_list as $key => $searching_word) {
$link = 'https://www.google.com.tr/search?hl='.$order_language.'&q='.$searching_word.'&btnI=1';
echo $link;
file_put_contents($result_list,$link[0]);
}
?>
I searched for 'asd' =) again ...
The result
https://www.google.com.tr/search?hl=en&q=asd&btnI=1
When i copied and paste to chrome , this link redirect to me to 1st result of 'asd searching'.
http://www.asd-europe.org/
If i can help you , i'll feel happy .
Have a good day.
I want to parse Google News rss with PHP. I managed to run this code:
<?
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
foreach($news->channel->item as $item) {
echo "<strong>" . $item->title . "</strong><br />";
echo strip_tags($item->description) ."<br /><br />";
}
?>
However, I'm unable to solve following problems. For example:
How can i get the hyperlink of the news title?
As each of the Google news has many related news links in footer, (and my code above includes them also). How can I remove those from the description?
How can i get the image of each news also? (Google displays a thumbnail image of each news)
Thanks.
There we go, just what you need for your particular situation:
<?php
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
$feeds = array();
$i = 0;
foreach ($news->channel->item as $item)
{
preg_match('#src="([^"]+)"#', $item->description, $match);
$parts = explode('<font size="-1">', $item->description);
$feeds[$i]['title'] = (string) $item->title;
$feeds[$i]['link'] = (string) $item->link;
$feeds[$i]['image'] = $match[1];
$feeds[$i]['site_title'] = strip_tags($parts[1]);
$feeds[$i]['story'] = strip_tags($parts[2]);
$i++;
}
echo '<pre>';
print_r($feeds);
echo '</pre>';
?>
And the output should look like this:
[2] => Array
(
[title] => Los Alamos Nuclear Lab Under Siege From Wildfire - ABC News
[link] => http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGxBe4YsZArH0kSwEjq_zDm_h-N4A&url=http://abcnews.go.com/Technology/wireStory?id%3D13951623
[image] => http://nt2.ggpht.com/news/tbn/OhH43xORRwiW1M/6.jpg
[site_title] => ABC News
[story] => A wildfire burning near the desert birthplace of the atomic bomb advanced on the Los Alamos laboratory and thousands of outdoor drums of plutonium-contaminated waste Tuesday as authorities stepped up ...
)
I'd recommend checking out SimplePie. I've used it for several different projects and it works great (and abstracts away all of the headache you're currently dealing with).
Now, if you're writing this code simply because you want to learn how to do it, you should probably ignore this answer. :)
To get the URL for a news item, use $item->link.
If there's a common delimiter for the related news links, you could use regex to cut off everything after it.
Google puts the thumbnail image HTML code inside the description field of the feed. You could regex out everything between the open and close brackets for the image declaration to get the HTML for it.