I just started using PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/) and have some problems parsing XML.
I can perfectly parse all the links from HTML documents, but parsing links from RSS feeds (XML format) doesn't work. For example, I want to parse all the links from http://www.bing.com/search?q=ipod&count=50&first=0&format=rss so I use this code:
$content = file_get_html('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
foreach($content->find('item') as $entry)
{
$item['title'] = $entry->find('title', 0)->plaintext;
$item['description'] = $entry->find('description', 0)->plaintext;
$item['link'] = $entry->find('link', 0)->plaintext;
$parsed_results_array[] = $item;
}
print_r($parsed_results_array);
The script parses title and description but link element is empty. Any ideas? My guess is that "link" is reserved word or something, so how do I get the parser to work?
I suggest you use the right tool for this job. Use SimpleXML: Plus, its built-in :)
$xml = simplexml_load_file('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
$parsed_results_array = array();
foreach($xml as $entry) {
foreach($entry->item as $item) {
// $parsed_results_array[] = json_decode(json_encode($item), true);
$items['title'] = (string) $item->title;
$items['description'] = (string) $item->description;
$items['link'] = (string) $item->link;
$parsed_results_array[] = $items;
}
}
echo '<pre>';
print_r($parsed_results_array);
Should yield something like:
Array
(
[0] => Array
(
[title] => Apple - iPod
[description] => Learn about iPod, Apple TV, and more. Download iTunes for free and purchase iTunes Gift Cards. Check out the most popular TV shows, movies, and music.
[link] => http://www.apple.com/ipod/
)
[1] => Array
(
[title] => iPod - Wikipedia, the free encyclopedia
[description] => The iPod is a line of portable media players designed and marketed by Apple Inc. The first line was released on October 23, 2001, about 8½ months after ...
[link] => http://en.wikipedia.org/wiki/IPod
)
If you are used to use PHP Simple HTML DOM, you can keep using it!
Too many approaches would make confusions, and simplehtmldom is already easy and powerful.
Be sure you start like this:
require_once('lib/simple_html_dom.php');
$content = file_get_contents('http://www.bing.com/search?q=ipod&count=50&first=0&format=rss');
$xml = new simple_html_dom();
$xml->load($content);
Then you can go with you queries!
edit simple_html_doom class
protected $self_closing_tags
delete key "link"
BEFORE:
protected $self_closing_tags = array('img'=>1, 'br'=>1,'link'=>1, 'input'=>1, 'meta'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);
AFTER:
protected $self_closing_tags = array('img'=>1, 'br'=>1, 'input'=>1, 'meta'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1);
Related
Why is this code able to fetch data from the following first page and insert them into an array by numbering the array, while it fails to do the same for the following second page:
http://nimishprabhu.com
https://www.fiverr.com/search/gigs?utf8=%E2%9C%93&source=guest-homepage&locale=en&search_in=everywhere&query=php
The page shows arrays numbered like the following, which is not correct:
Array ( [0] => mailto:support#fiverr.com )
Array ( [0] => https://collector.fiverr.com/api/v1/collector/noScript.gif?appId=PXK3bezZfO
[1] => https://collector.fiverr.com/api/v1/collector/pxPixel.gif?appId=PXK3bezZfO )
Array ( [0] => One Small Step )
Code:
<?php
/*
2.
FINDING HTML ELEMENTS BASED ON THEIR TAG NAMES
Suppose you wanted to find each and every image on a webpage or say, each
and every hyperlink.
We will be using “find” function to extract this information from the
object. Doing it using Simple HTML DOM Parser :
*/
include('simple_html_dom.php');
$html = file_get_html('https://www.fiverr.com/search/gigs?utf8=%E2%9C%93&source=guest-homepage&locale=en&search_in=everywhere&query=php');
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
//to fetch all images from a webpage
$images = array();
foreach($html->find('img') as $img) {
$images[] = $img->src;
}
print_r($images);
echo "<br />";
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
print_r($headlines);
echo "<br />";
?>
Any suggestions and code samples welcome for my learning purpose.
I am a self study student.
The reason is that the page you are trying to download (fiverr.com) is JavaScript-based with dynamically loaded content. This will not work in PHP, because it only sees the HTML that was sent by the server, it can't parse and run JavaScript. Because this is for learning purposes, you can simply try a different website.
However, if you want a working solution, you should look into Selenium. It's basically a headless web browser which does everything like other browsers, including running JavaScript. Through its web driver you will be able to fully parse websites like fiverr.com.
I want to parse Google News rss with PHP. I managed to run this code:
<?
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
foreach($news->channel->item as $item) {
echo "<strong>" . $item->title . "</strong><br />";
echo strip_tags($item->description) ."<br /><br />";
}
?>
However, I'm unable to solve following problems. For example:
How can i get the hyperlink of the news title?
As each of the Google news has many related news links in footer, (and my code above includes them also). How can I remove those from the description?
How can i get the image of each news also? (Google displays a thumbnail image of each news)
Thanks.
There we go, just what you need for your particular situation:
<?php
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
$feeds = array();
$i = 0;
foreach ($news->channel->item as $item)
{
preg_match('#src="([^"]+)"#', $item->description, $match);
$parts = explode('<font size="-1">', $item->description);
$feeds[$i]['title'] = (string) $item->title;
$feeds[$i]['link'] = (string) $item->link;
$feeds[$i]['image'] = $match[1];
$feeds[$i]['site_title'] = strip_tags($parts[1]);
$feeds[$i]['story'] = strip_tags($parts[2]);
$i++;
}
echo '<pre>';
print_r($feeds);
echo '</pre>';
?>
And the output should look like this:
[2] => Array
(
[title] => Los Alamos Nuclear Lab Under Siege From Wildfire - ABC News
[link] => http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGxBe4YsZArH0kSwEjq_zDm_h-N4A&url=http://abcnews.go.com/Technology/wireStory?id%3D13951623
[image] => http://nt2.ggpht.com/news/tbn/OhH43xORRwiW1M/6.jpg
[site_title] => ABC News
[story] => A wildfire burning near the desert birthplace of the atomic bomb advanced on the Los Alamos laboratory and thousands of outdoor drums of plutonium-contaminated waste Tuesday as authorities stepped up ...
)
I'd recommend checking out SimplePie. I've used it for several different projects and it works great (and abstracts away all of the headache you're currently dealing with).
Now, if you're writing this code simply because you want to learn how to do it, you should probably ignore this answer. :)
To get the URL for a news item, use $item->link.
If there's a common delimiter for the related news links, you could use regex to cut off everything after it.
Google puts the thumbnail image HTML code inside the description field of the feed. You could regex out everything between the open and close brackets for the image declaration to get the HTML for it.
I want to parse Google News rss with PHP. I managed to run this code:
<?
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
foreach($news->channel->item as $item) {
echo "<strong>" . $item->title . "</strong><br />";
echo strip_tags($item->description) ."<br /><br />";
}
?>
However, I'm unable to solve following problems. For example:
How can i get the hyperlink of the news title?
As each of the Google news has many related news links in footer, (and my code above includes them also). How can I remove those from the description?
How can i get the image of each news also? (Google displays a thumbnail image of each news)
Thanks.
There we go, just what you need for your particular situation:
<?php
$news = simplexml_load_file('http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&topic=n&output=rss');
$feeds = array();
$i = 0;
foreach ($news->channel->item as $item)
{
preg_match('#src="([^"]+)"#', $item->description, $match);
$parts = explode('<font size="-1">', $item->description);
$feeds[$i]['title'] = (string) $item->title;
$feeds[$i]['link'] = (string) $item->link;
$feeds[$i]['image'] = $match[1];
$feeds[$i]['site_title'] = strip_tags($parts[1]);
$feeds[$i]['story'] = strip_tags($parts[2]);
$i++;
}
echo '<pre>';
print_r($feeds);
echo '</pre>';
?>
And the output should look like this:
[2] => Array
(
[title] => Los Alamos Nuclear Lab Under Siege From Wildfire - ABC News
[link] => http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNGxBe4YsZArH0kSwEjq_zDm_h-N4A&url=http://abcnews.go.com/Technology/wireStory?id%3D13951623
[image] => http://nt2.ggpht.com/news/tbn/OhH43xORRwiW1M/6.jpg
[site_title] => ABC News
[story] => A wildfire burning near the desert birthplace of the atomic bomb advanced on the Los Alamos laboratory and thousands of outdoor drums of plutonium-contaminated waste Tuesday as authorities stepped up ...
)
I'd recommend checking out SimplePie. I've used it for several different projects and it works great (and abstracts away all of the headache you're currently dealing with).
Now, if you're writing this code simply because you want to learn how to do it, you should probably ignore this answer. :)
To get the URL for a news item, use $item->link.
If there's a common delimiter for the related news links, you could use regex to cut off everything after it.
Google puts the thumbnail image HTML code inside the description field of the feed. You could regex out everything between the open and close brackets for the image declaration to get the HTML for it.
I cannot fetch the meta description tag from some sites, one in particular is you-tube.
I have tried using "get_meta_tags" however it does not return the description. I have tried using several regex as well. The title returns fine.
Try getting the description from this link: http://www.youtube.com/watch?v=xci0-26M-bk
$url = 'http://www.youtube.com/watch?v=xci0-26M-bk';
if ($fp = #fopen($url, 'r')) {
$file = file($url);
$file = implode("", $file);
$tags = get_meta_tags($url);
$description = trim($tags['description']);
}
$description returns blank...
Why don't you put directly the $url in get_meta_tags?
If I do:
print_r(get_meta_tags('http://www.youtube.com/watch?v=xci0-26M-bk'));
I get:
Array ( [title] => Catfish Blues - Jimi Hendrix Experience [description] => Couldn't ignore Jimi any longer. This is a track recorded for Radio one about 1967. Love the simplicity (seemingly)and the, understated, power. [keywords] => Jimi, Hendrix, sixties )
Below is the XML I am working with - there are more items - this is the first set. How can I get these elements in to an array? I have been trying with PHP's SimpleXML etc. but I just cant do it.
<response xmlns:lf="http://api.lemonfree.com/ns/1.0">
<lf:request_type>listing</lf:request_type>
<lf:response_code>0</lf:response_code>
<lf:result type="listing" count="10">
<lf:item id="56832429">
<lf:attr name="title">Used 2005 Ford Mustang V6 Deluxe</lf:attr>
<lf:attr name="year">2005</lf:attr>
<lf:attr name="make">FORD</lf:attr>
<lf:attr name="model">MUSTANG</lf:attr>
<lf:attr name="vin">1ZVFT80N555169501</lf:attr>
<lf:attr name="price">12987</lf:attr>
<lf:attr name="mileage">42242</lf:attr>
<lf:attr name="auction">no</lf:attr>
<lf:attr name="city">Grand Rapids</lf:attr>
<lf:attr name="state">Michigan</lf:attr>
<lf:attr name="image">http://www.lemonfree.com/images/stock_images/thumbnails/2005_38_557_80.jpg</lf:attr>
<lf:attr name="link">http://www.lemonfree.com/56832429.html</lf:attr>
</lf:item>
<!-- more items -->
</lf:result>
</response>
Thanks guys
EDIT: I want the first items data in easy to access variables, I've been struggling for a couple of days to get SimpleXML to work as I am new to PHP, so I thought manipulating an array is easier to do.
Why do you want them in an array? They are structured already, use them as XML directly.
There is SimpleXML and DOMDocument, now it depends on what you want to do with the data (you failed to mention that) which one serves you better. Expand your question to get code samples.
EDIT: Here is an example of how you could handle your document with SimpleXML:
$url = "http://api.lemonfree.com/listings?key=xxxx&make=ford&model=mustang";
$ns_lf = "http://api.lemonfree.com/ns/1.0";
$response = simplexml_load_file($url);
// children() fetches all nodes of a given namespace
$result = $response->children($ns_lf)->result;
// dump the entire <lf:result> to see what it looks like
print_r($result);
// once the namespace was handled, you can go on normally (-> syntax)
foreach ($result->item as $item) {
$title = $item->xpath("lf:attr[#name='title']");
$state = $item->xpath("lf:attr[#name='state']");
// xpath() always returns an array of matches, hence the [0]
echo( $title[0].", ".$state[0] );
}
Perhaps you should look at SimplePie, it parses XML feeds into an array or an object (well, one of the two :D). I think it works well for namespaces and attributes too.
Some benefits include it's GPL license (it's free) and it's support community.
SimpleXML is the best way to read/write XML files. Usually it's as easy as using arrays, except in your case because there's XML namespaces involved and it complicates stuff. Also, the format used to stored attributes kind of sucks, so instead of being easy to use and obvious it's kind of complicated, so here's what you're looking for so that you can move on to doing something more interesting for you:
$response = simplexml_load_file($url);
$items = array();
foreach ($response->xpath('//lf:item') as $item)
{
$id = (string) $item['id'];
foreach ($item->xpath('lf:attr') as $attr)
{
$name = (string) $attr['name'];
$items[$id][$name] = (string) $attr;
}
}
You'll have everything you need in the $items array, use print_r() to see what's inside. $url should be the URL of that lemonfree API thing. The code assumes there can't be multiple values for one attribute (e.g. multiple images.)
Good luck.
This is the way I parsed XML returned from a clients email address book system into an array so I could use it on a page. uses an XML parser that is part of PHP, I think.
here's it's documentation http://www.php.net/manual/en/ref.xml.php
$user_info = YOUR_XML
SETS $xml_array AS ARRAY
$xml_array = array();
// SETS UP XML PARSER AND PARSES $user_info INTO AN ARRAY
$xml_parser = xml_parser_create();
xml_parse_into_struct($xml_parser, $user_info, $values, $index);
xml_parser_free($xml_parser);
foreach($values as $key => $value)
{
// $value['level'] relates to the nesting level of a tag, x will need to be a number
if($value['level']==x)
{
$tag_name = $value['tag'];
// INSERTS DETAILS INTO ARRAY $contact_array SETTING KEY = $tag_name VALUE = value for that tag
$xml_array[strtolower($tag_name)] = $value['value'];
}
}
If you var_dump($values) you should see what level the data you're info is on, I think it'll be 4 for the above XML, so you can then filter out anything you don't want by changing the value of $value['level']==x to the required level, ie. $value['level']==4.
This should return $xml_array as an array with $xml_array['title'] = 'Used 2005 Ford Mustang V6 Deluxe' etc.
Hope that helps some