Scraping data from amazon - php

I'm aware that there is an amazon API for pulling their data but I'm just trying to learn to scrape for my own knowledge and pulling data from amazon seems like a good test.
<?php
ini_set('display_errors',1);
ini_set('display_startup_errors',1);
error_reporting(-1);
include('../includes/simple_html_dom.php');
$html = file_get_html('http://www.amazon.co.uk/gp/product/B00AZYBFGY/ref=s9_simh_gw_p86_d0_i1?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1MP0FXRF8V70NWAN3ZWW&pf_r$')
foreach($html->find('a-section') as $element) {
echo $element->plaintext . '<br />';
}
echo $ret;
?>
All I'm trying to do is pull the product description from the link but I'm not sure why it's working. I'm not getting any errors or any data at all, really.

The class for the Product Description is simply productDescriptionWrapper so in your sample code use that css selector
foreach($html->find('.productDescriptionWrapper') as $element) {
echo $element->plaintext . '<br />';
}
simplehtmldom uses css selectors very similar to jQuery. so if you want all divs say ->find('div') if you want all anchors with a class of 'hotProduct' say ->find('a.hotProduct') so on and so forth

It doesn't work because the product description is being added by JavaScript into an iFrame.

You first can check if there is an HTML taken from the Amazon. It might block your request.
$url = "https://www.amazon.co.uk/gp/product/B00AZYBFGY/ref=s9_simh_gw_p86_d0_i1?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1MP0FXRF8V70NWAN3ZWW&pf_r$"
$htmlContent = file_get_contents($url);
echo $htmlContent;
$html = str_get_html($htmlContent);
Note, the https://, you have http://, maybe that is why you get nothing.
Once you get HTML, you can go forward.
Try different selectors:
foreach($html->find('div[id=productDescription]')) as $element) {
echo $element->plaintext . '<br />';
}
foreach($html->find('div[id=content]')) as $element) {
echo $element->plaintext . '<br />';
}
foreach($html->find('div[id=feature-bullets]')) as $element) {
echo $element->plaintext . '<br />';
}
It should display the page itself, maybe with some missing CSS.
If the HTML is in place. You can try those xpaths

Related

Fetch rss with php - Conditional for Enclosured image and not Enclosured

I'm working on a project and it's something new for me. I'll need to fetch rss content from websites, and display Descripion, Title and Images (Thumbnails). Right now i've noticed that some feeds show thumbnails as Enclosure tag and some others dont. right now i have the code for both, but i need to understand how i can create a conditional like:
If the rss returns enclosure image { Do something }
Else { get the common thumb }
Here follow the code that grab the images:
ENCLOSURE TAG IMAGE:
if ($enclosure = $block->get_enclosure())
{
echo "<img src=\"" . $enclosure->get_link() . "\">";
}
NOT ENCLOSURE:
if ($enclosure = $block->get_enclosure())
{
echo '<img src="'.$enclosure->get_thumbnail().'" title="'.$block->get_title().'" width="200" height="200">';
}
=================================================================================================
PS: If we look at both codes they're almost the same, the difference are get_thumbnail and get_link.
Is there a way i can create a conditional to use the correct code and always shows the thumbnail?
Thanks everyone in advance!
EDITED
Here is the full code i have right now:
include_once(ABSPATH . WPINC . '/feed.php');
if(function_exists('fetch_feed')) {
$feed = fetch_feed('http://feeds.bbci.co.uk/news/world/africa/rss.xml'); // this is the external website's RSS feed URL
if (!is_wp_error($feed)) : $feed->init();
$feed->set_output_encoding('UTF-8'); // this is the encoding parameter, and can be left unchanged in almost every case
$feed->handle_content_type(); // this double-checks the encoding type
$feed->set_cache_duration(21600); // 21,600 seconds is six hours
$feed->handle_content_type();
$limit = $feed->get_item_quantity(18); // fetches the 18 most recent RSS feed stories
$items = $feed->get_items(0, $limit); // this sets the limit and array for parsing the feed
endif;
}
$blocks = array_slice($items, 0, 3); // Items zero through six will be displayed here
foreach ($blocks as $block) {
//echo $block->get_date("m d Y");
echo '<div class="single">';
if ($enclosure = $block->get_enclosure())
{
echo '<img class="image_post" src="'.$enclosure->get_link().'" title="'.$block->get_title().'" width="150" height="100">';
}
echo '<div class="description">';
echo '<h3>'. $block->get_title() .'</h3>';
echo '<p>'.$block->get_description().'</p>';
echo '</div>';
echo '<div class="clear"></div>';
echo '</div>';
}
And here are the XML pieces with 2 different tags for images:
Using Thumbnails: view-source:http://feeds.bbci.co.uk/news/world/africa/rss.xml
Using Enclosure: http://feeds.news24.com/articles/news24/SouthAfrica/rss
Is there a way i can create a conditional to use the correct code and always shows the thumbnail?
Sure there is. You've not said in your question what blocks you so I have to assume the reason, but I can imagine multiple.
Is the reason a decisions with more than two alternations?
You handle the scenario of a feed item having no image or an image already:
if ($enclosure = $block->get_enclosure())
{
echo '<img class="image_post" src="'.$enclosure->get_link().'" title="'.$block->get_title().'" width="150" height="100">';
}
With your current scenario there is only one additional alternation which makes it three: if the enclosure is a thumbnail and not a link:
No image (no enclosure)
Image from link (enclosure with link)
Image from thumbnail (enclosure with thumbnail)
And you then don't know how to create a decision of that. This is what basically else-if is for:
if (!$enclosure = $block->get_enclosure())
{
echo "no enclosure: ", "-/-", "\n";
} elseif ($enclosure->get_link()) {
echo "enclosure link: ", $enclosure->get_link(), "\n";
} elseif ($enclosure->get_thumbnail()) {
echo "enclosure thumbnail: ", $enclosure->get_thumbnail(), "\n";
}
This is basically then doing the output based on that. However if you assign the image URL to a variable, you can decide on the output later on:
$image = NULL;
if (!$enclosure = $block->get_enclosure())
{
// nothing to do
} elseif ($enclosure->get_link()) {
$image = $enclosure->get_link();
} elseif ($enclosure->get_thumbnail()) {
$image = $enclosure->get_thumbnail();
}
if (isset($image)) {
// display image
}
And if you then move this more or less complex decision into a function of it's own, it will become even better to read:
$image = feed_item_get_image($block);
if (isset($image)) {
// display image
}
This works quite well until the decision becomes even more complex, but this would go out of scope for an answer on Stackoverflow.

Get title and meta tags from Facebook page with PHP

How can I get the title and all the meta tags from a Facebook page?
I have this code:
function getMetaData($url){
// get meta tags
$meta=get_meta_tags($url);
// store page
$page=file_get_contents($url);
// find where the title CONTENT begins
$titleStart=strpos($page,'<title>')+7;
// find how long the title is
$titleLength=strpos($page,'</title>')-$titleStart;
// extract title from $page
$meta['title']=substr($page,$titleStart,$titleLength);
// return array of data
return $meta;
}
$tags=getMetaData('https://www.facebook.com/showmeapp?filter=3');
echo 'Title: '.$tags['title'];
echo '<br />';
echo 'Description: '.$tags['description'];
echo '<br />';
echo 'Keywords: '.$tags['keywords']
Example: https://www.facebook.com/showmeapp?filter=3
You really don´t need to deal with the Metatags, just call the Graph API:
$data = file_get_contents('https://graph.facebook.com/bladauhu'); //example page
Works even without an App, for every public page.
i think url is not correct change your url to https://graph.facebook.com/showmeapp
Use Simple HTML DOM Parser
http://simplehtmldom.sourceforge.net/
Try below code:
<?php
require_once('simple_html_dom.php');
$page=$html = file_get_html('https://www.facebook.com/showmeapp');
$title= $html->find('meta[name="title"]',0)->content;
echo 'Title: '.$title ;
echo '<br />';
$description= $html->find('meta[name="description"]',0)->content;
echo 'Description: '.$description;
echo '<br />';
$keywords= $html->find('meta[name="keywords"]',0)->content;
echo 'Keywords: '.$keywords;
?>

php code to extract all text links not image link

I want to extract all text link from a webpage using simplehtmldom class. But i don't want image links.
<?
foreach($html->find('a[href]') as $element)
echo $element->href . '<br>';
?>
above code shows all anchor links containing href attribute.
contact
about
<a herf="/home"><img src="logo.png" /><a>
i want only /contact and /about not /home because it contains image instead of text
<?php
foreach($html->find('a[href]') as $element)
{
if (empty(trim($element->plaintext)))
continue;
echo $element->href . '<br>';
}
<?
foreach($html->find('a[href]') as $element){
if(!preg_match('%<img%', $element->href)){
echo $element->href . '<br>';
}
}
?>
It is possible to do that in css and with phpquery as:
$html->find('a:not(:has(img))')
This is not a feature that will likely ever come to simple though.

How to get page title in php?

I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.
Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags
Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});​
Demo : http://jsfiddle.net/WQNT8/1/
try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;

Simple HTML DOM Parser error handling

I'm using SimpleHTMLDOM Parser to scape a website and I would like to know if there's any error handling method. For example, if the link is broken there is no use to advance in the code and search the document.
Thank you.
<?php
$html = file_get_html('http://www.google.com/');
foreach($html->find('a') as $element)
{
if(empty($element->href))
{
continue; //will skip <a> without href
}
echo $element->href . "<br>\n";
}
?>
a loop and continue?

Categories