I'm trying to load html file from a Amazon URL to extract the product price using a simple php function on Yii.
I started to get the entire file with php function file_get_contents, and than extract only the price from my html file with DOM.
I'm using a DOM parser to read the HTML file. It has convenient functions to read the tags of a html file. This is the parser:
http://simplehtmldom.sourceforge.net/
The URL that php analyze can be of amazon.com, amazon.co.uk, amazon.it, etc. In the future this feature will be used also to analyze other url different from Amazon.
I created a simple function, that from a URL, extract the price, here it is:
public function findAmazonPriceFromUrl($url) {
Yii::import('ext.HtmlDOMParser.*');
require_once('simple_html_dom.php');
$html = file_get_html($url);
$item = $html->getElementsById('actualPriceValue');
if ($item) {
$price = $item[0]->firstChild()->innertext;
} else {
$item = $html->getElementsById('current-price');
$price = $item[0]->innertext;
}
return $price;
}
The file_get_html function is the following:
function file_get_html($url) {
$dom = new simple_html_dom();
$contents = file_get_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE) {
return false;
}
$dom->load($contents);
return $dom;
}
I noticed that after a few request (various links), I always get an error from the server (Error 500). I checked my apache log file, but everything is good.
Amazon could block my requests after certain time? How can i fix it?
Thanks in advance for the help
I had same problem and this is my fix: I run script again if image is not parsed. image is parsed first in my php script so I check if it works and amazon gives information. I hope it helps.
if($html->find('#main-image')) {
foreach($html->find('#main-image') as $e) {
echo '<span href="'. $e->src . '" class="imgblock parseimg">
<img src="'. $e->src . '" class="resultimg" alt="'.$name.'" title="'.$name.'">
</span>
<input type="hidden" name="my-item-img" value="'. $e->src . '" />';
}
} else {
gethtml($url,$domain);
die;
}
Related
I am new to scraping website and I was interested in getting the ticket prices from this website.
https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495
I see the ticket prices in the p#price-selected-label.filters-selected-label tag, but I cant seem to access it. I tried a few things and looked at a few tutorials, but either I get a blank returned or some error. The code is based off http://blog.endpoint.com/2016/07/scrape-web-content-with-php-no-api-no.html
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$html = file_get_html('https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495');
// creating an array of elements
$videos = [];
// Find top ten videos
$i = 1;
$videoDetails = $html->find('p#price-selected-label.filters-selected-label')-> innertext;
// $videoDetails = $html->find('p#price-selected-label.filters-selected-label',0);
echo $videoDetails;
/*
foreach ($html->find('li.expanded-shelf-content-item-wrapper') as $video) {
if ($i > 10) {
break;
}
// Find item link element
$videoDetails = $video->find('a.yt-uix-tile-link', 0);
// get title attribute
$videoTitle = $videoDetails->title;
// get href attribute
$videoUrl = 'https://youtube.com' . $videoDetails->href;
// push to a list of videos
$videos[] = [
'title' => $videoTitle,
'url' => $videoUrl
];
$i++;
}
var_dump($videos);
*/
You can't get it because javascript renders it, so it's not available in the original html that your library get.
Use phantomjs(will execute javascript);
Download phantomjs and place the executable in a path that your PHP binary can reach.
Place the following 2 files in the same directory:
get-website.php
<?php
$phantom_script= dirname(__FILE__). '/get-website.js';
$response = exec ('phantomjs ' . $phantom_script);
echo htmlspecialchars($response);
?>
get-website.js
var webPage = require('webpage');
var page = webPage.create();
page.open('https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495', function(status) {
if (status === "success") {
page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() {
var myElem = $('p#price-selected-label.filters-selected-label');
console.log(myElem);
});
phantom.exit();
}
});
Browse to get-website.php and the target site, https://www.cheaptickets.com/events/tickets/firefly-music-festival-4-day-pass-2867495 contents will return after executing inline javascript. You can also call this from a command line using php /path/to/get-website.php.
My scraping code works for just about every site i've come accross while testing... except for nytimes.com articles. I use ajax with the following PHP code (i've left out some details to focus on my specific problem):
$link = "http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp";
$article = new DOMDocument;
$article->loadHTMLFile($link);
//generate image array
$images = $article->getElementsByTagName("img");
foreach ($images as $image) {
$source = $image->getAttribute("src");
echo '<img src="' . $source . '" alt="alt"><br><br>';
}
My problem is that the main images on nytimes pages don't even seem to get picked up by the getElementsByTagName. Pinterest finds a way to scrape the main images from this site for example: http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp whereas I cannot. Any suggestions?
OK. So this is what I tried so far as I found your question interesting.
When I do this on browser console using jQuery, I do get results on images. My query was
var a= new Array();
$('img[src]').each(function(){ a.push($(this).attr('src'));});
console.log(a);
Also see screenshot of results
Note that console.log(arrayname) work in Chrome browser.
So ideally your code must work. Please consider adding a is_null check like I've done.
Below is the code where I try loading the URL using a different approach(perhaps better too) and get the root cause of why you get only single image of NYT logo.
The resultant HTML screenshot is attached .
<?php
$html = file_get_contents("http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp");
echo $html;
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->recover=true;
#$doc->loadHTML("<html><body>".$html."</body></html>");
$xpath = new DOMXpath($doc);
$images = $xpath->query("//*/img");
if (!is_null($images)) {
echo sizeof($images);
foreach ($images as $image) {
$source = $image->getAttribute('src');
echo '<img src="' . $source . '" alt="alt"><br><br>';
}
}
?>
You can't get the content via feed unless you are authenticated.
You can try-
To use context parameter in file_get_contents method
You can try consuming the RSS/ATOM feeds of the article.
you download the page as HTML and then load it in file_get_contents methods. PS: This works.
I am trying to get the title of a website. This code works perfectly on my computer but on the server it is not running smoothly. On server it could not fetch the url content. On my computer it is easily redirecting.
<?php
ini_set('max_execution_time', 300);
$url = "http://www.cricinfo.com/ci/engine/match/companion/597928.html";
if(strpos( $url, "companion" ) !== false)
{
$url = str_replace("/companion","",$url);
}
$html= file_get_contents($url);
echo $html;
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$msg1 = current(explode("|", $title));
$msg=rawurlencode($msg1);
echo $msg;
if(empty($msg))
{
echo "no data to send";
}
else
{
header("Location:fullonsms.php?msg=" .$msg);
}
exit();
?>
the output on server is this http://sendmysms.bugs3.com/cricket/fetch.php
It appears that the fopen wrappers aren't enabled. As you can see in the notes section of the php docs for file_get_contents, allow_url_fopen must be set to true in order to open a url with file_get_contents. Try running the following on the server to see if you can use file_get_contents with a url.
echo "urls ";
echo (ini_get('allow_url_include')) ? "allowed" : "not allowed";
echo " in file_get_contents.";
If that says 'urls not allowed in file_get_contents' then you'll need to update the setting via the php.ini, a .htaccess file, apache config, or some such equivalent. That is, if you would like to continue using file_get_contents to access the url. Another option is to use curl if you have the php curl extension installed.
P.S. I know this is a problem with the call to file_get_contents since you can see that his script echos the $html variable after he sets it. His link to his script on the server doesn't output any html which tells me this is an issue with grabbing the html rather than the html parser.
I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.
Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags
Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});
Demo : http://jsfiddle.net/WQNT8/1/
try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;
I'm looking to create a PHP script where, a user will provide a link to a webpage, and it will get the contents of that webpage and based on it's contents, parse the contents.
For example, if a user provides a YouTube link:
http://www.youtube.com/watch?v=xxxxxxxxxxx
Then, it will grab the basic information about that video (thumbnail, embed code?)
Or they might provide a vimeo link:
http://www.vimeo.com/xxxxxx
Or even if they were to provide any link, without a video attached, such as:
http://www.google.com/
And it could grab just the page Title or some meta content.
I'm thinking I'd have to use file_get_contents, but I'm not exactly sure how to use it in this context.
I'm not looking for someone to write the entire code, but perhaps provide me with some tools so that I can accomplish this.
You can use either the curl or the http library. You send a http request, and can use the library to get the information from the http response.
I know this question is quite old, but I'll answer just in case someone hits it looking for the same thing.
Use oEmbed (http://oembed.com/) for YouTube, Vimeo, Wordpress, Slideshare, Hulu, Flickr and many other services. If not in the list or you want to make it more precise, you can use this:
http://simplehtmldom.sourceforge.net/
It's a sort of jQuery for PHP, meaning you can use HTML selectors to get portions of the code (i.e.: all the images, get the contents of a div, return only text (no HTML) contents of a node, etc).
You could do something like this (could be done more elegantly but this is just an example):
require_once("simple_html_dom.php");
function getContent ($item, $contentLength)
{
$raw;
$content = "";
$html;
$images = "";
if (isset ($item->content) && $item->content != "")
{
$raw = $item->content;
$html = str_get_html ($raw);
$content = str_replace("\n", "<BR /><BR />\n\n", trim($html->plaintext));
try
{
foreach($html->find('img') as $image) {
if ($image->width != "1")
{
// Don't include images smaller than 100px height
$include = false;
$height = $image->width;
if ($height != "" && $height >= 100)
{
$include = true;
}
/*else
{
list($width, $height, $type, $attr) = getimagesize($image->src);
if ($height != "" && $height >= 100)
$include = true;
}*/
if ($include == true)
{
$images = $images . '<div class="theImage"><img src="'.$image->src.'" alt="'.$image->alt.'" class="postImage" border="0" /></div>';
}
}
}
}
catch (Exception $e) {
// Do nothing
}
$images = '<div id="images">'.$images.'</div>';
}
else
{
$raw = $item->summary;
$content = str_get_html ($raw)->plaintext;
}
return (substr($content, 0 , $contentLength) . (strlen ($content) > $contentLength ? "..." : "") . $images);
}
file_get_contents() would work in this case assuming that you have allow_fopen_url set to true in your php.ini. What you would do is something like:
$pageContent = #file_get_contents($url);
if ($pageContent) {
preg_match_all('#<embed.*</embed>#', $pageContent, $matches);
$embedStrings = $matches[0];
}
That said, file_get_contents() won't give you much in the way of error handling other receiving the content on success or false on failure. If you would like to have more rich control over the request and access the HTTP response codes, use the curl functions and in particular, curl_get_info, to look at the response codes, mime types, encoding, etc. Once you get the content via either curl or file_get_contents() your code for parsing it to look for the HTML of interest will be the same.
Maybe Thumbshots or Snap already have some of the functionality you want?
I know that's not exactly what you are looking for, but at least for the embedded stuff that might be handy. Also txwikinger already answered your other question. But maybe that helps ypu anyway.