I am trying to retrieve the price of an Amazon product.
I tried 2 methods:
file_get_contents -> regex -> it works.
using DOMXPath -> does not work for some reason.
I noticed that if javascript is enabled the xpath of the price differs from the xpath while javascript is disabled.
Anyway, how can I retrieve the price using xpath?
This is what I am doing but the code returns nothing (even though it is working on any other website):
(The xpath was taken using firebug)
$url = 'http://www.amazon.com/dp/product/B00TRQPSXM/';
$path = '/html/body/div[3]/form/table[3]/tbody/tr[1]/td/div/table/tbody/tr[2]';
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->query($path);
if($elements)
{
foreach($elements as $element)
{
echo $element->nodeName.'<br>';
echo $element->nodeValue.'<br>';
}
}
Your request will be blocked after a couple of tries every time, amazon checks for robot access. Instead of scrapping their site which btw is against amazon's terms of service (or whatever it's called), use their API found at http://developer.amazonservices.com. You will get the price information you are after with this operation.
There is also a php sdk you can use.
Either way, file_get_contents() is not an option here, if you want to scrape the page use curl and make it look like an unique visitor.
Related
I am playing around scraping website technique, For ex link, Its always returning empty for description.
The reason is its populated by JS with the following code, How do we go about with these kinds of senarios.
// Frontend JS
P.when('DynamicIframe').execute(function(DynamicIframe){
var BookDescriptionIframe = null,
bookDescEncodedData = "book desc data",
bookDescriptionAvailableHeight,
minBookDescriptionInitialHeight = 112,
options = {},
iframeId = "bookDesc_iframe";
I am using php domxpath as below
$file = 'sample.html';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
// I am saving the returned html to a file and reading the file.
#$dom->loadHTMLFile($file);
$xpath = new DOMXPath($dom);
// This xpath works on chrome console, but not here
// because the content is dynamically created via js
$desc = $xpath->query('//*[#id="bookDesc_iframe"]')
Everytime when you see these kinds of JavaScript Generated content and especially from big guys like amazon, google, you should immediately think that it would have a graceful degradation implementation.
Meaning it would be done for where Javascript doesn't work like links browser for better browser coverage.
Lookout for <noscript> you may find one. and with that you can solve the problem.
I am trying to find a way of displaying the text from a website on a different site.
I own both the sites, and they both run on wordpress (I know this may make it more difficult). I just need a page to mirror the text from the page and when the original page is updated, the mirror also updates.
I have some experience in PHP and HTML, and I also would rather not use Js.
I have been looking at some posts that suggest cURL and file_get_contents but have had no luck editing it to work with my sites.
Is this even possible?
Look forward to your answers!
Both cURL and file_get_contents() are fine to get the full html output from an url. For example with file_get_contents() you can do it like this:
<?php
$content = file_get_contents('http://elssolutions.co.uk/about-els');
echo $content;
However, in case you need just a portion of the page, DOMDocument and DOMXPath are far better options, as with the latter you also can query the DOM. Below is working an example.
<?php
// The `id` of the node in the target document to get the contents of
$url = 'http://elssolutions.co.uk/about-els';
$id = 'comp-iudvhnkb';
$dom = new DOMDocument();
// Silence `DOMDocument` errors/warnings on html5-tags
libxml_use_internal_errors(true);
// Loading content from external url
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
// Querying DOM for target `id`
$xpathResultset = $xpath->query("//*[#id='$id']")->item(0);
// Getting plain html
$content = $dom->saveHTML($xpathResultset);
echo $content;
I'm having an issue using XPath on the Google App Engine for PHP.
So I have the following code:
function getDataXpath($url_str, $xpath_exp_str)
{
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url_str);
libxml_use_internal_errors(false);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("".$xpath_exp_str."");
if (!is_null($elements)) {
return $elements;
}
return false;
}
And then I simply run it like this to get the nodes:
getDataXpath($url_str, $xpath_exp_str);
So on my local PHP install (v 5.5.19), when I run the following:
$url_str = 'http://www.alexa.com/topsites/category;0/Top/Shopping';
$xpath_exp_str = "//ul/li[#class='site-listing']/div/p/a";
$xpath_data = getDataXpath($url_str, $xpath_exp_str);
print_r($xpath_data);
I get the following result:
DOMNodeList Object ( [length] => 25 );
and this is correct.
However, when I run the same code on Google App Engine for PHP (v 5.5.26), I get the following:
DOMNodeList Object ( [length] => 0 );
Has anyone had this issue, and how did you fix it?
So it appears that Amazon might be blocking programmatic access to the Alexa TopSites pages. I'm actually subscribed to their new API, but it doesn't allow you to categorize responses (e.g. top e-commerce sites) like you can on the website, which is why I'm resorting to XPath.
I tried the same script on some other URLs and I didn't have any issues.
Anyway, it works when I run it locally (in browser and command-line), so I'll just have to skip Google App Engine for now. It's a broken workflow, especially since this was part of a much bigger automation effort, but it's out of my hands at this point.
How to load a URL obtained from $_SERVER['REQUEST_URI'] into domDocument?
I am trying to load a dynamic webpage into DOMDocument to be parsed for certain words. Ultimately I want to create a glossary for my site (Tiki Wiki CMS). I started very simple and right now I am only trying to load a page and parse the text for testing purposes.
I am new to DOMDocument and after reading several articles on this site and on PHP Manual, I know that I have to load a html page with loadHTMLFile, then parse the site by getElementsById or getElementsByTagName in order to do stuff with it. It works fine for static pages, but the main problem I am having is that I cannot enter a static url into loadHTMLFile, because parsing should be performed when the site is uploaded by the user.
Here's the code that DID work:
$url = 'http://mysite.org/bbk/tiki-index.php?page=pagetext';
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
So, I thought I could use $_SERVER['REQUEST_URI'] for the job, but it did not work.
This did NOT work (no error message):
$url = $_SERVER['REQUEST_URI'];
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
After checking what the $url output was, I decided to add http://mysite.org to it to make it identical to the url that worked. However, no luck either and this time I got an internal server error.
This did NOT work either (Internal Server Error):
$url = 'http://mysite.org' . $_SERVER['REQUEST_URI'];
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
I think I am missing something substantial here and I thought it might just not be possible to use DOMDocument in this way, so I was searching the web for help again (if it is possible to use $_SERVER['REQUEST_URI'] in combination with DOMdocument at all), but I didn't find an answer. So I hope anybody here can help. Any suggestions including third party parsers etc. would be helpful, except anything that requires parsing with regex. Tiki Wiki CMS already has a glossary option done with regex, but it is very buggy.
Thanks.
UPDATE
I haven't found an answer to the problem, but I think I have an idea on where my mistake was. I was expecting $_SERVER['REQUEST_URI'] to run on a dynamic page that was not completely built yet. I ran the script on the main setup page, so I guess the html was not rendered yet, when I tried to point $_SERVER['REQUEST_URI'] to it. When I noticed that this might be the problem, I abandoned the idea of parsing the document with DomDocument and used a javascript solution that can be loaded after the document is ready.
I can think of two things that you can do (probably won't solve your problem directly, but will help you greatly with solving it):
$_SERVER['REQUEST_URI'] doesn't contain what you think it does. Try echoing or var_dumping it, and see if it matches your expectations.
Enable error reporting. The reason you are seeing a generic 500 error page, is because error reporting is disabled. enable it using error_reporting().
Also note that DOMDocument only parses HTML, if you have dynamic DOM nodes generated and added to the page using a client-side language, or CSS pseudo elements, they won't be displayed unless you deploy a JS/CSS parser as well (which is not trivial).
I need to store the XML that i get it from Google Analytics. Its format is XML file. I need to create the script ( PHP ) that will read XML file from Google Analytics and store in my server with user defined name. I tried like that
<?php
$dom = new DOMDocument();
$dom->load('https://www.google.com/analytics/reporting/export?fmt=1&id=346044461&pdr=20100611-20100711&cmp=average&rpt=DashboardReport');
$dom->save('books3.xml');
?>
Can you help me
you're not assigning the result of load to anything you can save afterwards. and that is assuming you created a function load.
you'd need something more along the lines of
<?php
$remoteUri = 'https://www.google.com/analytics/reporting/export?...';
$doc = new DOMDocument();
$doc->loadXML(file_get_contents($remoteUri));
$xml = $doc->saveXML($doc->documentElement);
file_put_contents($yourLocalFilePath, $xml);
or if you just want a completely verbatim copy locally:
<?php
$remoteUri = ...
file_put_contents($yourLocalFilePath, file_get_contents($remoteUri));
the second, simpler version doesn't attempt to parse any xml and will therefore not have any clue if something is wrong with the recieved document.
depending on your server, you might have to resort to more complex methods of getting the file if url wrappers for fopen aren't enabled, or if your google endpoint wants to use cookies etc. for example.