I am playing around scraping website technique, For ex link, Its always returning empty for description.
The reason is its populated by JS with the following code, How do we go about with these kinds of senarios.
// Frontend JS
P.when('DynamicIframe').execute(function(DynamicIframe){
var BookDescriptionIframe = null,
bookDescEncodedData = "book desc data",
bookDescriptionAvailableHeight,
minBookDescriptionInitialHeight = 112,
options = {},
iframeId = "bookDesc_iframe";
I am using php domxpath as below
$file = 'sample.html';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
// I am saving the returned html to a file and reading the file.
#$dom->loadHTMLFile($file);
$xpath = new DOMXPath($dom);
// This xpath works on chrome console, but not here
// because the content is dynamically created via js
$desc = $xpath->query('//*[#id="bookDesc_iframe"]')
Everytime when you see these kinds of JavaScript Generated content and especially from big guys like amazon, google, you should immediately think that it would have a graceful degradation implementation.
Meaning it would be done for where Javascript doesn't work like links browser for better browser coverage.
Lookout for <noscript> you may find one. and with that you can solve the problem.
Related
I am trying to find a way of displaying the text from a website on a different site.
I own both the sites, and they both run on wordpress (I know this may make it more difficult). I just need a page to mirror the text from the page and when the original page is updated, the mirror also updates.
I have some experience in PHP and HTML, and I also would rather not use Js.
I have been looking at some posts that suggest cURL and file_get_contents but have had no luck editing it to work with my sites.
Is this even possible?
Look forward to your answers!
Both cURL and file_get_contents() are fine to get the full html output from an url. For example with file_get_contents() you can do it like this:
<?php
$content = file_get_contents('http://elssolutions.co.uk/about-els');
echo $content;
However, in case you need just a portion of the page, DOMDocument and DOMXPath are far better options, as with the latter you also can query the DOM. Below is working an example.
<?php
// The `id` of the node in the target document to get the contents of
$url = 'http://elssolutions.co.uk/about-els';
$id = 'comp-iudvhnkb';
$dom = new DOMDocument();
// Silence `DOMDocument` errors/warnings on html5-tags
libxml_use_internal_errors(true);
// Loading content from external url
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
// Querying DOM for target `id`
$xpathResultset = $xpath->query("//*[#id='$id']")->item(0);
// Getting plain html
$content = $dom->saveHTML($xpathResultset);
echo $content;
I'm trying get element from a website. But i can't find element append by javascript. Have solution for that problem>
Code here:
$dom = new Dom;
$obj = $dom->loadFromUrl($url);
$element = $obj->find(".c-payment");
echo count($element);
Result = 0, but it has on website
When you reading a web page content with PHP, you are getting only static content (which are providen from a web server). The dinamic part of the content (which will be generated by JavaScript) do not exists at that moment, because PHP do not executes the JavaScript code.
You can try to use V8 Javascript Engine Integration. But I do not think that you easily can achieve what you want.
Maybe it will be useful for you: https://github.com/scraperlab/browserext
I am using Goutte Laravel library in project to get page content and to crawl it.
I can find any element of DOM structure, except in one of the site i have found the important content placed in <script> tag.
The data is placed in javascript variable and i wants to crawl it without heavy string operations. Typical example of such a case
$html="var article_content = "Details article string";
var article_twtag = "#Madrid #Barcelona";
var article_twtitle = "Article title";
var article_images = new Array (
"http://img.sireasas.com/?i=reuters%2f2017-03-08%2f2017-03-
08t200344z_132005024_mt1aci14762686_rtrmadp_3_soccer-champions-fcb-
psg_reuters.jpg","",
"0000000000115043","",
"");";
Is there any way to crawl the javascript using selector or DOM methods ?
What I would do, was getting the content that existed inside the script tag and then extract whatever I wanted through regular expressions.
$doc = new DOMDocument();
$doc->loadHTML($yoursiteHTML);
foreach($doc->getElementsByTagName('script') as $content) {
// extract data
}
Goutte only receives the HTML response and does not run Javascript code, to get dynamic data, as a browser does.
Use PHP Simple HTML DOM Parser
$html = file_get_html('http://www.your-link-here.com/');
// Find all scripts
foreach($html->find('script') as $element)
echo $element->outertext . '<br>';
I am trying to retrieve the price of an Amazon product.
I tried 2 methods:
file_get_contents -> regex -> it works.
using DOMXPath -> does not work for some reason.
I noticed that if javascript is enabled the xpath of the price differs from the xpath while javascript is disabled.
Anyway, how can I retrieve the price using xpath?
This is what I am doing but the code returns nothing (even though it is working on any other website):
(The xpath was taken using firebug)
$url = 'http://www.amazon.com/dp/product/B00TRQPSXM/';
$path = '/html/body/div[3]/form/table[3]/tbody/tr[1]/td/div/table/tbody/tr[2]';
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->query($path);
if($elements)
{
foreach($elements as $element)
{
echo $element->nodeName.'<br>';
echo $element->nodeValue.'<br>';
}
}
Your request will be blocked after a couple of tries every time, amazon checks for robot access. Instead of scrapping their site which btw is against amazon's terms of service (or whatever it's called), use their API found at http://developer.amazonservices.com. You will get the price information you are after with this operation.
There is also a php sdk you can use.
Either way, file_get_contents() is not an option here, if you want to scrape the page use curl and make it look like an unique visitor.
Anybody any idea how they do it? I currently use OffLiberty.com to parse Mixcloud links to get the raw MP3 URL for use in a custom HTML5 player for iOS compatibility, I was just wondering if anyone knew how exactly their process works, so I could create something similar that would 'cut out the middleman' so to speak, so my end-user wouldn't have to go to an external site to get a link to the MP3 for the mix they want to post. Just a thought really, not terribly important if it couldn't be done, but it would be a nice touch :)
Anybody any idea?
Note that I'm against content scraping and you should ask those website permission to scrap their MP3 URLs. Else, if I was them, I'd block you right now and ad vitam æternam.
Anyway, you can parse its HTML using DOMDocument.
For example :
<?php
// just so you don't see parse errors
$internal_errors = libxml_use_internal_errors(true);
// initialize the document
$doc = new DomDocument();
// load a page
$doc->loadHTMLFile('http://www.mixcloud.com/LaidBackRadio/le-motel-on-the-road/');
// initialize XPATH for the document
$xpath = new DomXPath($doc);
// span with "data-preview-url" seems to contain MP3 url
// we request them inside a DomNodeList http://www.php.net/manual/en/class.domnodelist.php
$mp3 = $xpath->query('//span[#data-preview-url]');
foreach($mp3 as $m){
// we print the attribute value
echo $m->attributes->getNamedItem('data-preview-url')->nodeValue . '<br/>';
}
libxml_use_internal_errors($internal_errors);