Crawl the site and get data from HTML string - php

I am using Goutte Laravel library in project to get page content and to crawl it.
I can find any element of DOM structure, except in one of the site i have found the important content placed in <script> tag.
The data is placed in javascript variable and i wants to crawl it without heavy string operations. Typical example of such a case
$html="var article_content = "Details article string";
var article_twtag = "#Madrid #Barcelona";
var article_twtitle = "Article title";
var article_images = new Array (
"http://img.sireasas.com/?i=reuters%2f2017-03-08%2f2017-03-
08t200344z_132005024_mt1aci14762686_rtrmadp_3_soccer-champions-fcb-
psg_reuters.jpg","",
"0000000000115043","",
"");";
Is there any way to crawl the javascript using selector or DOM methods ?

What I would do, was getting the content that existed inside the script tag and then extract whatever I wanted through regular expressions.
$doc = new DOMDocument();
$doc->loadHTML($yoursiteHTML);
foreach($doc->getElementsByTagName('script') as $content) {
// extract data
}
Goutte only receives the HTML response and does not run Javascript code, to get dynamic data, as a browser does.

Use PHP Simple HTML DOM Parser
$html = file_get_html('http://www.your-link-here.com/');
// Find all scripts
foreach($html->find('script') as $element)
echo $element->outertext . '<br>';

Related

How to get dynamically created html <audio> tag in php

I am trying to read the html <audio> tag in PHP, But it is creating dynamically
This is the URL! I'm using to read
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach (iterator_to_array($dom->getElementsByTagName('audio')) as $node) {
$this->printnode($node);
}
In printnode() function it is showing like no <audio> tag exits because it is creating dynamically
After seeing the structure, yes the url for the actual audio is being loading dynamically via JS.
But the audio playlist data is still visible. Use that:
$xpath = new DOMXPath($dom);
$playlist_data = $xpath->evaluate('string(//script[#id="playlist-data"])');
$data = json_decode($playlist_data, 1);
echo $data['audio'];
Its inside another script tag on JSON string format. So basically, access this data and get the value as a string. Then you'll get the JSON string, and as usual, load it into json_decode and the parser will do its thing returning you with an array, then access the audio url like any normal array
Sidenote: I just used xpath as personal preference, you can use:
$playlist_data = $dom->getElementById('playlist-data')->nodeValue;
if you choose to do so.

Pull content from one wordpress site to another wordpress site

I am trying to find a way of displaying the text from a website on a different site.
I own both the sites, and they both run on wordpress (I know this may make it more difficult). I just need a page to mirror the text from the page and when the original page is updated, the mirror also updates.
I have some experience in PHP and HTML, and I also would rather not use Js.
I have been looking at some posts that suggest cURL and file_get_contents but have had no luck editing it to work with my sites.
Is this even possible?
Look forward to your answers!
Both cURL and file_get_contents() are fine to get the full html output from an url. For example with file_get_contents() you can do it like this:
<?php
$content = file_get_contents('http://elssolutions.co.uk/about-els');
echo $content;
However, in case you need just a portion of the page, DOMDocument and DOMXPath are far better options, as with the latter you also can query the DOM. Below is working an example.
<?php
// The `id` of the node in the target document to get the contents of
$url = 'http://elssolutions.co.uk/about-els';
$id = 'comp-iudvhnkb';
$dom = new DOMDocument();
// Silence `DOMDocument` errors/warnings on html5-tags
libxml_use_internal_errors(true);
// Loading content from external url
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
// Querying DOM for target `id`
$xpathResultset = $xpath->query("//*[#id='$id']")->item(0);
// Getting plain html
$content = $dom->saveHTML($xpathResultset);
echo $content;

PHP HTML DOM Parse - Can't find element appended javascript

I'm trying get element from a website. But i can't find element append by javascript. Have solution for that problem>
Code here:
$dom = new Dom;
$obj = $dom->loadFromUrl($url);
$element = $obj->find(".c-payment");
echo count($element);
Result = 0, but it has on website
When you reading a web page content with PHP, you are getting only static content (which are providen from a web server). The dinamic part of the content (which will be generated by JavaScript) do not exists at that moment, because PHP do not executes the JavaScript code.
You can try to use V8 Javascript Engine Integration. But I do not think that you easily can achieve what you want.
Maybe it will be useful for you: https://github.com/scraperlab/browserext

Finding and Echoing out a Specific ID from HTML document with PHP

I am grabbing the contents from google with PhP, how can I search $page for elements with the id of "#lga" and echo out another property? Say #lga is an image, how would I echo out it's source?
No, i'm not going to do this with Google, Google is strictly an example and testing page.
<body><img id="lga" src="snail.png" /></body>
I want to find the element named "lga" and echo out it's source; so the above code I would want to echo out "snail.png".
This is what i'm using and how i'm storing what I found:
<?php
$url = "https://www.google.com/";
$page = file($url);
foreach($page as $part){
}
?>
You can achieve this using the built-in DOMDocument class. This class allows you to work with HTML in a structured manner rather than parsing plain text yourself, and it's quite versatile:
$dom = new DOMDocument();
$dom->loadHTML($html);
To get the src attribute of the element with the id lga, you could simply use:
$imageSrc = $dom->getElementById('lga')->getAttribute('src');
Note that DOMDocument::loadHTML will generate warnings when it encounters invalid HTML. The method's doc page has a few notes on how to suppress these warnings.
Also, if you have control over the website you are parsing the HTML from, it might be more appropriate to have a dedicated script to serve the information you are after. Unless you need to parse exactly what's on a page as it is served, extracting data from HTML like this could be quite wasteful.

Find all hrefs in page and replace with link maintaining previous link - PHP

I'm trying to find all href links on a webpage and replace the link with my own proxy link.
For example
Google
Needs to be
Google
Use PHP's DomDocument to parse the page
$doc = new DOMDocument();
// load the string into the DOM (this is your page's HTML), see below for more info
$doc->loadHTML('Google');
//Loop through each <a> tag in the dom and change the href property
foreach($doc->getElementsByTagName('a') as $anchor) {
$link = $anchor->getAttribute('href');
$link = 'http://www.example.com/?loadpage='.urlencode($link);
$anchor->setAttribute('href', $link);
}
echo $doc->saveHTML();
Check it out here: http://codepad.org/9enqx3Rv
If you don't have the HTML as a string, you may use cUrl (docs) to grab the HTML, or you can use the loadHTMLFile method of DomDocument
Documentation
DomDocument - http://php.net/manual/en/class.domdocument.php
DomElement - http://www.php.net/manual/en/class.domelement.php
DomElement::getAttribute - http://www.php.net/manual/en/domelement.getattribute.php
DOMElement::setAttribute - http://www.php.net/manual/en/domelement.setattribute.php
urlencode - http://php.net/manual/en/function.urlencode.php
DomDocument::loadHTMLFile - http://www.php.net/manual/en/domdocument.loadhtmlfile.php
cURL - http://php.net/manual/en/book.curl.php
Just another option if you would like to have the links replaced with by jQuery you could also do the following:
$(document).find('a').each(function(key, element){
curValue = element.attr('href');
element.attr('href', 'http://www.example.com?loadpage='+curValue);
});
However a more secure way is doing it in php offcourse.
Simplest way I can think to do this:
$loader = "http://www.example.com?loadpage=";
$page_contents = str_ireplace(array('href="', "href='"), array('href="'.$loader, "href='".$loader), $page_contents);
But that might have some problems with urls containing ? or &. Or if the text (not code) of the document contains href="

Categories