PHP HTML DOM Parse - Can't find element appended javascript - php

I'm trying get element from a website. But i can't find element append by javascript. Have solution for that problem>
Code here:
$dom = new Dom;
$obj = $dom->loadFromUrl($url);
$element = $obj->find(".c-payment");
echo count($element);
Result = 0, but it has on website

When you reading a web page content with PHP, you are getting only static content (which are providen from a web server). The dinamic part of the content (which will be generated by JavaScript) do not exists at that moment, because PHP do not executes the JavaScript code.
You can try to use V8 Javascript Engine Integration. But I do not think that you easily can achieve what you want.
Maybe it will be useful for you: https://github.com/scraperlab/browserext

Related

PHP access DOM within your page

Let's say you wanted to parse the DOM with PHP. You can easily achieve this using the DomDocument.
However, in order to do so, you would need to load some HTML using loadHTML or loadHTMLFile and provide the functions with a string containing HTML (or a file path in the case of loadHTMLFile).
As an example, if you just wanted to get an element with a specific ID (in PHP, not JavaScript), WITHIN your page, what can you do?
If you have PHP code generating the page, you could use the output buffer to generate the page in memory, edit the generated page and then flush it to the browser. You can only change the DOM before the browser gets it.
You could do the following:
ob_start(); // Should be called before any output is generated
// ... PHP code that outputs HTML ...
$generated_html = ob_get_clean(); // Store generated HTML to string
// Load and manipulate HTML
$doc = new DOMDocument();
$doc->loadHTML($generated_html);
// ... Manipulate the generated HTML ...
echo $doc->saveHTML(); // echo the modified HTML
However, since you are generating the HTML it would make more sense to change whatever you need to change before it's generated to reduce procesing time.
If you want to change the HTML of a page which is already shown in the browser you'll need another way (such as JS/AJAX) since at that point PHP can't possibly access the DOM.
getElementById method can be invoked on the DOMDocument instance with id string to get the element. 1
$element = $testDOMDocument->getElementById('test-id');

PHP - file_get_html not returning anything

I am trying to scrape data from this site, using "inspect" I am checking the class of the div, but when I try to get it, it doesn't display anything:
Trying to get the "Diamond" below "Supremacy".
What I am using:
<?php
include('simple_html_dom.php');
$memberName = $_GET['memberName'];
$html = file_get_html('https://destinytracker.com/d2/profile/pc/'.$memberName.'');
preg_match("/<div id=\"dtr-rating\".*span>/", $html, $data);
var_dump($data);
?>
FYI, simple_html_dom is a package available on SourceForge at http://simplehtmldom.sourceforge.net/. See the documentation.
file_get_html(), from simple_html_dom, does not return a string; it returns an object that has methods you can call to traverse the HTML document. To get a string from the object, do:
$url = https://destinytracker.com/d2/profile/pc/'.$memberName;
$html_str = file_get_html($url)->plaintext;
But if you are going to do that, you might as well just do:
$html_str = file_get_contents($url);
and then run your regex on $html_str.
BUT ... if you want to use the power of simple_html_dom ...
$html_obj = file_get_html($url);
$the_div = $html_obj->find('div[id=dtr-rating]', 0);
$inner_str = $the_div->innertext;
I'm not sure how to do exactly what you want, because when I look at the source of the web link you provided, I cannot find a <div> with id="dtr-rating".
My other answer is about using simple_html_dom. After looking at the HTML doc in more detail, I see the problem is different than I first thought (I'll leave it there for pointers on better use of simple_html_dom).
I see that the web page you are scraping is a VueJS application. That means the HTML sent by the web server causes Javascript to run and build the dynamic contents of the web page that you see displayed. That means, the <div> your are looking for with regex DOES NOT EXIST in the HTML sent by the server. Your regex cannot find anything but its not there.
In Chrome, do Ctl+U to see what the web server sent (no "Supremacy"). Do Ctl+Shift+I and look under the "Elements" tab to see the HTML after the Javascript has done is magic (this does have "Supremacy").
This means you won't be able to get the initial HTML of the web page and scrape it to get the data you want.

Crawl the site and get data from HTML string

I am using Goutte Laravel library in project to get page content and to crawl it.
I can find any element of DOM structure, except in one of the site i have found the important content placed in <script> tag.
The data is placed in javascript variable and i wants to crawl it without heavy string operations. Typical example of such a case
$html="var article_content = "Details article string";
var article_twtag = "#Madrid #Barcelona";
var article_twtitle = "Article title";
var article_images = new Array (
"http://img.sireasas.com/?i=reuters%2f2017-03-08%2f2017-03-
08t200344z_132005024_mt1aci14762686_rtrmadp_3_soccer-champions-fcb-
psg_reuters.jpg","",
"0000000000115043","",
"");";
Is there any way to crawl the javascript using selector or DOM methods ?
What I would do, was getting the content that existed inside the script tag and then extract whatever I wanted through regular expressions.
$doc = new DOMDocument();
$doc->loadHTML($yoursiteHTML);
foreach($doc->getElementsByTagName('script') as $content) {
// extract data
}
Goutte only receives the HTML response and does not run Javascript code, to get dynamic data, as a browser does.
Use PHP Simple HTML DOM Parser
$html = file_get_html('http://www.your-link-here.com/');
// Find all scripts
foreach($html->find('script') as $element)
echo $element->outertext . '<br>';

Check if div ID exists (PHP)

Is it possible to check if an element exists with PHP?
I'm aware of the javascript method already but I just want to avoid it if possible.
If you have the HTML server side in a string, you can use DOMDocument:
<?php
$html = '<html><body><div id="first"></div><div id="second"></div></body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$element = $dom->getElementById('second');
// this will be null if it isn't found
var_dump($element);
Not directly, because PHP is serverside only.
But if you really wish to do so, you may send the whole code of your page to a php script on your server using an ajax request, parse it there to find out if a div with a specified ID exists (see Shabbyrobes post; sure this would be very ineffective and is not recommended when you can easily check it with javascript...) and return the result in your ajax response.
No. PHP can only serve content, it has no control or view of the DOM except what you ask it to create.

Basic web-crawling question: How to create a list of all pages on a website using php?

I would like to create a crawler using php that would give me a list of all the pages on a specific domain (starting from the homepage: www.example.com).
How can I do this in php?
I don't know how to recursively find all the pages on a website starting from a specific page and excluding external links.
For the general approach, check out the answers to these questions:
How to write a crawler?
How to best develop web crawlers
Is there a way to use PHP to crawl links?
In PHP, you should be able to simply fetch a remote URL with file_get_contents(). You could perform a naive parse of the HTML by using a regular expression with preg_match() to find <a href=""> tags and parse the URL out of them (See this question for some typical approaches).
Once you've extract the raw href attribute, you could use parse_url() to break into it components and figure out if its a URL you want to fetch - remember also the URLs may be relative to the page you've fetched.
Though fast, a regex isn't the best way of parsing HTML though - you could also try the DOM classes to parse the HTML you fetch, for example:
$dom = new DOMDocument();
$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
if ( count($anchors->length) > 0 ) {
foreach ( $anchors as $anchor ) {
if ( $anchor->hasAttribute('href') ) {
$url = $anchor->getAttribute('href');
//now figure out whether to processs this
//URL and add it to a list of URLs to be fetched
}
}
}
Finally, rather than write it yourself, see also this question for other resources you could use.
is there a good web crawler library available for PHP or Ruby?
Overview
Here are some notes on the basics of the crawler.
It is a console app - It doesn't need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.
Getting the Text from an Html Page
The first crucial piece of building a crawler is the mechanism for going out and fetching the html off of the web (or your local machine, if you have the site running locally.). Like so much else, .NET has classes for doing this very thing built into the framework.
private static string GetWebText(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "A .NET Web Crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText = reader.ReadToEnd();
return htmlText;
}
The HttpWebRequest class can be used to request any page from the internet. The response (retrieved through a call to GetResponse()) holds the data you want. Get the response stream, throw it in a StreamReader, and read the text to get your html.
for Reference: http://www.juicer.headrun.com

Categories