How to get dom with headless-chromium-php

How to get dom with headless-chromium-php - php

I'm using headless-chromium-php, but getHtml function seems to get the html source code.
https://github.com/chrome-php/chrome#get-the-page-html
Instead, I want to get the DOM displayed in the chrome browser.
so, How can i do it
I want to get the html source after browser rendering.

As you surmise, you need to wait for the page to finish loading, including any javascript rendering; have a look at the example earlier on in that documentation
[https://github.com/chrome-php/chrome#evaluate-script-on-the-page] to get the inner html.

Related

Browsers are changing html markup. How force all browsers to use native markup?

I have a problem and hope someone can help me.
I use iframe with src="http://my.own.domain/some/path/file-1";
This url send me a content from "http://some-site.com/path1/path2/path3/qwerty.html";
But before sending the content I am pre-proccessing links and resources.
For example if css <link rel="/css/style1">, I add protocol and host to it and makes something like <link rel="http://some-site.com/css/style1">
After what I'm clicking on some page element and read current node information by js ( name and attributes of current node, name and attributes of parent node and goes up till I see html tag).
This data I send to php script using ajax.
Using php I convert it to XPath selector and see that my selector is incorrect.
//html
/body[0]
/div[#id='wrap']
/div[#id='main']
/table[contains(#class, 'content-wrapper')][1]
/tbody[1]
/tr[1]
/td[contains(#class, 'content-wrap')][1]
/div[contains(#class, 'content')][1]
/div[contains(#class, 'node')][1]
/div[contains(#class, 'techs')][1]
/table[1]
/tbody[1]
/tr[4]
/td[contains(#class, 'techs-right')][1]
But native markup of that page is:
//html
/body[0]
/div[#id='wrap']
/div[#id='main']
/table[contains(#class, 'content-wrapper')][1]
/*/tbody[1] - without this*/
/tr[1]
/td[contains(#class, 'content-wrap')][1]
/div[contains(#class, 'content')][1]
/div[contains(#class, 'node')][1]
/div[contains(#class, 'techs')][1]
/table[1]
/*/tbody[1] - without this*/
/tr[4]
/td[contains(#class, 'techs-right')][1]
It seems like browser is modifying incorrect markup and makes it correct.. But this is a hitch for me. How to turn this off?

i want to get data from another website and display it on mine but with my style.css

So my school has this very annoying way to view my rooster.
you have to bypass 5 links to get to my rooster.
this is the link for my class (it updates weekly without changing the link)
https://webuntis.a12.nl/WebUntis/?school=roc%20a12#Timetable?type=1&departmentId=0&id=2147
i want to display the content from that page on my website but with my
own stylesheet.
i don't mean this:
<?php
$homepage = file_get_contents('http://www.example.com/');
echo $homepage;
?>
or an iframe....

I think this can be better done using jquery and ajax. You can get jquery to load the target page, use selectors to strip out what you need, then attach it to your document tree. You should then be able to style it anyway you like.

I would recommend you to use the cURL library: http://www.php.net/manual/en/curl.examples.php
But you have to extract part of the page you want to display, because you will get the whole HTML document.

You'd probably read the whole page into a string variable (using file_get_contents like you mentioned for example) and parse the content, here you have some possibilities:
Regular expressions
Walking the DOM tree (eg. using PHPs DOMDocument classes)
After that, you'd most likely replace all the style="..." or class="..." information with your own.

Dynamically create elements from unrendered code using jquery and/or php

I want to store html that isn't to be rendered until needed either within a tag that can hold raw html code without rendering it on page load or store it within a php or jquery variable for later use. I then want to be able to insert the html into the DOM on button click and have it render.
I've tried storing it within an xmp tag as that can store html code with the < and > characters without using character codes for them, but when trying to insert it into the DOM, the updated source shows it had been copied but it wouldn't render on screen. Also tried storing it within a code tag, which worked on a desktop browser but not in mobile safari. Since this is a webapp mobile browser compatibility is important.
Anyone know of a good method of doing this?

Try <script> tags with a type of text/plain or text/html:
<script type="text/plain" id="example">
<div class="example">
<h2>Hello</h2>
<p>World</p>
</div>
</script>
$(".button").click(function () {
var html = $("#example").text();
$("#destination").html(html);
});

It depends on where do you want to generate the content in question. If it's easier for you setup to generate it on the server side, you can use css to hide those parts (like display:none) and just remove the css property or grab the nodes with javascript and put them elsewhere with something like this:
$('.target').html($('.hidden_node').html());
If you want to generate the content on the js side, you can build it as a long string and just shove it into the target, or you can use jquery's node generation syntax like:
$('<div />').attr({
class: 'test'
}).appendTo("body");
Or you can use one of the various javascript templating solutions like mustache or handlebars.

how to extract all image urls from a html source and download them using curl?

I am using curl to get the images from html source code of an external webpage. I am getting img original='imageurl' on view page source in Firefox. But when i select the particular images then it shows img src='imageurl' on view selection source in in Firefox.
How can I get this type of image using curl?
Currently I am using regex to get the image:
preg_match_all('/<img[^>]+>/i',$output, $result);
print_r($result);
But it doesn't display any image.
I am very confused about what to do here. Anyone have any thoughts?

I am very confused about what to do here.
The confusion probably results from that you use your webbrowser to view the source of an URL. Even if it's often the case that the source of the page displayed by the webbrowser is the data that curl would return as well, this is not always the case.
Especially the Firefox feature view selection source will not display that selection from the original resource, but often something else. To prevent that, you need to disable javascript in your Firefox browserDocs. Because often documents are modified with javascript and you want to see the original, not the modification because curl is not able to run javascript, it can only get "the original".
Anyone have any thoughts?
Disable javascript in your browser.
Reload the page.
Locate the fragment of the HTML-source-code you're interested in.
Write it down, e.g. into a string.
Request the page with CURL. Output the source.
Locate that string in there. If it's not in there, search the curl request result for the string you're interested and use that instead.
Write a regular expression that is able to obtain what you need from that string.
Use that regular expression in your program then.

Your web browser is reformatting the HTML according to how it understands/parses the HTML page.
When you choose "View Page Source" it shows you the original source code served from the server.
When you select content and choose "View Selection Source" it shows what the browser has parsed into DOM (what the browser understands) for the selected content.
I am guessing you're using Firefox
If you are attempting to use cURL to process the HTML served from the server, you must not look at "View Selection Source" of the page, always refer to "View Page Source"..
Ultimately
You should rather refer to the ACTUAL result from cURL
For example:
$content = curl_exec($ch);
header("Content-type: text/plain");
echo $content;
That should echo exactly what cURL has received from the server...
NOTE: This is a re-post of https://stackoverflow.com/questions/8754844/can-not-get-images-using-curl
Furthermore
If you want to fetch the actual image inside a <img src=""> tag then you need to pin-point the IMG tag in the result HTML response using preg_match, and do a seperate cURL request to the IMG SRC

phpQuery - make php script wait until iframe content has loaded

I'm using the phpQuery library (http://code.google.com/p/phpquery/) to parse web pages but have stumbled across a problem getting sites that use Ajax to display all the content.
I have worked out that I can get all the content if I load it in to an iframe (the code below works):
$temp = phpQuery::newDocumentHTML('<iframe src="" id="test">a</iframe>')->find('iframe[id=test]')->attr('src', 'http://www.example.com/');
echo $temp;
BUT, my question is, how can I get my PHP script to wait until the iframe has loaded before proceeding?
Below is the jQuery equivalent but I was wondering if anybody knows how to do the equivalent using phpQuery?
$(iFrame).attr('src', 'http://www.example.com');
$(iFrame).load(function(){
alert("Loaded");
});
Thanks in advance.

BUT, my question is, how can I get my PHP script to wait until the iframe has loaded before proceeding?
This is not how PHP-side HTML parsing works. phpQuery just parses the HTML code, it doesn't do anything with it - like load and/or render iframes, or run JavaScript events.
There is probably a way to do what you want to do - if you tell us what that is!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to get dom with headless-chromium-php - php

I'm using headless-chromium-php, but getHtml function seems to get the html source code. https://github.com/chrome-php/chrome#get-the-page-html Instead, I want to get the DOM displayed in the chrome browser. so, How can i do it I want to get the html source after browser rendering.

As you surmise, you need to wait for the page to finish loading, including any javascript rendering; have a look at the example earlier on in that documentation [https://github.com/chrome-php/chrome#evaluate-script-on-the-page] to get the inner html.

Related

Browsers are changing html markup. How force all browsers to use native markup?

i want to get data from another website and display it on mine but with my style.css

Dynamically create elements from unrendered code using jquery and/or php

how to extract all image urls from a html source and download them using curl?

phpQuery - make php script wait until iframe content has loaded

Categories

Resources