As title, my question is how to output (lets say save as a text file on server computer or pass the result to some other php function using ajax) all DOM content on a page?
I did some homework, I tried curl can just output all DOM content using "curl http://google.ca > dom.txt"
However, the this approach will not save contents that Javascript generated, in other words, the javascript code will not run.
Another approach is to embed some javascript code into a page and let the page load the website we want to output, then use the javascript code to save all DOM file after everything is loaded.
I am not sure if phantom.js can do such job, if yes, then how to?
Any body can give a detailed answer on how to achieve this?
I am open to any solutions, this program will run on my server to provide service.
Thank you in advance.
Why not:
jQuery(document).ready(function($) {
$.post(
'/your_filename.php',
'html='+$("html").html(),
function(response){
alert(response);
}
);
});
You can get the contents of the HTML element (including both head and body) using document.documentElement.innerHTML. If you need everything, you can concatenate document.doctype with document.documentElement.outerHTML.
Note that outerHTML isn't quite cross-browser (it works in IE and Chrome, but not Firefox) - for a way to simulate outerHTML for Firefox, see this question: How do I do OuterHTML in firefox?
Javascript is a client side language, so running it on a Server is going to require specialized technology. PHP actually has the ability to work with DOM stuff, as it can build and modify dom elements before transmitting to the client, read more about that here.
I'm not really sure what you are trying to accomplish by doing this, but it sounds like you are trying too hard: you are sending code to the client so that the client can turn around and send code back to the server so that the server can save it as a file? Although if that is what you need to do, follow Brilliand's and iambriansreed's advice to scoop up dom elements with Javascript/jQuery.
Related
I am currently trying to load an HTML page via cURL. I can retrieve the HTML content, but part is loaded later via scripting (AJAX POST). I can not recover the HTML part (this is a table).
Is it possible to load a page entirely?
Thank you for your answers
No, you cannot do this.
CURL does nothing more than download a file from a URL -- it doesn't care whether it's HTML, Javascript, and image, a spreadsheet, or any other arbitrary data; it just downloads. It doesn't run anything or parse anything or display anything, it just downloads.
You are asking for something more than that. You need to download, parse the result as HTML, then run some Javascript that downloads something else, then run more Javascript that parses that result into more HTML and inserts it into the original HTML.
What you're basically looking for is a full-blown web browser, not CURL.
Since your goal involves "running some Javascript code", it should be fairly clear that it is not acheivable without having a Javascript interpreter available. This means that it is obviously not going to work inside of a PHP program (*). You're going to need to move beyond PHP. You're going to need a browser.
The solution I'd suggest is to use a very specialised browser called PhantomJS. This is actually a full Webkit browser, but without a user interface. It's specifically designed for automated testing of websites and other similar tasks. Your requirement fits it pretty well: write a script to get PhantomJS to open your URL, wait for the table to finish rendering, and grab the finished HTML code.
You'll need to install PhantomJS on your server, and then use a library like this one to control it from your PHP code.
I hope that helps.
(*) yes, I'm aware of the PHP extension that provides a JS interpreter inside of PHP, and it would provide a way to solve the problem, but it's experimental, unfinished, would be still difficult to implement as a solution, and I don't think it's a particularly good idea anyway, so let's not consider it for the purposes of this answer.
No, the only way you can do that is if you make a separate curl request to ajax request and put the two results together afterwards.
I have a small script that pulls HTML from another site using Javascript.
I want to include that static HTML that gets pulled in a PHP page without any of the Javascript code appearing in the final PHP page that gets displayed.
I tried doing an include of the file with the Javascript code in the PHP page, but it just included the actual Javascript and not the results of the Javascript.
So how would I go about doing this?
You would need to fetch the page, execute the JavaScript in it, then extract the data you wanted from the generated DOM.
The usual approach to this is to use a web automation tool such as Selenium.
You simply can't.
You need to understand that PHP and Javascript operate on different places, PHP on the server and Javascript on the client.
Your only solution is to change the way all this is done and use "file_get_contents(url)" from PHP to get the same content your javascript used to get. This way, there is no javascript anymore and you can still pre-process your page with distant content.
You wouldn't be able to do this directly from within PHP, since you'd need to run Javascript code.
I'd suggest passing the URL (and any required actions such as click event, etc) to a headless browser such as Phantom or Zombie, and capturing the DOM from it once the JS engine has done it's work.
You could also use a real browser, but of course you don't need a UI in your case, and it might actually get in the way of what you're trying to do, so a headless browser might be better.
This sort of thing would normally be used for automated testing of a site (ie Functional Testing).
There is a PHP tool named Mink which can run these sorts of scripts from within a PHP program. It is aimed at writing test scripts, but I would imagine you could use it for your purposes.
Hope that helps.
When i use Firebug or chrome inspector on this page http://www.facebook.com/GaryFromCooper?sk=wall (right click inspect element) i could see an hidden input named "link_data".
But when i curl it with php and read the HTML file there is no hidden input...
So i guess this might be in the DOM.
But i couldn't found any way to read the DOM after my curl request, i tried DOM Php function but this doesn't work...
Can someone help me ?
I just want to retrieve the "link_data" value from the http://www.facebook.com/GaryFromCooper?sk=wall page...using curl
Thanks for your help
It's probably inserted with JavaScript. cURL is just a tool for transfering data, not executing JavaScript :P
Considering this involves Facebook, there's probably a really good reason why you can't just 'scrape' that value.
Your better of using the Facebook API to get the data that you need, if anything changes from Facebooks part you wont be affected.
http://developers.facebook.com/
It might be a DOM node inserted by JS. See this curl FAQ. Curl doesn't support JavaScript.
I have been working on parsing some of the data from the wow armory and have come into a bit of a snag. When it comes to the site serving up the achievements that players have received, it uses javascript to intemperate a string such as #73:1283 to display the requested information. (I made this number up but the data for the requests are formated like this).
Is it possible to pull data from a page that requires javascript to display its data with php?
How do you parse data from a site that has been loaded after the dom is ready or complete using php?
By using Firebug, I was able to look at the HTTP headers to see what AJAX calls were being made to generate the content on these pages: http://us.battle.net/wow/en/character/black-dragonflight/glitchshot/achievement#96:14861 and http://us.battle.net/wow/en/character/black-dragonflight/glitchshot/achievement#96
It looks the page is making an asynchronous call to load this page: http://us.battle.net/wow/en/character/black-dragonflight/glitchshot/achievement/14861 when the part after the hash is 96:14861, and a call to http://us.battle.net/wow/en/character/black-dragonflight/glitchshot/achievement/96 when the part after the hash is just 96. Both of those pages return XML that can be parsed to render HTML.
So generally speaking, if there's just one number after the hash, just put http://.../achievement/<number here> as the URL. If there are two numbers, put the second number at the end of the URL instead.
What you'll need to do, rather than pulling the Javascript and interpreting it, is make HTTP requests to those URLs by yourself in PHP (using cURL, for example) and parse the data on your own.
I would really recommend learning JavaScript and jQuery, since it will be very hard for you to really build a good site that pulls information from the WoW Armory without understanding all the AJAX loads that are going on in the background.
I would recommend seeing if you can replicate the query sent by JavaScript in PHP. While I don't believe there is a way to process JavaScript in PHP, there definitely isn't a simple or scalable way.
I would attempt to scan the first page's source that you downloaded with PHP for strings of that format you mention. Then if the JS on their site is querying something like http://www.wow.com/armory.php?id=#72:1284 you can just download the source of that next. You can find out how the JS is querying the server with something like FireBug or the Inspector in Chrome or Safari.
So in summary:
Check to find the JS URL format and if you can replicate it.
Create PHP to get main page and extract all strings.
Create PHP to loop through these strings and get these pages (with URL that JS requests).
Do whatever you wanted to with that information.
You can try jquery's $(document).onready function which helps
to run java script code when the web page loads up.
ex
<div id="wowoData">#4325325</div>
<script>
$(document).ready(
function(){
$("#wowoData").css("border","1px solid red");
}
)
</script>
I am dealing with a problem where I need to do few thing at the SERVER SIDE using JAVASCRIPT (I am using php + apache combination )-
read source of url using curl
run it through some server side JavaScript and get DOM out of it
traverse and parse the DOM using pre-existing java script code.This code works fine in a browser.
I goggled and found http://pecl.php.net/package/spidermonkey , which allows us to run java script at server.is there any better way to achieve this? can we use Mozilla engine to get DOM out of HTML source code and process it using java script ?
Thanks in advance
You can check Jaxer.org, where you tell your javascript where to run.
alt text http://jaxer.org/images//Picture+4_0.png
hope it helps, Sinan.
PHP contains a DOM parser - I would recommend using this to achieve the same results, rather than using server-side Javascript.
You might want to use something else than Javascript, but if you really need this, you can run firefox under Xvfb and remote connect to it from php. It's not exactly trivial to set up, but it's possible.
You might want to try with something like SimpleBrowser instead.
You might want to try installing GromJS. But the success depends on complexity of your JS code. As far as I see, GromJS does not have DOM :(
A lot more complex project, Narwhal does have DOM and a lot more.
For more information, refer to Mozilla hub about ServerJS.