I need to get the contents of a website through PHP, however, the content is only available when JavaScript is enabled. The workaround that I am using now is making an applescript to open the website in Safari, and selecting all of the page content, copying it to the clipboard, and pasting it.
That will be really hard to achieve I guess. If you observe the JS on that page that is responsible for getting the content ready, you may discover its just another AJAX call that you may be able to call directly from your PHP script.
best possible solution: ask the website owner for api/export access ;)
If that is not possible, you can only pray that you can analyze the requests that are initialized via JavaScript and imitate them.
(possible tools: firefox with firebug or tamper data plugin).
Warning the owner of the website might not like this approach, in fact, it may be disallowed to scrape the data automatically
What do you mean by:
the content is only available when JavaScript is enabled
Does the page pull data from somewhere via JS? Would it be easier to analyse where the data is coming from and access that place directly from PHP?
Related
I am currently trying to load an HTML page via cURL. I can retrieve the HTML content, but part is loaded later via scripting (AJAX POST). I can not recover the HTML part (this is a table).
Is it possible to load a page entirely?
Thank you for your answers
No, you cannot do this.
CURL does nothing more than download a file from a URL -- it doesn't care whether it's HTML, Javascript, and image, a spreadsheet, or any other arbitrary data; it just downloads. It doesn't run anything or parse anything or display anything, it just downloads.
You are asking for something more than that. You need to download, parse the result as HTML, then run some Javascript that downloads something else, then run more Javascript that parses that result into more HTML and inserts it into the original HTML.
What you're basically looking for is a full-blown web browser, not CURL.
Since your goal involves "running some Javascript code", it should be fairly clear that it is not acheivable without having a Javascript interpreter available. This means that it is obviously not going to work inside of a PHP program (*). You're going to need to move beyond PHP. You're going to need a browser.
The solution I'd suggest is to use a very specialised browser called PhantomJS. This is actually a full Webkit browser, but without a user interface. It's specifically designed for automated testing of websites and other similar tasks. Your requirement fits it pretty well: write a script to get PhantomJS to open your URL, wait for the table to finish rendering, and grab the finished HTML code.
You'll need to install PhantomJS on your server, and then use a library like this one to control it from your PHP code.
I hope that helps.
(*) yes, I'm aware of the PHP extension that provides a JS interpreter inside of PHP, and it would provide a way to solve the problem, but it's experimental, unfinished, would be still difficult to implement as a solution, and I don't think it's a particularly good idea anyway, so let's not consider it for the purposes of this answer.
No, the only way you can do that is if you make a separate curl request to ajax request and put the two results together afterwards.
I have a small script that pulls HTML from another site using Javascript.
I want to include that static HTML that gets pulled in a PHP page without any of the Javascript code appearing in the final PHP page that gets displayed.
I tried doing an include of the file with the Javascript code in the PHP page, but it just included the actual Javascript and not the results of the Javascript.
So how would I go about doing this?
You would need to fetch the page, execute the JavaScript in it, then extract the data you wanted from the generated DOM.
The usual approach to this is to use a web automation tool such as Selenium.
You simply can't.
You need to understand that PHP and Javascript operate on different places, PHP on the server and Javascript on the client.
Your only solution is to change the way all this is done and use "file_get_contents(url)" from PHP to get the same content your javascript used to get. This way, there is no javascript anymore and you can still pre-process your page with distant content.
You wouldn't be able to do this directly from within PHP, since you'd need to run Javascript code.
I'd suggest passing the URL (and any required actions such as click event, etc) to a headless browser such as Phantom or Zombie, and capturing the DOM from it once the JS engine has done it's work.
You could also use a real browser, but of course you don't need a UI in your case, and it might actually get in the way of what you're trying to do, so a headless browser might be better.
This sort of thing would normally be used for automated testing of a site (ie Functional Testing).
There is a PHP tool named Mink which can run these sorts of scripts from within a PHP program. It is aimed at writing test scripts, but I would imagine you could use it for your purposes.
Hope that helps.
I'm generating some content through an API, accessed by javascript, and I cannot grab the source code of what is plainly displayed, post-load, on the browser. I can highlight the text and view the source of selected text (which is a firefox feature), but I will be using CURL to capture the data automatically with php... How can I capture the data? Is there a way to update the source(maybe through a DOM update) so it displays some how? Any help is appreciated.
You can't just request some HTML source and expect the results of modifying it with JS to be in place without running the JS so if you want to get the content in PHP, then you will have to either
Push the HTML through something that will execute the JavaScript (I'd probably look to WWW::Mechanize::Firefox if I were using Perl, it uses Mozrepl. I don't know if PHP has a similar nice API for it)
Reverse engineer the JavaScript and do whatever it does to get the data yourself.
You can pull up the page source using Google Chrome from within developer tools (wrench in the top right -> Tools -> Developer tools, or Control+Shift+I (that's an uppercase i)). The source code shown in the developer tools represents the up-to-date source code of the page, including things that were generated dynamically by JavaScript after the page initially loads.
I'm sure other browsers have similar capabilities, I just happen to know Chrome's method off the top of my head.
If you developing environment is Linux/Unix, you could incorporate PjantonJS, which is a very nifty tool that executes the JavaScript and it passes the output. The way I would recommend doing this would be with a shell_exec() in witch you run you CLI PhantomJS.
Hope this helps.
What i basically want to do is to get content from a website and load it into a div of another website. This should be no problem so far.
The problem is, that the content that should be fetched is located on a different server and i have no source access to it.
I'd prefer a solution using JavaScript of jQuery.
Can i use a .htacces redirect to fetch the content from a remote server with client-side (js) techniques?
I will also go with other solutions though.
Thanks a lot in advance!
You can't execute an AJAX call against a different domain, due to the same-origin policy. You can add a <script> tag to the DOM which points at a Javascript file on another domain. If this JS file contains some JSON data that you can use, you're all set.
The only problem is you need to get at the JSON data somehow, which is where JSON-P callbacks come into the picture. If the foreign resource supports JSON-P, it will give you something that looks like
your_callback( { // JSON data } );
You then specify your code in the callback.
See JSONP for more.
If JSONP isn't an option, then the best bet is to probably fetch the data server-side, say with a cron job every few minutes, and store it locally on your own site.
You can use a server-side XMLHTTP request to grab your content from the other server. You can then parse it on you server (A.K.A screen-scraping) and serve-up the portion you want along with your web page.
If the content from the other website is just an HTML doc that you want to display on your site, you could also use an iframe to pull it in. You won't have access to any of its content because of browser security rules.
You will likely have to "scrape" the data you need and store it on your server.
This is a great tutorial on how to cache data from an external site. It is actually written to fetch and store XML, so it'll need some modification. Also, if your site doesn't allow file_get_contents then you may have to modify it to use cUrl.
While cross-site scripting is generally regarded as negative, I've run into several situations where it's necessary.
I was recently working within the confines of a very limiting content management system. I needed to include database code within the page, but the hosting server didn't have anything usable available. I set up a couple bare-bones scripts on my own server, originally thinking that I could use AJAX to import the contents of my scripts directly into the template of the CMS (thus retaining dynamic images, menu items, CSS, etc.). I was wrong.
Due to the limitations of XMLHttpRequest objects, it's not possible to grab content from a different domain. So I thought iFrame - even though I'm not a fan of frames, I thought that I could create a frame that matched the width and height of the content so that it would appear native. Again, I was blocked by cross-site scripting "protections." While I could indeed load a remote file into the iFrame, I couldn't execute JavaScript to modify its size on either the host page or inside the loaded page.
In this particular scenario, I wasn't able to point a subdomain to my server. I also couldn't create a script on the CMS server that could proxy content from my server, so my last thought was to use a remote JavaScript.
A remote JavaScript works. It breaks when the user has JavaScript disabled, which is a downside; but it works. The "problem" I was having with using a remote JavaScript was that I had to use the JS function document.write() to output any content. Any output that isn't JS causes script errors. In addition to using document.write() for every line, you also have to ensure that the content is escaped - or else you end up with more script errors.
My solution was as follows:
My script received a GET parameter ("page") and then looked for the file ({$page}.php), and read the contents into a variable. However, I had to use awkward buffering techniques in order to actually execute the included scripts (for things like database interaction) then strip the final content of all line break characters (\n) followed by escaping all required characters. The end result is that my original script (which outputs JavaScript) accesses seemingly "standard" scripts on my server and converts their standard output to JavaScript for displaying within the CMS template.
While this solution works, it seems like there may be a better way to accomplish the same thing. What is the best way to make cross-site scripting work specifically for the purpose of including content from a completely different domain?
You've got three choices:
Create a server side proxy script.
Create a remote script to read in remote dynamic HTML. Use a library like jQuery to make this easier. You can use the load function to inject HTML where needed. EDIT What I originally meant for example # 2 was utilizing JSONP, which requires the server side script to recognize the "callback=?" param.
Use a client side Flash proxy and setup a crossdomain.xml file on your server's web root.
Personally, I would call to that other domain on the server and get and parse the data there for use in your page. That way you avoid any problems and you get the power of a server-side language/platform for getting and parsing the data.
Not sure if that would work for your specific scenario...hard to know even with your verbose description...
You could try easyXDM, by including very little code, you can pass data or method calls between documents of different domains.
I've come across that YDN server side proxy script before. It says it's built to work with Yahoo's Search APIs.
Will it work with any domain, if you simply trim the Yahoo API code out? Or do you need to replace it with the domain you want it to work with?
iframe remote content can be accessed by local javascript.
The remote server just have to set the document.domain of the page.
Eg:
Site A contain an iframe with src='Site B/home.php'
home.php looks like this :
[php stuff]...[/php]
[script type='text/javascript']document.domain='Site A'[/script]