I need to load content from a remote uri into a PHP variable locally. The remote page only shows content when JavaScript is turned on. How can I get around this?
Essentially, how can I use cURL for pages requiring JavaScript loaded content?
Mink was the only php headless browswer that I could find.
As noted selenium is another popular choice. I don't know how good of performance these will offer though if you have a lot of scraping to do. They seem to be more geared towards testing?
A number of other languages have them which are listed in the link below. Since php does does not process javascript you will need another tool. Headless browswers expose the javascript engine and allow you to interact with the browser programattically.
headless internet browser?
To do this you have to emulate a browser using a browser plugin such as selenium. This will involve slightly more than just a simple get request though.
http://seleniumhq.org/
Related
I am trying to get the HTML code for any webpage, only and only after it is fully loaded.
I tried CURL, and file_get_contents, and I understand by now that they do not wait for JavaScript to finish.
I know by now that the solution is to use a headless browser. I tried PhantomJS, but it is a little bit inefficient since the only method found to make it wait, is to set a constant timeout period.
Also, I found out that in general, it is almost impossible to get when the page actually fully load, and the best approach was to keep checking for the network data until they completely stop.
I believe to keep checking for the existence of a content in the page, would work just fine for my use, but as fast as I know, the only method to implement that is to use puppeteer package which only works well with NodeJS not PHP.
So, do you guys know about any efficient method to get the HTML code after the page is fully loaded in PHP, without going through the complex process of integrating other programming languages, or other platforms?
I don't think you'll be able to accomplish it with PHP since it's not a browser and can't run JavaScript. You can use something like Headless Chromium and do chrome --headless --disable-gpu --dump-dom https://www.chromestatus.com/ which unfortunately can't tell exactly when it's "fully loaded" but you can do it on some type of delay I'm sure.
Mb something like setting JS on page load finish to get all page content to variable, and then sending that variable to php script via Ajax?
So I am using AJAX to call a server file which uses WordPress to populate a pages content and return. Which I than use to populate fields. Now what I am confused about is, how do I create the snapshot and what do I have to do to make google know I am creating one besides #! also why do I do this? The escaped_fragments are a little unclear to and hope I could get a more detailed explanation. Does anyone have any tutorials that walk you through this process similar to what I am doing?
David
Google's crawlers don't typically run your JavaScript. They hit your page, scrape your HTML, and move on. This is much more efficient than loading your page and all of its resources, running your JavaScript, guessing at when everything finished loading, and then scraping data out of the DOM.
If your site uses AJAX to populate the page with content, this is a problem for Google and others. Your page is effectively empty... void of any content... in its HTML state. It requires your JavaScript to fill it in. Since the crawlers don't run your JavaScript, your page isn't all that useful to the crawler.
These days, there are an awful lot of sites that blend the line between web-based applications and content-driven sites. These sites (like yours) require client-side code to run to get the content. Google doesn't have the resources to do this on every site they encountered, but they did provide an option. That's the info you found about escaped anchor fragments.
Google has given you the opportunity to do the work of scraping the full finished DOM for them. They have put the CPU and memory burden of running your JavaScript back on you. You can signify to Google that this is encouraged by using links with #!. Google sees this and knows that they can then request the same page, but convert everything after #! (which isn't sent to the server) to ?_escaped_fragment_= and make a request to your server. At this point, your server should generated a snapshot of the complete finished DOM, after JavaScript has ran.
The good news is that these days you don't have to hack a lot of code in place to do it. I've written a server to do this using PhantomJS. (I'm trying to get permission to open the source code up, but it's in legal limbo, sorry!) Basically, PhantomJS is a full webkit web browser but it runs without a GUI. You can use PhantomJS to load your site, run all the JavaScript, and then when its ready scrape the HTML back out of the page and send that version to Google. This doesn't require you to do anything special, other than fix your routing to point requests with _escaped_fragment_ at your snapshot server.
You can do this in about 20 lines of code. PhantomJS even has a mini web server built into it, but they recommend not using it for production code.
I hope this helps clear up some confusion!
I am working in a complex Webframework with lots of Javascript in the Frontend and lots of PHP in the backend. Since I’m new to it, finding out the workflow is quite a hassle. Is there a way to log the complete sequence of function calls (in PHP, JS or both) from the moment a request is sent until the response is executed (or the JS after the request has been sent is executed?) That would be really helpfull.
There's no perfect solution here but you will probably have to just use browser debugging tools like Firefox's Firebug or Chrome's debugger . Using these tools you can see JavaScript errors, AJAX requests, PHP (server) responses, individual page file loads etc etc.
All I can think of would be using Chrome developer tools, network tab in order to see the sequence of events. Or maybe even better, the HTTPFox plugin in firefox.
That will help you to find out the execution order of JS calls and which PHP files are accessed via AJAX.
What's happening on server side isn't easy to follow, but you could debug your code using XDebug.
I have created a html page which sends custom data to a php file which then processes and evaluates it.
My next task is to make this into a GUI with the requirements:
1. A box for a custom search with button (it then posts this into the
php)
2. A box where xml/json request can be seen
3. A box where the xml/json response can be seen
4. A box where the parsed version is translated and made to look pretty.
***MUST CONNECT TO INTERNET, PHP ESTABLISHES CONNECTION BUT DO NOT WANT A GUI ISSUE
Any suggestions on programs or languages etc which can help me communicate with PHP in GUI form. It needs to be able to access the internet!
I was thinking perhaps Visual Basic as that's the only one I've ever used that really uses GUI's but I'm wondering what you all think!
Thanks!
Basically, what you're asking for is a web browser, with a very simple little HTML/Javascript front-end web page to make the PHP calls and display the results. I'm not entirely sure what it is about a browser environment that makes you think it's unsuitable, but it's basically exactly what you're asking for.
If a full-blown web browser really isn't suitable, you could try using a web browser control inside a simple GUI app. This would still work exactly the same, but would be without the browser controls, such as the URL bar.
Just use a browser.
If you don't want to do that -- build a browser.
If you are just looking for basically a web-based REST testing tool, try the Firefox RESTClient plug-in.
Why don't you use a framework ?
You may take a look here:
AppJS for Linux, Windows and Mac using HTML, CSS and Javascript
Adobe AIR : cross-platform using ActionScript/FLEX or HTML/Javascript
Titanium : HTML/CSS (no support anymore)
PhoneGap : mainly used for cross-phone-platform, but here's an Windows implementation of it (you should read the README.md ...)
You may also check this from Mozilla
I have a simple php driven website running and I'm trying to figure out how it treats php pages. Some of my php documents are routing logic and some just includes for individual pages. How do i go about making this work offline?
What I though was that I'd have to re-create the routing logic in javascript. Is that my only option? In that case, is it even possible to have the site be driven by php while online and switch to JS offline? I can't make sense of it.
If your site is fairly static, HTML5's cache manifest may get you most of the way there. Have PHP output a cache.manifest file in the correct format with all your routing system's URLs and those URLs will be stored locally in a compliant browser. Attempting to access them will pull them out of the cache if possible.
If you're looking for something more dynamic, though, you're going to have to do more legwork.
Here's some good info on offline caching.
It is important to remember that PHP is processed on the server. The result of your PHP code is all that is sent to your browser. Your browser has absolutely no knowledge that PHP was even used to make the page!
If you have some dynamic code that must run offline, then you must use Javascript. If this is just for testing on your own machine, put a web server running PHP on your dev machine and acccess it via http://localhost.
HTML5 offline caching does not work to make your pages interact; it works only to make a particular page available offline. Basically, it works on a URL-by-URL basis. If you absolutely need offline functionality, you will be forced to make it work in JS.
Also, make sure your manifest includes all resources used by all pages.
Hope this helps!
It seems obvious not to use any server side scripting language file while caching it in your browser. PHP/JSP/ASP etc all are server side language we cant fulfill the request forwarded by client that need to be generated dynamically and most importantly there is no server running on client side. SO , i think we should go for JS whenever we want to do such things.