I'm currently using curl to login to a site and grab the html for one of the pages. My problem is that the page has some ajax links on it (click on the link results to html changes). How would I be able to make the clicks of the link and get the html of the final state using php? Seems like from researching this I need some sort of headless browser? Is there something like that in php I can use?
I'm not aware of any headless browsers that supports Javascript/AJAX that you can drive with PHP. If you want to drive a real browser with PHP, see http://seleniumhq.org/
Had this exact problem a few minutes ago. This works like a charm. Use .live() as the top answer here explains.
Reload javascript file after an AJAX request
Tested and works.
Related
Please let me know is it possible to scrap some info after ajax loaded with PHP? I had only used SIMPLE_HTML_DOM for static pages.
Thanks for advice.
Scraping the entire site
Scraping Dynamic content requires you to actually render the page. A PHP server-side scraper will just do a simple file_get_contents or similar. Most server based scrappers wont render the entire site and therefore don't load the dynamic content generated by the Ajax calls.
Something like Selenium should do the trick. Quick google search found numerous examples on how to set it up. Here is one
Scraping JUST the Ajax calls
Though I wouldn't consider this scraping you can always examine an ajax call by using your browsers dev tools. In chrome while on the site hit F12 to open up the dev tools console.
You should then see a window like the above. Hit the network tab and then hit chrome's refresh button. This will show every request made between you and the site. You can then filter out specific requests.
For example if you are interested in Ajax calls you can select XHR
You can then click on any of the listed items in the tabled section to get more information.
File get content on AJAX call
Depending on how robust the APIs are on these ajax calls you could do something like the following.
<?php
$url = "http://www.example.com/test.php?ajax=call";
$content = file_get_contents($url);
?>
If the return is JSON then add
$data = json_decode($content);
However, you are going to have to do this for each AJAX request on a site. Beyond that you are going to have to use a solution similar to the ones presented [here].
Finally you can also implement PhantomJS to render an entire site.
Summary
If all you want is the data returned by specific ajax calls you might be able to get them using file_get_contents. However, if you are trying to scrape the entire site that happens to also use AJAX to manipulate the document then you will NOT be able to use SIMPLE_HTML_DOM.
Finally I worked around my problem. I just get a POST url with all parameters from ajax call and make the same request using SIMPLE_HTML_DOM class.
I want to get the html source of a webpage genereted by javascript using Curl(PHP)
I tried the curl but I get just a javascript code :(
Can I use ruby to resolve my problem ?!
The javascript is executed by the browser to generate the HTML. If you make a request with CURL it will just show you the actual HTML content.You would need a Javascript engine to process the Javascript after receiving the response body.
just look at any web inspector tools (in chrome just ctrl+shift+i). here you can see the changes that the javascript has on the page reflected. I dont think curl or any curl-like-tool can do this.
This is a tough problem because the JavaScript has to run to get the right code. What I would say is download all the code locally and then add in an ajax call to the code, so it can ajax the source back to you after all the js has run. Then run the code in a browser.
If you need to do this a bunch of times you could queue these pages that needed to be loaded into a db and load all the pages using php. Then once the js has ajax'd the code back to the server it can refresh and pull the next page off the queue.
Let me know if you need me to clarify anything.
This can be done by headless browser activity like phantom js a great way to create your own logic whatever you want then get result array in console for php you can try activity here https://github.com/jonnnnyw/php-phantomjs & also https://github.com/ariya/phantomjs
As many of you have noticed, when you hit a link to switch from page to page in Google+ or facebook, the URL changes, the body changes but some parts of the page don't, like the chatbox. I believe AJAX can change a specific content of the page by requesting a PHP page from the server and get some results, but that won't change the URL.
Actually, I didn't know exactly how to search that in Google, so, any keywords/names/linkes will be strongly appreciated.
I'm using JQuery library for Javascript and Symfony2 framework for PHP, if this helps.
Look at the JQUERY load method.
http://api.jquery.com/load/
All you need to do is use a selector:
$('mydiv').load('newcontent');
Very powerful function. Look it up!
edit:
Sorry missed that url change. The trick alot of times with the URL is around the Hashtag. If you look closely at the URL there will be a "#" pound symbol in there somewhere. This allows the site to store current state without a reload.
Currently there is no way to change the URL in the browser save for the bit after the hashtag with out fully reloading the site.
you can either use iframe or ajax to keep some part of your page static.To change url either you can use hash hack.
window.location.hash = "pageidentifier"
or you can use the html 5 trick described in the url provided by arxanas
I am working on this php base scraper/crawler, which works fine until it get .net generated herf link __doPostBack(...), any idea how to deal with this and crawl page behind those links ?
Instead of trying to automate clicking the JavaScript button, which requires additional libraries in PHP, try replicating what request is sent by your browser after clicking the button. There are various firefox extensions that will help you examine the request, such as TamperData, Firebug, and LiveHttp.
How to hide an iframe url in HTML source code.I have two applications one applications get an url of another application into its iFrame,so that it displays in its source code.I dont want to display another application url in the source code.
I think you would need to set the IFRAME URL via JavaScript. The Javascript could then be Obfuscated, so that the URL would not be in plain text... Please see the following link for the obfuscator:
http://www.javascriptobfuscator.com/Default.aspx
i.e. if it was jQuery...
$("#myiFrame").attr('src','http://www.google.com');
becomes:
var _0xc1cb=["\x73\x72\x63","\x68\x74\x74\x70\x3A\x2F\x2F\x77\x77\x77\x2E\x67\x6F\x6F\x67\x6C\x65\x2E\x63\x6F\x6D","\x61\x74\x74\x72","\x23\x6D\x79\x69\x46\x72\x61\x6D\x65"];$(_0xc1cb[3])[_0xc1cb[2]](_0xc1cb[0],_0xc1cb[1]);
You can't hide it per say, but you can run it through something like TinyURL so that anyone interested would need to go an extra step. Anyway, that's the only thing I can think of. However, if you are displaying that page in a frame, what's the harm in having the URL in the source code? There really isn't a good, foolproof way to prevent someone determined from finding out the location of that iframe page.
You can create a php script which uses curl to call the url through localhost, then use this script as your iframe source.
If you have an issue with relative links and sub-directories, you can put your curl script inside the sub-directory.