Is there a way to retrieve the fully rendered html from a page with javascript post rendering ? If I use curl, it simply retrieves the base html, but lacks the post rendering of iframes, javascript processing etc.
What would be the best way to accomplish this?
As no-one else has answered (except the copmment above, but I'll come to that later) I'll try to help as much as possible.
There no "simple" answer. PHP can't process javascript/navigate the DOM natively, so you need something that can.
Your options as I see it:
If you are after screen grab (which is what I'm hoping as you also want Flash to load), I suggest you use one of the commercial APIs that are out there for doing this. You can find some in this list http://www.programmableweb.com/apitag/?q=thumbnail, for example http://www.programmableweb.com/api/convertapi-web2image
Otherwise you need to run something yourself that can handle Javascript and the DOM on, orconnected to, your server. For this, you'd need an automated browser that you can run serverside and get the information you need. Follow the list in Bergi's comment above and you'd need to test a suitable solution - the main one Selinium is great for "unit testing" on a known website, but I'm not sure on how I'd script it to handle random sites, for example. As you would (presumably) only have one "automated browser" and you don't know how long each page will take to load, you'd need to queue the requests and handle one at a time. You'd also need to ensure pop-up alert()s are handled, all the third party libraries (you say you want flash?!) installed, handle redirects, timeouts and potential memory hogs (if running this non-stop, you'll periodically want to kill your browser and restart it to clean out the memory!). Also handle virus attacks, pop-up windows and requests to close the browser completely.
Thirdly, VB has a web-browser component. I used it for a project a long time ago to do something similarish, but on a known site. Whether it's possible with .NET (to me, it' a huge security risk), and how you program for unknowns (e.g. pop-ups and Flash) I have no idea. But if you're desparate an adventurous .NET developer may be able to suggest more.
In summary - if you want more than a screen grab and can choose option 1, good luck ;)
If you're looking for something scriptable with no GUI you could use a headless browser. I've used PhantomJS for similar tasks.
If still relevant, I found that the easy way to this is using PhantomJs as a Service;
public string GetPagePhantomJs(string url)
{
using (var client = new System.Net.Http.HttpClient())
{
client.DefaultRequestHeaders.ExpectContinue = false;
var pageRequestJson = new System.Net.Http.StringContent(#"{'url':'" + url + "','renderType':'plainText','outputAsJson':false }");
var response = client.PostAsync("https://PhantomJsCloud.com/api/browser/v2/SECRET_KEY/", pageRequestJson).Result;
return response.Content.ReadAsStringAsync().Result;
}
}
It is really simple, when subscribing to the service there is a free plan that allows 500 pages/day. The SECRET_KEY is to be replaced by your own key that you will get.
Use a "terminal" browser like w3m or lynx. Even if the site you want to access needs login, this is possible, for example:
curl [-u login:pass] http://www.a_page.com | w3m -T text/html -dump
or
curl [-u login:pass] http://www.a_page.com | lynx -stdin -dump
This should give you the whole html with all frames etc.
look at this command line IECapt.exe
It has no javascript support, but lynx was useful for me in a situation where I needed to do processing of data from a webpage. This way I got the (plaintext) rendering and didn't have to filter through the raw html tags as with curl.
lynx -nonumbers -dump -width=9999999 ${url} | grep ... et cetera.
Related
I am currently trying to load an HTML page via cURL. I can retrieve the HTML content, but part is loaded later via scripting (AJAX POST). I can not recover the HTML part (this is a table).
Is it possible to load a page entirely?
Thank you for your answers
No, you cannot do this.
CURL does nothing more than download a file from a URL -- it doesn't care whether it's HTML, Javascript, and image, a spreadsheet, or any other arbitrary data; it just downloads. It doesn't run anything or parse anything or display anything, it just downloads.
You are asking for something more than that. You need to download, parse the result as HTML, then run some Javascript that downloads something else, then run more Javascript that parses that result into more HTML and inserts it into the original HTML.
What you're basically looking for is a full-blown web browser, not CURL.
Since your goal involves "running some Javascript code", it should be fairly clear that it is not acheivable without having a Javascript interpreter available. This means that it is obviously not going to work inside of a PHP program (*). You're going to need to move beyond PHP. You're going to need a browser.
The solution I'd suggest is to use a very specialised browser called PhantomJS. This is actually a full Webkit browser, but without a user interface. It's specifically designed for automated testing of websites and other similar tasks. Your requirement fits it pretty well: write a script to get PhantomJS to open your URL, wait for the table to finish rendering, and grab the finished HTML code.
You'll need to install PhantomJS on your server, and then use a library like this one to control it from your PHP code.
I hope that helps.
(*) yes, I'm aware of the PHP extension that provides a JS interpreter inside of PHP, and it would provide a way to solve the problem, but it's experimental, unfinished, would be still difficult to implement as a solution, and I don't think it's a particularly good idea anyway, so let's not consider it for the purposes of this answer.
No, the only way you can do that is if you make a separate curl request to ajax request and put the two results together afterwards.
I think topic ask the question, I usually use PHP for parse/ web scraping, but I have really bad time scraping javascript most cases I cant do it
ex: Parse a div that appears when a javascript its executed.
I readed about RUBY, that have a parser library for javascript, so question is w is the languaje for program a web scraping that will effective scrap javascript generated content ?? Its here a library for PHP like the one for ruby for parse javascript content ?
There are a handful of strategies for this. Depending on your needs, consider pro grammatically instantiating a browser instance that you can hook into and read the page from.
The idea is, let the browser do the work, as the page is made for a browser and not your bot. You can then tap in and scrape away using a browser plugin that feeds data to your primary application running things.
This may be way overkill for what you need though. I'll leave it up to you to decide.
You should look at some GUI-less/headless browsers. There is some written for Java. I didn't find one for PHP.
Look at :
HTMLUnit
Golf
You can try using something like Selenium, which allows you to automate browser tasks.
On the other hand, you can go into details on what happens when the js code is executed. For example, if the js code is requesting something from the server by POSTing some data, you could emulate that in the regular fashion.
You should look at PhantomJS and CasperJS (headless browsers).
In the ruby world the gem for running Phantomjs would be poltergeist
There is another article about some of the options you have in ruby here too (however they are not all js capable)
⚠ In this question, PHP is used in an unusual way. It is not used as a server side language ("no browser is open"). It is intended to be run on my own computer, simulating mouse move on my computer.
Is it possible to simulate mouse's move in PHP ? By that I mean to do something like :
$mouse->moveToCoordinate($x,$Y); // will move the screen to to the coordinate $X, $Y of the screen
$mouse->moveVector($x,$Y); // will move from the current point to the (current X + $X, current Y + $Y);
$mouse->click(); // will simulate a mouse click on the screen.
This should be usable, even if no browser is open (so cannot use the classic browser-side javascript solution).
1 - use exec() and : Simulate mouse movement in Ubuntu . Basically, use any other language, compile it if needed, and use the executable with argument throughout command line.
2 - PHP-QT might do the trick
| IT IS POSSIBLE !!! |
People have suggested to use another language (javascript), but for this problem, it's not possible to use a browser. So other languages will do the trick.
Thanks for your message though, and if anybody have other solutions, I'd be interested to know them.
PHP is a server side scripting language and cannot do that. You should do that by Javascript. It's possible to do that from PHP (write needed Javascript in PHP and send to client). The most real-time solution is using AJAX but you still suffering round-trip lags depending on client speed.
Just as an exercise.
It might be possible to write standalone desktop PHP app that has access to user pointer. For that you have to use bindings such as http://gtk.php.net/ (there were Qt bindings some time ago, but project seems to be dead).
And even that it might be hard. PHP-GTK is not well documented at this moment.
+1 To everything that was said before.
I'll add that more details on the goal is needed.
Depending on what you really want (A click to do what ? On what ? etc...), you can still use cURL to reach a page, parsing it and following to the link you want (if that's a link you want to click...), entering a whole form and submiting it, etc...
You can access to the html code and save it in a file on your server (if that's what you need.) etc... etc...
Anyway, as everyone said, PHP is server-side and, even as CLI, you need to have a server on your localhost and that will just execute a PHP script, PHP that don't have access to mouse/mouse movement etc without a client-side language like javascript.
IMHO I think your going about whatever it is your trying to do in the wrong way. There is no way to control the users mouse unless your using some sort of remote desktop app as that would be a security issue. That said I could take a guess as some possible things you could do
set focus on an object using javascript
click something using javascript
3 write and applescript (if on a mac) to click something in the finder or automate a process
hth
EDIT
is should also be noted that if you use applescript stuidio you have access to objective c which would let you write code to change the mouse position. but I don't recommend it the user should control the mouse and nothing else should
It's not so hard. Look example.
You can easily edit it and send AJAX HTTP request for x,y positions and return xstart->x , ystart->y.
Hard part is make object to avoid other objects.
I want to change my HTML page as an image. Is there a way in PHP to change or save an HTML page as an image?
This is not easy; as NullUserException says in his comment, you would need to render the HTML page on the server-side, which is not something PHP (or any other server-sided language) has built in.
The approach that comes to mind would be to write a program (probably not in PHP, but rather something like C# or C++) that runs on your server, fires up a web browser, and does a series of screen captures (possibly combined with page scrolls). As this is a very nontrivial and bug-prone process, I would suggest looking into third-party components that are capable of doing this.
You would then execute this program from PHP, and when it's done running, display the results from the file it output.
I would advise you to use an external service with an api. This list might be a good start: http://blogs.sitepoint.com/2008/07/10/9-ways-to-put-site-screenshots-in-your-web-app/
Thumbalizr seems great, they allso provide a php script so you can cache the images locally:
http://www.thumbalizr.com/apitools.php
Try taking a look at browsershots.org - source code is available for it if you want to install it locally. Essentially it uses a browser to take screenshots, and can be controlled via an XML-RPC interface, which you can call from PHP.
As others have said this is not a simple job, and not something you can do directly in PHP, so use an external service.
(I'm not affiliated with browsershots.org in any way)
I was recently visiting a site and noticed that the page had a section that said it noticed that I was using AdBlocking software and could I kindly turn it off to help support a small site like that.
I was just wondering how you would do that? Would it be best done client-side or server-side?
This is something that simply can't be done server side - there's zilch reason for person to knock on your door and say "Look at me, I have AdblockPlus!". When on the client side, adblock is actively trying to influence the page content, which is something you can see happen and see that they are using an adblocker.
Anyway, I happened to know that newgrounds.com is doing this too. (their new layout was screwed up for adblock plus users - as a response they made a contest for the best "if you're not going to help us through our ads, go and buy something in the store"-banner.
A quick look in the source of newgrounds told me they are doing this with some simple javascript.
First in the document:
var user_is_leecher = true;
Next there is a external script tag: src=checkabp?thisistotrickabp=***adress of ad affilliate***
Now the joke: they simply trust adblock plus to filter that script out, as all that's in there is: user_is_leecher = false;
From there, they can do just about anything.
All off the methods mentioned here rely on the ad blockers to strip out code. This doesn't work for some adblockers(like NetBarrier on Mac). You also have to keep updating your code when the adblockers catch on.
To detect if the user is blocking ads, all you have to do is find a function in the ad javascript and try testing for it. It doesn't matter what method they're using to block the ad. Here's what it looks like for Google Adsense ads:
if(typeof(window.google_render_ad)=="undefined")
{
//They're blocking ads, do something else.
}
This method is outlined here: http://www.metamorphosite.com/detect-web-popup-blocker-software-adblock-spam
You could do it on server side by pairing requests for html pages and for the acording ads (probably with some unique identifiers to each request ...) ... But this is just an idea, i've never tried it and never even seen it used.
I found this part in the code which seems to look like how they did it:
/*MOOTOOLS*/
window.addEvent('domready', function(){
$$('.cat-item').each(function(el) {
var fx = new Fx.Morph(el,{ duration:300, link:'cancel' });
el.addEvents({
'mouseenter': function() { fx.start({ 'padding-left': 25 }); },
'mouseleave': function() { fx.start({ 'padding-left': 15 }); }
});
});
if ($$(".google-sense468")[0] && $$(".google-sense468")[0].clientHeight == 0 && $('block-warning')) $('block-warning').setStyle('display','block');
});
/*MOOTOOLS END*/
I guess there are several ways of doing it, but probably the easiest one would be to have some kind of background image, or text, that will be replaced when the ad is loaded. Thus, if the ad gets loaded, you see the ad. If the ad doesn't load, you see the text.
This example would be client side, done by either JavaScript or just plain CSS might even suffice.
There might be some server-side gimmicks that could do this too, but they would be unnecessarily elaborate and clunky. One method that springs to mind would include some kind of API with the advertiser that could be asked "did the user from IP such.and.such load any images?" and in that way get the answer. But I doubt there's such services - it would be much easier to do on the client side.
I believe that is much easier to do it on client side than in server side. Ad blockers are installed on the client, so they can manipulate DOM and block ajax requests. That's why I believe it makes more sense to detect on the client than on the server.
Anyway, this is a standalone simple plugin that detects users with ad blockers enabled, it's open-source and the full code is on github:
https://github.com/retargetly/mockingbird
It's more publisher oriented so they can easily show messages on the ads containers or in a popup. The plugin is frequently updated, and it's worth a try. This is the fiddle also:
http://jsfiddle.net/retargetly/9vsha32h/
The only method you need to use is
mockingbird.adsBlocked(Obj)
The call can be done anywhere in the code and you don't need jQuery to make it work.
Wish you luck !
I don't think there is an easy way to do this. What you can do is to create "trap". Make a php script listen to a very obvious url like yourdomain.com/ad.png. You can probably achieve this by url rewriting. If this page is loaded you can note this in a session variable and send back a 1x1 blank png.
On the next request you can see whether ad.png has been loaded. If it hasn't you can guess that the client is using some form of AdBlock software. Make sure you set the appropriate http headers to prevent clients from caching "ad.png".
This is the only server side approach I can think of at the moment and it has some flaws.
The png file can be cached regardless of the http headers
This will not work for the first http request
Some extra server load as browsers will keep hitting ad.png for each request
That the image gets loaded from the server is no guarantee for it actually being displayed
Probably more side effects that I haven't thought of
Please make a comment on this post if you decide to try it out.
Regarding a client side solution. This shouldn't be to difficult. You can create a tiny Javascript to run on page load complete. This script can check that the page contains the dom-nodes holding the ads. If you this when the page is loaded completely (not only the dom) you can check the width and height of your ad images. The most obvious drawback with this solution is that clients can disable javascripts.
A few good answers here, so I'll just add this:
use some ad management system (You can write Your own). With that, track every ad that's being displayed (and make it obvious, like ads.php or showad.php or whatever). If that script is never called, the user is using SOME form of ad blocking software.
Be sure to handle each and every ad through that handler, though. Mod_Rewrite isn't required, it can be done using simple PHP.
What you can do to detect the adblocker on the server-side is somithing like:
<?php
header('Content-Type: application/javascript');
//Save it to session
session_start();
$_SESSION['noAdblocker']=true;
?>
noAdblocker=true;
Save this file as ads.php
Now the index.php:
<?php
session_start();
$_SESSION['noAdblocker']=false;
?>
<!DOCTYPE HTML><html><head>
<!-- Now place the "ad-script" -->
<script src="ads.php"></script>
</head><body></body></html>
You can add javascript-code to your page, that is only executed if there's no adblocker, e.g. use "ad" as variable-name, use "ad.js" as file-name.
This code sends an ajax-event to the server, saying "this user doesn't use an adlocker". So if you don't receive that event, you know, that this user is blocking ads or even javascript altogether.