Automated Download of URL Content from site requiring cookies - php

I'm attempting to automatically download content at regular time intervals from a site requiring users to log in. The content I'm seeking to download is a small .js file (<10 kb).
As the site will display the desired data only when I'm logged in, I'm unable to simply use functions such as urlwrite (in MATLAB) to download the data.
I'm not sure whether the libcurl library in PHP would be able to solve the problem easily.
As suggested in the answer to this similar question (Fetching data from a site requiring POST data?), I've tried to use the Zend_Http_Client, but haven't been able to get it to work.
In summary, I'd like help on automatically downloading URL content from a site requiring user log-in (and presumably submission of cookies).
In addition to this, I'd appreciate advice on which software is best for automated download of such data at regular time intervals.
(If you do require the exact URL I am trying to download from to test a solution, please leave a comment below.)

It depends on the type of login the site uses. If it uses HTTP authentication you use curl option CURLOPT_HTTPAUTH (see setopt, http://php.net/manual/en/function.curl-setopt.php) Otherwise, as said, you use COOKIEJAR and possible COOKIEFILE.
Another option is the standalone utility wget. The FAQ contains a nice explanation of both login methods http://wget.addictivecode.org/FrequentlyAskedQuestions#password-protected
If this is the first time you use curl: don't forget to set CURL_RETURNTRANSFER to true (if false the content is send to stdout) and CURL_HEADER to false to get the content without headers.

I your only concern is login, rather than cookies in general. Check the answer to this question : How do I use libcurl to login to a secure website and get at the html behind the login

Related

PHP handle download request when session is expired

I'm looking for some help in how best to handle page navigation/redirection from a PHP application. We don't offer many downloads so this has only just now come up as an issue. The gist is that a user loads a webpage to view some data and this page offers a hyperlink to download the data into a spreadsheet (dynamically built). The issue that I'm struggling to come up with a slick solution to is if the user sits on the webpage for long enough to where their session expires in PHP. Suppose in that case the user comes back to the page and clicks the download link.
There are two scenarios I need to handle. The first is with old browsers like IE (have to support it for the time being). IE doesn't support the download attribute for ANCHOR elements. Therefore, when the link is clicked and the session is invalid, the user is presented with a login form but the browser URL now reflects the endpoint of the download. Upon logging in, the download functions correctly but the user is left at the login form because the presence of the Content-Type: attachment makes the browser not navigate. I am looking for how to best get the user back to what is essentially the initial HTTP_REFERER when the download was requested. The only idea I can come up with is either a standard endpoint or query string parameter to use so that my login form handling code can properly redirect after successful login for a download request.
The other scenario is for modern browsers that support the download attribute. My code does set the HTTP response code to 401 when it determines the login form needs to be rendered (maybe that's not correct though). I do not see anything within $_SERVER that alludes to that fact though which suggests, again, a standard endpoint or query string parameter to use for identification. Modern browsers handle this case well by simply denying the download and actually displays that the request needs authorization. So, this works well as long as setting the status to 401 on all login form renders is correct otherwise, I'd again need some way to know that the requested endpoint is a download.
I'd like to avoid any kind of JavaScript solution if possible.

POST, GET and cookies via PHP

I posted this question earlier but it was misinterpreted by those reading it and was closed before I had time to clarify. If you don't understand what I mean, please ask!
I have a site, let's call it "site A". On "site A", there is a log in page. On this page, you POST a username and password to a PHP script. If the login details are correct, the PHP script sets a cookie on the browser. This cookie is called "SESSION".
When you view the site, it checks whether "SESSION" is valid, and displays either the information or the login page.
I want to connect to the page via PHP and POST the login details. I then want to store the "SESSION" cookie via PHP, and display the contents of the page (again, via PHP).
How would I do this?
You can use PHP as a web client as well. You can use the cUrl library to make requests from PHP.
You can use setopt to set all kinds op options for your cUrl session, including POST (CURLOPT_POST) and the POST variables (CURLOPT_POSTFIELDS), but also choose a kind of authentication (CURLOPT_HTTPAUTH) in case the site doesn't use normal post for this.
I found an example that might be useful here: http://davidwalsh.name/curl-post, although you can find many other examples by Googling for something like 'php curl post'.

Return HTML or XML based on request in PHP

There's an existing website written in PHP. Originally only the website existed, but now an Android application is being built that would benefit from re-using some of the PHP logic.
The PHP site was structured such that there are many pages that perform an action, set success/error information in $_SESSION, and then redirect to a visual page without outputting any content themselves. For example, there's action_login.php:
The page accepts a username and password (from GET or POST variables), validates the credentials, sets success/failure messages in $_SESSION, and then redirects to the logged-in homepage on success or back to the login screen on failure. Let's call this behavior the "HTML response".
The Android application will need to call the same page but somehow tell it that it wants an "XML response" instead. When the page detects this, it will output success/error message in an XML format instead of putting them in $_SESSION and won't redirect. That's the idea anyway. This helps prevent duplicate code. I don't want to have action_login.php and action_login.xml.php floating around.
I've read that the Accept Header isn't reliable enough to use (see: Unacceptable Browser HTTP Accept Headers (Yes, You Safari and Internet Explorer)). My fallback solution is to POST xml=1 or use {url}?xml=1 for GET requests. Is there a better way?
No frameworks are being used, this is plain PHP.
That's what the Accept Header is for. Have the Android request the page as application/xml and then check what was requested in your script. You might also be interested in mod_negotiation when using Apache. Or use WURFL to detect the UserAgent and serve as XML when Android.
I'd go with the android app sending a cookie for every request (really I would prefer the Accept header, but with the problems you pointed out with webkit I understand your reluctance to do so). The cookie simplifies the code server-side to not have to check for $_GET['xml'] or $_POST['xml'], and if some android user shares an URL of your application and it had a ?xml=1, the user who opens this in a computer browser would receive XML instead of the normal web output.
I wouldn't rely on $_SESSION for mobile applications because users (or at least I do) on mobile platforms tend to open your app, play 5 minutes, put mobile on pocket and 2 hours later return to your app. Do you want to set a session lifetime so long?
why not set a specific session for the app and then only set the header if the session is set something along the lines of
$_SESSION['app'] = "andriod app";
if ($_SESSION['app'] == "andriod app") {
header..
not really sure how to implement this into an app as I've done really little work with apps but hope this helps your thought process

Prevent remote script using PHP CURL from logging into website

What are some methods that could be used to secure a login page from being able to be logged into by a remote PHP script using CURL? Checking referrer and user agent won't work since those can be set with CURL. The ideal solution would be to solve this without using a CAPTCHA, that is the point of this question to try and figure out if this is possible.
One approach is to include some JavaScript in your login form, and make it so that the form cannot possibly be successfully submitted unless that JavaScript has run. This makes your login form only usable for people with JavaScript turned on, which CURL doesn't have. If the necessary JavaScript is some kind of challenge/response that differs every time (for instance use something like http://www.ohdave.com/rsa/ to make it non-trivial), the presence of the correctly set value in the form is good evidence that JavaScript ran.
You won't be able to stop all automated scripts though, it is easy enough to write scripts that drive an actual browser engine, and they will pass this test.
There isn't any way to prevent it simply. If the script knows the user name and password they will be able to login.
You could use a captcha so that automated logins won't be able to read it, but that will be a burden on actual users as well.
If you are concerned about it being used to try and brute force a login, then you could require some additional information after several attempts.
Disable the account and require reactivation via email
Require a captcha after several unsuccessful attempts
if I undestand correctly :
you have login page what execute login script
login script is hacked by remote cURL script...
Solution
in login page place hidden element with secret unique code what can happend only once, save this secret code in session, in loging script look in session for this code, compare with what was posted to the script, should same to proceed, clear session...
more about subject: http://en.wikipedia.org/wiki/Cross-site_request_forgery
cURL is no different from any other client (e.g. a browser). You could use nonce tied to a session in a hidden input field to prevent POST requests from being made directly but there are still ways around that. It's also a good idea to limit the number of log in attempts per minute to make brute-force attacks more difficult if that's what you're worried about.

page sends file to curl i want to get download link insted

there is a page that i need to post a password to it and then i get a file to download.
the post goes to the same page address its loads again and pop up the download manager (download starts automatically).
now i want to do the same but in curl, i posted the data to the url and then its sends me the file back but i don't want my script to download the whole file i want only to get a link to download it by myself.
how can i do that?
Actually, you most probably can't. Such password protected download system usually checks either cookies or browser / environment based variables. Getting the link itself shouldn't be problem, however you could not use it outside this generator's scope anyway.
firstly you need to post that password with curl assuming "on specific form. the form will take you to the downloading page" now you need to use regex (regular expressions).
filter the data you want then save it on other variable to re-use it.
There is for sure a redirection after you hit 1st page with POST. Look for that redirection with curl and read http response headers: Content-Location or Location or even Refresh
To prevent the automatic download you have to set the curl opt to not follow redirects. I can't remember the exact command but curl by default will follow auto refreshes and URL redirects, which happen in split seconds so humans don't actually see it happening.
I kinda don't understand what you really want to do, but if you just want a link then have the php script perform the entire curl post and everything when they click it. Doesn't matter what the web server will require a password before access to a file, you can't skip that step.

Categories