Trying to log into a site with the cURL extension of PHP - php

Basically, I'm trying to log into a site. I've got it logging in, but the site redirects to another part of the site, and upon doing so, it redirects my browser as well.
For example:
It successfully logs into http://example.com/login.php
But then my browser goes to http://mysite.com/site.php?page=loggedin
I just want it to return the contents of the page, not be redirected to it.
How would I do this?
As requested, here is my code
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, $loginURL);
//Some setopts
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postFields);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_FOLLOWREDIRECT, FALSE);
curl_setopt($ch, CURLOPT_REFERRER, $referrer);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
echo $output;

Figured it out. The webpage was echoing a meta refresh, and since I was echoing the output, my browser followed.
Removed the echo $output; and it no longer does that.
I feel kind of dumb for not recognizing that in the beginning.
Thanks everyone.

Using cURL you have to find the redirect and follow it, then return that page's content. I'm not sure why your browser would be redirecting unless you have some weird header code that you are returning from the login page.

set CURLOPT_FOLLOWLOCATION to false.
curl_setopt($ch , CURLOPT_FOLLOWLOCATION , FALSE);
this might help you.

Related

Get the destination URL only without loading the contents with cURL

If I visit the website https://example.com/a/abc?name=jack, I get redirected to https://example.com/b/uuid-123. I want only the end URL which is https://example.com/b/uuid-123. The contents of https://example.com/b/uuid-123 is about 1mb but I am not interested in the content. I only want the redirected URL and not the content. How can I get the redirected URL without having to also load the 1mb of content which wastes my bandwidth and time.
I have seen a few questions about redirection on stackoverflow but nothing on how not to load the content.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://example.com/a/abc?name=jack');
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
$end_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
echo('End URL is ' . $end_url);
For clarity i'll add it as an answer as well.
You can tell curl to only retrieve the headers by setting the CURLOPT_NOBODY to true.
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);
From these headers you can parse the location part to get the redirected URL.
Edit for potential future readers: CURLOPT_HEADER will also need to be set to true, i had left this out as you already had it included in your code.

How can I properly follow all redirects on sites I am trying to scrape with cURL in PHP?

I am using cURL to try to scrape an ASP site that is not on my server, with the following option to automatically follow redirects it comes across:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
but it is not properly following all redirects that the website sends back: it is putting several of the redirect URLs as relative to my server and my PHP script's path, instead of the website's server and the path that the website's pages should be relative to. Is there any way to set the base path or server path in cURL, so my script can properly follow the relative redirects it comes across when scraping through the other website?
For example: If I authenticate on their site and then try to access "https://www.theirserver.com/theirapp/mainForm/securepage.aspx" with my script at "https://www.myserver.com/php/myscript.php", then, under some circumstances, their website tries to redirect back to their login page, but this causes a big problem, because the redirect sends my cURL client to "https://www.myserver.com/php/mainForm/login.aspx", that is, '/mainForm/login.aspx' relative to my script on my server, instead of the correct "https://www.theirserver.com/theirapp/mainForm/login.aspx" relative to the site I am scraping on their server.
I would expect cURL's FOLLOWLOCATION option to properly follow relative redirects based on the "Location:" header of the web pages I am accessing, but it seems that it doesn't and can't. Since this seems to not work, preferably I want a way to tell cURL a base path for the server or for all relative redirects it sees, so I can just use FOLLOWLOCATION. If not, then I need to figure out some code that will do the same thing FOLLOWLOCATION does, but that can let me specify a base path to handle these relative URLs when it comes across them.
I see several similar questions about following relative paths with cURL, but none of the answers have any good suggestions for dealing with this problem, where I don't own the website's server and I don't know every single redirect that might come up. In fact, none of the answers I've seen for similar questions seem to even understand that a person might be trying to scrape an external website and would want any relative redirects they come across while scraping the site to just be relative to that site.
EDIT: Here is the code in question:
$urlLogin = "https://www.theirsite.com/theirApp/MainForm/login.aspx"
$urlSecuredPage = "https://www.theirsite.com/theirApp/ContentPages/content.aspx"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; yie8)");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
// GET login page
$data=curl_exec($ch);
// Read ASP viewstate and eventvalidation fields
$viewstate = parseExtract($data,$regexViewstate, 1);
$eventval = parseExtract($data, $regexEventVal, 1);
//set POST data
$postData = '__EVENTTARGET='.$eventtarget
.'&__EVENTARGUMENT='.$eventargument
.'&__VIEWSTATE='.$viewstate
.'&__EVENTVALIDATION='.$eventval
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn;
// POST authentication
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/******************************************************************
GET secure page (This is where a redirect fails... when getting
the secure page, it redirects to /mainForm/login.aspx relative to my
script, instead of /mainForm/login.aspx on their site.
*****************************************************************/
curl_setopt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
echo $data; // Page Not Found
You may be running into redirects that are JavaScript redirects.
To find out what is there:
This will give you additional info.
curl_setopt($ch, CURLOPT_FILETIME, true);
You should set fail on error:
curl_setopt($ch, CURLOPT_FAILONERROR,true);
You may also need to see all the Request and Response headers:
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
The big thing you are missing is curl_getinfo($ch);
It has info on all the redirects and the headers.
You may want to turn off: CURLOPT_FOLLOWLOCATION
And do each request individually. You can get the redirect location from curl_getinfo("redirect_url")
Or you can set CURLOPT_MAXREDIRS to the number of successful redirects, then do a separate curl request for the problem redirect location
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
When you get the response, if no curl error, get the resposne header
$data = curl_exec($ch);
if (curl_errno($ch)){
$data .= 'Retreive Base Page Error: ' . curl_error($ch);
echo $data;
}
else {
$skip = intval(curl_getinfo($ch, CURLINFO_HEADER_SIZE));
$responseHeader = substr($data,0,$skip);
$data= substr($data,$skip);
$info = curl_getinfo($ch);
$info = var_export($info,true);
}
echo $responseHeader . $info . $data;
A better way to web scraping a webpage is to use 2 PHP Packages = Guzzle + DomCrawler.
I made a lot of tests with this combination and i came to the conclusion that this is the best choice.
Here, you will find an example for your implementation.
Let me know if you have any problem! ;)

Remote login using curl on cakephp returns blackhole

I have a website which i need to implement a login form to other website , which is based on cakephp.
I don't want to change the current security settings on the Cakephp website.
The login is based on the auth component.
Therefore I've pulled the form from the cakephp site using Curl, to keep the token fields.
When i login, i've got a blackhole message 'The request has been black-holed'.
How can i fix that without lowering the security level?
What are the steps to debug this situation.
Thanks
$url = WEB_APP . '/users/login';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, '/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$result = curl_exec($ch);
$markup = strip_tags($result, '<form><input>');
echo $markup;
I've checked and the problem is that the cookie is missing , for sure.
I tried to use sendcookie on the cookie file i got from curl, and it's saved encoded.
Using setrawcookie returns false.
How can I set the cookie correctly?

cURL can't follow redirection

my curl function cannot follow the redirection of Facebook external link redirector, l.php and i have no idea what's wrong...
here is the code that i'm working on and i commented the lines that i've tried... and an example link (http://www.facebook.com/l.php?u=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DGvhFyNLK66A%26feature%3Dyoutu.be&h=xAQFD_3svAQFKxF5YrtqNQ5cL3lIQxo0uaC9PoB7qAvG7Yw&enc=AZPxNZ8P5q54FREC37UC_MP02pwh2DOmsI5bbFkoQm5VUPUlYeNzQASjarRjhTtcedRkmM3mDjK7J_r_P5pRpYhL)
function connect($u) {
$ch= curl_init();
curl_setopt($ch, CURLOPT_URL, $u);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_HEADER, true);
//curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
//curl_setopt($ch, CURLOPT_REFERER, 'spie');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_AUTOREFERER, true );
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//curl_setopt($ch, CURLOPT_VERBOSE, true);
//curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
$source=curl_exec($ch);
curl_close($ch);
return $source;
}
thank you..
I first thought this was a redirect issue with cURL (safe mode enabled for instance). But it actually comes from how Facebook redirector works.
There is no Location: header, so curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); won't help you with it.
The Facebook link page actually redirects you using Javascript:
<script type="text/javascript">document.location.replace("http:\/\/www.youtube.com\/watch?v=GvhFyNLK66A&feature=youtu.be");</script>
cURL cannot analyse the content of the page nor execute javascript so this is exepcted behaviour. If you still want to do this, you'll need to parse the content of the page, grab the URL from the javascript, and issue an new cURL request to this URL.
Apparently only HTTP redirects are supported by cURL with the '--location' option.
Reference: https://everything.curl.dev/http/redirects#non-http-redirects

PHP Header redirection problem within page called by cURL

I have a site that uses cURL to access some pages, stores the returned results in variables, and then uses these variables within its own page. The script works well except where the target cURL page has a header('Location: ...') command inside it. It seems to just ignore this header command.
The cURL command is as follows...
//Load result page into variable so portions can be allocated to correct variables
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); # URL to post to
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1 ); # return into a variable
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postfields);
$loaded_result = curl_exec( $ch ); # run!
curl_close($ch);
I've tried changing the CURLOPT_HEADER to 1 but it doesn't do anything.
So how can I allow script redirection within the target urls using cURL to grab the results? By the way, the pages work fine if accessed other than via cURL but iFrames are not an option in this instance.
If you want cURL to follow redirections add this:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects
You'll want the options CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS. See the manual.
try
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

Categories