Curl not saving all Cookies - php

I've done scraping for lots of sites but one in particular isn't saving it's cookies to my cookie file. Any ideas?
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_TIMEOUT,8200);
curl_setopt($ch,CURLOPT_TIMEOUT_MS,8200);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT ,8200);
$cookie_file = "cookies/zapper.txt";
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
if ($fields) {curl_setopt($ch,CURLOPT_POST, count($fields)); }
if ($fields) {curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_string); }
This is the first site that I've done that doesn't respond to my cookie saves. All others use the same code and work perfectly. I've even emulated the post of their forms and faked the header in case it was checking [those.
The site I'm trying to mimic an add to cart process for is http://zapper.co.uk/

Read a possible solution directly from the php.net site about curl_setopt. It's a workaround to get Cookie content from the header output. Seems to be a cool alternative.
Also, you can get surprising results modifiying some of your rules at curl_setop. Sometimes we use more options than needed.
I also recommend you to echo the whole $ch content (It will print page like the browser does). Sometimes you get a detailed error not present at headers seeing the live result content.

Related

Rapidgator API Direct Download Link Error

Guys, I am currently working on file hosting premium link generator basically it will be a website from where you can get a premium link of uptobox,rapidgator,uploaded.net and other file hosts sites without purchasing the premium account. Basically, We are purchasing the accounts of this website on behalf of the users and offering this service at a low price. So when I was setting up API of direct download link of rapidgator I was able to get that link but I was getting session is over. I was trying to that API via a software, not via manual coding and I am facing this problem
So I have been getting Rapidgator API reference from Tihs Site:- https://gist.github.com/Chak10/f097b77c32a9ce83d05ef3574a30367d
So I am doing the following Thing With My Debugging Software And I am getting success response but when I just open that URL in my browser it shows Session Id Failed.
So Here Are Steps What I am Doing
Sending a post request on https://rapidgator.net/api/user/login with username and data and I am getting this output
{"response":{"session_id":"g8a13f32hr4cbbbo54qdigrcb3","expire_date":1542688501,"traffic_left":"13178268723435"},"response_status":200,"response_details":null}
Now I am sending a get request (I tried Post Request Too But the Same Thing Happened) on this url with session id and URL embedded in URL https://rapidgator.net/api/file/download?sid=&url=
and I am getting this output
{"response":{"url":"http:\/\/pr56.rapidgator.net\/\/?r=download\/index&session_id=uB9st0rVfhX2bNgPrFUri01a9i5xmxan"},"response_status":200,"response_details":null}
When I try to download the file from the Url through my browser It says Invalid Session and sometimes too many open connections error
Link of the error:- https://i.imgur.com/wcZ2Rh7.png
Success Response:- https://i.imgur.com/MqTsB8Q.png
Rapidgator needs its api to be hit three times with different URLs.
$cookie = $working_dir.rand();
$headers = array("header"=>"Referer: https://rapidgator.net");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://rapidgator.net/api/user/login");
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
curl_setopt($ch, CURLOPT_POSTFIELDS, "username=email#domain.ext&password=myplaintextpassword");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$result = curl_exec($ch);
curl_close ($ch);
$rapidgator_json = json_decode($result,true);
return array($rapidgator_json['response']['session_id'],$cookie);
http://rapidgator.net/api/user/login (this is the initial login)
Above link gives you a session id that you need. The response is in JSON
Now we need to request a download link that will allow us to download without having to log in to a human input form. So we will use its api to request a download link using the intial session id we got from the 1st url.
$headers = array("header"=>"Referer: http://rapidgator.net/api/user/login");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://rapidgator.net/api/file/download?sid=$rapidgator_session&url=$rapidgator_file");
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $working_dir.$rapidgator_cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $working_dir.$rapidgator_cookie);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$result = curl_exec($ch);
curl_close ($ch);
$rapidgator_json = json_decode($result,true);
return array($rapidgator_json['response']['url']);
Basically, we pass the session id Rapidgator gave us assuming you have properly passed a valid account. Then we include the source url you had obtained (Link to file) http://rapidgator.net/api/file/download?sid=$rapidgator_session&url=$rapidgator_file
After that. Rapidgator will return a JSON response with an url that u can use to obtain the file in question. This allows you to use whatever download method you want
as that link is a session url is valid for a short period of time.
$rapidgator_json['response']['url']
All code above is somewhat incomplete. Some extra checks on the json responces for possible errors/limits are recommended. I used functions on my end but this is enough for you to see what you should be doing. Rapidshare API has other data that can be useful in determining if you have gone over your daily quota. How long the session url is going to last and so on.

How can I properly follow all redirects on sites I am trying to scrape with cURL in PHP?

I am using cURL to try to scrape an ASP site that is not on my server, with the following option to automatically follow redirects it comes across:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
but it is not properly following all redirects that the website sends back: it is putting several of the redirect URLs as relative to my server and my PHP script's path, instead of the website's server and the path that the website's pages should be relative to. Is there any way to set the base path or server path in cURL, so my script can properly follow the relative redirects it comes across when scraping through the other website?
For example: If I authenticate on their site and then try to access "https://www.theirserver.com/theirapp/mainForm/securepage.aspx" with my script at "https://www.myserver.com/php/myscript.php", then, under some circumstances, their website tries to redirect back to their login page, but this causes a big problem, because the redirect sends my cURL client to "https://www.myserver.com/php/mainForm/login.aspx", that is, '/mainForm/login.aspx' relative to my script on my server, instead of the correct "https://www.theirserver.com/theirapp/mainForm/login.aspx" relative to the site I am scraping on their server.
I would expect cURL's FOLLOWLOCATION option to properly follow relative redirects based on the "Location:" header of the web pages I am accessing, but it seems that it doesn't and can't. Since this seems to not work, preferably I want a way to tell cURL a base path for the server or for all relative redirects it sees, so I can just use FOLLOWLOCATION. If not, then I need to figure out some code that will do the same thing FOLLOWLOCATION does, but that can let me specify a base path to handle these relative URLs when it comes across them.
I see several similar questions about following relative paths with cURL, but none of the answers have any good suggestions for dealing with this problem, where I don't own the website's server and I don't know every single redirect that might come up. In fact, none of the answers I've seen for similar questions seem to even understand that a person might be trying to scrape an external website and would want any relative redirects they come across while scraping the site to just be relative to that site.
EDIT: Here is the code in question:
$urlLogin = "https://www.theirsite.com/theirApp/MainForm/login.aspx"
$urlSecuredPage = "https://www.theirsite.com/theirApp/ContentPages/content.aspx"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; yie8)");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
// GET login page
$data=curl_exec($ch);
// Read ASP viewstate and eventvalidation fields
$viewstate = parseExtract($data,$regexViewstate, 1);
$eventval = parseExtract($data, $regexEventVal, 1);
//set POST data
$postData = '__EVENTTARGET='.$eventtarget
.'&__EVENTARGUMENT='.$eventargument
.'&__VIEWSTATE='.$viewstate
.'&__EVENTVALIDATION='.$eventval
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn;
// POST authentication
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/******************************************************************
GET secure page (This is where a redirect fails... when getting
the secure page, it redirects to /mainForm/login.aspx relative to my
script, instead of /mainForm/login.aspx on their site.
*****************************************************************/
curl_setopt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
echo $data; // Page Not Found
You may be running into redirects that are JavaScript redirects.
To find out what is there:
This will give you additional info.
curl_setopt($ch, CURLOPT_FILETIME, true);
You should set fail on error:
curl_setopt($ch, CURLOPT_FAILONERROR,true);
You may also need to see all the Request and Response headers:
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
The big thing you are missing is curl_getinfo($ch);
It has info on all the redirects and the headers.
You may want to turn off: CURLOPT_FOLLOWLOCATION
And do each request individually. You can get the redirect location from curl_getinfo("redirect_url")
Or you can set CURLOPT_MAXREDIRS to the number of successful redirects, then do a separate curl request for the problem redirect location
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
When you get the response, if no curl error, get the resposne header
$data = curl_exec($ch);
if (curl_errno($ch)){
$data .= 'Retreive Base Page Error: ' . curl_error($ch);
echo $data;
}
else {
$skip = intval(curl_getinfo($ch, CURLINFO_HEADER_SIZE));
$responseHeader = substr($data,0,$skip);
$data= substr($data,$skip);
$info = curl_getinfo($ch);
$info = var_export($info,true);
}
echo $responseHeader . $info . $data;
A better way to web scraping a webpage is to use 2 PHP Packages = Guzzle + DomCrawler.
I made a lot of tests with this combination and i came to the conclusion that this is the best choice.
Here, you will find an example for your implementation.
Let me know if you have any problem! ;)

PHP CURL Request Failing When Site Requires Cookies Enabled

I am trying to grab the meta data from a news article on the NY Times website, specifically http://www.nytimes.com/2014/06/25/us/politics/thad-cochran-chris-mcdaniel-mississippi-senate-primary.html
Whenever I try however I am getting redirects from the sight because my "browser" does not accept cookies. I have enabled the curl options to save cookies and tried following the accepted answers in a few other StackOverflow questions (here, here, and here) and while the answer worked on those websites it doesn't seem to work on the nytimes site.
My current php curl function looks like this:
function get_extra_meta_tags_curl($url) {
$ckfile = tempnam("/public_html/commentarium/", "cookies.txt");
$ch = curl_init($main_url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($ch);
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
}
The problem appears to be that when I request the URL, nytimes.com checks if the browser accepts cookies. I checks a couple of times before redirecting to the login page with a REFUSE_COOKIE_ERROR. Instead of posting the full redirect list here you can see it on my test page here along with the raw html that the final redirect returns and what my current get_extra_meta_tags_curl function is returning under CURL test
Thanks for any help!
You enable cookies auto-handling in wrong manner. CURLOPT_COOKIEJAR only enables cookies saving (storing), but you need also enable cookies loading and passing them with request (by CURLOPT_COOKIEFILE option). Otherwise cookies auto-handling won't work and you will experienced mentioned "Browser does not accept cookies" problem.
So you have to set both CURLOPT_COOKIEJAR and CURLOPT_COOKIEFILE options to the same value ($ckfile) at each CURL request:
...
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
...

Remote login using curl on cakephp returns blackhole

I have a website which i need to implement a login form to other website , which is based on cakephp.
I don't want to change the current security settings on the Cakephp website.
The login is based on the auth component.
Therefore I've pulled the form from the cakephp site using Curl, to keep the token fields.
When i login, i've got a blackhole message 'The request has been black-holed'.
How can i fix that without lowering the security level?
What are the steps to debug this situation.
Thanks
$url = WEB_APP . '/users/login';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, '/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$result = curl_exec($ch);
$markup = strip_tags($result, '<form><input>');
echo $markup;
I've checked and the problem is that the cookie is missing , for sure.
I tried to use sendcookie on the cookie file i got from curl, and it's saved encoded.
Using setrawcookie returns false.
How can I set the cookie correctly?

Are PayPal cookies linked to an IP address?

I have been experimenting with curl for accessing the PayPal payment authorisation site using PHP.
e.g.
...
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $nvp);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$res = curl_exec ($ch);
preg_match_all('/Set-Cookie: .*/', $res, $cookieMatches);
foreach ($cookieMatches[0] as $cookieMatch)
header($cookieMatch);
preg_match('/Location: .*/', $res, $locMatches);
header($locMatches[0]);
header('Vary: Accept-Encoding');
header('Strict-Transport-Security: max-age=500');
header('Transfer-Encoding: chunked');
header('Content-Type: text/html');
The principle being simply to reflect the original redirect (I am sure there is a simpler way to do this). However, the response from PayPal seems to indicate some kind of cookie error.
My hunch is that the cookie has been linked to the originating machine in some way. Can anyone confirm this, or am I just missing something obvious!
The CURL has built-in support for cookies (as you know). But it's been tricky. I haven't managed cookies to work until I declared option
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
Third parameter is a name of the file storing cookies - preferably in temp folder. Maybe you should just try this approach.
With this the redirects work "automatically".
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
//SAVE THE COOKIES
curl_setopt ($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
USE THE COOKIES
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1');
// Follow Where the location will take you, maybe you catch the issue.
Since it's working on browser it has to work using CURL, unless they are using javascript to set cookies.
even if they are using cookies depending on IP address, try to start the session from beginning using curl so they set your server ip address with generated cookies.

Categories