Get the destination URL only without loading the contents with cURL - php

If I visit the website https://example.com/a/abc?name=jack, I get redirected to https://example.com/b/uuid-123. I want only the end URL which is https://example.com/b/uuid-123. The contents of https://example.com/b/uuid-123 is about 1mb but I am not interested in the content. I only want the redirected URL and not the content. How can I get the redirected URL without having to also load the 1mb of content which wastes my bandwidth and time.
I have seen a few questions about redirection on stackoverflow but nothing on how not to load the content.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://example.com/a/abc?name=jack');
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
$end_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
echo('End URL is ' . $end_url);

For clarity i'll add it as an answer as well.
You can tell curl to only retrieve the headers by setting the CURLOPT_NOBODY to true.
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);
From these headers you can parse the location part to get the redirected URL.
Edit for potential future readers: CURLOPT_HEADER will also need to be set to true, i had left this out as you already had it included in your code.

Related

Detect URL redirect path php

I want to get full redirect path of the url.
Let's say if source.com redirects to destination.com after multiple redirects like this:
http://www.source.com/ -> http://www.b.com/ -> http://www.c.com/ -> http://www.destination.com/
how do I get all redirected URL's?
using this below code I am getting only http://www.destination.com/ how do I detect full url redirect chain?
<?php
$url='windows.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow the redirects
curl_setopt($ch, CURLOPT_HEADER, false); // no needs to pass the headers to the data stream
curl_setopt($ch, CURLOPT_NOBODY, true); // get the resource without a body
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // accept any server certificate
curl_exec($ch);
// get the last used URL
$lastUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
echo $lastUrl;
?>
This code has another problem it can't detect redirected url of youtube redirects.
Tested URL : https://www.youtube.com/redirect?redir_token=QUFFLUhqbkVxUFZUME9NbWF4RThxdFpGV3pmTTJEdFVWQXxBQ3Jtc0tubGJqU016TzJ6WnlfeUItX0ZmOUItUE1jRlZoZXhxMzNpQllpM0NLSk4ycnBLMGNidTFsX3N6WkU2X3RsUTRZb1lXQVp5SEZjbnU3eDFuZS1VU3dhdzg2QW9ZMTl1azFCZFZHcHRLdFF3dTM1MlRWdw%3D%3D&event=video_description&v=KEa2XWRGf_4&q=https%3A%2F%2Fwww.facebook.com%2Fabhiandniyu
My question is how do I detect full url redirect chain for all types of redirect requests.
You're probably missing:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
add it to your CURL config and it should work then.
Don't follow HTTP redirects: curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
And output HTTP headers, while testing: curl_setopt($ch, CURLOPT_HEADER, true);
Then you can obtain the Location header from the received HTTP 302 response.
When it's more than one redirect, this would have to run in a loop, until HTTP 200 has been received. In this context HTTP 200 means, that the final destination has been reached.

Rapidgator API Direct Download Link Error

Guys, I am currently working on file hosting premium link generator basically it will be a website from where you can get a premium link of uptobox,rapidgator,uploaded.net and other file hosts sites without purchasing the premium account. Basically, We are purchasing the accounts of this website on behalf of the users and offering this service at a low price. So when I was setting up API of direct download link of rapidgator I was able to get that link but I was getting session is over. I was trying to that API via a software, not via manual coding and I am facing this problem
So I have been getting Rapidgator API reference from Tihs Site:- https://gist.github.com/Chak10/f097b77c32a9ce83d05ef3574a30367d
So I am doing the following Thing With My Debugging Software And I am getting success response but when I just open that URL in my browser it shows Session Id Failed.
So Here Are Steps What I am Doing
Sending a post request on https://rapidgator.net/api/user/login with username and data and I am getting this output
{"response":{"session_id":"g8a13f32hr4cbbbo54qdigrcb3","expire_date":1542688501,"traffic_left":"13178268723435"},"response_status":200,"response_details":null}
Now I am sending a get request (I tried Post Request Too But the Same Thing Happened) on this url with session id and URL embedded in URL https://rapidgator.net/api/file/download?sid=&url=
and I am getting this output
{"response":{"url":"http:\/\/pr56.rapidgator.net\/\/?r=download\/index&session_id=uB9st0rVfhX2bNgPrFUri01a9i5xmxan"},"response_status":200,"response_details":null}
When I try to download the file from the Url through my browser It says Invalid Session and sometimes too many open connections error
Link of the error:- https://i.imgur.com/wcZ2Rh7.png
Success Response:- https://i.imgur.com/MqTsB8Q.png
Rapidgator needs its api to be hit three times with different URLs.
$cookie = $working_dir.rand();
$headers = array("header"=>"Referer: https://rapidgator.net");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://rapidgator.net/api/user/login");
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
curl_setopt($ch, CURLOPT_POSTFIELDS, "username=email#domain.ext&password=myplaintextpassword");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$result = curl_exec($ch);
curl_close ($ch);
$rapidgator_json = json_decode($result,true);
return array($rapidgator_json['response']['session_id'],$cookie);
http://rapidgator.net/api/user/login (this is the initial login)
Above link gives you a session id that you need. The response is in JSON
Now we need to request a download link that will allow us to download without having to log in to a human input form. So we will use its api to request a download link using the intial session id we got from the 1st url.
$headers = array("header"=>"Referer: http://rapidgator.net/api/user/login");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://rapidgator.net/api/file/download?sid=$rapidgator_session&url=$rapidgator_file");
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $working_dir.$rapidgator_cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $working_dir.$rapidgator_cookie);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$result = curl_exec($ch);
curl_close ($ch);
$rapidgator_json = json_decode($result,true);
return array($rapidgator_json['response']['url']);
Basically, we pass the session id Rapidgator gave us assuming you have properly passed a valid account. Then we include the source url you had obtained (Link to file) http://rapidgator.net/api/file/download?sid=$rapidgator_session&url=$rapidgator_file
After that. Rapidgator will return a JSON response with an url that u can use to obtain the file in question. This allows you to use whatever download method you want
as that link is a session url is valid for a short period of time.
$rapidgator_json['response']['url']
All code above is somewhat incomplete. Some extra checks on the json responces for possible errors/limits are recommended. I used functions on my end but this is enough for you to see what you should be doing. Rapidshare API has other data that can be useful in determining if you have gone over your daily quota. How long the session url is going to last and so on.

When using cURL, can the target page detect the exact url?

Lets say we have our own script here:
https://example.com/random_name/script.php
// Code inside the script.php
$url = 'https://foo.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
$data = curl_exec($ch);
As you can see, the link above gets content of this domain:
https://foo.com
I know foo.com can see example.com and it's IP Address!
but Questions is, can foo.com (the page we're getting content from) also by any methods detect this exact part:
/random_name/script.php is making the request? does it depend on using TLS?
The page (server/application) can have all the information in the request. If the request (in your example - script.php) will not send the extra data (/random_name/script.php) the page the receive the request will not have it.
If you want to receiving end (foo.com) to know about it you can use the referer header:
curl_setopt($ch, CURLOPT_REFERER, "https://example.com/random_name/script.php");
And this way - foo.com can view that information in the referer header.

How can I properly follow all redirects on sites I am trying to scrape with cURL in PHP?

I am using cURL to try to scrape an ASP site that is not on my server, with the following option to automatically follow redirects it comes across:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
but it is not properly following all redirects that the website sends back: it is putting several of the redirect URLs as relative to my server and my PHP script's path, instead of the website's server and the path that the website's pages should be relative to. Is there any way to set the base path or server path in cURL, so my script can properly follow the relative redirects it comes across when scraping through the other website?
For example: If I authenticate on their site and then try to access "https://www.theirserver.com/theirapp/mainForm/securepage.aspx" with my script at "https://www.myserver.com/php/myscript.php", then, under some circumstances, their website tries to redirect back to their login page, but this causes a big problem, because the redirect sends my cURL client to "https://www.myserver.com/php/mainForm/login.aspx", that is, '/mainForm/login.aspx' relative to my script on my server, instead of the correct "https://www.theirserver.com/theirapp/mainForm/login.aspx" relative to the site I am scraping on their server.
I would expect cURL's FOLLOWLOCATION option to properly follow relative redirects based on the "Location:" header of the web pages I am accessing, but it seems that it doesn't and can't. Since this seems to not work, preferably I want a way to tell cURL a base path for the server or for all relative redirects it sees, so I can just use FOLLOWLOCATION. If not, then I need to figure out some code that will do the same thing FOLLOWLOCATION does, but that can let me specify a base path to handle these relative URLs when it comes across them.
I see several similar questions about following relative paths with cURL, but none of the answers have any good suggestions for dealing with this problem, where I don't own the website's server and I don't know every single redirect that might come up. In fact, none of the answers I've seen for similar questions seem to even understand that a person might be trying to scrape an external website and would want any relative redirects they come across while scraping the site to just be relative to that site.
EDIT: Here is the code in question:
$urlLogin = "https://www.theirsite.com/theirApp/MainForm/login.aspx"
$urlSecuredPage = "https://www.theirsite.com/theirApp/ContentPages/content.aspx"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; yie8)");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
// GET login page
$data=curl_exec($ch);
// Read ASP viewstate and eventvalidation fields
$viewstate = parseExtract($data,$regexViewstate, 1);
$eventval = parseExtract($data, $regexEventVal, 1);
//set POST data
$postData = '__EVENTTARGET='.$eventtarget
.'&__EVENTARGUMENT='.$eventargument
.'&__VIEWSTATE='.$viewstate
.'&__EVENTVALIDATION='.$eventval
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn;
// POST authentication
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/******************************************************************
GET secure page (This is where a redirect fails... when getting
the secure page, it redirects to /mainForm/login.aspx relative to my
script, instead of /mainForm/login.aspx on their site.
*****************************************************************/
curl_setopt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
echo $data; // Page Not Found
You may be running into redirects that are JavaScript redirects.
To find out what is there:
This will give you additional info.
curl_setopt($ch, CURLOPT_FILETIME, true);
You should set fail on error:
curl_setopt($ch, CURLOPT_FAILONERROR,true);
You may also need to see all the Request and Response headers:
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
The big thing you are missing is curl_getinfo($ch);
It has info on all the redirects and the headers.
You may want to turn off: CURLOPT_FOLLOWLOCATION
And do each request individually. You can get the redirect location from curl_getinfo("redirect_url")
Or you can set CURLOPT_MAXREDIRS to the number of successful redirects, then do a separate curl request for the problem redirect location
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
When you get the response, if no curl error, get the resposne header
$data = curl_exec($ch);
if (curl_errno($ch)){
$data .= 'Retreive Base Page Error: ' . curl_error($ch);
echo $data;
}
else {
$skip = intval(curl_getinfo($ch, CURLINFO_HEADER_SIZE));
$responseHeader = substr($data,0,$skip);
$data= substr($data,$skip);
$info = curl_getinfo($ch);
$info = var_export($info,true);
}
echo $responseHeader . $info . $data;
A better way to web scraping a webpage is to use 2 PHP Packages = Guzzle + DomCrawler.
I made a lot of tests with this combination and i came to the conclusion that this is the best choice.
Here, you will find an example for your implementation.
Let me know if you have any problem! ;)

Trying to log into a site with the cURL extension of PHP

Basically, I'm trying to log into a site. I've got it logging in, but the site redirects to another part of the site, and upon doing so, it redirects my browser as well.
For example:
It successfully logs into http://example.com/login.php
But then my browser goes to http://mysite.com/site.php?page=loggedin
I just want it to return the contents of the page, not be redirected to it.
How would I do this?
As requested, here is my code
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, $loginURL);
//Some setopts
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postFields);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_FOLLOWREDIRECT, FALSE);
curl_setopt($ch, CURLOPT_REFERRER, $referrer);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
echo $output;
Figured it out. The webpage was echoing a meta refresh, and since I was echoing the output, my browser followed.
Removed the echo $output; and it no longer does that.
I feel kind of dumb for not recognizing that in the beginning.
Thanks everyone.
Using cURL you have to find the redirect and follow it, then return that page's content. I'm not sure why your browser would be redirecting unless you have some weird header code that you are returning from the login page.
set CURLOPT_FOLLOWLOCATION to false.
curl_setopt($ch , CURLOPT_FOLLOWLOCATION , FALSE);
this might help you.

Categories