Lets say we have our own script here:
https://example.com/random_name/script.php
// Code inside the script.php
$url = 'https://foo.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
$data = curl_exec($ch);
As you can see, the link above gets content of this domain:
https://foo.com
I know foo.com can see example.com and it's IP Address!
but Questions is, can foo.com (the page we're getting content from) also by any methods detect this exact part:
/random_name/script.php is making the request? does it depend on using TLS?
The page (server/application) can have all the information in the request. If the request (in your example - script.php) will not send the extra data (/random_name/script.php) the page the receive the request will not have it.
If you want to receiving end (foo.com) to know about it you can use the referer header:
curl_setopt($ch, CURLOPT_REFERER, "https://example.com/random_name/script.php");
And this way - foo.com can view that information in the referer header.
Related
I want to get full redirect path of the url.
Let's say if source.com redirects to destination.com after multiple redirects like this:
http://www.source.com/ -> http://www.b.com/ -> http://www.c.com/ -> http://www.destination.com/
how do I get all redirected URL's?
using this below code I am getting only http://www.destination.com/ how do I detect full url redirect chain?
<?php
$url='windows.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow the redirects
curl_setopt($ch, CURLOPT_HEADER, false); // no needs to pass the headers to the data stream
curl_setopt($ch, CURLOPT_NOBODY, true); // get the resource without a body
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // accept any server certificate
curl_exec($ch);
// get the last used URL
$lastUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
echo $lastUrl;
?>
This code has another problem it can't detect redirected url of youtube redirects.
Tested URL : https://www.youtube.com/redirect?redir_token=QUFFLUhqbkVxUFZUME9NbWF4RThxdFpGV3pmTTJEdFVWQXxBQ3Jtc0tubGJqU016TzJ6WnlfeUItX0ZmOUItUE1jRlZoZXhxMzNpQllpM0NLSk4ycnBLMGNidTFsX3N6WkU2X3RsUTRZb1lXQVp5SEZjbnU3eDFuZS1VU3dhdzg2QW9ZMTl1azFCZFZHcHRLdFF3dTM1MlRWdw%3D%3D&event=video_description&v=KEa2XWRGf_4&q=https%3A%2F%2Fwww.facebook.com%2Fabhiandniyu
My question is how do I detect full url redirect chain for all types of redirect requests.
You're probably missing:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
add it to your CURL config and it should work then.
Don't follow HTTP redirects: curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
And output HTTP headers, while testing: curl_setopt($ch, CURLOPT_HEADER, true);
Then you can obtain the Location header from the received HTTP 302 response.
When it's more than one redirect, this would have to run in a loop, until HTTP 200 has been received. In this context HTTP 200 means, that the final destination has been reached.
Guys, I am currently working on file hosting premium link generator basically it will be a website from where you can get a premium link of uptobox,rapidgator,uploaded.net and other file hosts sites without purchasing the premium account. Basically, We are purchasing the accounts of this website on behalf of the users and offering this service at a low price. So when I was setting up API of direct download link of rapidgator I was able to get that link but I was getting session is over. I was trying to that API via a software, not via manual coding and I am facing this problem
So I have been getting Rapidgator API reference from Tihs Site:- https://gist.github.com/Chak10/f097b77c32a9ce83d05ef3574a30367d
So I am doing the following Thing With My Debugging Software And I am getting success response but when I just open that URL in my browser it shows Session Id Failed.
So Here Are Steps What I am Doing
Sending a post request on https://rapidgator.net/api/user/login with username and data and I am getting this output
{"response":{"session_id":"g8a13f32hr4cbbbo54qdigrcb3","expire_date":1542688501,"traffic_left":"13178268723435"},"response_status":200,"response_details":null}
Now I am sending a get request (I tried Post Request Too But the Same Thing Happened) on this url with session id and URL embedded in URL https://rapidgator.net/api/file/download?sid=&url=
and I am getting this output
{"response":{"url":"http:\/\/pr56.rapidgator.net\/\/?r=download\/index&session_id=uB9st0rVfhX2bNgPrFUri01a9i5xmxan"},"response_status":200,"response_details":null}
When I try to download the file from the Url through my browser It says Invalid Session and sometimes too many open connections error
Link of the error:- https://i.imgur.com/wcZ2Rh7.png
Success Response:- https://i.imgur.com/MqTsB8Q.png
Rapidgator needs its api to be hit three times with different URLs.
$cookie = $working_dir.rand();
$headers = array("header"=>"Referer: https://rapidgator.net");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://rapidgator.net/api/user/login");
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
curl_setopt($ch, CURLOPT_POSTFIELDS, "username=email#domain.ext&password=myplaintextpassword");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$result = curl_exec($ch);
curl_close ($ch);
$rapidgator_json = json_decode($result,true);
return array($rapidgator_json['response']['session_id'],$cookie);
http://rapidgator.net/api/user/login (this is the initial login)
Above link gives you a session id that you need. The response is in JSON
Now we need to request a download link that will allow us to download without having to log in to a human input form. So we will use its api to request a download link using the intial session id we got from the 1st url.
$headers = array("header"=>"Referer: http://rapidgator.net/api/user/login");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://rapidgator.net/api/file/download?sid=$rapidgator_session&url=$rapidgator_file");
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $working_dir.$rapidgator_cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $working_dir.$rapidgator_cookie);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$result = curl_exec($ch);
curl_close ($ch);
$rapidgator_json = json_decode($result,true);
return array($rapidgator_json['response']['url']);
Basically, we pass the session id Rapidgator gave us assuming you have properly passed a valid account. Then we include the source url you had obtained (Link to file) http://rapidgator.net/api/file/download?sid=$rapidgator_session&url=$rapidgator_file
After that. Rapidgator will return a JSON response with an url that u can use to obtain the file in question. This allows you to use whatever download method you want
as that link is a session url is valid for a short period of time.
$rapidgator_json['response']['url']
All code above is somewhat incomplete. Some extra checks on the json responces for possible errors/limits are recommended. I used functions on my end but this is enough for you to see what you should be doing. Rapidshare API has other data that can be useful in determining if you have gone over your daily quota. How long the session url is going to last and so on.
I am using cURL to try to scrape an ASP site that is not on my server, with the following option to automatically follow redirects it comes across:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
but it is not properly following all redirects that the website sends back: it is putting several of the redirect URLs as relative to my server and my PHP script's path, instead of the website's server and the path that the website's pages should be relative to. Is there any way to set the base path or server path in cURL, so my script can properly follow the relative redirects it comes across when scraping through the other website?
For example: If I authenticate on their site and then try to access "https://www.theirserver.com/theirapp/mainForm/securepage.aspx" with my script at "https://www.myserver.com/php/myscript.php", then, under some circumstances, their website tries to redirect back to their login page, but this causes a big problem, because the redirect sends my cURL client to "https://www.myserver.com/php/mainForm/login.aspx", that is, '/mainForm/login.aspx' relative to my script on my server, instead of the correct "https://www.theirserver.com/theirapp/mainForm/login.aspx" relative to the site I am scraping on their server.
I would expect cURL's FOLLOWLOCATION option to properly follow relative redirects based on the "Location:" header of the web pages I am accessing, but it seems that it doesn't and can't. Since this seems to not work, preferably I want a way to tell cURL a base path for the server or for all relative redirects it sees, so I can just use FOLLOWLOCATION. If not, then I need to figure out some code that will do the same thing FOLLOWLOCATION does, but that can let me specify a base path to handle these relative URLs when it comes across them.
I see several similar questions about following relative paths with cURL, but none of the answers have any good suggestions for dealing with this problem, where I don't own the website's server and I don't know every single redirect that might come up. In fact, none of the answers I've seen for similar questions seem to even understand that a person might be trying to scrape an external website and would want any relative redirects they come across while scraping the site to just be relative to that site.
EDIT: Here is the code in question:
$urlLogin = "https://www.theirsite.com/theirApp/MainForm/login.aspx"
$urlSecuredPage = "https://www.theirsite.com/theirApp/ContentPages/content.aspx"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; yie8)");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
// GET login page
$data=curl_exec($ch);
// Read ASP viewstate and eventvalidation fields
$viewstate = parseExtract($data,$regexViewstate, 1);
$eventval = parseExtract($data, $regexEventVal, 1);
//set POST data
$postData = '__EVENTTARGET='.$eventtarget
.'&__EVENTARGUMENT='.$eventargument
.'&__VIEWSTATE='.$viewstate
.'&__EVENTVALIDATION='.$eventval
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn;
// POST authentication
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/******************************************************************
GET secure page (This is where a redirect fails... when getting
the secure page, it redirects to /mainForm/login.aspx relative to my
script, instead of /mainForm/login.aspx on their site.
*****************************************************************/
curl_setopt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
echo $data; // Page Not Found
You may be running into redirects that are JavaScript redirects.
To find out what is there:
This will give you additional info.
curl_setopt($ch, CURLOPT_FILETIME, true);
You should set fail on error:
curl_setopt($ch, CURLOPT_FAILONERROR,true);
You may also need to see all the Request and Response headers:
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
The big thing you are missing is curl_getinfo($ch);
It has info on all the redirects and the headers.
You may want to turn off: CURLOPT_FOLLOWLOCATION
And do each request individually. You can get the redirect location from curl_getinfo("redirect_url")
Or you can set CURLOPT_MAXREDIRS to the number of successful redirects, then do a separate curl request for the problem redirect location
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
When you get the response, if no curl error, get the resposne header
$data = curl_exec($ch);
if (curl_errno($ch)){
$data .= 'Retreive Base Page Error: ' . curl_error($ch);
echo $data;
}
else {
$skip = intval(curl_getinfo($ch, CURLINFO_HEADER_SIZE));
$responseHeader = substr($data,0,$skip);
$data= substr($data,$skip);
$info = curl_getinfo($ch);
$info = var_export($info,true);
}
echo $responseHeader . $info . $data;
A better way to web scraping a webpage is to use 2 PHP Packages = Guzzle + DomCrawler.
I made a lot of tests with this combination and i came to the conclusion that this is the best choice.
Here, you will find an example for your implementation.
Let me know if you have any problem! ;)
I have got this code:
public function get_thead_page($cookie=null) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $this->url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_COOKIEFILE,'');
if($cookie) curl_setopt($ch, CURLOPT_COOKIE, $cookie);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Now I dont want to use my cookie value, but I want the browser to handle it for me. I wan tthe request to behave as if it was given by the browser.
So I want to the cookie to be loaded with the request instead of providing a value...
There is that value..
curl_setopt($ch, CURLOPT_COOKIEFILE,'');
which asks for the cookie file location...but I dont want to specify the location, I want the request to be sent with a cookie being loaded somehow without specifying the path on the system..
Is there any solution?
The browser can't do that. CURLOPT_COOKIEFILE refers to a server-side file which the browser have no access.
You're the one who made this app. It's to you to choose the cookie's location when you create it.
I have a site that uses cURL to access some pages, stores the returned results in variables, and then uses these variables within its own page. The script works well except where the target cURL page has a header('Location: ...') command inside it. It seems to just ignore this header command.
The cURL command is as follows...
//Load result page into variable so portions can be allocated to correct variables
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); # URL to post to
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1 ); # return into a variable
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postfields);
$loaded_result = curl_exec( $ch ); # run!
curl_close($ch);
I've tried changing the CURLOPT_HEADER to 1 but it doesn't do anything.
So how can I allow script redirection within the target urls using cURL to grab the results? By the way, the pages work fine if accessed other than via cURL but iFrames are not an option in this instance.
If you want cURL to follow redirections add this:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects
You'll want the options CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS. See the manual.
try
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);