I'm trying to get contents of a SSL ASPX page using cURL, here's my code:
$ourFileName = "cookieFIle.txt";
$ourFileHandle = fopen($ourFileName, 'w') or die("can't open file");
fclose($ourFileHandle);
function curl_get($url) {
$ch = curl_init();
$options = array(
CURLOPT_HEADER => 1,
CURLOPT_URL => $url,
CURLOPT_USERPWD => 'XXX:XXX',
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_HTTPAUTH => CURLAUTH_ANY,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13',
CURLOPT_HTTPHEADER => $header,
CURLOPT_COOKIESESSION, true,
CURLOPT_COOKIEFILE, "cookieFIle.txt",
CURLOPT_COOKIEJAR, "cookieFIle.txt"
);
curl_setopt_array($ch, $options);
$return = curl_exec($ch);
curl_close($ch);
return $return;
}
echo curl_get('https://somepage.com/intranet/loginprocess.aspx');
And whenever I perform the code I receive this:
Header:
HTTP/1.1 401 Unauthorized
Content-Length: 1656
Content-Type: text/html
Server: Microsoft-IIS/6.0
WWW-Authenticate: Negotiate
WWW-Authenticate: NTLM
MicrosoftOfficeWebServer: 5.0_Pub
X-Powered-By: ASP.NET
X-UA-Compatible: IE=EmulateIE8
Date: Sat, 16 Nov 2013 19:05:18 GMT
Message:
You are not authorized to view this page
You do not have permission to view this directory or page using the credentials that you supplied because your Web browser is sending a WWW-Authenticate header field that the Web server is not configured to accept.
The login and password are 100% correct and url is too. OpenSSL is installed on the RaspverryPI I'm using and cURL is enabled in php.ini
The loginprocess.aspx redirects you to studenthome.aspx after authorisation is complete, but I think the problem is in the authorisation itself.
You are trying to connect with basic authentication, while the server is requesting integrated windows authentication (NTLM).
Use the option CURLAUTH_NTLM.
You should not set cookiesession to true. From the manual :
TRUE to mark this as a new cookie "session". It will force libcurl to ignore all cookies it is about to load that are "session cookies" from the previous session. By default, libcurl always stores and loads all cookies, independent if they are session cookies or not. Session cookies are cookies without expiry date and they are meant to be alive and existing for this "session" only.
Your code also has typos. It should read like this :
CURLOPT_HTTPHEADER => $header,
CURLOPT_COOKIEFILE => "cookieFIle.txt",
CURLOPT_COOKIEJAR => "cookieFIle.txt"
);
Related
I want to get content of this page by php curl:
my curl sample:
function curll($url,$headers=null){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
if ($headers){
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
}
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0');
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
$response = curl_exec($ch);
$res['headerout'] = curl_getinfo($ch,CURLINFO_HEADER_OUT);
$res['rescode'] = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($response === false) {
$res['content'] = $response;
$res['error'] = array(curl_errno($ch),curl_error($ch));
return $res;
}
$header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
$res['headerin'] = substr($response, 0, $header_size);
$res['content'] = substr($response, $header_size);
return $res;
}
response:
array (size=4)
'headerout' => string 'GET /wallets HTTP/1.1
Host: www.cryptocompare.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: br
Accept-Language: en-US,en;q=0.5
Connection: keep-alive
Upgrade-Insecure-Requests: 1
' (length=327)
'rescode' => string '200' (length=3)
'content' => boolean false
'error' =>
array (size=2)
0 => int 23
1 => string 'Unrecognized content encoding type. libcurl understands deflate, gzip content encodings.' (length=88)
response encoding is br and response content is false
I am aware that using gzip or deflate as encoding would get me a content. However, the content that I have in mind is only shown by br encoding.
I read on this page that Curl V7.57.0 supports the Brotli Compression Capability. I currently have version 7.59.0 installed, but Curl encounters an error as it recieves content in br encoding.
now I want to know how can I get content of a page with br encoding and uncompress it by php curl ?
I had the exact same issue because one server was only able to return brotli and my PHP Curl-bundled version didn't support Brotli. I had to use a PHP extension: https://github.com/kjdev/php-ext-brotli
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'URL');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output_brized = curl_exec($ch);
$output_ok = brotli_uncompress($output_brized);
I checked and, with PHP 7.4.9 on Windows with bundled Curl version 7.70.0, setting the CURLOPT_ENCODING option to '' (like you did) forced the bundled Curl to do the request with one additionnal header accept-encoding: deflate, gzip which are the content encodings the bundled Curl can decode. If I omited this option, there was just 2 headers: Host: www.google.com and accept: */*.
Indeed, searching the PHP source code (https://github.com/php/php-src/search?q=CURLOPT_ENCODING) for this CURLOPT_ENCODING option lead to nothing that may set a default value or change value from PHP. PHP sends the option value to Curl without altering it so what I am observing is the default behavior of my bundled Curl version.
I then discovered Curl supports Brotli from version 7.57.0 (https://github.com/curl/curl/blob/bf1571eb6ff24a8299da7da84408da31f0094f66/docs/libcurl/symbols-in-versions) from november 2018 (https://github.com/curl/curl/blob/fd1ce3d4b085e7982975f29904faebf398f66ecd/docs/HISTORY.md) but requires to be compiled with a --with-brotli flag (https://github.com/curl/curl/blob/9325ab2cf98ceca3cf3985313587c94dc1325c81/configure.ac) which was probably not used for my PHP version.
Unfortunately, there is no curl_getopt() function to get the default value of an option. But, phpinfo() gives a valuable info as I got a BROTLI => No line which confirms my version was not compiled with Brotli support. You may want to check your phpinfo to find out if your Curl-bundled version should support Brotli. If it doesn't, use my solution. If it does, more investigation need to be done to find out if it's a bug or a misuse.
If you want to know what your Curl sent, you have to use a proxy like Charles/Fiddler or use Curl verbose mode.
Additionnaly, for the sake of completness, in the HTTP1/1 specs (https://www.rfc-editor.org/rfc/rfc2616#page-102), it's said:
If an Accept-Encoding field is present in a request, and if the
server cannot send a response which is acceptable according to the
Accept-Encoding header, then the server SHOULD send an error response
with the 406 (Not Acceptable) status code.
If no Accept-Encoding field is present in a request, the server MAY
assume that the client will accept any content coding.
So, if your PHP version behaved the same as mine, the website should have received a Accept-Encoding not containing br so should NOT have replied with a br content and, instead, should have replied with a gzip or deflate content or, if it was not able to do so, replied with a 406 Not Acceptable instead of a 200.
if you using cloudflare, then you can try to disable brotli extension from cloudflare.
So if I'm browsing http://www.example.com/user1.jpg I see the user's picture.
But if I'm making curl request via PHP from my localhost webserver (so the same IP) it throws 401 unauthorized.
I even tried to change the user agent and still no success.
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => 'http://example.com/user1.jpg',
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0'
));
$resp = curl_exec($curl);
echo $resp;
curl_close($curl);
What can be wrong?
I used Fiddler tool analyzing the headers sawing 3 GET requests. First two were 401 Unauthorized, third was accepted without typing credentials (probably SSON implemented).
It was using NTLM authentication protocol, so running curl from CLI with "--ntlm username:password" did the job for me.
The following cURL configuration works fine on my local machine using cURL 7.30.0:
$curl = curl_init();
curl_setopt_array($curl, array(
// Just showing the noteworthy options here.
CURLOPT_HTTPHEADER => array("Content-Type: application/x-www-form-urlencoded")
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_0,
CURLOPT_COOKIE => "foo=bar",
));
$response = curl_exec($curl);
curl_close($curl);
Excerpt of the debugging output:
> GET / HTTP/1.0
Host: example.com
Accept: */*
Cookie: foo=bar
Content-Type: application/x-www-form-urlencoded
Now I run the same code on a shared hosting environment with cURL 7.19.7 and get:
> GET / HTTP/1.1
Host: example.com
Accept: */*
Content-Type: application/x-www-form-urlencoded
Basically cURL is working 99% fine, but ignores the forced HTTP version and cookie string. Is the hosting company running a configuration that blocks these features? Is the cURL version they are running too old? What's going on here?
I found it. curl_setopt_array quits processing options as soon as one option fails, I didn't know that. I should've checked the return value to make sure all was well.
In my case the culprit was option CURLOPT_FOLLOWLOCATION. It probably failed because the hosting provider is using safe mode, which disables the follow 301/302 feature.
$curl = curl_init();
$check = curl_setopt_array($curl, $options);
if(!$check)
die("Ye be warned: one of your options did not make it.");
$response = curl_exec($curl);
curl_close($curl);
I need to get data from a webpage on a server which uses the https protocol (i.e. https://site.com/page). Here's the PHP code I've been using:
$POSTData = array('');
$context = stream_context_create(array(
'http' => array(
//'ignore_errors' => true,
'user_agent' => "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36",
'header' => "User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11.1 (KHTML, like Gecko) Version/3.0.3 Safari/522.12.1",
'request_fulluri' => true,
'method' => 'POST',
'content' => http_build_query($POSTData),
)
));
$pageHTML = file_get_contents("https://site.com/page", FALSE, $context);
echo $pageHTML;
However, this doesn't seem to work, giving out a Warning: file_get_contents with no information on the error. What might be the case, and how do I work around it to connect to the server and get the page?
EDIT: Thanks to everyone who answered, my problem was that I was using an HTTP proxy, which I removed from the code so that it wouldn't confuse you, as I thought it couldn't possibly have been the problem. To make the code load an HTTPS page via an HTTP proxy, I modified the stream_context_create I used like this:
stream_context_create(array(
'https' => array(
//...etc
Have a look at cURL if you haven't already. With cURL you can remotely access a webpage/API/file and have it downloaded to your server. The curl_setopt() function allows you to specify whether or not to verify the certificate of the remote server
$file = fopen("some/file/directory/file.ext", w);
$ch = curl_init("https://site.com/page");
curl_setopt($ch, CURLOPT_FILE, $file);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); //false to disable the cert check if needed
$data = curl_exec($ch);
curl_close($ch);
fclose($file);
Something like that will allow you to connect to an HTTPS server and then download the file that you want. If you know the server has a valid certificate (i.e. you aren't developing on a server that doesn't have a valid certificate) then you can leave out the curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); line, as cURL will attempt to verify the certificate by default.
cURL also has the curl_getinfo() function that will give you details about the most recently processed transfer that will help you debug the program.
For some reason I can't seem to get this particular web page's contents via cURL. I've managed to use cURL to get to the "top level page" contents fine, but the same self-built quick cURL function doesn't seem to work for one of the linked off sub web pages.
Top level page: http://www.deindeal.ch/
A sub page: http://www.deindeal.ch/deals/hotel-cristal-in-nuernberg-30/
My cURL function (in functions.php)
function curl_get($url) {
$ch = curl_init();
$header = array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Accept-Language: en-us;q=0.8,en;q=0.6'
);
$options = array(
CURLOPT_URL => $url,
CURLOPT_HEADER => 0,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13',
CURLOPT_HTTPHEADER => $header
);
curl_setopt_array($ch, $options);
$return = curl_exec($ch);
curl_close($ch);
return $return;
}
PHP file to get the contents (using echo for testing)
require "functions.php";
require "phpQuery.php";
echo curl_get('http://www.deindeal.ch/deals/hotel-walliserhof-zermatt-2-naechte-30/');
So far I've attempted the following to get this to work
Ran the file both locally (XAMPP) and remotely (LAMP).
Added in the user-agent and HTTP headers as recommended here file_get_contents and CURL can't open a specific website - before the function curl_get() contained all the options as current, except for CURLOPT_USERAGENTandCURLOPT_HTTPHEADERS`.
Is it possible for a website to completely block requests via cURL or other remote file opening mechanisms, regardless of how much data is supplied to attempt to make a real browser request?
Also, is it possible to diagnose why my requests are turning up with nothing?
Any help answering the above two questions, or editing/making suggestions to get the file's contents, even if through a method different than cURL would be greatly appreciated ;).
Try adding:
CURLOPT_FOLLOWLOCATION => TRUE
to your options.
If you run a simple curl request from the command line (including a -i to see the response headers) then it is pretty easy to see:
$ curl -i 'http://www.deindeal.ch/deals/hotel-cristal-in-nuernberg-30/'
HTTP/1.1 302 FOUND
Date: Fri, 30 Dec 2011 02:42:54 GMT
Server: Apache/2.2.16 (Debian)
Vary: Accept-Language,Cookie,Accept-Encoding
Content-Language: de
Set-Cookie: csrftoken=d127d2de73fb3bd72e8986daeca86711; Domain=www.deindeal.ch; Max-Age=31449600; Path=/
Set-Cookie: generic_cookie=1; Path=/
Set-Cookie: sessionid=987b1a11224ecd0e009175470cf7317b; expires=Fri, 27-Jan-2012 02:42:54 GMT; Max-Age=2419200; Path=/
Location: http://www.deindeal.ch/welcome/?deal_slug=hotel-cristal-in-nuernberg-30
Content-Length: 0
Connection: close
Content-Type: text/html; charset=utf-8
As you can see, it returns a 302 with a Location header. If you hit that location directly, you will get the content you are looking for.
And to answer your two questions:
No, it is not possile to block requests from something like curl. If the consumer can talk HTTP then it can get to anything the browser can get to.
Diagnosing with an HTTP proxy could have been helpful for you. Wireshark, fiddler, charles, et al. should help you out in the future. Or, do like I did and make a request from the command line.
EDIT
Ah, I see what you are talking about now. So, when you go to that link for the first time you are redirected and a cookie (or cookies) is set. Once you have those cookie, your request goes through as intended.
So, you need to use a cookiejar, like in this example: http://icfun.blogspot.com/2009/04/php-how-to-use-cookie-jar-with-curl.html
So, you will need to make an initial request, save the cookies, and make your subsequent requests including the cookies after that.