Connect to https server with PHP - php

I need to get data from a webpage on a server which uses the https protocol (i.e. https://site.com/page). Here's the PHP code I've been using:
$POSTData = array('');
$context = stream_context_create(array(
'http' => array(
//'ignore_errors' => true,
'user_agent' => "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36",
'header' => "User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11.1 (KHTML, like Gecko) Version/3.0.3 Safari/522.12.1",
'request_fulluri' => true,
'method' => 'POST',
'content' => http_build_query($POSTData),
)
));
$pageHTML = file_get_contents("https://site.com/page", FALSE, $context);
echo $pageHTML;
However, this doesn't seem to work, giving out a Warning: file_get_contents with no information on the error. What might be the case, and how do I work around it to connect to the server and get the page?
EDIT: Thanks to everyone who answered, my problem was that I was using an HTTP proxy, which I removed from the code so that it wouldn't confuse you, as I thought it couldn't possibly have been the problem. To make the code load an HTTPS page via an HTTP proxy, I modified the stream_context_create I used like this:
stream_context_create(array(
'https' => array(
//...etc

Have a look at cURL if you haven't already. With cURL you can remotely access a webpage/API/file and have it downloaded to your server. The curl_setopt() function allows you to specify whether or not to verify the certificate of the remote server
$file = fopen("some/file/directory/file.ext", w);
$ch = curl_init("https://site.com/page");
curl_setopt($ch, CURLOPT_FILE, $file);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); //false to disable the cert check if needed
$data = curl_exec($ch);
curl_close($ch);
fclose($file);
Something like that will allow you to connect to an HTTPS server and then download the file that you want. If you know the server has a valid certificate (i.e. you aren't developing on a server that doesn't have a valid certificate) then you can leave out the curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); line, as cURL will attempt to verify the certificate by default.
cURL also has the curl_getinfo() function that will give you details about the most recently processed transfer that will help you debug the program.

Related

PHP: How to get website with cURL and act like a real browser?

There's a specific website I want to get the source code from with PHP cURL.
Visiting this website with a bowser from my computer works without any problems.
But when I want to access this website with my PHP script, the website recognizes that this is an automated request and shows an error message.
This is my PHP script:
<?php
$url = "https://www.example.com";
$user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.1 Safari/605.1.15";
$header = array('http' => array('user_agent' => $user_agent));
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
curl_close($ch);
echo $data;
?>
The user agent is the same I'm also using with the browser. I'm using a local server with MAMP PRO. This means I'm using the same IP address for both, browser access and PHP script access.
I already tried my PHP script with many different headers and options but nothing worked.
There must be anything that makes a PHP script access look different than a browser access, for the web server I want so access the website from. But what? Do you have an idea?
EDIT
I found out that it's working with this cURL:
curl 'https://www.example.com/' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3' -H 'accept-language: de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7'
If I type this in e.g. the Terminal, it's showing the correct source code.
I converted it to a PHP script as follows:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.example.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
$headers = array();
$headers[] = 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3';
$headers[] = 'Accept-Language: de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
curl_close($ch);
echo $result;
?>
Unfortunately, this way it's still showing the error message.
This means, there must be anything that makes a command line access look different than a browser access, for the web server I want so access the website from. But what is it?
There is no difference between a cURL request and the request that a browser makes, apart from the HTTP headers it requests, and that a browser has JavaScript running on the client.
The only thing that identifies an HTTP client is its headers -- typically the user agent string -- and seeing as you have set the user agent to exactly the same as the browser, there must be other checks in place.
By default, cURL doesn't send any default Accept header, whereas browsers request pages with this header to show the capabilities of the browser. I expect the web server will be checking on something like this.
Take a look at the screenshot above of Chrome Developer Tools. It allows you to copy the whole request as a cURL request, including all the headers that were sent from Chrome, for testing in the terminal.
Try to match all the headers exactly from within your PHP, and I'm sure the web server will not be able to identify you as a script.
You should try to mimic a real browser by forging "real" http request. Add more headers than the User-Agent, like "Accept", "Accept-Language", "Accept-Encoding". Also, you probably need to accept (and handle correctly) cookies.
If your targeted website use javascript to detect a real browser, this is an other challenge.

php curl unauthorized but works manually

So if I'm browsing http://www.example.com/user1.jpg I see the user's picture.
But if I'm making curl request via PHP from my localhost webserver (so the same IP) it throws 401 unauthorized.
I even tried to change the user agent and still no success.
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => 'http://example.com/user1.jpg',
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0'
));
$resp = curl_exec($curl);
echo $resp;
curl_close($curl);
What can be wrong?
I used Fiddler tool analyzing the headers sawing 3 GET requests. First two were 401 Unauthorized, third was accepted without typing credentials (probably SSON implemented).
It was using NTLM authentication protocol, so running curl from CLI with "--ntlm username:password" did the job for me.

Is there a way to get round a 403 error with php file_get_contents?

I'm trying to get a specific webpage using php file_get_contents - when I view the page directly there is no problem but when trying to grab it using php I get "failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden". Theres a piece of data that I'm trying to extract from the page.
$ft = file_get_contents('https://www.vesselfinder.com/vessels/CELEBRITY-MILLENNIUM-IMO-9189419-MMSI-249055000');
echo $ft;
I've read up on various pages here about using stream_context_create, mainly the user agent part
$context = stream_context_create(
array(
"http" => array(
"header" => "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
)
);
But nothing works and I now get a 400 error message. Unfortunately it doesn't look like my server is configured to use cURL so file_get_contents seems to be the only way for me to do this.
You need to add the User-Agent header to the actual header:
$context = stream_context_create(
array(
'http' => array(
'header' => 'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
),
));
You could also use the user_agent option:
$context = stream_context_create(
array(
'http' => array(
'user_agent' => 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
),
));
Both above examples should work and you should now be able to get the contents using:
$content = file_get_contents('https://www.vesselfinder.com/vessels/CELEBRITY-MILLENNIUM-IMO-9189419-MMSI-249055000', false, $context);
echo $content;
This could of course also be tested using curl from the command line. Notice that we are setting our own User-Agent header:
curl --verbose -H 'User-Agent: YourApplication/1.0' 'https://www.vesselfinder.com/vessels/CELEBRITY-MILLENNIUM-IMO-9189419-MMSI-249055000'
It might also be worth knowing that the default User-Agent used by curl seems to be blocked, so if using curl you need to add your own using the -H flag.
Vesselfinder, the service you are making the request to, seems to deny automatic parsing of their data, as #ADyson said. Read the docs: https://www.vesselfinder.com/de/realtime-ais-data#rt-web-services
You may ask them for an API token, maybe it is a paid plan.
They have an official API. You need an Api key.

file_get_contents, curl, wget fails with 403 response

I am trying to echo site data & for 95% of sites file_get_content, curl works just fine but for few sites, it never works whatever I tried. I tried to define proper user agent, changes SSL verify to false but nothing worked.
test site where it fails with forbidden https://norskbymiriams.dk/
wget is unable to copy ssl sites however wget is compiled with ssl support. checked with wget -V
i tried these codes.none worked for this particular site
file_get_contents
$list_url = "https://norskbymiriams.dk/";
$html = file_get_contents($list_url);
echo $html;
curl
$handle=curl_init('https://norskbymiriams.dk');
curl_setopt($handle, CURLOPT_HEADER, true);
curl_setopt($handle, CURLOPT_VERBOSE, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36");
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, true);
$content = curl_exec($handle);
echo $content;
any help will be great
Some websites analyse a request extremely good. If there is a single thing that makes that web server think you are a crawling bot, it might return 403.
I would try this:
make a request from browser, see all request headers, and place them in my curl request (simulate a real browser).
my curl request would look like this:
curl 'https://norskbymiriams.dk/'
-H 'Upgrade-Insecure-Requests: 1'
-H
'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100
Safari/537.36'
--compressed
Please try it. it works.
You can make a request in Chrome for example, and use Network tab from Developer tools to inspect a page request. If you right click on it, you will see Copy as cURL
Therefore test each header separately in your actual cURL request, see which is the missing link, then add it and continue your crawling.

HTTP headers sent by PHP

I am using CURL and file_get_contents to find out the basic difference between a server request for a page and a browser request (organic).
I am requesting for a PHPINFO page both ways and found that it is giving different output in different cases.
For example, when I am using a browser the PHPINFO shows this:
_SERVER["HTTP_CACHE_CONTROL"] no-cache
This info is missing when I am requesting the same page through PHP.
My CURL:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/phpinfo.php");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0");
curl_setopt($ch, CURLOPT_INTERFACE, $testIP);
$output = curl_exec($ch);
curl_close($ch);
My file_get_contents:
$opts = array(
'socket' => array('bindto' => 'xxx.xx.xx.xx:0'),
'method' => 'GET',
'user_agent ' => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0", // this doesn't work
'header' => array('Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8')
);
My goal:
To make a PHP request look identical to a browser request.
one of possible ways for server to detect you are a php code not a browser is check your cookie. with php curl request to the server once and inject the cookie you get to your next request.
check here :
http://docstore.mik.ua/orelly/webprog/pcook/ch11_04.htm
one other way that server can understand you are a robot(php code) is check referer http header.
you can learn more here :
http://en.wikipedia.org/wiki/HTTP_referer

Categories