I am trying to echo site data & for 95% of sites file_get_content, curl works just fine but for few sites, it never works whatever I tried. I tried to define proper user agent, changes SSL verify to false but nothing worked.
test site where it fails with forbidden https://norskbymiriams.dk/
wget is unable to copy ssl sites however wget is compiled with ssl support. checked with wget -V
i tried these codes.none worked for this particular site
file_get_contents
$list_url = "https://norskbymiriams.dk/";
$html = file_get_contents($list_url);
echo $html;
curl
$handle=curl_init('https://norskbymiriams.dk');
curl_setopt($handle, CURLOPT_HEADER, true);
curl_setopt($handle, CURLOPT_VERBOSE, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36");
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, true);
$content = curl_exec($handle);
echo $content;
any help will be great
Some websites analyse a request extremely good. If there is a single thing that makes that web server think you are a crawling bot, it might return 403.
I would try this:
make a request from browser, see all request headers, and place them in my curl request (simulate a real browser).
my curl request would look like this:
curl 'https://norskbymiriams.dk/'
-H 'Upgrade-Insecure-Requests: 1'
-H
'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100
Safari/537.36'
--compressed
Please try it. it works.
You can make a request in Chrome for example, and use Network tab from Developer tools to inspect a page request. If you right click on it, you will see Copy as cURL
Therefore test each header separately in your actual cURL request, see which is the missing link, then add it and continue your crawling.
Related
There's a specific website I want to get the source code from with PHP cURL.
Visiting this website with a bowser from my computer works without any problems.
But when I want to access this website with my PHP script, the website recognizes that this is an automated request and shows an error message.
This is my PHP script:
<?php
$url = "https://www.example.com";
$user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.1 Safari/605.1.15";
$header = array('http' => array('user_agent' => $user_agent));
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
curl_close($ch);
echo $data;
?>
The user agent is the same I'm also using with the browser. I'm using a local server with MAMP PRO. This means I'm using the same IP address for both, browser access and PHP script access.
I already tried my PHP script with many different headers and options but nothing worked.
There must be anything that makes a PHP script access look different than a browser access, for the web server I want so access the website from. But what? Do you have an idea?
EDIT
I found out that it's working with this cURL:
curl 'https://www.example.com/' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3' -H 'accept-language: de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7'
If I type this in e.g. the Terminal, it's showing the correct source code.
I converted it to a PHP script as follows:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.example.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
$headers = array();
$headers[] = 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3';
$headers[] = 'Accept-Language: de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
curl_close($ch);
echo $result;
?>
Unfortunately, this way it's still showing the error message.
This means, there must be anything that makes a command line access look different than a browser access, for the web server I want so access the website from. But what is it?
There is no difference between a cURL request and the request that a browser makes, apart from the HTTP headers it requests, and that a browser has JavaScript running on the client.
The only thing that identifies an HTTP client is its headers -- typically the user agent string -- and seeing as you have set the user agent to exactly the same as the browser, there must be other checks in place.
By default, cURL doesn't send any default Accept header, whereas browsers request pages with this header to show the capabilities of the browser. I expect the web server will be checking on something like this.
Take a look at the screenshot above of Chrome Developer Tools. It allows you to copy the whole request as a cURL request, including all the headers that were sent from Chrome, for testing in the terminal.
Try to match all the headers exactly from within your PHP, and I'm sure the web server will not be able to identify you as a script.
You should try to mimic a real browser by forging "real" http request. Add more headers than the User-Agent, like "Accept", "Accept-Language", "Accept-Encoding". Also, you probably need to accept (and handle correctly) cookies.
If your targeted website use javascript to detect a real browser, this is an other challenge.
I'm a novice programmer in PHP. Last week I read about cURL that capture my attention to study it. first, I copy and paste codes posted on different blogs and it goes run good like my code below.
<?php
$handle=curl_init('http://www.google.co.kr/');
curl_setopt($handle, CURLOPT_VERBOSE, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
$content = curl_exec($handle);
echo $content;
?>
BUT WHY I CAN'T cURL the website
http://www.todayhumor.co.kr/
since that, i am using same code above it outputs
Looking for your positive response guys. thank you in advance.
After calling curl_exec($handle) you should close the session with curl_close($handle). Maybe you tried so many times and now it doesn't work anymore, because you have so many open sessions on your local server. I would add that line to your code, restart xampp and try again.
Edit:
The server rejects requests without a valid user-agent. Add a user-agent to your request: curl_setopt($handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36'); this worked for me
I've got a really odd problem and no idea how to debug it. Maybe some experienced developer can help me. I've the following code:
$url = 'https://home.mobile.de/home/ses.html?customerId=471445&json=true&_='.time();
echo $url;
$agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36';
// Initiate curl
$ch = curl_init();
// Activate debugging
curl_setopt($ch, CURLOPT_VERBOSE, true);
// Disable SSL verification
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// Will return the response, if false it print the response
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Set browser user agent
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
// Set the url
curl_setopt($ch, CURLOPT_URL,$url);
// Execute
$result=curl_exec($ch);
// Closing
curl_close($ch);
$php_object = json_decode($result);
var_dump($php_object);
I've put this code into a php file called playground.php. If I open playground.php with Chrome (I am using MAMP as local server) then everything works as expected. Also if I run on the osx command line "php playground.php" it works as expected, but for any reason it does not work if I run it inside the Phpstorm cli as shown below.
Any idea what could be wrong and how I can debug this issue?
Many thanks in advance.
Thanks to LazyOne I was able to find out that a firewall rule was blocking the outgoing request. Many thanks!
Seem to be in a bit of a predicament. As far as I am aware, there have been no changed to PHP or Apache, however a code that has worked for almost 6 months just stoped working today at 2pm.
The code is:
function ls_record($prospectid,$campid){
$api_post = "method=NewProspect&prospect_id=".$prospectid."&campaign_id=".$campid;
$ch = curl_init();
curl_setopt($ch, CURLOPT_FRESH_CONNECT, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $api_post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://XXXXX/XXXXXX/store.php");
$x = print_r(curl_exec($ch), TRUE);
return $x;
}
It returns NULL, I tried usingfile_get_contents()which also returnsNULL`. I checking the Apache error logs and see nothing...I need some help on this one.
Do you have access to the command line of the server? It could be that the destination has blocked you somehow.
If you have command line access, try this
wget http://XXXXX/XXXXXX/store.php
That should at least return something (if not headers)
use curl_getinfo to check your curl execution status, it maybe that the server you try to extract content from need your curl to set user-agent, some site check user-agent to block unwanted curl access.
below are the user agent I used to disguise my curl as desktop chrome browser.
curl_setopt($ch,CURLOPT_USERAGENT,' Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36');
I face same problem on my server , because of the low internet speed. Internet speed is go down for some time and curl take so many time to execute , so it return a timeout error . After a few minute it is working fine without any changes on server.
I'm writing a cURL script, but how can I check if it's working and passing properly when it's visiting the website?
$ckfile = '/tmp/cookies.txt';
$useragent= "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0_1 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Mobile/7A400";
$ch = curl_init ("http://website.com");
curl_setopt($ch, CURLOPT_AUTOREFERER , true);
=> true
curl_setopt($ch, CURLOPT_USERAGENT, $useragent); // set user agent
curl_setopt ($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$output = curl_exec ($ch);
curl_close($ch);
just make a php page like this on your server and try your script on your own url
var_dump($_SERVER);
and check the HTTP_USER_AGENT string.
You can also achieve the same things by looking at the Apache logs.
But I am pretty sure curl is setting the User-Agent string like it should ;-)
You'll find the FF extension LiveHTTPHEaders will help you see exactly what happens to the headers when using a normal browsing session.
http://livehttpheaders.mozdev.org/
This will increase your understanding of how your target server responds, and even shows if it redirects your request internally.