We have recently encountered CloudFare blocking on some sites we have not had issue scraping in the past.
It is not an IP block (tried from several IPs) and it is not tied to an account or any other kind of authentication. The site does not show user captchas
We created a PHP Curl request using the exact GET request with all headers but we receive a 403 Forbidden error and it displays:
www.******.com used CloudFlare to restrict access
Forgive my ignorance, but how does CloudFlare detect this? There are no cookies involved (as its the initial site request), the user-agent and everything else is identical.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.******.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_VERBOSE, TRUE);
$headers = array();
$headers[] = 'Host: www.******.com';
$headers[] = 'Connection: keep-alive';
$headers[] = 'Sec-Ch-Ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"';
$headers[] = 'Sec-Ch-Ua-Mobile: ?0';
$headers[] = 'Upgrade-Insecure-Requests: 1';
$headers[] = 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9';
$headers[] = 'Sec-Fetch-Site: none';
$headers[] = 'Sec-Fetch-Mode: navigate';
$headers[] = 'Sec-Fetch-User: ?1';
$headers[] = 'Sec-Fetch-Dest: document';
$headers[] = 'Accept-Encoding: gzip, deflate, br';
$headers[] = 'Accept-Language: en-US,en;q=0.9';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
result
HTTP/1.1 403 Forbidden
Any possible workaround?
Thank you
i think only cloudflare developers can give you correct answer. But they will not open their commercial secrets so easily.
All other information is theories and speculation. as far as i know, as response to first request to site cloudflare shows page with some javascript to profile browser, to proof it is stack browser used by humans, not curl.
if browser is stack one, than cloudflare allows requests to be served by backend.
I think you can try selenium, sometimes it doesn't trigger cloudflare blocker page with captcha
Related
I have a server, with 3rd party API installed, located: http://65.21.1.13:3000/. When I open it in browser, I receive the answer - Service start!, meaning that the service is working. I successfully receive this answer using android java or Visual c++ MFC.
But when I'm trying to open this site using PHP (curl or file_get_contents) - I receive an error. I tried to add headers, flags and other - my curl_exec always returns false. Is there solution, to get proper answer from server using PHP? One of the curl tries below:
$url = 'http://65.21.1.13';
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PORT ,3000);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$headers = [
'X-Apple-Tz: 0',
'X-Apple-Store-Front: 143444,12',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding: gzip, deflate',
'Accept-Language: en-US,en;q=0.5',
'Cache-Control: no-cache',
'Content-Type: application/x-www-form-urlencoded; charset=utf-8',
'Host: www.example.com',
'Referer: http://www.example.com/index.php', //Your referrer address
'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:28.0) Gecko/20100101 Firefox/28.0',
'X-MicrosoftAjax: Delta=true'
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
var_dump($result);
The answer was very simple. The 3000 port was blocked on PHP machine by firewall. Sorry to bother you.
I am trying cURL, but curl_exec() returns unreadable text like the screenshot below.
I wrote cURL like below. I was wondering how to fix this issue.
$ch = curl_init("https://app.kajabi.com/login");
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Host: app.kajabi.com',
'Connection: keep-alive',
'Cache-Control: max-age=0',
'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"',
'sec-ch-ua-mobile: ?0',
'sec-ch-ua-platform: "Windows"',
'Upgrade-Insecure-Requests: 1',
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site: none',
'Sec-Fetch-Mode: navigate',
'Sec-Fetch-User: ?1',
'Sec-Fetch-Dest: document',
'Accept-Encoding: gzip, deflate, br',
'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8',
'Cookie: _kjb_session=795006a5538f30410ce2f56bd813ddb0; __cf_bm=7iLyh_LWPmJjzo07YdEJQaE_RT0LPS2R6NL1Hp3Li6g-1649142817-0-Ae4i2Gq5QTr+PktvLBJEV8MHcgGTw5ADVHkedUa3JTcVLHEDTyE01Nw6qsZtmjs7Quu+phKNOlCtu/8Cxpdwxec=; __cfruid=531ca052551b47923660c7b1832af0f2ea867981-1649142817; _kjb_ua_components=41e11a8e3c73294e1d2e0f1813e1f86d'
));
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
if (curl_errno($ch)) {
print "Error: " . curl_error($ch);
exit();
}
echo $response;
I tried putting the response into a file, and it appears that the response is in gzip format.
file_put_contents('temp.gz',$response)
I extracted the archive and found that it's a HTML document telling you that the access is denied.
You can show the response directly in the output of your php script, though:
$decoded_response = gzdecode($response);
echo $decoded_response;
And maybe you should check whether the content is actually gzip before attempting to use gzdecode; see this thread: php curl, detect response is gzip or not
Edit:
You can let php automatically do the decoding by setting CURLOPT_ENCODING to '':
<?php
$ch = curl_init("https://app.kajabi.com/login");
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Host: app.kajabi.com',
'Connection: keep-alive',
'Cache-Control: max-age=0',
'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"',
'sec-ch-ua-mobile: ?0',
'sec-ch-ua-platform: "Windows"',
'Upgrade-Insecure-Requests: 1',
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site: none',
'Sec-Fetch-Mode: navigate',
'Sec-Fetch-User: ?1',
'Sec-Fetch-Dest: document',
'Accept-Encoding: gzip, deflate, br',
'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8',
'Cookie: _kjb_session=795006a5538f30410ce2f56bd813ddb0; __cf_bm=7iLyh_LWPmJjzo07YdEJQaE_RT0LPS2R6NL1Hp3Li6g-1649142817-0-Ae4i2Gq5QTr+PktvLBJEV8MHcgGTw5ADVHkedUa3JTcVLHEDTyE01Nw6qsZtmjs7Quu+phKNOlCtu/8Cxpdwxec=; __cfruid=531ca052551b47923660c7b1832af0f2ea867981-1649142817; _kjb_ua_components=41e11a8e3c73294e1d2e0f1813e1f86d'
));
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
if (curl_errno($ch)) {
print "Error: " . curl_error($ch);
exit();
}
echo $response;
?>
You are getting un-handled GZIP from Curl because you are manually setting the Accept-Encoding: header in your header array, rather than letting Curl handle it. Curl then gets an unexpectedly-encoded response and goes "I dunno, you deal with this".
You're telling the remote side "I want things handled this way" but you're not actually telling the local side.
Easy fix: Remove the Accept-Encoding: header from your header array, optionally move those encoding specifications to the CURLOPT_ENCODING setting you added in your own answer, but I would say that this is unnecessary as curl will prefer compression anyway.
Other headers that you should likely not be manually setting:
Host: unnecessary unless you need a value other than the hostname in the URL
Connection: client needs to be aware
Upgrade-Insecure-Requests: client needs to be aware, browser-specific
I need to extract DOM from external website in php. I tried testing URL, but sometimes it shows a many many chinesse letters :) (more specifically characters in unicode I though)
It's strange, that if I use different link, it works, but if I use link below and run php for example 3 times, after 3. try it stops working (but for the 1, a 2. time it shows normal DOM structure)
URL: https://www.csfd.cz/film/300902-bohemian-rhapsody/prehled/
DOM after 3. (ca.) run: https://i.stack.imgur.com/lnM1I.png
Code:
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile("https://www.csfd.cz/film/300902-bohemian-rhapsody/prehled/");
dd($doc->saveHTML());
Does anybody know, what to do?
I guess it is because of the site compression, you can extract data by using good old curl:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.csfd.cz/film/300902-bohemian-rhapsody/prehled/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
$headers = array();
$headers[] = 'Connection: keep-alive';
$headers[] = 'Cache-Control: max-age=0';
$headers[] = 'Save-Data: on';
$headers[] = 'Upgrade-Insecure-Requests: 1';
$headers[] = 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36';
$headers[] = 'Dnt: 1';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8';
$headers[] = 'Accept-Encoding: gzip, deflate, br';
$headers[] = 'Accept-Language: en-US;q=0.8,en;q=0.7,uk;q=0.6';
$headers[] = 'Cookie: nette-samesite=1; developers-ad=1;';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result);
dd($doc->saveHTML());
I have 1 REST api :
http://www.animemobile.com/service/v2/mobile2.php?episode_id=47272
I use curl to request it, in my PC with xampp, it works well and returns the correct results. This is results from my PC with xampp:
[
{
"Title":"English Subbed",
"link":"\/[HorribleSubs] Pascal-sensei - 01 [720p]_af.mp4?
st=14GwNjlMxuI8524DS56IUA&e=1495183034"
}
]
I use
/[HorribleSubs] Pascal-sensei - 01 [720p]_af.mp4?
st=14GwNjlMxuI8524DS56IUA&e=1495183034
to create a link as:
http://st2.anime1.com/[HorribleSubs]%20Pascal-sensei%20-
%2001%20[720p]_af.mp4?st=14GwNjlMxuI8524DS56IUA&e=1495183034.
This link is a video that can be played when request from browser (now).
But when I use the curl in my SERVER, it still works well but does not return the correct results. This is results from my Server:
[
{
"Title":"English Subbed",
"link":"\/[HorribleSubs] Pascal-sensei - 01 [720p]_af.mp4?
st=ghDP4290fsBNdmfsSKCD=1495195645"
}
]
When I use
/[HorribleSubs] Pascal-sensei - 01 [720p]_af.mp4?
st=ghDP4290fsBNdmfsSKCD=1495195645
to create a link as:
http://st2.anime1.com/[HorribleSubs]%20Pascal-sensei%20-
%2001%20[720p]_af.mp4?%20st=ghDP4290fsBNdmfsSKCD=1495195645.
It doesn't play on my browser.
This is my curl:
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($c, CURLINFO_HEADER_OUT, true);
$headers = [
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding: gzip, deflate, sdch',
'Accept-Language: vi,en-US;q=0.8,en;q=0.6',
'Cache-Control: max-age=0',
'Connection: keep-alive',
'Cookie: __cfduid=d7bf11c717fbcd54ec9b259e301a966d71480412679',
'Host: www.animemobile.com',
'Upgrade-Insecure-Requests: 1',
'User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
];
curl_setopt($c, CURLOPT_HTTPHEADER, $headers);
$data = curl_exec($c);
What is the problem here? Please help me!
Edit1: If you want to test the results, you need to request the REST-api again because it had limited time for link to be created. Important that request the REST-api on PC returns correct results but request from server returns wrong results although they look very similar!
How to force curl (with) PHP to download page as browser? The page I want to download is a price comparator, for e.g. http://www.ceneo.pl/22416171. It's public, anybody can access site.
To check if the curl downloading is even possible, I typed on my Debian-based local server
curl http://www.ceneo.pl/22416171
And it worked perfectly. But I do need to use it on my Virtual PHP-Apache serv, so I need to use PHP to do it.
While trying to download page as PHP-based curl, it gives me nothing, opposite to shell curl.
Why? How can I get the right content on PHP?
Tried:
<?php
$curl = curl_init(http://www.ceneo.pl/22416171);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_HEADER, 1);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl,CURLOPT_HTTPHEADER,
array(
'User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:28.0) Gecko/20100101 Firefox/28.0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: pl,en-US;q=0.7,en;q=0.3',
'Accept-Encoding: gzip, deflate',
'p3p: CP="NOI CURa ADMa DEVa TAIa OUR BUS IND UNI COM NAV INT"',
'Vary: Accept-Encoding',
'Content-Type: text/html; charset=utf-8',
'Cache-Control: private'
));
$body = curl_exec($curl);
curl_close($curl);
echo $body;
?>
I tried also to use
<?php exec(curl http://www.ceneo.pl/22416171); ?>
But it gave
curl: /usr/local/lib/libcurl.so.4: no version information available (required by curl)
Take a look at the documentation: http://www.php.net/manual/en/curl.examples.php
This is how you do it:
test.php
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://www.ceneo.pl/22416171");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
//set headers
curl_setopt($ch,CURLOPT_HTTPHEADER, array(
'User-Agent: Mozilla/5.0 (Windows NT 6.0; rv:28.0) Gecko/20100101 Firefox/28.0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: pl,en-US;q=0.7,en;q=0.3',
//'Accept-Encoding: gzip, deflate',
'p3p: CP="NOI CURa ADMa DEVa TAIa OUR BUS IND UNI COM NAV INT"',
'Vary: Accept-Encoding',
'Content-Type: text/html; charset=utf-8',
'Cache-Control: private'
));
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
// debug
echo $output;
Demo of it working (only the html output from the site is retrieved):