I can web crawling a newspaper web site successful before but fail today.
But I can access the web successfully by using firefox. It just happen in curl. That mean it allow my IP to access and it is not banned.
Here is the error shown by the web
Please enable cookies.
Error 1010 Ray ID: 1a17d04d7c4f8888
Access denied
What happened?
The owner of this website (www1.hkej.com) has banned your access based
on your browser's signature (1a17d04d7c4f8888-ua45).
CloudFlare Ray ID: 1a17d04d7c4f8888 • Your IP: 2xx.1x.1xx.2xx •
Performance & security by CloudFlare
Here is my code which work before:
$cookieMain = "cookieHKEJ.txt"; // need to use 2 different cookies since it will overwrite the old one when curl store cookie. cookie file is store under apache folder
$cookieMobile = "cookieMobile.txt"; // need to use 2 different cookies since it will overwrite the old one when curl store cookie. cookie file is store under apache folder
$agent = "User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0";
// submit a login
function cLogin($url, $post, $agent, $cookiefile, $referer) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 100); // follow the location if the web page refer to the other page automatically
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // Get returned value as string (don’t put to screen)
curl_setopt($ch, CURLOPT_USERAGENT, $agent); // Spoof the user-agent to be the browser that the user is on (and accessing the php script)
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); // Use cookie.txt for STORING cookies
curl_setopt($ch, CURLOPT_POST, true); // Tell curl that we are posting data
curl_setopt($ch, CURLOPT_POSTFIELDS, $post); // Post the data in the array above
curl_setopt($ch, CURLOPT_REFERER, $referer);
$output = curl_exec($ch); // execute
curl_close($ch);
return $output;
}
$input = cDisplay("http://www1.hkej.com/dailynews/toc", $agent, $cookieMain);
echo $input;
How can I use curl to pretend the browser successfully? Did I miss some parameters?
As I said in the post, I can use firefox to access the web and my IP is not banned.
At last, I got success after I changed the code from
$agent = "User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0";
to
$agent = $_SERVER['HTTP_USER_AGENT'];
Actually, I don't know why it fail when "User-Agent: " exist start from yesterday but it is alright before.
Thanks all anyway.
The users have used Cloudflares security features to prevent you crawling their website, More than likely got shown as a malicious bot. They will have done this based on your user-agent and IP address.
Try changing your IP (if home user, try rebooting your router. sometimes will get a different IP address). Try using a proxy and try sending different headers with Curl.
More importantly they do not want people crawling their site and affecting their traffic etc, You should really ask permission for this.
Related
I'm running into an issue with cURL while getting customer review data from Google (without API). Before my cURL request was working just fine, but it seems Google now redirects all requests to a cookie consent page.
Below you'll find my current code:
$ch = curl_init('https://www.google.com/maps?cid=4493464801819550785');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
print_r($result);
$result now just prints "302 Moved. The document had moved here."
I also tried setting curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); but that didn't help either.
Does anyone has an idea on how to overcome this? Can I programmatically deny (or accept) Google's cookies somehow? Or maybe there is a better way of handling this?
What you need is the following:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
The above curl option is what tells curl to follow redirects. However, I am not sure whether what is returned will be of much use for the specific URL you are trying to fetch. By adding the above option you will obtain the HTML source for the final page Google redirects to. But this page contains scripts that when executed load the map and other content that is ultimately displayed in your browser. So if you need to fetch data from what is subsequently loaded by JavaScript, then you will not find it in the returned results. Instead you should look into using a tool like selenium with PHP (you might take a look at this post).
I had a simple parser for an external site that's required to confirm that the link user submitted leads to an account this user owns (by parsing a link to their profile from linked page). And it worked for a good long while with just this wordpress function:
function fetch_body_url($fetch_link){
$response = wp_remote_get($fetch_link, array('timeout' => 120));
return wp_remote_retrieve_body($response);
}
But then the website changed something in their cloudflare defense, and now this results in "Please wait..." page of cloudflare with no option to pass it.
Thing is, I don't even need it done automatically - if there was a captcha, the user could've complete it. But it won't show anything other than endlessly spinning "checking your browser".
Googled a bunch of curl examples, and best I could get so far is this:
<?php
$url='https://ficbook.net/authors/1000'; //random profile from requrested website
$agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_REFERER, 'https://facebook.com/');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
$response = curl_exec($ch);
curl_close($ch);
echo '<textarea>'.$response.'</textarea>';
?>
Yet it still returns the browser check screen. Adding random free proxy to it doesn't seem to work either, or maybe I wasn't lucky finding a working one (or couldn't figure out how to insert it correctly in this case). Is there any way around it? Or perhaps there is some other way to see if there is a specific keyword/link on the page?
Ok, I've spent most of the day on this problem, and seems like I got it more or less sorted. Not exactly the way I expected, but hey, it works... sort of.
Instead of solving this on the server side, I ended up looking for solution to parse it on my own PC (it has better uptime than my hosting's server anyway). Turns out, there are plenty of ready-to-use open source scrapers, including those that know how to bypass cloudflare being extra defensive for no good reason.
Solution for python dummies like myself:
Install Anaconda if you don't have python installed yet.
In cmd type pip install cloudscraper
Open Spyder (it comes along with Anaconda) and paste this:
import cloudscraper
scraper = cloudscraper.create_scraper()
print(scraper.get("https://your-parse-target/").text)
Save it anywhere and poke at run button to test. If it works, you got your data in the console window of same app.
Replace print with whatever you're gonna do with that data.
For my specific case it also required to install mysql-connector-python and to enable remote access for mysql database (and my hosting had it available for free all this time, huh?). So instead of directly verifying that user is the owner of the profile they input, there's now a queue - which isn't perfect, but oh well, they'll have to wait.
First, user request is saved to mysql. My local python script will check that table every now and then to see if anything's in line to be verified. It'll get the page's content and save it back to mysql. Then the old php parser will do its job like before, but from mysql fetch instead of actual website.
Perhaps there are better solutions that don't require resorting to measures like creating a separate local parser, but maybe this will help to someone running into similar issue.
I need to retrieve and parse the text of public domain books, such as those found on gutenberg.org, with PHP.
To retrieve the content of most webpages I am able to use CURL requests to retrieve the HTML exactly as I would find had I navigated to the URL in a browser.
Unfortunately on some pages, most importantly gutenberg.org pages, the websites display different content or send a redirect header.
For example, when attempting to load this target, gutenberg.org, page a curl request gets redirected to this different but logically related, gutenberg.org, page. I am successfully able to visit the target page with both cookies and javascript turned off on my browser.
Why is the curl request being redirected while a regular browser request to the same site is not?
Here is the code I use to retrieve the webpage:
$urlToScan = "http://www.gutenberg.org/cache/epub/34175/pg34175.txt";
if(!isset($userAgent)){
$userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36";
}
$ch = curl_init();
$timeout = 15;
curl_setopt($ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch, CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
#curl_setopt($ch, CURLOPT_HEADER, 1); // return HTTP headers with response
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_URL, $urlToScan);
$html = curl_exec($ch);
curl_close($ch);
if($html == null){
return false;
}
print $html;
The hint is probably in the url: it says "welcome stranger". They are redirecting every "first" time visitor to this page. Once you have visited the page, they will not redirect you anymore.
THey don't seem to be saving a lot of stuff in your browser, but they do set a cookie with a session id. This is the most logical thing really: check if there is a session.
What you need to do is connect with curl AND a cookie. You can use your browsers cookie for this, but in case it expires, you'd be better of doing
request the page.
if the page is redirected, safe the cookie (you now have a session)
request the page again with that cookie.
If all goes well, the second request will not redirect. Until the cookie / session expires, and then you start again. see the manual to see how to work with cookies/cookie-jars
The reason that one could navigate to the target page in a browser without cookies or javascript, yet not by curl, was due to the website tracking the referrer in the header. The page can be loaded without cookies by setting the appropriate referrer header:
curl_setopt($ch, CURLOPT_REFERER, "http://www.gutenberg.org/ebooks/34175?msg=welcome_stranger");
As pointed out by #madshvero, the page also be, surprisingly, loaded by simply excluding the user agent.
My question is, i'm using - $url = http://sms.emefocus.com/sendsms.jsp?user="$uname"&password="$pwd"&mobiles="$mobiil_no"&sms="$msg"&senderid="$sender_id"; $ret = file($url);- url to send sms to users from user panel and i'm using FILE operation to execute this url as mentioned above.
After executing this when i'm trying to print $ret, its giving me status true and generating message id and sending id.
But its not getting delivered to user....??
When same url i'm executing in browser as $url = http://sms.emefocus.com/sendsms.jsp?user="$uname"&password="$pwd"&mobiles=98xxxxxx02&sms=Hi..&senderid="$sender_id"
its getting delivered immediately..??
can anyone help me out..?? Thanks in advance..
It is possible that this SMS service needs to think a browser and not a bot is executing the request, or there is some "protection" we don't know about. Is there any documentation regarding this particular service ? Is it intended to be used like you're trying to do?
You can try with CURL and see if the behaviour is still the same:
<?php
// create curl resource
$ch = curl_init();
$agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)';
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
// Fake real browser
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$ret = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
Does it help?
I have a partner that has created some content for me to scrape.
I can access the page with my browser, but when trying to user file_get_contents, I get a 403 forbidden.
I've tried using stream_context_create, but that's not helping - it might be because I don't know what should go in there.
1) Is there any way for me to scrape the data?
2) If no, and if partner is not allowed to configure server to allow me access, what can I do then?
The code I've tried using:
$opts = array(
'http'=>array(
'user_agent' => 'My company name',
'method'=>"GET",
'header'=> implode("\r\n", array(
'Content-type: text/plain;'
))
)
);
$context = stream_context_create($opts);
//Get header content
$_header = file_get_contents($partner_url,false, $context);
This is not a problem in your script, its a feature in you partners web server security.
It's hard to say exactly whats blocking you, most likely its some sort of block against scraping. If your partner has access to his web servers setup it might help pinpoint.
What you could do is to "fake a web browser" by setting the user-agent headers so that it imitates a standard web browser.
I would recommend cURL to do this, and it will be easy to find good documentation for doing this.
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
//set User Agent first
ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 6.0)');
Also if for some reason you're requesting a http resource but that resource lives on your server you can save yourself some trouble if you just include the file as an absolute path.
Like: /home/sally/statusReport/myhtmlfile.html
instead of
https://example.org/myhtmlfile.html
I have two things in my mind, If you're opening a URI with special characters, such as spaces, you need to encode the URI with urlencode() and A URL can be used as a filename with this function if the fopen wrappers have been enabled.