Trouble using PHP Curl + Proxies to Query Google - php

Having trouble with Google blocking my IPs when querying Google for content matches. I've got 300 private IPs and have no trouble connecting to Google with a desktop app (w/ the same IPs) that performs a similar function. Yet, when I crank it up on my server using CURL my IPs get temporarily blocked - your machine maybe sending automated queries, please try again in 30 secs. So there must be a footprint somewhere.
Any how here's my code:
function file_get_contents_curl($url, $proxy = true) {
global $proxies;
App::import('Vendor', 'proxies');
$proxies = $this->shuffle_assoc($proxies);
$proxy_ip = $proxies[array_rand($proxies, 1)];//proxy IP here
$proxy = $proxy_ip.':60099';
$loginpassw = 'myusername:mypassword'; //proxy login and password here
$ch = curl_init();
if($proxy) {
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
//curl_setopt($ch, CURLOPT_PROXYPORT, $proxy_port);
curl_setopt($ch, CURLOPT_PROXYTYPE, 'HTTP');
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $loginpassw);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)');
}
curl_setopt($ch, CURLOPT_HEADER, 1);
#curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
//Set curl to return the data instead of printing it to the browser.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_0);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
//echo $data;
curl_close($ch);
return $data;
}
And I've verified that the IP is being set and that I'm connecting thru the proxy.
Anyone got any ideas?

Tried with SOCKS5 but no difference. Trouble with the Google api is that you only get 100 queries per day.

HTTP proxies as well as SOCKS proxies can be used, there is no difference when scraping google results.
There are multiple possible reasons why you get detected.
Your proxies are of bad quality or shared (maybe without your knowledge)
Your proxies are in only one or two subnets / too sequential
You query Google too fast or too often
You should not query Google more often with an IP than 20 times per hour, that's just a rough value that works and doesn't get punished by the search engine.
So you should implement delay based on your proxy count.
But if option 1) or 2) are true than even that won't help, you'll need another IP solution.
Check out the Google rank scraper ( http://google-rank-checker.squabbel.com/), it's a free PHP project for scraping Google and includes proper delay routines you could use for your own code.
Also the caching functions might proof useful for you as you don't want to query Google more than required.
And not to forget:
If you get detected then make your script STOP automated!
You just cause trouble by going on, detection means you did something wrong.

Http-proxies doesn't guarantee your privacy. You may try to use socks.
But you better use google-api instead.

Related

how to search data from other website using curl

Hi how can i search data from other website using curl and php. i want to search imei number from this website https://www.example.com/xxx
this is what i have tried so far
$imei = '013887009861498';
$cookie_file_path = "cookies/cookiejar.txt";
$fp = fopen("$cookie_file_path","w") or die("<BR><B>Unable to open cookie file $mycookiefile for write!<BR>");
fclose($fp);
$url="https://example.com/xxx";
$agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,$imei);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
$result = curl_exec ($ch);
echo $result ;
(this is not a full answer, but too long to be a comment. i can't be arsed to figure out all the small details for you)
there are several different problems here, the first is how to do a POST request with php/curl, of which you can find an example here.
another problem, is how to parse HTML in PHP, of which there are several options listed here. (i highly recommend the DOMDocument & DOMXPath combo)
another problem, is how to get past CAPTCHA challenges in PHP, 1 solution is to use the deathbycaptcha API (which is a paid service, by the way), you can find an example of that here.
another problem is that they're using 3 different CSRF-like tokens, called __VIEWSTATE, __EVENTVALIDATION, and hdnCaptchaInstance, all of which must be parsed out and submitted with the captcha answer. also you need to handle cookies, as the CSRF tokens and captcha is tied to your cookie session (luckily you can let curl handle cookies automatically with CURLOPT_COOKIEFILE )

An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security

I have an app that uses cURL to scrape some elements of sites.
I've started receiving some errors that look like this:
"Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security."
Have you ever seen this?
If so, How can I get around it?
I checked 2 sites that do the same thing I do and everything worked fine
Regarding the cURL, this is what I use:
public function cURL_scraping($url){
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_MAXREDIRS, 10);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl,CURLOPT_HTTPHEADER,array('Expect:'));
curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt($curl, CURLOPT_ENCODING, 'identity');
$response['str'] = curl_exec($curl);
$response['header'] = curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
return $response;
}
Well I found the reason. I removed the user agent and it works. I guess the server was blocking this specific user agent.
It looks like the site you are scraping has set up a detection and blocking of scraping. To check this you can try to get the webpage from the same ip and/or with all the same headers.
If that is the case, you really should respect the site owners wishes to not be scraped. You could ask them, or experiment to what is an acceptable scraping of their site. Did you read their robots.txt?
The error usually has a timeout, but it might be permanent. In that case you probably need to change ip address to try again.
I got same error and I was just playing around and found an answer.
If you understand some basic python, it will be easy for you to change related code in in the language that you are working with.
I just added a header like this,
headers = {
"User-Agent":
"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0"
}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
And this works!

cURL retrieve only URL address

Using PHP and cURL, I'd like to check if I can login to a website using the provided user credentials. For that I'm currently retrieving the entire website and then use regex to filter for keywords that might indicate the login didn't work.
The url itself contains the string "errormessage" if a wrong username/password has been entered. Is it possible to only use curl to get the url address, without the contents to speed it up?
Here's my curl PHP code:
function curl_get_request($referer, $submit_url, $ch)
{
global $cookie_path;
// sends a request via curl to the string specifics listed
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
curl_setopt($ch, CURLOPT_URL, $submit_url);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_path);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_path);
return $result = curl_exec ($ch);
}
Also, if somebody has a better idea on how to handle a problem like this, please let me know!
What you should do is check the URL each time there is a redirect. Most redirects are going to be done with the proper HTTP headers. If that is the case, see this answer:
PHP: cURL and keep track of all redirections
Basically, turn off automatic redirection following, and check the HTTP status code for 301 or 302. If you get one of those, you can continue to follow the redirection if needed, or exit from there.
If instead, the redirection is happening client side, you will have to parse the page with a DOM parser.

i have a curl that works in localhost but not in live website. i cannot figure out why

i have this link
http://www.bata.com.sg, this website actually exists
that works in my curl code that checks if the page exists.
it works in my localhost code, but it keeps failing in my live website.
I have tested using other domains like http://www.yahoo.com.sg, it works all the time on my localhost AND my live website.
i copied this code http://w-shadow.com/blog/2007/08/02/how-to-check-if-page-exists-with-curl/ word for word.
i dont understand why it fails with this particular url.
my website is hosted with site5.
i noticed that i keep getting false(boolean) for this line
curl_exec($ch);
I get this for curl_error Couldn't resolve host 'www.bata.com.sg'
please advise.
You need to talk to customer support of site5 to figure out why their server can not resolve www.bata.com.sg
Until you get an answer from them, try the following code.
Key points
It connect to the IP address www.bata.com.sg resolves to - 194.228.50.32
Then sends Host: www.bata.com.sg header
In essence, it works the same way as Curl would if it could resolve the address.
<?php
// this is the IP address that www.bata.com.sg resolves to
$server = '194.228.50.32';
$host = 'www.bata.com.sg';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $server);
/* set the user agent - might help, doesn't hurt */
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
/* try to follow redirects */
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
/* timeout after the specified number of seconds. assuming that this script runs
on a server, 20 seconds should be plenty of time to verify a valid URL. */
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
$headers = array();
$headers[] = "Host: $host";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLOPT_VERBOSE, true);
/* don't download the page, just the header (much faster in this case) */
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_HEADER, true);
$response = curl_exec($ch);
curl_close($ch);
var_dump($response);
I have figured out the reason.
Using this website i was told that there was an issue with the nameservers of bata.com.sg
http://www.intodns.com/bata.com.sg
anyway the answers above have been useful as well. i have learned something from them.
This might be a firewall issue. sometimes the hosting company might restrict what the webserver can consult.
also can you make sure curl is present in the php_info(). I think you didn't mentioned any error so.
also you get give a try to
file_get_contents('http://www.yahoo.com.sg');

How do I CURL www.google.com - it keeps redirecting me to .co.uk

I am using CURL to check for the existence of a URL (HEAD request) but when I test it with www.google.com, it redirects me to www.google.co.uk - probably because my server is UK-based.
Is there a way you can stop this from happening? I don't want to remove the CURLOPT_FOLLOWLOCATION option as this is useful for 301 redirects etc.
Part of my code is below;
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 5);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FORBID_REUSE, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 4);
curl_setopt($ch, CURLOPT_TIMEOUT, 4);
$output = curl_exec($ch);
// get data
$data = curl_getinfo($ch);
$data['url'] contains www.google.co.uk when I set $url as www.google.com
You need to use curl with a cookie that simulate a similar behavior in a browser.
When you visit google.com from England it redirects you to google.co.uk, however there is a link on that page titled "go to google.com" that lets you go back to google.com and stay there. It uses a cookie to remember your site preferences.
For example, here are the cookies that I have after doing this (using firefox):
Try accessing www.google.com/ncr, it'll avoid the redirect to the .co.uk (or any other national) page.
Another option is to use simply encrypted.google.com. That won't redirect.
A bit of a hack, but how about using an IP address? http://216.239.59.147/ http://66.102.7.104/
You could use www.google.co.uk directly, no difference there. google.com/.net always redirect to your location but if you use a country TLD like .co.uk it will not redirect.
There is no way (known to me) to prevent the redirect when using .com or .net.
One way to avoid Google from deciding what country you are in, is by setting a different IP address. Just get one of the many US proxy servers from the Web and do something like this:
$ch=curl_init();
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_FOLLOWLOCTION,1);
curl_setopt($ch,CURLOPT_PROXY,"8.12.33.159");
curl_setopt($ch,CURLOPT_PROXYPORT,"80");
curl_setopt($ch,CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3");
curl_setopt($ch,CURLOPT_URL,$URI);
$results=curl_exec($ch);
curl_close($ch);
This way, Google will think you come form a US IP address and not redirect to a local Google.
You should turn off the follow location from curl (set it to false) and you won't be redirected anymore ...
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);

Categories