Check proxy using PHP - php

I'm writing a web app that requires a lot of proxies to work.
I also have a list of proxies, but I don't know which of them works and what type are they (socks, http, https).
Let assume I have 5000 proxies in ip:port format.
What is the fastest way to check all of them?
I tried fsockopen, but it is quite slow.
Maybe pinging them first will save the time?

<?php
$proxies = file ("proxies.txt");
$mc = curl_multi_init ();
for ($thread_no = 0; $thread_no<count ($proxies); $thread_no++)
{
$c [$thread_no] = curl_init ();
curl_setopt ($c [$thread_no], CURLOPT_URL, "http://google.com");
curl_setopt ($c [$thread_no], CURLOPT_HEADER, 0);
curl_setopt ($c [$thread_no], CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($c [$thread_no], CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt ($c [$thread_no], CURLOPT_TIMEOUT, 10);
curl_setopt ($c [$thread_no], CURLOPT_PROXY, trim ($proxies [$thread_no]));
curl_setopt ($c [$thread_no], CURLOPT_PROXYTYPE, 0);
curl_multi_add_handle ($mc, $c [$thread_no]);
}
do {
while (($execrun = curl_multi_exec ($mc, $running)) == CURLM_CALL_MULTI_PERFORM);
if ($execrun != CURLM_OK) break;
while ($done = curl_multi_info_read ($mc))
{
$info = curl_getinfo ($done ['handle']);
if ($info ['http_code'] == 301) {
echo trim ($proxies [array_search ($done['handle'], $c)])."\r\n";
}
curl_multi_remove_handle ($mc, $done ['handle']);
}
} while ($running);
curl_multi_close ($mc);
?>

You could use cURL to check for proxies. Some good article is given here
Hope it helps

The port usually gives you a good clue about the proxy type.
80,8080,3128 is typically HTTP
1080 is typically SOCKS
But let's be realistic, you seem to have a list of public proxies. It's not unlikely that every single one is not working anymore.
You can use curl or wget or lynx in a script or similar to test the proxies.
You can also try sort your list up into SOCKS and HTTP as good as you can and enter it into the Proxycollective .
That's a free project but you need an invitation code or a 99cent ticket to become member.
Once you are member you can upload your proxy lists and they will be tested. All working ones will be given back to you sortable.
So if you don't want to program something on your own that's maybe your best bet, invitation codes can sometimes be found in various forums.
But keep in mind what I said, if you have a list of 5000 random proxies I bet you'll hardly find more than 10 working ones in there anymore. Public proxies only live short.

This proxy checker API could be exactly what you're looking for. You can easily check a proxy list with that.
If you want to develop it yourself, is not difficult to do a small script to do the same than that API does.

Here is the code I use. You can modify it to meet your requirements:
function _check($url,$usecookie = false,$sock="",$ref) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, False);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,$_POST['timeoutpp']);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.7) Gecko/20050414 Firefox/1.0.3");
if($sock){
curl_setopt($ch, CURLOPT_PROXY, $sock);
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
}
if ($usecookie){
curl_setopt($ch, CURLOPT_COOKIEJAR, $usecookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $usecookie);
}
if ($ref){
curl_setopt($ch, CURLOPT_REFERER,$ref);
}
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
$result=curl_exec ($ch);
curl_close($ch);
return $result;
}

There is a PHP tool on github that can scan the proxy list as multithread. It will not be difficult to integrate it into your own project. The tool also analyzes the security level of the proxies it scans. Yes, I wrote it. :)
https://github.com/enseitankado/proxy-profiler

Related

Proxymillion IPs with cURL

I am using proxymillion to scrape data from google. I am using cURL but not getting the result and getting error Error 405 (Method Not Allowed)!!1
my code
$proxies[] = 'username:password#IP:port'; // Some proxies require user, password, IP and port number
$proxies[] = 'username:password#IP:port'; // Some proxies require user, password, IP and port number
$proxies[] = 'username:password#IP:port'; // Some proxies require user, password, IP and port number
$proxies[] = 'username:password#IP:port'; // Some proxies require user, password, IP and port number
if (isset($proxies)) { // If the $proxies array contains items, then
$proxy = $proxies[array_rand($proxies)]; // Select a random proxy from the array and assign to $proxy variable
}
$ch = curl_init();
if (isset($proxy)) { // If the $proxy variable is set, then
curl_setopt($ch, CURLOPT_PROXY, $proxy); // Set CURLOPT_PROXY with proxy in $proxy variable
}
$url="https://www.google.com.pk/?gws_rd=cr,ssl&ei=8kXQWNChIsTSvgSZ3J24DA#q=pakistan&*";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
// curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
// curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "PUT");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$page = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$html = $dom->load($page);
$title=$html->find("title",0);
echo $title->innertext;
If I guess right you are looking for a budget solution for scraping Google, that's why you switched out for the proxymillion provider in the sample code you linked in comments ?
You can not scrape with massively shared proxies (that's the provider you took), Google will spot them either directly or within a few pages and block.
Also using "&ei=8kXQWNChIsTSvgSZ3J24DA" is not the best idea, that's not a default entry into Google and will probably link your scrape request with your browser (where you have that parameter from originally).
If you look for a budget solution you can consider using a scraping service (php source code here: http://scraping.services/?api&chapter=Source%20Code ), that's cheaper than private proxies in most cases and allows to scrape ten thousands of keywords for a few USD.
Alternatively, if you want to continue that route I would suggest testing your proxymillion performance using a simple bash script.
Use curl or lynx in a bash script (If you use linux, otherwise you can do the same on windows with MinGW/msys) and just make them access Google with the proxies. See if it works at all or if you get blocked within a few pages.
But even if you succeed: any shared proxy provider will be unreliable in "performance".

Bulk link checker in php

I want to check the links in my database, if the status of the link (through possible redirections) is still valid (e.g. status 200). The below script is what I currently use. The limitation is that over +/- 400 links, the server gives me a 500 - internal error. Unfortunality, I cannot review the servers logs on what the reason is, my assumption is that it's a time out issue.
How can I make this script scalable, so that it will allow me to run more then the currently +/- 400 links?
function urlValidator($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpcode != '200') {
echo $url;
echo " - ". $httpcode;
}
}
// creation of $url_array
//
foreach($url_array as $url){
if(!is_null($url)) {
urlValidator($url);
}
}
I did try to add flush() and/or ob_flush() to the code, but it didn't help either (or implemented wrongly).
Any suggestions are more then welcome.
The default execution time of a PHP script is 30 seconds. After that it will time out.
You can either increase this time to something like this:
ini_set('max_execution_time', 600); //10 minutes
But, to make it really scalable, I would store the current "link-check" status in a database, so that you can continue where you left off and have multiple instances call your script.

How to use PHP CURL to bypass cross domain

I need PHP to submit paramaters from one domain to another. JavaScript is not an option for my situation. I'm now trying to use CURL with PHP, but have not been successful in bypassing the cross domain.
From domain_A, I have a page with the following PHP with CURL script:
if (_iscurl()){
echo "<p>CURL is enabled</p>";
$url = "http://domain_B/process.php?id=123&amt=100&jsonp=?";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($ch, CURLOPT_USERAGENT , "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)");
curl_setopt($ch, CURLOPT_URL, $url );
$return = curl_exec($ch);
curl_close($ch);
echo "<p>Finished operations</p>";
}
else{
echo "CURL is disabled";
}
?>
I am not getting any results, so I am assuming that the PHP CURL script is not successful. Any ideas to fix this?
Thanks
Well, its bit late. But adding this answer for further readers who might face similar issue. This issue arises some times when we are sending php curl request from a domain hosted over http to a domain hosted over https (http over ssl).
Just add below code snippet before curl execution.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
Using false in CURLOPT_RETURNTRANSFER doesn't return anything by curl. make it true(or 1)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

Why CodeIgniter's Curl library slower than using Curl in plain PHP?

Recently I moved my scraping code with Curl to CodeIgniter. I'm using Curl CI library from http://philsturgeon.co.uk/code/codeigniter-curl. I put the scraping process in a controller and then I found the execution time of my scraping is slower than the one I built in plain PHP.
It took 12 seconds for CodeIgniter to output the result, whereas it only takes 6 seconds in plain PHP. Both are including the parsing process with the HTML DOM parser.
Here's my Curl code in CodeIgniter:
function curl($url, $postdata=false)
{
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
$this->curl->create($url);
$this->curl->ssl(false);
$options = array(
'URL' => $url,
'HEADER' => 0,
'AUTOREFERER' => true,
'FOLLOWLOCATION' => true,
'TIMEOUT' => 60,
'RETURNTRANSFER' => 1,
'USERAGENT' => $agent,
'COOKIEJAR' => dirname(__FILE__) . "/cookie.txt",
'COOKIEFILE' => dirname(__FILE__) . "/cookie.txt",
);
if($postdata)
{
$this->curl->post($postdata, $options);
}
else
{
$this->curl->options($options);
}
return $this->curl->execute();
}
non codeigniter (plain php) code :
function curl($url ,$binary=false,$post=false,$cookie =false ){
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Accepts all CAs
curl_setopt ($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt ($ch, CURLOPT_URL, $url );
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
if($cookie){
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookie.txt");
}
if($binary)
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
if($post){
foreach($post as $key=>$value)
{
$post_array_string1 .= $key.'='.$value.'&';
}
$post_array_string1 = rtrim($post_array_string1,'&');
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_array_string1);
}
return curl_exec ($ch);
}
Does anyone know why this CodeIgniter Curl is slower?? or maybe it's because the simple_html_dom parser??
I'm not sure I know the exact answer for this, but I have a few observations about Curl & CI as I use it extensively.
Check for the state of DNS caches/queries.
I noticed a substantial speedup when code was uploaded to a hosted staging server from my dev desktop. It was traced to a DNS issue that was solved by rebooting a bastion host... You can sometimes check this by using IP addresses instead of hostnames.
Phil's 'library' is really just a wrapper.
All he's really done is map CI-style functions to the PHP Curl library. There's almost nothing else going on. I spent some time poking around (I forget why) and it was really unremarkable. That said, there may well be some general CI overhead - you might see what happens in another similar framework (Fuel, Kohana, Laravel, etc).
Check your reverse lookup.
Some API's do reverse DNS checks as part of their security scanning. Sometimes hostnames or other headers are badly set in buried configs and can cause real headaches.
Use Chrome's Postman extension to debug REST APIs.
No comment, it's brilliant - https://github.com/a85/POSTMan-Chrome-Extension/wiki and you have fine grained control of the 'conversation'.
I would have to know more about the CI Library and if it is doing any extra tasks on the gathered data but I would try naming your method to something other than the library name. I have had issues where with the Facebook library, calling it in a method named facebook caused problems. $this->curl could be ambiguous to if you are talking about the library or the method.
Also, try adding the debug profiler and see what it comes up with. Add this either in the construct or the method:
$this->output->enable_profiler(TRUE);

Using CURL with Google

I want to CURL to Google to see how many results it returns for a certain search.
I've tried this:
$url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
$response = curl_exec($ch);
curl_close($ch);
But it just returns a 405 Method Allowed google error.
Any ideas?
Thanks
Use a GET request instead of a POST request. That is, get rid of
curl_setopt($ch, CURLOPT_POST, true);
Or even better, use their well defined search API instead of screen-scraping.
Scrapping Google is a very easy thing to do. However, if you don't require more than the first 30 results, then the search API is preferable (as others have suggested). Otherwise, here's some sample code. I've ripped this out of a couple of classes that I'm using so it might not be totally functional as is, but you should get the idea.
function queryToUrl($query, $start=null, $perPage=100, $country="US") {
return "http://www.google.com/search?" . $this->_helpers->url->buildQuery(array(
// Query
"q" => urlencode($query),
// Country (geolocation presumably)
"gl" => $country,
// Start offset
"start" => $start,
// Number of result to a page
"num" => $perPage
), true);
}
// Find first 100 result for "pizza" in Canada
$ch = curl_init(queryToUrl("pizza", 0, 100, "CA"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $this->getUserAgent(/*$proxyIp*/));
curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$response = curl_exec($ch);
Note: $this->_helpers->url->buildQuery() is identical to http_build_query except that it will drop empty parameters.
Use the Google Ajax API.
http://code.google.com/apis/ajaxsearch/
See this thread for how to get the number of results. While it refers to c# libraries, it might give you some pointers.
Before scrapping data please read https://support.google.com/websearch/answer/86640?rd=1
Against google terms
Automated traffic includes:
Sending searches from a robot, computer program, automated service, or search scraper
Using software that sends searches to Google to see how a website or webpage ranks on Google
CURLOPT_CUSTOMREQUEST => ($post)? "POST" : "GET"

Categories