Proxymillion IPs with cURL - php

I am using proxymillion to scrape data from google. I am using cURL but not getting the result and getting error Error 405 (Method Not Allowed)!!1
my code
$proxies[] = 'username:password#IP:port'; // Some proxies require user, password, IP and port number
$proxies[] = 'username:password#IP:port'; // Some proxies require user, password, IP and port number
$proxies[] = 'username:password#IP:port'; // Some proxies require user, password, IP and port number
$proxies[] = 'username:password#IP:port'; // Some proxies require user, password, IP and port number
if (isset($proxies)) { // If the $proxies array contains items, then
$proxy = $proxies[array_rand($proxies)]; // Select a random proxy from the array and assign to $proxy variable
}
$ch = curl_init();
if (isset($proxy)) { // If the $proxy variable is set, then
curl_setopt($ch, CURLOPT_PROXY, $proxy); // Set CURLOPT_PROXY with proxy in $proxy variable
}
$url="https://www.google.com.pk/?gws_rd=cr,ssl&ei=8kXQWNChIsTSvgSZ3J24DA#q=pakistan&*";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
// curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
// curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "PUT");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$page = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$html = $dom->load($page);
$title=$html->find("title",0);
echo $title->innertext;

If I guess right you are looking for a budget solution for scraping Google, that's why you switched out for the proxymillion provider in the sample code you linked in comments ?
You can not scrape with massively shared proxies (that's the provider you took), Google will spot them either directly or within a few pages and block.
Also using "&ei=8kXQWNChIsTSvgSZ3J24DA" is not the best idea, that's not a default entry into Google and will probably link your scrape request with your browser (where you have that parameter from originally).
If you look for a budget solution you can consider using a scraping service (php source code here: http://scraping.services/?api&chapter=Source%20Code ), that's cheaper than private proxies in most cases and allows to scrape ten thousands of keywords for a few USD.
Alternatively, if you want to continue that route I would suggest testing your proxymillion performance using a simple bash script.
Use curl or lynx in a bash script (If you use linux, otherwise you can do the same on windows with MinGW/msys) and just make them access Google with the proxies. See if it works at all or if you get blocked within a few pages.
But even if you succeed: any shared proxy provider will be unreliable in "performance".

Related

Log into website using curl failing

I am trying to login into to a remote site using curl. ( before doing some data scraping)
Using the following code I am producing a cookies.txt file that has the following:
# Netscape HTTP Cookie File
# https://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.
#HttpOnly_www.xxx.com FALSE / TRUE 0 xxxv5 h_r4hXtn-gNAilZwhvHjYdE3Vr4HewhxtGrxja57LbW03-M9MLNqZSeiW7lQ2wRT9lZypNsAiX0gS0Ev1PrvNkGLmwL3B8ZmyOUMLYbTYbSW0y_aPGrIFlEp4skDzh0GJGIGtFHisCmQjEMlu0CJr0UEw2rCT9jbjzg0IyOnFYxNffaMPo229NZWV7HDfCK5M1_y6MPNvW_Kt-h4qTy8YmqGbfBwKxB-bulV78MSXU9ZWz_DVvdu6jXfPiHwCBDMV8FFBLaXm5rqYgNzvbsq8JLe1xkTPn1PNJhyizUa-hlwB6ev8HNwIwBpzs7406l6mL3VgyrDJpay6bHNoMtjh4fLwI7KapFANhFHfn57mg4
#HttpOnly_www.xxx.com FALSE / TRUE 0 ASP.NET_SessionId txakhdi15oeqxyfq53f44dts
When I manually log into the web site the cookie names are correct. So I think I am creating the login ( otherwise the cookies would not be created) but when I output
echo 'HELLO html1 = '.$html1;
I see the page telling me I have entered the wrong username and password.
Code as follows:
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);
$username = 'xxx';
$password = 'xxx';
// echo 'STARTING';
//login form action url
$url="https://www.xxxx.com/Login";
$postinfo = "username=".$username."&password=".$password;
$cookie_file_path = "cookie.txt";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
//set the cookie the site has for certain features, this is optional
curl_setopt($ch, CURLOPT_COOKIE, "cookiename=0");
curl_setopt($ch, CURLOPT_USERAGENT,
"Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $_SERVER['REQUEST_URI']);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_MAXREDIRS,5); // return into a variable
// curl_setopt($ch, CURLOPT_UPLOAD, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST" );
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postinfo);
// set content length
$headers[] = 'Content-length: 0';
$headers[] = 'Transfer-Encoding: chunked';
curl_setopt($ch, CURLOPT_HTTPHEADER , $headers);
$html1 = curl_exec($ch);
echo 'HELLO html1 = '.$html1;
I cannot show the site for security reasons. ( which may be a killer)
Can anyone point me in the right direction?
first off, this won't work: ini_set('display_startup_errors', 1);
- the startup phase is already finished before the userland php code starts to run,
so this setting is set too late. it must be set in the php.ini config file. (not strictly true, but close enough, like on windows you can do crazy registry hacks to enable it, and you can set it with .user.ini files, etc, more info here http://php.net/manual/en/configuration.php )
second, obvious error here is that you don't urlencode $username and $password in $postinfo = "username=".$username."&password=".$password; -
if the username OR password contains any characters with special meanings in urlencoded format, you'll send the wrong credentials and won't get logged in (this includes &,=,#, spaces, and many other characters). fixed version would look like $postinfo = "username=".urlencode($username)."&password=".urlencode($password);
third, don't use CURLOPT_CUSTOMREQUEST for POST requests,
just use CURLOPT_POST.
fourth, your Content-length header is outright lying. the
correct length is actually 'Content-length: '.strlen($postinfo) - which with your code, is definitely not 0 -
but you shouldn't set this header at all, curl will do it for you
if you don't, and unlike you, curl won't mess up the code calculating
the size, so get rid of the entire line.
fifth, this code is also wrong:
$headers[] = 'Transfer-Encoding: chunked';
your curl code here is NOT using chuncked transfers,
and if it were, curl would send that header automatically,
so get rid of it.
sixth, don't just call curl_setopt, if there's an
error setting any of your options, curl_setopt will return
bool(false), and you should watch out for such errors,
use curl_error to extract the error message, and throw an exception,
if such an error occur. - instead of what your code is doing right now,
silently ignoring any curl_setopt errors. use something like
function ecurl_setopt($ch,int $option, $value){if(!curl_setopt($ch,$option,$value)){throw new \RuntimeException('curl_setopt failed!: '.curl_error($ch));}}
if fixing all of these problems is not enough to log in, you're not giving us enough information to help you any further. what does the browsers http login request look like? or what is the login url?
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);
$username = 'xxx';
$password = 'xxx';
//login form action url
$url="https://www.xxxx.com/Login";
$postinfo = array("username"=>$username,"password"=>$password);
$cookie_file_path = "cookie.txt";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST,false);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false);
curl_setopt($ch,CURLOPT_COOKIEFILE,$cookie_file_path);
curl_setopt($ch,CURLOPT_COOKIEJAR,$cookie_file_path);
curl_setopt($ch, CURLOPT_USERAGENT,
"Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postinfo);
$html = curl_exec($ch);
echo $html;
Above code must works fine.
If there is still an issue, you must check cookie.txt file permissions.
Also if there is an invisible data needs to be sent including post, you can check it using firefox Live Http Headers plugin.
It is not as simple as reading the HTML page using curl. You need to supply a POST value for the submit button. If there is any javascript that executes prior to the activation of ACTION script, then that has to be looked at as well.
Usually you get better results if you use Selenium. See http://www.seleniumhq.org/
EDIT1:
If the server is rejecting your post string try: curl_setopt($handle, CURLOPT_POSTFIELDS, http_build_query($data));

How to use PHP CURL to bypass cross domain

I need PHP to submit paramaters from one domain to another. JavaScript is not an option for my situation. I'm now trying to use CURL with PHP, but have not been successful in bypassing the cross domain.
From domain_A, I have a page with the following PHP with CURL script:
if (_iscurl()){
echo "<p>CURL is enabled</p>";
$url = "http://domain_B/process.php?id=123&amt=100&jsonp=?";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($ch, CURLOPT_USERAGENT , "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)");
curl_setopt($ch, CURLOPT_URL, $url );
$return = curl_exec($ch);
curl_close($ch);
echo "<p>Finished operations</p>";
}
else{
echo "CURL is disabled";
}
?>
I am not getting any results, so I am assuming that the PHP CURL script is not successful. Any ideas to fix this?
Thanks
Well, its bit late. But adding this answer for further readers who might face similar issue. This issue arises some times when we are sending php curl request from a domain hosted over http to a domain hosted over https (http over ssl).
Just add below code snippet before curl execution.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
Using false in CURLOPT_RETURNTRANSFER doesn't return anything by curl. make it true(or 1)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

PHP CURL - Problems storing and using cookies when scraping

I've been trying to write a script that retrieves Google trends results for a given keyword. Please note im not trying to do anything malicious I just want to be able to automate this process and run it a few times every day.
After investigating the Google trends page I discovered that the information is available using the following URL:
http://www.google.com/trends/trendsReport?hl=en-GB&q=keyword&cmpt=q&content=1
You can request that information mutliple times with no issues from a browser, but if you try with "privacy mode" after 4 or 5 requests the following is displayed:
An error has been detected You have reached your quota limit. Please
try again later.
This makes me think that cookies are required. So I have written my script as follows:
$cookiefile = $siteurl . '/wp-content/plugins/' . basename(dirname(__FILE__)) . '/cookies.txt';
$url = 'http://www.google.com/trends/trendsReport?hl=en-GB&q=keyword&cmpt=q&content=1';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefile);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$x='error';
while (trim($x) != '' ){
$html=curl_exec($ch);
$x=curl_error($ch);
}
echo "test cookiefile contents = ".file_get_contents($cookiefile)."<br />";
echo $html;
However I just can't get anything written to my cookies file. So I keep on getting the error message. Can anyone see where I'm going wrong with this?
I'm pretty sure your cookie file should exist before you can use it with curl.
Try:
$h = fopen($cookiefile, "x+");

Why CodeIgniter's Curl library slower than using Curl in plain PHP?

Recently I moved my scraping code with Curl to CodeIgniter. I'm using Curl CI library from http://philsturgeon.co.uk/code/codeigniter-curl. I put the scraping process in a controller and then I found the execution time of my scraping is slower than the one I built in plain PHP.
It took 12 seconds for CodeIgniter to output the result, whereas it only takes 6 seconds in plain PHP. Both are including the parsing process with the HTML DOM parser.
Here's my Curl code in CodeIgniter:
function curl($url, $postdata=false)
{
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
$this->curl->create($url);
$this->curl->ssl(false);
$options = array(
'URL' => $url,
'HEADER' => 0,
'AUTOREFERER' => true,
'FOLLOWLOCATION' => true,
'TIMEOUT' => 60,
'RETURNTRANSFER' => 1,
'USERAGENT' => $agent,
'COOKIEJAR' => dirname(__FILE__) . "/cookie.txt",
'COOKIEFILE' => dirname(__FILE__) . "/cookie.txt",
);
if($postdata)
{
$this->curl->post($postdata, $options);
}
else
{
$this->curl->options($options);
}
return $this->curl->execute();
}
non codeigniter (plain php) code :
function curl($url ,$binary=false,$post=false,$cookie =false ){
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Accepts all CAs
curl_setopt ($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt ($ch, CURLOPT_URL, $url );
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
if($cookie){
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookie.txt");
}
if($binary)
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
if($post){
foreach($post as $key=>$value)
{
$post_array_string1 .= $key.'='.$value.'&';
}
$post_array_string1 = rtrim($post_array_string1,'&');
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_array_string1);
}
return curl_exec ($ch);
}
Does anyone know why this CodeIgniter Curl is slower?? or maybe it's because the simple_html_dom parser??
I'm not sure I know the exact answer for this, but I have a few observations about Curl & CI as I use it extensively.
Check for the state of DNS caches/queries.
I noticed a substantial speedup when code was uploaded to a hosted staging server from my dev desktop. It was traced to a DNS issue that was solved by rebooting a bastion host... You can sometimes check this by using IP addresses instead of hostnames.
Phil's 'library' is really just a wrapper.
All he's really done is map CI-style functions to the PHP Curl library. There's almost nothing else going on. I spent some time poking around (I forget why) and it was really unremarkable. That said, there may well be some general CI overhead - you might see what happens in another similar framework (Fuel, Kohana, Laravel, etc).
Check your reverse lookup.
Some API's do reverse DNS checks as part of their security scanning. Sometimes hostnames or other headers are badly set in buried configs and can cause real headaches.
Use Chrome's Postman extension to debug REST APIs.
No comment, it's brilliant - https://github.com/a85/POSTMan-Chrome-Extension/wiki and you have fine grained control of the 'conversation'.
I would have to know more about the CI Library and if it is doing any extra tasks on the gathered data but I would try naming your method to something other than the library name. I have had issues where with the Facebook library, calling it in a method named facebook caused problems. $this->curl could be ambiguous to if you are talking about the library or the method.
Also, try adding the debug profiler and see what it comes up with. Add this either in the construct or the method:
$this->output->enable_profiler(TRUE);

Check proxy using PHP

I'm writing a web app that requires a lot of proxies to work.
I also have a list of proxies, but I don't know which of them works and what type are they (socks, http, https).
Let assume I have 5000 proxies in ip:port format.
What is the fastest way to check all of them?
I tried fsockopen, but it is quite slow.
Maybe pinging them first will save the time?
<?php
$proxies = file ("proxies.txt");
$mc = curl_multi_init ();
for ($thread_no = 0; $thread_no<count ($proxies); $thread_no++)
{
$c [$thread_no] = curl_init ();
curl_setopt ($c [$thread_no], CURLOPT_URL, "http://google.com");
curl_setopt ($c [$thread_no], CURLOPT_HEADER, 0);
curl_setopt ($c [$thread_no], CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($c [$thread_no], CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt ($c [$thread_no], CURLOPT_TIMEOUT, 10);
curl_setopt ($c [$thread_no], CURLOPT_PROXY, trim ($proxies [$thread_no]));
curl_setopt ($c [$thread_no], CURLOPT_PROXYTYPE, 0);
curl_multi_add_handle ($mc, $c [$thread_no]);
}
do {
while (($execrun = curl_multi_exec ($mc, $running)) == CURLM_CALL_MULTI_PERFORM);
if ($execrun != CURLM_OK) break;
while ($done = curl_multi_info_read ($mc))
{
$info = curl_getinfo ($done ['handle']);
if ($info ['http_code'] == 301) {
echo trim ($proxies [array_search ($done['handle'], $c)])."\r\n";
}
curl_multi_remove_handle ($mc, $done ['handle']);
}
} while ($running);
curl_multi_close ($mc);
?>
You could use cURL to check for proxies. Some good article is given here
Hope it helps
The port usually gives you a good clue about the proxy type.
80,8080,3128 is typically HTTP
1080 is typically SOCKS
But let's be realistic, you seem to have a list of public proxies. It's not unlikely that every single one is not working anymore.
You can use curl or wget or lynx in a script or similar to test the proxies.
You can also try sort your list up into SOCKS and HTTP as good as you can and enter it into the Proxycollective .
That's a free project but you need an invitation code or a 99cent ticket to become member.
Once you are member you can upload your proxy lists and they will be tested. All working ones will be given back to you sortable.
So if you don't want to program something on your own that's maybe your best bet, invitation codes can sometimes be found in various forums.
But keep in mind what I said, if you have a list of 5000 random proxies I bet you'll hardly find more than 10 working ones in there anymore. Public proxies only live short.
This proxy checker API could be exactly what you're looking for. You can easily check a proxy list with that.
If you want to develop it yourself, is not difficult to do a small script to do the same than that API does.
Here is the code I use. You can modify it to meet your requirements:
function _check($url,$usecookie = false,$sock="",$ref) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, False);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,$_POST['timeoutpp']);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.7) Gecko/20050414 Firefox/1.0.3");
if($sock){
curl_setopt($ch, CURLOPT_PROXY, $sock);
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
}
if ($usecookie){
curl_setopt($ch, CURLOPT_COOKIEJAR, $usecookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $usecookie);
}
if ($ref){
curl_setopt($ch, CURLOPT_REFERER,$ref);
}
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
$result=curl_exec ($ch);
curl_close($ch);
return $result;
}
There is a PHP tool on github that can scan the proxy list as multithread. It will not be difficult to integrate it into your own project. The tool also analyzes the security level of the proxies it scans. Yes, I wrote it. :)
https://github.com/enseitankado/proxy-profiler

Categories