Just one question. I got an cURL based code, and it send a request to the serwer, then if the respond is 'valid' it's making a sql query, but if the respond is 'busy' I need to change the proxy which the script is using.
I'm making it this way:
$proxys = file('http_proxy.txt');
...then...
for($n = 0, $count = count($proxys); $n <= $count; $n++) {
...and to change the proxy I used something like this:
$proxy = $proxys[$n + 1];
but it doesn't work.
Any suggestions?
Regards.
For starters, file('http_proxy.txt'); will retain the newlines in your file, so use the FILE_IGNORE_NEW_LINES flag to omit this. Then, you could use break; to stop the loop after succesfully using CURL on a proxy:
$proxys = file('http_proxy.txt', FILE_IGNORE_NEW_LINES);
foreach($proxys as $proxy)
{
$response = sendRequestTo($proxy);
if($response == 'valid')
{
performQuery($proxy);
break;
}
}
Related
If you are calling php page from web, you can give as
../../somepage.php?myid=1&trackno=2&anotherparam=3
and then you can use $_REQUEST or $_GET to retrieve the information
In command line, you can use
$options = getopt("a:b:c:"); to get the options that are passed through arguments
How to make sure, same source works either in web or in command line?
Let say your requests are like following;
WEB: http://domain.com/somepage.php?myid=1&trackno=2&anotherparam=3
CLI: php /path/to/this/php/file/somepage.php 1 2 3
You can use following php code;
<?php
if (!empty($_REQUEST)) {
$myid = $_REQUEST["myid"];
$trackno = $_REQUEST["trackno"];
$anotherparam = $_REQUEST["anotherparam"];
} else if (!empty($argv)) {
$myid = $argv[1];
$trackno = $argv[2];
$anotherparam = $argv[3];
} else {
die("Invalid request!");
}
You have already know how to handle web requests, you can refer here for more detail about $argv. Simply,
$argv[0] => scriptname(somepage.php),
$argv[1] => first param, ...,
$argv[n] => (n-1)th param
Edit:
In order to not miss order of commandline arguments, you can use naming conventions like;
php somepage.php myid_1 anotherparam_2 trackno_3
and you can use following to handle this;
foreach ($argv as $k => $v) {
if ($k == 0) continue;
$temp = explode("_", $v);
${$temp[0]} = $temp[1];
}
Simply,
myid_3 becomes $myid = 3;
variable names hidden in the values so you don't need to know about sequences
I'm here again, learning more and more about PHP, but still have some problems for my scenario, most of my scenario has been programmed and solved without problem, but I found an issue, but to understand it, I need to explain it first:
I have a PHP script which can be invoked by any client and its work is to receive a request, ping to a proxy from a list which I define manually, to know if a proxy is available, if it is available, I proceed to retrieve a response using "curl" with a POST method. The logic is like this:
$proxyList = array('192.168.3.41:8013'=> 0, '192.168.3.41:8023'=>0, '192.168.3.41:8033'=>0);
$errorCounter = 0;
foreach ($proxyList as $key => $value){
if(!isUrlAvailable($key){ //It means it is NOT available so I count errors
$errorCounter++;
} else { //It means it is AVAILABLE
$result = callThisProxy($key);
}
}
The function "isUrlAvailable" uses a $fsockopen to know if the proxy is available. If not, I make a POST with CURL as mentioned before, the function has callThisProxy() something like:
$ch = curl_init($proxyUrl);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,'xmlQuery='.$rawXml);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$info = curl_exec ($ch);
if($isDebug){echo 'Info in the moment: '.$info.'<br/>';}
curl_close ($ch);
But, we're testing some scenarios, what happen if I turn off the proxy between the verification of the proxy availability and the call? I mean:
foreach ($proxyList as $key => $value){
if(!isUrlAvailable($key){ //It means it is NOT available so I count errors
$errorCounter++;
} else { //It means it is AVAILABLE
$result = callThisProxy($key);//What happen if I kill the proxy when the result is being processed?
}
}
I tested it and when I do that, the $result comes as empty string ''. But the problem is that I lost that request, and my goal is to retry it with the next $key which is a proxy. So, I've been thinking of a "do, while" when I invoke the result. But not sure, if it is ok or there's a better way to do it, so please I ask for help with this issue. Thanks in advance for your time any answer is welcome. Thanks.
Maybe something like:
$result = "";
while ($result == "")
{
foreach ($proxyList as $key => $value)
{
if (!isUrlAvailable($key))
{
$errorCounter++;
}
else
{
$result = callThisProxy($key);
}
}
}
// Now check $result, which should contain the first successful callThisProxy()
// result, or nothing if none of the keys worked.
You could just keep a list of proxies that you still need to try. When you hit the error or get a valid response then you remove the proxy from the list of proxies to try. If you do not get a good response then keep it in the list and try it again later.
$proxiesToTry = $proxyList;
$i = 0;
while (count($proxiesToTry) != 0) {
// reset to beginning of array
if($i >= count($proxiesToTry))
$i = 0;
$proxy = $proxiesToTry[$i];
if (!isUrlAvailable($proxy)) { //It means it is NOT available so I count errors
$errorCounter++;
unset($proxiesToTry[$i]);
} else { //It means it is AVAILABLE
$result = callThisProxy($proxy);
if($result != "") // If we got a response remove it from the array of proxies to try.
unset($proxiesToTry[$i]);
}
$i++;
}
NOTE: You will never break out of this loop if you don't ever get a valid response from some proxy.
I don't know how to make this.
There is an XML Api server and I'm getting contents with cURL; it works fine. Now I have to call the creditCardPreprocessors state. It has 'in progress state' too and PHP should wait until the progess is finished. I tried already with sleep and other ways, but I can't make it. This is a simplified example variation of what I tried:
function process_state($xml){
if($result = request($xml)){
// It'll return NULL on bad state for example
return $result;
}
sleep(3);
process_state($xml);
}
I know, this can be an infite loop but I've tried to add counting to exit if it reaches five; it won't exit, the server will hang up and I'll have 500 errors for minutes and Apache goes unreachable for that vhost.
EDIT:
Another example
$i = 0;
$card_state = false;
// We're gona assume now the request() turns back NULL if card state is processing TRUE if it's done
while(!$card_state && $i < 10){
$i++;
if($result = request('XML STUFF')){
$card_state = $result;
break;
}
sleep(2);
}
The recursive method you've defined could cause problems depending on the response timing you get back from the server. I think you'd want to use a while loop here. It keeps the requests serialized.
$returnable_responses = array('code1','code2','code3'); // the array of responses that you want the function to stop after receiving
$max_number_of_calls = 5; // or some number
$iterator = 0;
$result = NULL;
while(!in_array($result,$returnable_responses) && ($iterator < $max_number_of_calls)) {
$result = request($xml);
$iterator++;
}
Hi everyone once again!
We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.
Let's use some pseudo code to understand the logic:
1) While ($links_to_be_scanned > 0).
2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
3) Scan_the_link() and run some other functions.
4) Extract the new links from the xdom.
5) Push the new links into $links_to_be_scanned.
5) Push the current link into $links_already_scanned.
6) Remove the current link from $links_to_be_scanned.
Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.
I understand that we're gonna have to create a $links_being_scanned or some kind of queue.
I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.
Thanks in advance!
Chris;
Extended:
I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.
Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.
So now rethinking, we would have to do something like this:
While (There's links to be scanned)
Foreach ($Link_to_scann as $link)
If (There's less than 10 scanners running)
Launch_a_new_scanner($link)
Remove the link from $links_to_be_scanned array
Push the link into $links_on_queue array
Endif;
And each scanner does (This should be run in parallel):
Create an object with the given link
Send a curl request to the given link
Create a dom and an Xdom with the response body
Perform other operations over the response body
Remove the link from the $links_on_queue array
Push the link into the $links_already_scanned array
I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?
Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.
I assume I would have to approach this using fsockopen or pcntl_fork.
Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!
Thanks a lot!
DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.
The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.
Limiting the number of concurrent connections is easily accomplished. Consider:
<?php
use Artax\Client, Artax\Response;
require dirname(__DIR__) . '/autoload.php';
$client = new Client;
// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);
$requests = array(
'so-home' => 'http://stackoverflow.com',
'so-php' => 'http://stackoverflow.com/questions/tagged/php',
'so-python' => 'http://stackoverflow.com/questions/tagged/python',
'so-http' => 'http://stackoverflow.com/questions/tagged/http',
'so-html' => 'http://stackoverflow.com/questions/tagged/html',
'so-css' => 'http://stackoverflow.com/questions/tagged/css',
'so-js' => 'http://stackoverflow.com/questions/tagged/javascript'
);
$onResponse = function($requestKey, Response $r) {
echo $requestKey, ' :: ', $r->getStatus();
};
$onError = function($requestKey, Exception $e) {
echo $requestKey, ' :: ', $e->getMessage();
}
$client->requestMulti($requests, $onResponse, $onError);
IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.
you could try something like this, haven't checked it, but you should get the idea
$request_pool = array();
function CreateHandle($url) {
$handle = curl_init($url);
// set curl options here
return $handle;
}
function Process($data) {
global $request_pool;
// do something with data
array_push($request_pool , CreateHandle($some_new_url));
}
function RunMulti() {
global $request_pool;
$multi_handle = curl_multi_init();
$active_request_pool = array();
$running = 0;
$active_request_count = 0;
$active_request_max = 10; // adjust as necessary
do {
$waiting_request_count = count($request_pool);
while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
$request = array_shift($request_pool);
curl_multi_add_handle($multi_handle , $request);
$active_request_pool[(int)$request] = $request;
$waiting_request_count--;
$active_request_count++;
}
curl_multi_exec($multi_handle , $running);
curl_multi_select($multi_handle);
while($info = curl_multi_info_read($multi_handle)) {
$curl_handle = $info['handle'];
call_user_func('Process' , curl_multi_getcontent($curl_handle));
curl_multi_remove_handle($multi_handle , $curl_handle);
curl_close($curl_handle);
$active_request_count--;
}
} while($active_request_count > 0 || $waiting_request_count > 0);
curl_multi_close($multi_handle);
}
You should look for some more robust solution to your problem. RabbitMQ
is a very good solution I used. There is also Gearman but I think it is your choice.
I prefer RabbitMQ.
I will share with you my code which I have used to collect email addresses from certain website.
You can modify it to fit your needs.
There were some problems with relative URL's there.
And I do not use CURL here.
<?php
error_reporting(E_ALL);
$home = 'http://kharkov-reklama.com.ua/jborudovanie/';
$writer = new RWriter('C:\parser_13-09-2012_05.txt');
set_time_limit(0);
ini_set('memory_limit', '512M');
function scan_page($home, $full_url, &$writer) {
static $done = array();
$done[] = $full_url;
// Scan only internal links. Do not scan all the internet!))
if (strpos($full_url, $home) === false) {
return false;
}
$html = #file_get_contents($full_url);
if (empty($html) || (strpos($html, '<body') === false && strpos($html, '<BODY') === false)) {
return false;
}
echo $full_url . '<br />';
preg_match_all('/([A-Za-z0-9_\-]+\.)*[A-Za-z0-9_\-]+#([A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]\.)+[A-Za-z]{2,4}/', $html, $emails);
if (!empty($emails) && is_array($emails)) {
foreach ($emails as $email_group) {
if (is_array($email_group)) {
foreach ($email_group as $email) {
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
$writer->write($email);
}
}
}
}
}
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER);
if (is_array($matches)) {
foreach($matches as $match) {
if (!empty($match[2]) && is_scalar($match[2])) {
$url = $match[2];
if (!filter_var($url, FILTER_VALIDATE_URL)) {
$url = $home . $url;
}
if (!in_array($url, $done)) {
scan_page($home, $url, $writer);
}
}
}
}
}
class RWriter {
private $_fh = null;
private $_written = array();
public function __construct($fname) {
$this->_fh = fopen($fname, 'w+');
}
public function write($line) {
if (in_array($line, $this->_written)) {
return;
}
$this->_written[] = $line;
echo $line . '<br />';
fwrite($this->_fh, "{$line}\r\n");
}
public function __destruct() {
fclose($this->_fh);
}
}
scan_page($home, 'http://kharkov-reklama.com.ua/jborudovanie/', $writer);
I'm using a web service to send 100's of http posts. However, the service only allows 5 per second. I'm wondering if the usleep command is the best way to do this. For example:
foreach($JSONarray['DATABASE'] as $E)
{
$aws = curl_init();
//curl stuff
curl_exec($aws);
curl_close($aws);
usleep(200000);
}
Now this is untested, but it should provide you with the idea of what I would do(and perhaps this snippet just work as it is - who knows...) :
// presets
$thissecond = time();
$cnt = 0;
foreach($JSONarray['DATABASE'] as $E)
{
while ($thissecond == time() && $cnt > 4) { // go into "waiting" when we going to fast
usleep(100000); // wait .1 second and ask again
}
if ($thissecond != time()) { // remember to reset this second and the cnt
$thissecond = time();
$cnt = 0;
}
// off with the payload
$aws = curl_init();
//curl stuf
curl_exec($aws);
curl_close($aws);
// remember to count it all
$cnt++;
}