Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 months ago.
Improve this question
I noticed that some user overloading my website by downloading multiple files (for example 500 files at same time) and opening more pages in small duration, I want to show captcha if unexpected navigation detected by user.
I know how to implement Captcha, but I can't figure out what is the best approach to detect traffic abuse using (PHP)?
A common approach is to use something like memcached to store the requests on a minute basis, I have open sourced a small class that achieves this: php-ratelimiter
If you are interested in a more thorough explanation of why the requests need to be stored on a minute basis, check this post.
So to sum it up, your code could end up looking like this:
if (!verifyCaptcha()) {
$rateLimiter = new RateLimiter(new Memcache(), $_SERVER["REMOTE_ADDR"]);
try {
$rateLimiter->limitRequestsInMinutes(100, 5);
} catch (RateExceededException $e) {
displayCaptcha();
exit;
}
}
Actually, the code is based on a per-minute basis but you can quite easily adapt this to be on a per 30 seconds basis:
private function getKeys($halfminutes) {
$keys = array();
$now = time();
for ($time = $now - $halfminutes * 30; $time <= $now; $time += 30) {
$keys[] = $this->prefix . date("dHis", $time);
}
return $keys;
}
Introduction
A similar question has be answered before Prevent PHP script from being flooded but it might not be sufficient reasons :
It uses $_SERVER["REMOTE_ADDR"] and they are some shared connection have the same Public IP Address
There are so many Firefox addon that can allows users to use multiple proxy for each request
Multiple Request != Multiple Download
Preventing multiple request is totally different from Multiple Download why ?
Lest Imagine a file of 10MB that would take 1min to download , If you limit users to say 100 request per min what it means you are given access to the user to download
10MB * 100 per min
To fix this issue you can look at Download - max connections per user?.
Multiple Request
Back to page access you can use SimpleFlood which extend memcache to limit users per second. It uses cookies to resolve the shared connection issue and attempts to get the real IP address
$flood = new SimpleFlood();
$flood->addserver("127.0.0.1"); // add memcache server
$flood->setLimit(2); // expect 1 request every 2 sec
try {
$flood->check();
} catch ( Exception $e ) {
sleep(2); // Feel like wasting time
// Display Captcher
// Write Message to Log
printf("%s => %s %s", date("Y-m-d g:i:s"), $e->getMessage(), $e->getFile());
}
Please note that SimpleFlood::setLimit(float $float); accepts floats so you can have
$flood->setLimit(0.1); // expect 1 request every 0.1 sec
Class Used
class SimpleFlood extends \Memcache {
private $ip;
private $key;
private $prenalty = 0;
private $limit = 100;
private $mins = 1;
private $salt = "I like TO dance A #### Lot";
function check() {
$this->parseValues();
$runtime = floatval($this->get($this->key));
$diff = microtime(true) - $runtime;
if ($diff < $this->limit) {
throw new Exception("Limit Exceeded By : $this->ip");
}
$this->set($this->key, microtime(true));
}
public function setLimit($limit) {
$this->limit = $limit;
}
private function parseValues() {
$this->ip = $this->getIPAddress();
if (! $this->ip) {
throw new Exception("Where the hell is the ip address");
}
if (isset($_COOKIE["xf"])) {
$cookie = json_decode($_COOKIE["xf"]);
if ($this->ip != $cookie->ip) {
unset($_COOKIE["xf"]);
setcookie("xf", null, time() - 3600);
throw new Exception("Last IP did not match");
}
if ($cookie->hash != sha1($cookie->key . $this->salt)) {
unset($_COOKIE["xf"]);
setcookie("xf", null, time() - 3600);
throw new Exception("Nice Faking cookie");
}
if (strpos($cookie->key, "floodIP") === 0) {
$cookie->key = "floodRand" . bin2hex(mcrypt_create_iv(50, MCRYPT_DEV_URANDOM));
}
$this->key = $cookie->key;
} else {
$this->key = "floodIP" . sha1($this->ip);
$cookie = (object) array(
"key" => $this->key,
"ip" => $this->ip
);
}
$cookie->hash = sha1($this->key . $this->salt);
$cookie = json_encode($cookie);
setcookie("xf", $cookie, time() + 3600); // expire in 1hr
}
private function getIPAddress() {
foreach ( array(
'HTTP_CLIENT_IP',
'HTTP_X_FORWARDED_FOR',
'HTTP_X_FORWARDED',
'HTTP_X_CLUSTER_CLIENT_IP',
'HTTP_FORWARDED_FOR',
'HTTP_FORWARDED',
'REMOTE_ADDR'
) as $key ) {
if (array_key_exists($key, $_SERVER) === true) {
foreach ( explode(',', $_SERVER[$key]) as $ip ) {
if (filter_var($ip, FILTER_VALIDATE_IP) !== false) {
return $ip;
}
}
}
}
return false;
}
}
Conclusion
This is a basic prove of concept and additional layers can be added to it such as
Set different limit for differences URLS
Add support for penalties where you block user for certain number of Mins or hours
Detection and Different Limit for Tor connections
etc
I think you can use sessions in this case.
Initialize a session to store a timestamp[use microtime for better results] and then get timestamp of the new page.The difference can be used to analyzed the frequency of pages being visited and captcha can be shown.
You can also run a counter on pages being visited and use a 2d array to store the page and timestamp.If the value of pages being visited increases suddenly then you can check for timestamp difference.
Related
I know this can be done in mysql but I want the IP to be stored in php or a text file and its kinda hard for me because I do not quite understand it.
$SESSION is used to log but how can it be stored and banned for 24 hours after a html button is clicked.
Many thanks
Something like this should do the trick.
We write the visitors IP and the time that the ban expires to a text file. We then load this and check if they are in that file, and still banned.
function ban()
{
$ip = $_SERVER['REMOTE_ADDR'];
file_put_contents(
'bans.txt',
sprintf("%s;%s\n", $ip, (new DateTime())->add(new DateInterval('PT24H'))->getTimestamp()),
FILE_APPEND | LOCK_EX);
}
function checkBan()
{
$ip = $_SERVER['REMOTE_ADDR'];
$bans = file('bans.txt');
foreach ($bans as $ban) {
$banExpiry = (int) explode(';', $ban)[1];
if ($banExpiry < time()) {
continue;
}
$bannedIp = explode(';', $ban)[0];
if ($bannedIp === $ip) {
return true;
}
}
return false;
}
Hi everyone once again!
We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.
Let's use some pseudo code to understand the logic:
1) While ($links_to_be_scanned > 0).
2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
3) Scan_the_link() and run some other functions.
4) Extract the new links from the xdom.
5) Push the new links into $links_to_be_scanned.
5) Push the current link into $links_already_scanned.
6) Remove the current link from $links_to_be_scanned.
Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.
I understand that we're gonna have to create a $links_being_scanned or some kind of queue.
I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.
Thanks in advance!
Chris;
Extended:
I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.
Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.
So now rethinking, we would have to do something like this:
While (There's links to be scanned)
Foreach ($Link_to_scann as $link)
If (There's less than 10 scanners running)
Launch_a_new_scanner($link)
Remove the link from $links_to_be_scanned array
Push the link into $links_on_queue array
Endif;
And each scanner does (This should be run in parallel):
Create an object with the given link
Send a curl request to the given link
Create a dom and an Xdom with the response body
Perform other operations over the response body
Remove the link from the $links_on_queue array
Push the link into the $links_already_scanned array
I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?
Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.
I assume I would have to approach this using fsockopen or pcntl_fork.
Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!
Thanks a lot!
DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.
The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.
Limiting the number of concurrent connections is easily accomplished. Consider:
<?php
use Artax\Client, Artax\Response;
require dirname(__DIR__) . '/autoload.php';
$client = new Client;
// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);
$requests = array(
'so-home' => 'http://stackoverflow.com',
'so-php' => 'http://stackoverflow.com/questions/tagged/php',
'so-python' => 'http://stackoverflow.com/questions/tagged/python',
'so-http' => 'http://stackoverflow.com/questions/tagged/http',
'so-html' => 'http://stackoverflow.com/questions/tagged/html',
'so-css' => 'http://stackoverflow.com/questions/tagged/css',
'so-js' => 'http://stackoverflow.com/questions/tagged/javascript'
);
$onResponse = function($requestKey, Response $r) {
echo $requestKey, ' :: ', $r->getStatus();
};
$onError = function($requestKey, Exception $e) {
echo $requestKey, ' :: ', $e->getMessage();
}
$client->requestMulti($requests, $onResponse, $onError);
IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.
you could try something like this, haven't checked it, but you should get the idea
$request_pool = array();
function CreateHandle($url) {
$handle = curl_init($url);
// set curl options here
return $handle;
}
function Process($data) {
global $request_pool;
// do something with data
array_push($request_pool , CreateHandle($some_new_url));
}
function RunMulti() {
global $request_pool;
$multi_handle = curl_multi_init();
$active_request_pool = array();
$running = 0;
$active_request_count = 0;
$active_request_max = 10; // adjust as necessary
do {
$waiting_request_count = count($request_pool);
while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
$request = array_shift($request_pool);
curl_multi_add_handle($multi_handle , $request);
$active_request_pool[(int)$request] = $request;
$waiting_request_count--;
$active_request_count++;
}
curl_multi_exec($multi_handle , $running);
curl_multi_select($multi_handle);
while($info = curl_multi_info_read($multi_handle)) {
$curl_handle = $info['handle'];
call_user_func('Process' , curl_multi_getcontent($curl_handle));
curl_multi_remove_handle($multi_handle , $curl_handle);
curl_close($curl_handle);
$active_request_count--;
}
} while($active_request_count > 0 || $waiting_request_count > 0);
curl_multi_close($multi_handle);
}
You should look for some more robust solution to your problem. RabbitMQ
is a very good solution I used. There is also Gearman but I think it is your choice.
I prefer RabbitMQ.
I will share with you my code which I have used to collect email addresses from certain website.
You can modify it to fit your needs.
There were some problems with relative URL's there.
And I do not use CURL here.
<?php
error_reporting(E_ALL);
$home = 'http://kharkov-reklama.com.ua/jborudovanie/';
$writer = new RWriter('C:\parser_13-09-2012_05.txt');
set_time_limit(0);
ini_set('memory_limit', '512M');
function scan_page($home, $full_url, &$writer) {
static $done = array();
$done[] = $full_url;
// Scan only internal links. Do not scan all the internet!))
if (strpos($full_url, $home) === false) {
return false;
}
$html = #file_get_contents($full_url);
if (empty($html) || (strpos($html, '<body') === false && strpos($html, '<BODY') === false)) {
return false;
}
echo $full_url . '<br />';
preg_match_all('/([A-Za-z0-9_\-]+\.)*[A-Za-z0-9_\-]+#([A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]\.)+[A-Za-z]{2,4}/', $html, $emails);
if (!empty($emails) && is_array($emails)) {
foreach ($emails as $email_group) {
if (is_array($email_group)) {
foreach ($email_group as $email) {
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
$writer->write($email);
}
}
}
}
}
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER);
if (is_array($matches)) {
foreach($matches as $match) {
if (!empty($match[2]) && is_scalar($match[2])) {
$url = $match[2];
if (!filter_var($url, FILTER_VALIDATE_URL)) {
$url = $home . $url;
}
if (!in_array($url, $done)) {
scan_page($home, $url, $writer);
}
}
}
}
}
class RWriter {
private $_fh = null;
private $_written = array();
public function __construct($fname) {
$this->_fh = fopen($fname, 'w+');
}
public function write($line) {
if (in_array($line, $this->_written)) {
return;
}
$this->_written[] = $line;
echo $line . '<br />';
fwrite($this->_fh, "{$line}\r\n");
}
public function __destruct() {
fclose($this->_fh);
}
}
scan_page($home, 'http://kharkov-reklama.com.ua/jborudovanie/', $writer);
We serve several high load sites, and so far we came with this code for caching remote banners that support some basic local banner rotation. It should be fairly easy for all to understand, what I would like to hear from you about the possible improvement or suggestions on this code.
Here it is...
$cachetime = 6 * 60 * 60; // 6 hours
$bannercache = $_SERVER['DOCUMENT_ROOT']."/banner-".$bpos.".txt";
// Serve from the cache if it is younger than $cachetime
if (file_exists($bannercache) && (time() - $cachetime
< filemtime($bannercache)))
{
// if it's ok don't update from remote
} else {
// if cache is old, update from remote
$bannercachecontent = #file_get_contents('ADSERVER.com/showad.php?category='.$adcat.'&dimensions='.$dimensions);
if ($bannercachecontent === FALSE) {
// on error, just update local time, so that it's not pulled again in case of remote mysql overload
$fb = #fopen($bannercache, 'a+');
fwrite($fb, "\n<!-- Changed date on ".date('jS F Y H:i', filemtime($cachefile))."-->\n");
fclose($fb);
} else {
// if it's ok, save new local file
$fb = #fopen($bannercache, 'w');
fwrite($fb, $bannercachecontent);
fwrite($fb, "\n<!-- Cached ".date('jS F Y H:i', filemtime($cachefile))."-->\n");
fclose($fb);
}
}
$fhm = file_get_contents($bannercache);
$fhmpos = strpos($fhm, '-----#####-----'); // check if it needs to be exploded for rotation
if ($fhmpos === false) {
echo $fhm;
} else {
$fhmpicks = explode("-----#####-----", $fhm);
foreach ($fhmpicks as $fhmkey => $fhmvalue)
{
if (trim($fhmpicks[$fhmkey]) == '')
{
unset($fhmpicks[$fhmkey]);
}
}
$fhmpick = array_rand($fhmpicks,1);
echo $fhmpicks[$fhmpick]; // show only one banner
}
Don't let your clients update your banners. You will always have 1 user that has to do this, and that isn't necessary:
Instead: let you clients always load the local image, and run a background process (cron, with a manual override if you want to pull the image NOW) that fetches the right image(s).
The code would be fairly simple:
your client does: $yourImage= $_SERVER['DOCUMENT_ROOT']."/banner-".$bpos.".txt";
and your update-script can just cURL the right image.
I'm having some difficulties getting the PHP libevent extension to break out of a loop on a timeout. Here's what I've got so far based on the demos on the PHP.net docs pages:
// From here: http://www.php.net/manual/en/libevent.examples.php
function print_line($fd, $events, $arg) {
static $max_requests = 0;
$max_requests++;
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), fgets($fd));
if ($max_requests == 10) {
// exit loop after 3 writes
echo " [EXIT]\n";
event_base_loopexit($arg[1]);
}
}
// create base and event
$base = event_base_new();
$event = event_new();
getTimer(); // Initialise time
$fd = STDIN;
event_set($event, $fd, EV_READ | EV_PERSIST, "print_line", array($event, $base));
event_base_set($event, $base);
event_add($event, 2000000);
event_base_loop($base);
// extract flags from bitmask
function getEventFlags ($ebm) {
$expFlags = array('EV_TIMEOUT', 'EV_SIGNAL', 'EV_READ', 'EV_WRITE', 'EV_PERSIST');
$ret = array();
foreach ($expFlags as $exf) {
if ($ebm & constant($exf)) {
$ret[] = $exf;
}
}
return $ret;
}
// Used to track time!
function getTimer () {
static $ts;
if (is_null($ts)) {
$ts = microtime(true);
return "Timer initialised";
}
$newts = microtime(true);
$r = sprintf("Delta: %.3f", $newts - $ts);
$ts = $newts;
return $r;
}
I can see that the timeout value passed to event_add effects the events passed to print_line(), if these events are any more than 2 seconds apart I get an EV_TIMEOUT instead of an EV_READ. What I want however is for libevent to call print_line as soon as the timeout is reached rather than waiting for the next event in order to give me the timeout.
I've tried using event_base_loopexit($base, 2000000), this causes the event loop to exit immediately without blocking for events. I've also tried passing EV_TIMEOUT to event_set, this seems to have no effect at all.
Has anyone managed to get this working before? I know that the event_buffer_* stuff works with timeouts, however I want to use the standard event_base functions. One of the PECL bugs talks about event_timer_* functions and these functions do exist on my system, however they're not documented at all.
Problem is in fgets() in:
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), fgets($fd));
This blocks processing and waits for data from STDIN (but there are none on timeout)
Change to something like that:
$text = '';
if ($events & EV_READ) {
$text = fgets($fd);
}
printf("Received event: %s after %s\n%s", implode(getEventFlags($events)), getTimer(), $text);
What there be a performance hit when I add this to my .htaccess file:
HOWTO stop automated spam-bots using .htaccess
or should I add it to my PHP file instead?
or leave it out completely? Because spammers might fake their useragent anyway?
Would it also make sense to prevent users from accessing your website via a proxy server? I know that this might also block people from accessing your website who didn't come here with bad intentions. But, what are some of the reasons why people would visit a website via a proxy server, other than spam, or when a website is blocked in their country?
What there be a performance hit when I add this to my .htaccess file?
Possibly, if you have thousands or tens of thousands of user agent strings to match against. Apache has to check this rule on every request.
or should I add it to my PHP file instead?
No Apache's parsing of .htaccess will still be quicker than a PHP process. For PHP, Apache has to start a PHP interpreter process for every request.
or leave it out completely? Because spammers might fake their useragent anyway?
Probably yes. It is very likely that most malicious spam bots will be faking a standard user agent.
But, what are some of the reasons why people would visit a website via a proxy server, other than spam, or when a website is blocked in their country?
There is a lot of legitimate uses for a proxy server. One is mobile clients that use some sort of prefetching to save mobile traffic. There are also some ISPs who force their clients to use their proxy servers. In my opinion, locking out users who use a proxy server is not a wise move.
The bottom line is probably that these things are not worth worrying about unless you have a lot of traffic going to waste because of malicious activities.
I personally would focus more on securing the basics like forms, codes, open ports etc. of the website as compared to blocking. A visit counts anyway! ;)
...whats wrong with setting up a domain dot com/bottrap, disallow access to it through robots.txt, capture the naughty bot, put its IP in .txt array, denying it access with a 403 header forever?
PHP Limit/Block Website requests for Spiders/Bots/Clients etc.
Here i have written a PHP function which can Block unwanted Requests to reduce your Website-Traffic. God for Spiders, Bots and annoying Clients.
CLIENT/Bots Blocker
DEMO: http://szczepan.info/9-webdesign/php/1-php-limit-block-website-requests-for-spiders-bots-clients-etc.html
CODE:
/* Function which can Block unwanted Requests
* #return boolean/array status
*/
function requestBlocker()
{
/*
Version 1.0 11 Jan 2013
Author: Szczepan K
http://www.szczepan.info
me[#] szczepan [dot] info
###Description###
A PHP function which can Block unwanted Requests to reduce your Website-Traffic.
God for Spiders, Bots and annoying Clients.
*/
$dir = 'requestBlocker/'; ## Create & set directory writeable!!!!
$rules = array(
#You can add multiple Rules in a array like this one here
#Notice that large "sec definitions" (like 60*60*60) will blow up your client File
array(
//if >5 requests in 5 Seconds then Block client 15 Seconds
'requests' => 5, //5 requests
'sek' => 5, //5 requests in 5 Seconds
'blockTime' => 15 // Block client 15 Seconds
),
array(
//if >10 requests in 30 Seconds then Block client 20 Seconds
'requests' => 10, //10 requests
'sek' => 30, //10 requests in 30 Seconds
'blockTime' => 20 // Block client 20 Seconds
),
array(
//if >200 requests in 1 Hour then Block client 10 Minutes
'requests' => 200, //200 requests
'sek' => 60 * 60, //200 requests in 1 Hour
'blockTime' => 60 * 10 // Block client 10 Minutes
)
);
$time = time();
$blockIt = array();
$user = array();
#Set Unique Name for each Client-File
$user[] = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : 'IP_unknown';
$user[] = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
$user[] = strtolower(gethostbyaddr($user[0]));
# Notice that i use files because bots does not accept Sessions
$botFile = $dir . substr($user[0], 0, 8) . '_' . substr(md5(join('', $user)), 0, 5) . '.txt';
if (file_exists($botFile)) {
$file = file_get_contents($botFile);
$client = unserialize($file);
} else {
$client = array();
$client['time'][$time] = 0;
}
# Set/Unset Blocktime for blocked Clients
if (isset($client['block'])) {
foreach ($client['block'] as $ruleNr => $timestampPast) {
$left = $time - $timestampPast;
if (($left) > $rules[$ruleNr]['blockTime']) {
unset($client['block'][$ruleNr]);
continue;
}
$blockIt[] = 'Block active for Rule: ' . $ruleNr . ' - unlock in ' . ($left - $rules[$ruleNr]['blockTime']) . ' Sec.';
}
if (!empty($blockIt)) {
return $blockIt;
}
}
# log/count each access
if (!isset($client['time'][$time])) {
$client['time'][$time] = 1;
} else {
$client['time'][$time]++;
}
#check the Rules for Client
$min = array(
0
);
foreach ($rules as $ruleNr => $v) {
$i = 0;
$tr = false;
$sum[$ruleNr] = '';
$requests = $v['requests'];
$sek = $v['sek'];
foreach ($client['time'] as $timestampPast => $count) {
if (($time - $timestampPast) < $sek) {
$sum[$ruleNr] += $count;
if ($tr == false) {
#register non-use Timestamps for File
$min[] = $i;
unset($min[0]);
$tr = true;
}
}
$i++;
}
if ($sum[$ruleNr] > $requests) {
$blockIt[] = 'Limit : ' . $ruleNr . '=' . $requests . ' requests in ' . $sek . ' seconds!';
$client['block'][$ruleNr] = $time;
}
}
$min = min($min) - 1;
#drop non-use Timestamps in File
foreach ($client['time'] as $k => $v) {
if (!($min <= $i)) {
unset($client['time'][$k]);
}
}
$file = file_put_contents($botFile, serialize($client));
return $blockIt;
}
if ($t = requestBlocker()) {
echo 'dont pass here!';
print_R($t);
} else {
echo "go on!";
}