What there be a performance hit when I add this to my .htaccess file:
HOWTO stop automated spam-bots using .htaccess
or should I add it to my PHP file instead?
or leave it out completely? Because spammers might fake their useragent anyway?
Would it also make sense to prevent users from accessing your website via a proxy server? I know that this might also block people from accessing your website who didn't come here with bad intentions. But, what are some of the reasons why people would visit a website via a proxy server, other than spam, or when a website is blocked in their country?
What there be a performance hit when I add this to my .htaccess file?
Possibly, if you have thousands or tens of thousands of user agent strings to match against. Apache has to check this rule on every request.
or should I add it to my PHP file instead?
No Apache's parsing of .htaccess will still be quicker than a PHP process. For PHP, Apache has to start a PHP interpreter process for every request.
or leave it out completely? Because spammers might fake their useragent anyway?
Probably yes. It is very likely that most malicious spam bots will be faking a standard user agent.
But, what are some of the reasons why people would visit a website via a proxy server, other than spam, or when a website is blocked in their country?
There is a lot of legitimate uses for a proxy server. One is mobile clients that use some sort of prefetching to save mobile traffic. There are also some ISPs who force their clients to use their proxy servers. In my opinion, locking out users who use a proxy server is not a wise move.
The bottom line is probably that these things are not worth worrying about unless you have a lot of traffic going to waste because of malicious activities.
I personally would focus more on securing the basics like forms, codes, open ports etc. of the website as compared to blocking. A visit counts anyway! ;)
...whats wrong with setting up a domain dot com/bottrap, disallow access to it through robots.txt, capture the naughty bot, put its IP in .txt array, denying it access with a 403 header forever?
PHP Limit/Block Website requests for Spiders/Bots/Clients etc.
Here i have written a PHP function which can Block unwanted Requests to reduce your Website-Traffic. God for Spiders, Bots and annoying Clients.
CLIENT/Bots Blocker
DEMO: http://szczepan.info/9-webdesign/php/1-php-limit-block-website-requests-for-spiders-bots-clients-etc.html
CODE:
/* Function which can Block unwanted Requests
* #return boolean/array status
*/
function requestBlocker()
{
/*
Version 1.0 11 Jan 2013
Author: Szczepan K
http://www.szczepan.info
me[#] szczepan [dot] info
###Description###
A PHP function which can Block unwanted Requests to reduce your Website-Traffic.
God for Spiders, Bots and annoying Clients.
*/
$dir = 'requestBlocker/'; ## Create & set directory writeable!!!!
$rules = array(
#You can add multiple Rules in a array like this one here
#Notice that large "sec definitions" (like 60*60*60) will blow up your client File
array(
//if >5 requests in 5 Seconds then Block client 15 Seconds
'requests' => 5, //5 requests
'sek' => 5, //5 requests in 5 Seconds
'blockTime' => 15 // Block client 15 Seconds
),
array(
//if >10 requests in 30 Seconds then Block client 20 Seconds
'requests' => 10, //10 requests
'sek' => 30, //10 requests in 30 Seconds
'blockTime' => 20 // Block client 20 Seconds
),
array(
//if >200 requests in 1 Hour then Block client 10 Minutes
'requests' => 200, //200 requests
'sek' => 60 * 60, //200 requests in 1 Hour
'blockTime' => 60 * 10 // Block client 10 Minutes
)
);
$time = time();
$blockIt = array();
$user = array();
#Set Unique Name for each Client-File
$user[] = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : 'IP_unknown';
$user[] = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
$user[] = strtolower(gethostbyaddr($user[0]));
# Notice that i use files because bots does not accept Sessions
$botFile = $dir . substr($user[0], 0, 8) . '_' . substr(md5(join('', $user)), 0, 5) . '.txt';
if (file_exists($botFile)) {
$file = file_get_contents($botFile);
$client = unserialize($file);
} else {
$client = array();
$client['time'][$time] = 0;
}
# Set/Unset Blocktime for blocked Clients
if (isset($client['block'])) {
foreach ($client['block'] as $ruleNr => $timestampPast) {
$left = $time - $timestampPast;
if (($left) > $rules[$ruleNr]['blockTime']) {
unset($client['block'][$ruleNr]);
continue;
}
$blockIt[] = 'Block active for Rule: ' . $ruleNr . ' - unlock in ' . ($left - $rules[$ruleNr]['blockTime']) . ' Sec.';
}
if (!empty($blockIt)) {
return $blockIt;
}
}
# log/count each access
if (!isset($client['time'][$time])) {
$client['time'][$time] = 1;
} else {
$client['time'][$time]++;
}
#check the Rules for Client
$min = array(
0
);
foreach ($rules as $ruleNr => $v) {
$i = 0;
$tr = false;
$sum[$ruleNr] = '';
$requests = $v['requests'];
$sek = $v['sek'];
foreach ($client['time'] as $timestampPast => $count) {
if (($time - $timestampPast) < $sek) {
$sum[$ruleNr] += $count;
if ($tr == false) {
#register non-use Timestamps for File
$min[] = $i;
unset($min[0]);
$tr = true;
}
}
$i++;
}
if ($sum[$ruleNr] > $requests) {
$blockIt[] = 'Limit : ' . $ruleNr . '=' . $requests . ' requests in ' . $sek . ' seconds!';
$client['block'][$ruleNr] = $time;
}
}
$min = min($min) - 1;
#drop non-use Timestamps in File
foreach ($client['time'] as $k => $v) {
if (!($min <= $i)) {
unset($client['time'][$k]);
}
}
$file = file_put_contents($botFile, serialize($client));
return $blockIt;
}
if ($t = requestBlocker()) {
echo 'dont pass here!';
print_R($t);
} else {
echo "go on!";
}
Related
I'm looking for good code in PHP for Banning some spammers IP's My server is giving me error 500 if I'm using .htaccess
This will do the work
$getip = $_SERVER["REMOTE_ADDR"];
$banned_ip = array();
$banned_ip[] = '194.9.94.*';
$banned_ip[] = '77.105.2.*';
foreach($banned_ip as $banned)
{
$blacked=str_replace('*', '', $banned);
$len=strlen($blacked);
if ($getip==$blacked || substr($getip, 0, $len)==$blacked)
{
$_banned_ip=true;
}
}
if($_banned_ip==true){
echo 'THIS IP IS BANNED!';
exit;
}
The simplest way would be to have a database that keeps a list of the banned ip addresses, if you want to do it on the PHP end rather than directly in the server.
for($i = 0;$i < count($listOfIps);$i++) {
if($listOfIps[$i] == filteredIP($_SERVER['REMOTED_ADDR'])) { //filteredIP is not a native function, it's just a representation of however you want to filter the ip addresses which are sent to you
$banned = true;
}
}
if($banned):
//redirect user or kill script
else:
//render page
endif;
However, there may be better solutions based on the page or application specifics, but this is the best solution I can think of based on your question
I'm developing an API to let my users access to files stored on another server.
Let's call my two servers, server 1 and server 2!
server 1 is the server im hosting my web site, and
server 2 is the server im storing my files!
My site is basically Javascript based one, so I will be using Javascript to post data to API when user needs to access files which are stored on server 2.
when users requests to access files, the data will be posted to API URL via Javascript! API is made of PHP. Using that PHP script(API) on server 1, I will made another request to server 2 asking for files so there will be another PHP script(API) on server 2.
I need to know how should I do this authentication between two servers as server 2 has no access to user details on server 1?
I hope to do that like this, I can use the method which is used by most payment gateways.
When API on server 2 received a request with some unique data of the user , post back those unique data through SSL to server 1 API and match them with user data in the database, then post back result through SSL to server 2 so then server 2 knows file request is a genuine request.
In this case what kind of user data/credentials server 1 API should post to server 2 and server 2 API should post back to server 1? and which user data should be matched with the data in the database? like user ID, session, cookies, ip, time stamp, ect!
Any clear and described answer would be nice! Thanks.
I would go with this:
user initiates action, javascript asks Server 1 (ajax) for request for file on Server 2
Server 1 creates URL using hash_hmac with data: file, user ID, user secret
when clicking that URL (server2.com/?file=FILE&user_id=ID&hash=SHA_1_HASH) server 2 asks server 1 for validation (sends file, user_id and hash)
server 1 does the validation, sends response to server 2
server 2 pushes file or sends 403 HTTP response
This way, server 2 only needs to consume API of server 1, server 1 has all the logic.
Pseudocode for hash and url creation:
// getHash($userId, $file) method
$user = getUser($userId);
$hash = hash_hmac('sha1', $userId . $file, $user->getSecret());
// getUrl($userId, $file) method
return sprintf('http://server2.com/get-file?file=%1&user_id=%2&hash=%3',
$userId,
$file,
$security->getHash($userId, $file)
);
Pseudocode for validation:
$hash = $security->getHash($_GET['id'], $_GET['file']);
if ($hash === $_GET['hash']) {
// All is good
}
Edit: getHash() method accepts user ID and file (ID or string, what ever suits your needs). With that data, it produces a hash, using hash_hmac method. For the secret parameter of hash_hmac function, users "secret key" is used. That key would be stored together with users data in the db table. It would be generated with mt_rand or even something stronger as reading /dev/random or using something like https://stackoverflow.com/a/16478556/691850.
A word of advice, use mod_xsendfile on server 2 (if it is Apache) to push files.
Introduction
You can use 2 simple method
Authentication Token
Signed Request
You can also combine both of them by using Token for authentication and using signature to verify integrity of the message sent
Authentication Token
If you are going to consider matching any identification in the database perhaps you can consider creating authentication token rather than user ID, session, cookies, ip, time stamp, etc! as suggested.
Create a random token and save to Database
$token = bin2hex(mcrypt_create_iv(64, MCRYPT_DEV_URANDOM));
This can be easily generated
You can guaranteed it more difficult to guess unlike password
It can easily be deleted if compromised and re generate another key
Signed Request
The concept is simple, For each file uploaded must meat a specific signature crated using a random generated key just like the token for each specific user
This can easily be implemented with HMAC with hash_hmac_file function
Combine Both Authentication & Signed Request
Here is a simple Prof of concept
Server 1
/**
* This should be stored securly
* Only known to User
* Unique to each User
* Eg : mcrypt_create_iv(32, MCRYPT_DEV_URANDOM);
*/
$key = "d767d183315656d90cce5c8a316c596c971246fbc48d70f06f94177f6b5d7174";
$token = "3380cb5229d4737ebe8e92c1c2a90542e46ce288901da80fe8d8c456bace2a9e";
$url = "http://server 2/run.php";
// Start File Upload Manager
$request = new FileManager($key, $token);
// Send Multiple Files
$responce = $request->send($url, [
"file1" => __DIR__ . "/a.png",
"file2" => __DIR__ . "/b.css"
]);
// Decode Responce
$json = json_decode($responce->data, true);
// Output Information
foreach($json as $file) {
printf("%s - %s \n", $file['name'], $file['msg']);
}
Output
temp\14-a.png - OK
temp\14-b.css - OK
Server 2
// Where to store the files
$tmpDir = __DIR__ . "/temp";
try {
$file = new FileManager($key, $token);
echo json_encode($file->recive($tmpDir), 128);
} catch (Exception $e) {
echo json_encode([
[
"name" => "Execption",
"msg" => $e->getMessage(),
"status" => 0
]
], 128);
}
Class Used
class FileManager {
private $key;
function __construct($key, $token) {
$this->key = $key;
$this->token = $token;
}
function send($url, $files) {
$post = [];
// Convert to array fromat
$files = is_array($files) ? $files : [
$files
];
// Build Post Request
foreach($files as $name => $file) {
$file = realpath($file);
if (! (is_file($file) || is_readable($file))) {
throw new InvalidArgumentException("Invalid File");
}
// Add File
$post[$name] = "#" . $file;
// Sign File
$post[$name . "-sign"] = $this->sign($file);
}
// Start Curl ;
$ch = curl_init($url);
$options = [
CURLOPT_HTTPHEADER => [
"X-TOKEN:" . $this->token
],
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_POST => count($post),
CURLOPT_POSTFIELDS => $post
];
curl_setopt_array($ch, $options);
// Get Responce
$responce = [
"data" => curl_exec($ch),
"error" => curl_error($ch),
"error" => curl_errno($ch),
"info" => curl_getinfo($ch)
];
curl_close($ch);
return (object) $responce;
}
function recive($dir) {
if (! isset($_SERVER['HTTP_X_TOKEN'])) {
throw new ErrorException("Missing Security Token");
}
if ($_SERVER['HTTP_X_TOKEN'] !== $this->token) {
throw new ErrorException("Invalid Security Token");
}
if (! isset($_FILES)) {
throw new ErrorException("File was not uploaded");
}
$responce = [];
foreach($_FILES as $name => $file) {
$responce[$name]['status'] = 0;
// check if file is uploaded
if ($file['error'] == UPLOAD_ERR_OK) {
// Check for signatire
if (isset($_POST[$name . '-sign']) && $_POST[$name . '-sign'] === $this->sign($file['tmp_name'])) {
$path = $dir . DIRECTORY_SEPARATOR . $file['name'];
$x = 0;
while(file_exists($path)) {
$x ++;
$path = $dir . DIRECTORY_SEPARATOR . $x . "-" . $file['name'];
}
// Move File to temp folder
move_uploaded_file($file['tmp_name'], $path);
$responce[$name]['name'] = $path;
$responce[$name]['sign'] = $_POST[$name . '-sign'];
$responce[$name]['status'] = 1;
$responce[$name]['msg'] = "OK";
} else {
$responce[$name]['msg'] = sprintf("Invalid File Signature");
}
} else {
$responce[$name]['msg'] = sprintf("Upload Error : %s" . $file['error']);
}
}
return $responce;
}
private function sign($file) {
return hash_hmac_file("sha256", $file, $this->key);
}
}
Other things to consider
For better security you can consider the follow
IP Lock down
File Size Limit
File Type Validation
Public-Key Cryptography
Changing Date Based token generation
Conclusion
The sample class can be extended in so many ways and rather than use URL you can consider a proper json RCP solution
A long enough, single-use, short-lived, random generated key should suffice in this case.
Client requests for a file to Server 1
Server 1 confirms login information and generates a long single-use key and sends it to the user. Server 1 keeps track of this key and matches it with an actual file on Server 2.
Client sends a request to Server 2 along with the key
Server 2 contacts Server 1 and submits the key
Server 1 returns a file path if the key is valid. The key is invalidated (destroyed).
Server 2 sends the file to the client
Server 1 invalidates the key after say 30 seconds, even if it didn't receive a confirmation request from Server 2. Your front-end should account for this case and retry the process a couple of times before returning an error.
I do not think there is a point in sending cookie/session information along, this information can be brute-forced just like the random key.
A 1024-bit long key sounds more than reasonable. This entropy can be obtained with a string of less than 200 alphanumeric characters.
For the absolute best security you would need some communication from server 2 to server 1, to double check if the request is valid. Although this communication could be minimal, its still communication and thus slows down the proces.
If you could live with a marginally less secure solution, I would suggest the following.
Server 1 requestfile.php:
<?php
//check login
if (!$loggedon) {
die('You need to be logged on');
}
$dataKey = array();
$uniqueKey = 'fgsdjk%^347JH$#^%&5ghjksc'; //choose whatever you want.
//check file
$file = isset($_GET['file']) ? $_GET['file'] : '';
if (empty($file)) {
die('Invalid request');
}
//add user data to create a reasonably unique fingerprint.
//It will mostlikely be the same for people in the same office with the same browser, thats mainly where the security drop comes from.
//I double check if all variables are set just to be sure. Most of these will never be missing.
if (isset($_SERVER['HTTP_USER_AGENT'])) {
$dataKey[] = $_SERVER['HTTP_USER_AGENT'];
}
if (isset($_SERVER['REMOTE_ADDR'])) {
$dataKey[] = $_SERVER['REMOTE_ADDR'];
}
if (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
$dataKey[] = $_SERVER['HTTP_ACCEPT_LANGUAGE'];
}
if (isset($_SERVER['HTTP_ACCEPT_ENCODING'])) {
$dataKey[] = $_SERVER['HTTP_ACCEPT_ENCODING'];
}
if (isset($_SERVER['HTTP_ACCEPT'])) {
$dataKey[] = $_SERVER['HTTP_ACCEPT'];
}
//also add the unique key
$dataKey[] = $uniqueKey;
//add the file
$dataKey[] = $file;
//add a timestamp. Since the request will be a different times, dont use the exact second
//make sure its added last
$dataKey[] = date('YmdHi');
//create a hash
$hash = md5(implode('-', $dataKey));
//send to server 2
header('Location: https://server2.com/download.php?file='.urlencode($file).'&key='.$hash);
?>
On server 2 you will do almost the same.
<?php
$valid = false;
$dataKey = array();
$uniqueKey = 'fgsdjk%^347JH$#^%&5ghjksc'; //same as on server one
//check file
$file = isset($_GET['file']) ? $_GET['file'] : '';
if (empty($file)) {
die('Invalid request');
}
//check key
$key = isset($_GET['key']) ? $_GET['key'] : '';
if (empty($key)) {
die('Invalid request');
}
//add user data to create a reasonably unique fingerprint.
if (isset($_SERVER['HTTP_USER_AGENT'])) {
$dataKey[] = $_SERVER['HTTP_USER_AGENT'];
}
if (isset($_SERVER['REMOTE_ADDR'])) {
$dataKey[] = $_SERVER['REMOTE_ADDR'];
}
if (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
$dataKey[] = $_SERVER['HTTP_ACCEPT_LANGUAGE'];
}
if (isset($_SERVER['HTTP_ACCEPT_ENCODING'])) {
$dataKey[] = $_SERVER['HTTP_ACCEPT_ENCODING'];
}
if (isset($_SERVER['HTTP_ACCEPT'])) {
$dataKey[] = $_SERVER['HTTP_ACCEPT'];
}
//also add the unique key
$dataKey[] = $uniqueKey;
//add the file
$dataKey[] = $file;
//add a timestamp. Since the request will be a different times, dont use the exact second
//keep the request time in a variable
$time = time();
$dataKey[] = date('YmdHi', $time);
//create a hash
$hash = md5(implode('-', $dataKey));
if ($hash == $key) {
$valid = true;
} else {
//perhaps the request to server one was made at 2013-06-26 14:59 and the request to server 2 come in at 2013-06-26 15:00
//It would still fail when the request to server 1 and 2 are more then one minute apart, but I think thats an acceptable margin. You could always adjust for more margin though.
//drop the current time
$requesttime = array_pop($dataKey);
//go back one minute
$time -= 60;
//add the time again
$dataKey[] = date('YmdHi', $time);
//create a hash
$hash = md5(implode('-', $dataKey));
if ($hash == $key) {
$valid = true;
}
}
if ($valid!==true) {
die('Invalid request');
}
//all is ok. Put the code to download the file here
?>
You can restrict access to server2. Only server1 will be able to send request to server2. You can do this by whitelisting ip of server1 on server side or using .htaccess file. In php you can do by checking request generated ip and validate it with server1 ip.
Also you can write a algorithm which generates a unique number. Using that algorithm generate a number on server1 and send it to server2 in request. On server2 check if that number is generated by algorithm and if yes then request is valid.
I'd go with a simple symetric encryption, where server 1 encodes the date and the authenticated user using a key known only by server 1 and server 2, sending it to the client who cant read it, but can send it to server 2 as a sort of ticket to authenticate himself. The date is important to not let any client use the same "ticket" over the time. But at least one of the servers must know which user have access to which files, so unless you use dedicated folders or access groups you must keep the user and file infos together.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 months ago.
Improve this question
I noticed that some user overloading my website by downloading multiple files (for example 500 files at same time) and opening more pages in small duration, I want to show captcha if unexpected navigation detected by user.
I know how to implement Captcha, but I can't figure out what is the best approach to detect traffic abuse using (PHP)?
A common approach is to use something like memcached to store the requests on a minute basis, I have open sourced a small class that achieves this: php-ratelimiter
If you are interested in a more thorough explanation of why the requests need to be stored on a minute basis, check this post.
So to sum it up, your code could end up looking like this:
if (!verifyCaptcha()) {
$rateLimiter = new RateLimiter(new Memcache(), $_SERVER["REMOTE_ADDR"]);
try {
$rateLimiter->limitRequestsInMinutes(100, 5);
} catch (RateExceededException $e) {
displayCaptcha();
exit;
}
}
Actually, the code is based on a per-minute basis but you can quite easily adapt this to be on a per 30 seconds basis:
private function getKeys($halfminutes) {
$keys = array();
$now = time();
for ($time = $now - $halfminutes * 30; $time <= $now; $time += 30) {
$keys[] = $this->prefix . date("dHis", $time);
}
return $keys;
}
Introduction
A similar question has be answered before Prevent PHP script from being flooded but it might not be sufficient reasons :
It uses $_SERVER["REMOTE_ADDR"] and they are some shared connection have the same Public IP Address
There are so many Firefox addon that can allows users to use multiple proxy for each request
Multiple Request != Multiple Download
Preventing multiple request is totally different from Multiple Download why ?
Lest Imagine a file of 10MB that would take 1min to download , If you limit users to say 100 request per min what it means you are given access to the user to download
10MB * 100 per min
To fix this issue you can look at Download - max connections per user?.
Multiple Request
Back to page access you can use SimpleFlood which extend memcache to limit users per second. It uses cookies to resolve the shared connection issue and attempts to get the real IP address
$flood = new SimpleFlood();
$flood->addserver("127.0.0.1"); // add memcache server
$flood->setLimit(2); // expect 1 request every 2 sec
try {
$flood->check();
} catch ( Exception $e ) {
sleep(2); // Feel like wasting time
// Display Captcher
// Write Message to Log
printf("%s => %s %s", date("Y-m-d g:i:s"), $e->getMessage(), $e->getFile());
}
Please note that SimpleFlood::setLimit(float $float); accepts floats so you can have
$flood->setLimit(0.1); // expect 1 request every 0.1 sec
Class Used
class SimpleFlood extends \Memcache {
private $ip;
private $key;
private $prenalty = 0;
private $limit = 100;
private $mins = 1;
private $salt = "I like TO dance A #### Lot";
function check() {
$this->parseValues();
$runtime = floatval($this->get($this->key));
$diff = microtime(true) - $runtime;
if ($diff < $this->limit) {
throw new Exception("Limit Exceeded By : $this->ip");
}
$this->set($this->key, microtime(true));
}
public function setLimit($limit) {
$this->limit = $limit;
}
private function parseValues() {
$this->ip = $this->getIPAddress();
if (! $this->ip) {
throw new Exception("Where the hell is the ip address");
}
if (isset($_COOKIE["xf"])) {
$cookie = json_decode($_COOKIE["xf"]);
if ($this->ip != $cookie->ip) {
unset($_COOKIE["xf"]);
setcookie("xf", null, time() - 3600);
throw new Exception("Last IP did not match");
}
if ($cookie->hash != sha1($cookie->key . $this->salt)) {
unset($_COOKIE["xf"]);
setcookie("xf", null, time() - 3600);
throw new Exception("Nice Faking cookie");
}
if (strpos($cookie->key, "floodIP") === 0) {
$cookie->key = "floodRand" . bin2hex(mcrypt_create_iv(50, MCRYPT_DEV_URANDOM));
}
$this->key = $cookie->key;
} else {
$this->key = "floodIP" . sha1($this->ip);
$cookie = (object) array(
"key" => $this->key,
"ip" => $this->ip
);
}
$cookie->hash = sha1($this->key . $this->salt);
$cookie = json_encode($cookie);
setcookie("xf", $cookie, time() + 3600); // expire in 1hr
}
private function getIPAddress() {
foreach ( array(
'HTTP_CLIENT_IP',
'HTTP_X_FORWARDED_FOR',
'HTTP_X_FORWARDED',
'HTTP_X_CLUSTER_CLIENT_IP',
'HTTP_FORWARDED_FOR',
'HTTP_FORWARDED',
'REMOTE_ADDR'
) as $key ) {
if (array_key_exists($key, $_SERVER) === true) {
foreach ( explode(',', $_SERVER[$key]) as $ip ) {
if (filter_var($ip, FILTER_VALIDATE_IP) !== false) {
return $ip;
}
}
}
}
return false;
}
}
Conclusion
This is a basic prove of concept and additional layers can be added to it such as
Set different limit for differences URLS
Add support for penalties where you block user for certain number of Mins or hours
Detection and Different Limit for Tor connections
etc
I think you can use sessions in this case.
Initialize a session to store a timestamp[use microtime for better results] and then get timestamp of the new page.The difference can be used to analyzed the frequency of pages being visited and captcha can be shown.
You can also run a counter on pages being visited and use a 2d array to store the page and timestamp.If the value of pages being visited increases suddenly then you can check for timestamp difference.
Hi everyone once again!
We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.
Let's use some pseudo code to understand the logic:
1) While ($links_to_be_scanned > 0).
2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
3) Scan_the_link() and run some other functions.
4) Extract the new links from the xdom.
5) Push the new links into $links_to_be_scanned.
5) Push the current link into $links_already_scanned.
6) Remove the current link from $links_to_be_scanned.
Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.
I understand that we're gonna have to create a $links_being_scanned or some kind of queue.
I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.
Thanks in advance!
Chris;
Extended:
I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.
Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.
So now rethinking, we would have to do something like this:
While (There's links to be scanned)
Foreach ($Link_to_scann as $link)
If (There's less than 10 scanners running)
Launch_a_new_scanner($link)
Remove the link from $links_to_be_scanned array
Push the link into $links_on_queue array
Endif;
And each scanner does (This should be run in parallel):
Create an object with the given link
Send a curl request to the given link
Create a dom and an Xdom with the response body
Perform other operations over the response body
Remove the link from the $links_on_queue array
Push the link into the $links_already_scanned array
I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?
Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.
I assume I would have to approach this using fsockopen or pcntl_fork.
Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!
Thanks a lot!
DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.
The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.
Limiting the number of concurrent connections is easily accomplished. Consider:
<?php
use Artax\Client, Artax\Response;
require dirname(__DIR__) . '/autoload.php';
$client = new Client;
// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);
$requests = array(
'so-home' => 'http://stackoverflow.com',
'so-php' => 'http://stackoverflow.com/questions/tagged/php',
'so-python' => 'http://stackoverflow.com/questions/tagged/python',
'so-http' => 'http://stackoverflow.com/questions/tagged/http',
'so-html' => 'http://stackoverflow.com/questions/tagged/html',
'so-css' => 'http://stackoverflow.com/questions/tagged/css',
'so-js' => 'http://stackoverflow.com/questions/tagged/javascript'
);
$onResponse = function($requestKey, Response $r) {
echo $requestKey, ' :: ', $r->getStatus();
};
$onError = function($requestKey, Exception $e) {
echo $requestKey, ' :: ', $e->getMessage();
}
$client->requestMulti($requests, $onResponse, $onError);
IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.
you could try something like this, haven't checked it, but you should get the idea
$request_pool = array();
function CreateHandle($url) {
$handle = curl_init($url);
// set curl options here
return $handle;
}
function Process($data) {
global $request_pool;
// do something with data
array_push($request_pool , CreateHandle($some_new_url));
}
function RunMulti() {
global $request_pool;
$multi_handle = curl_multi_init();
$active_request_pool = array();
$running = 0;
$active_request_count = 0;
$active_request_max = 10; // adjust as necessary
do {
$waiting_request_count = count($request_pool);
while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
$request = array_shift($request_pool);
curl_multi_add_handle($multi_handle , $request);
$active_request_pool[(int)$request] = $request;
$waiting_request_count--;
$active_request_count++;
}
curl_multi_exec($multi_handle , $running);
curl_multi_select($multi_handle);
while($info = curl_multi_info_read($multi_handle)) {
$curl_handle = $info['handle'];
call_user_func('Process' , curl_multi_getcontent($curl_handle));
curl_multi_remove_handle($multi_handle , $curl_handle);
curl_close($curl_handle);
$active_request_count--;
}
} while($active_request_count > 0 || $waiting_request_count > 0);
curl_multi_close($multi_handle);
}
You should look for some more robust solution to your problem. RabbitMQ
is a very good solution I used. There is also Gearman but I think it is your choice.
I prefer RabbitMQ.
I will share with you my code which I have used to collect email addresses from certain website.
You can modify it to fit your needs.
There were some problems with relative URL's there.
And I do not use CURL here.
<?php
error_reporting(E_ALL);
$home = 'http://kharkov-reklama.com.ua/jborudovanie/';
$writer = new RWriter('C:\parser_13-09-2012_05.txt');
set_time_limit(0);
ini_set('memory_limit', '512M');
function scan_page($home, $full_url, &$writer) {
static $done = array();
$done[] = $full_url;
// Scan only internal links. Do not scan all the internet!))
if (strpos($full_url, $home) === false) {
return false;
}
$html = #file_get_contents($full_url);
if (empty($html) || (strpos($html, '<body') === false && strpos($html, '<BODY') === false)) {
return false;
}
echo $full_url . '<br />';
preg_match_all('/([A-Za-z0-9_\-]+\.)*[A-Za-z0-9_\-]+#([A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]\.)+[A-Za-z]{2,4}/', $html, $emails);
if (!empty($emails) && is_array($emails)) {
foreach ($emails as $email_group) {
if (is_array($email_group)) {
foreach ($email_group as $email) {
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
$writer->write($email);
}
}
}
}
}
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER);
if (is_array($matches)) {
foreach($matches as $match) {
if (!empty($match[2]) && is_scalar($match[2])) {
$url = $match[2];
if (!filter_var($url, FILTER_VALIDATE_URL)) {
$url = $home . $url;
}
if (!in_array($url, $done)) {
scan_page($home, $url, $writer);
}
}
}
}
}
class RWriter {
private $_fh = null;
private $_written = array();
public function __construct($fname) {
$this->_fh = fopen($fname, 'w+');
}
public function write($line) {
if (in_array($line, $this->_written)) {
return;
}
$this->_written[] = $line;
echo $line . '<br />';
fwrite($this->_fh, "{$line}\r\n");
}
public function __destruct() {
fclose($this->_fh);
}
}
scan_page($home, 'http://kharkov-reklama.com.ua/jborudovanie/', $writer);
We serve several high load sites, and so far we came with this code for caching remote banners that support some basic local banner rotation. It should be fairly easy for all to understand, what I would like to hear from you about the possible improvement or suggestions on this code.
Here it is...
$cachetime = 6 * 60 * 60; // 6 hours
$bannercache = $_SERVER['DOCUMENT_ROOT']."/banner-".$bpos.".txt";
// Serve from the cache if it is younger than $cachetime
if (file_exists($bannercache) && (time() - $cachetime
< filemtime($bannercache)))
{
// if it's ok don't update from remote
} else {
// if cache is old, update from remote
$bannercachecontent = #file_get_contents('ADSERVER.com/showad.php?category='.$adcat.'&dimensions='.$dimensions);
if ($bannercachecontent === FALSE) {
// on error, just update local time, so that it's not pulled again in case of remote mysql overload
$fb = #fopen($bannercache, 'a+');
fwrite($fb, "\n<!-- Changed date on ".date('jS F Y H:i', filemtime($cachefile))."-->\n");
fclose($fb);
} else {
// if it's ok, save new local file
$fb = #fopen($bannercache, 'w');
fwrite($fb, $bannercachecontent);
fwrite($fb, "\n<!-- Cached ".date('jS F Y H:i', filemtime($cachefile))."-->\n");
fclose($fb);
}
}
$fhm = file_get_contents($bannercache);
$fhmpos = strpos($fhm, '-----#####-----'); // check if it needs to be exploded for rotation
if ($fhmpos === false) {
echo $fhm;
} else {
$fhmpicks = explode("-----#####-----", $fhm);
foreach ($fhmpicks as $fhmkey => $fhmvalue)
{
if (trim($fhmpicks[$fhmkey]) == '')
{
unset($fhmpicks[$fhmkey]);
}
}
$fhmpick = array_rand($fhmpicks,1);
echo $fhmpicks[$fhmpick]; // show only one banner
}
Don't let your clients update your banners. You will always have 1 user that has to do this, and that isn't necessary:
Instead: let you clients always load the local image, and run a background process (cron, with a manual override if you want to pull the image NOW) that fetches the right image(s).
The code would be fairly simple:
your client does: $yourImage= $_SERVER['DOCUMENT_ROOT']."/banner-".$bpos.".txt";
and your update-script can just cURL the right image.