How can I filter out hits from webcrawlers etc. Hits which not is human..
I use maxmind.com to request the city from the IP.. It is not quite cheap if I have to pay for ALL hits including webcrawlers, robots etc.
There are two general ways to detect robots and I would call them "Polite/Passive" and "Aggressive". Basically, you have to give your web site a psychological disorder.
Polite
These are ways to politely tell crawlers that they shouldn't crawl your site and to limit how often you are crawled. Politeness is ensured through robots.txt file in which you specify which bots, if any, should be allowed to crawl your website and how often your website can be crawled. This assumes that the robot you're dealing with is polite.
Aggressive
Another way to keep bots off your site is to get aggressive.
User Agent
Some aggressive behavior includes (as previously mentioned by other users) the filtering of user-agent strings. This is probably the simplest, but also the least reliable way to detect if it's a user or not. A lot of bots tend to spoof user agents and some do it for legitimate reasons (i.e. they only want to crawl mobile content), while others simply don't want to be identified as bots. Even worse, some bots spoof legitimate/polite bot agents, such as the user agents of google, microsoft, lycos and other crawlers which are generally considered polite. Relying on the user agent can be helpful, but not by itself.
There are more aggressive ways to deal with robots that spoof user agents AND don't abide by your robots.txt file:
Bot Trap
I like to think of this as a "Venus Fly Trap," and it basically punishes any bot that wants to play tricks with you.
A bot trap is probably the most effective way to find bots that don't adhere to your robots.txt file without actually impairing the usability of your website. Creating a bot trap ensures that only bots are captured and not real users. The basic way to do it is to setup a directory which you specifically mark as off limits in your robots.txt file, so any robot that is polite will not fall into the trap. The second thing you do is to place a "hidden" link from your website to the bot trap directory (this ensures that real users will never go there, since real users never click on invisible links). Finally, you ban any IP address that goes to the bot trap directory.
Here are some instructions on how to achieve this:
Create a bot trap (or in your case: a PHP bot trap).
Note: of course, some bots are smart enough to read your robots.txt file, see all the directories which you've marked as "off limits" and STILL ignore your politeness settings (such as crawl rate and allowed bots). Those bots will probably not fall into your bot trap despite the fact that they are not polite.
Violent
I think this is actually too aggressive for the general audience (and general use), so if there are any kids under the age of 18, then please take them to another room!
You can make the bot trap "violent" by simply not specifying a robots.txt file. In this situation ANY BOT that crawls the hidden links will probably end up in the bot trap and you can ban all bots, period!
The reason this is not recommended is that you may actually want some bots to crawl your website (such as Google, Microsoft or other bots for site indexing). Allowing your website to be politely crawled by the bots from Google, Microsoft, Lycos, etc. will ensure that your site gets indexed and it shows up when people search for it on their favorite search engine.
Self Destructive
Yet another way to limits what bots can crawl on your website, is to serve CAPTCHAs or other challenges which a bot cannot solve. This comes at an expense of your users and I would think that anything which makes your website less usable (such as a CAPTCHA) is "self destructive." This, of course, will not actually block the bot from repeatedly trying to crawl your website, it will simply make your website very uninteresting to them. There are ways to "get around" the CAPTCHAs, but they're difficult to implement so I'm not going to delve into this too much.
Conclusion
For your purposes, probably the best way to deal with bots is to employ a combination of the above mentioned strategies:
Filter user agents.
Setup a bot trap (the violent one).
Catch all the bots that go into the violent bot trap and simply black-list their IPs (but don't block them). This way you will still get the "benefits" of being crawled by bots, but you will not have to pay to check the IP addresses that are black-listed due to going to your bot trap.
You can check USER_AGENT, something like:
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
array('Google', 'Google'),
array('msnbot', 'MSN'),
array('Rambler', 'Rambler'),
array('Yahoo', 'Yahoo'),
array('AbachoBOT', 'AbachoBOT'),
array('accoona', 'Accoona'),
array('AcoiRobot', 'AcoiRobot'),
array('ASPSeek', 'ASPSeek'),
array('CrocCrawler', 'CrocCrawler'),
array('Dumbot', 'Dumbot'),
array('FAST-WebCrawler', 'FAST-WebCrawler'),
array('GeonaBot', 'GeonaBot'),
array('Gigabot', 'Gigabot'),
array('Lycos', 'Lycos spider'),
array('MSRBOT', 'MSRBOT'),
array('Scooter', 'Altavista robot'),
array('AltaVista', 'Altavista robot'),
array('IDBot', 'ID-Search Bot'),
array('eStyle', 'eStyle Bot'),
array('Scrubby', 'Scrubby robot')
);
foreach ($crawlers as $c)
{
if (stristr($USER_AGENT, $c[0]))
{
return($c[1]);
}
}
return false;
}
// example
$crawler = crawlerDetect($_SERVER['HTTP_USER_AGENT']);
The user agent ($_SERVER['HTTP_USER_AGENT']) often identifies whether the connecting agent is a browser or a robot. Review logs/analytics for the user agents of crawlers that visit your site. Filter accordingly.
Take note that the user agent is a header supplied by the client application. As such it can be pretty much anything and shouldn't be trusted 100%. Plan accordingly.
Checking the User-Agent will protect you from legitimate bots like Google and Yahoo.
However, if you're also being hit with spam bots, then chances are User-Agent comparison won't protect you since those bots typically forge a common User-Agent string anyway. In that instance, you would need to imploy more sophisticated measures. If user input is required, a simple image verification scheme like ReCaptcha will work.
If you're looking to filter out all page hits from a bot, unfortunately, there's no 100% reliable way to do this if the bot is forging its credentials. This is just an annoying fact of life on the internet that web admins have to put up with.
I found this package, it's actively being developed and I'm quite liking it so far:
https://github.com/JayBizzle/Crawler-Detect
It's simple as this:
use Jaybizzle\CrawlerDetect\CrawlerDetect;
$CrawlerDetect = new CrawlerDetect;
// Check the user agent of the current 'visitor'
if($CrawlerDetect->isCrawler()) {
// true if crawler user agent detected
}
// Pass a user agent as a string
if($CrawlerDetect->isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')) {
// true if crawler user agent detected
}
// Output the name of the bot that matched (if any)
echo $CrawlerDetect->getMatches();
useragentstring.com is serving a lilst that you can use to analyze the userstring:
$api_request="http://www.useragentstring.com/?uas=".urlencode($_SERVER['HTTP_USER_AGENT'])."&getJSON=all";
$ua=json_decode(file_get_contents($api_request), true);
if($ua["agent_type"]=="Crawler") die();
Related
So far I am able to detect robots from a list of user agent string by matching these strings to known user agents, but I was wondering what other methods there are to do this using php as I am retrieving fewer bots than expected using this method.
I am also looking to find out how to detect if a browser or robot is spoofing another browser using a user agent string.
Any advice is appreciated.
EDIT: This has to be done using a log file with lines as follows:
129.173.129.168 - - [11/Oct/2011:00:00:05 -0300] "GET /cams/uni_ave2.jpg?time=1318302291289 HTTP/1.1" 200 20240 "http://faculty.dentistry.dal.ca/loanertracker/webcam.html" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.23) Gecko/20110920 Firefox/3.6.23"
This means I can't check user behaviour aside from access times.
In addition to filtering key words in the user agent string, I have had luck with putting a hidden honeypot link on all pages:
<a style="display:none" href="autocatch.php">A</a>
Then in "autocatch.php" record the session (or IP address) as a bot. This link is invisible to users but it's hidden characteristic would hopefully not be realized by bots. Taking the style attribute out and putting it into a CSS file might help even more.
Because, as previously stated, you can spoof user-agents & IP, these cannot be used for reliable bot detection.
I work for a security company and our bot detection algorithm look something like this:
Step 1 - Gathering data:
a. Cross-Check user-agent vs IP. (both need to be right)
b. Check Header parameters (what is missing, what is the order and etc...)
c. Check behavior (early access and compliance to robots.txt, general behavior, number of pages visited, visit rates and etc)
Step 2 - Classification:
By cross verifying the data, the bot is classified as "Good", "Bad" or "Suspicious"
Step 3 - Active Challenges:
Suspicious bots undergo the following challenges:
a. JS Challenge (can it activate JS?)
b. Cookie Challenge (can it accept coockies?)
c. If still not conclusive -> CAPTCHA
This filtering mechanism is VERY effective but I don't really think it could be replicated by a single person or even an unspecialized provider (for one thing, challenges and bot DB needs to be constantly updated by security team).
We offer some sort of "do it yourself" tools in form of Botopedia.org, our directory that can be used for IP/User-name cross-verification, but for truly efficient solution you will have to rely on specialized services.
There are several free bot monitoring solutions, including our own and most will use the same strategy I've described above (or similar).
GL
Beyond just comparing user agents, you would keep a log of activity and look for robot behavior. Often times this will include checking for /robots.txt and not loading images. Another trick is to ask the client if they have javascript since most bots won't mark it as enabled.
However, beware, you may well accidently get some people who are genuinely people.
No, user agents can be spoofed so they are not to be trusted.
In addition to checking for Javascript or image/css loads, you can also measure pageload speed as bots will usually crawl your site a lot faster than any human visitor would jump around. But this only works for small sites, popular sites that would have a lot of visitors behind a shared external IP address (large corporation or university campus) might hit your site at bot-like rates.
I suppose you could also measure the order in which they load as bots would crawl in a first come first crawl order where as human users would usually not fit that pattern, but thats a bit more complicated to track
Your question specifically relates to detection using the user agent string. As many have mentioned this can be spoofed.
To understand what is possible in spoofing, and to see how difficult it is to detect, you are probably best advised to learn the art in PHP using cURL.
In essence using cURL almost everything that can be sent in a browser(client) request can be spoofed with the notable exception of the IP, but even here a determined spoofer will also hide themselves behind a proxy server to eliminate your detecting their IP.
It goes without saying that using the same parameters each time a request is made will enable a spoofer to be detected, but rotating with different parameters will make it very difficult, if not impossible to detect any spoofers amongst genuine traffic logs.
I've a website where I'm providing email encryption to users and I'm trying to figure out if there's a way to detect if a user is human or a bot.
I've been digging into $_SESSION in php but it's easy to bypass, I'm also not interested in captcha, useragent or login solutions, any idea of what I need ?
There are other questions very similar to this one in SO but I couldn't find any straight answer...
Any help will be very welcome, thank you all !
This is a hard problem, and no solution I know of is going to be 100% perfect from a bot-defending and usability perspective. If your attacker is really determined to use a bot on your site, they probably will be able to. If you take things far enough to make it impractical for a computer program to access anything on your site, it's likely no human will want to either, but you can strike a good balance.
My point of view on this is partially as a web developer, but more so from the other side of things, having written numerous web crawler programs for clients all over the world. Not all bots have malicious intent, and can be used for things from automating form submissions to populating databases of doctors office addresses or analyzing stock market data. If your site is well designed from a usability standpoint, there should be no need for a bot that "makes things easier" for a user, but there are cases where there are special needs you can't plan for.
Of course there are those who do have malicious intent, which you definitely want to protect your site against as well as possible. There is virtually no site that can't be automated in some way. Most sites are not difficult at all, but here are a few ideas off the top of my head, from other answers or comments on this page, and from my experience writing (non-malicious) bots.
Types of bots
First I should mention that there are two different categories I would put bots into:
General purpose crawlers, indexers, or bots
Special purpose bots, made specifically for your site to perform some task
Usually a general-purpose bot is going to be something like a search engine's indexer, or possibly some hacker's script that looks for a form to submit, uses a dictionary attack to search for a vulnerable URL, or something like this. They can also attack "engine sites", such as Wordpress blogs. If your site is properly secured with good passwords and the like, these aren't usually going to pose much of a risk to you (unless you do use Wordpress, in which case you have to keep up with the latest versions and security updates).
Special purpose "personalized" bots are the kind I've written. A bot made specifically for your site can be made to act very much like a human user of your site, including inserting time delays between form submissions, setting cookies, and so on, so they can be hard to detect. For the most part this is the kind I'm talking about in the rest of this answer.
Captchas
Captchas are probably the most common approach to making sure a user is humanoid, and generally they are difficult to automatically get around. However, if you simply require the captcha as a one-time thing when the user creates an account, for example, it's easy for a human to get past it and then give their shiny new account credentials to a bot to automate usage of the system.
I remember a few years ago reading about a pretty elaborate system to "automate" breaking captchas on a popular gaming site: a separate site was set up that loaded captchas from the gaming site, and presented them to users, where they were essentially crowd-sourced. Users on the second site would get some sort of reward for each correct captcha, and the owners of the site were able to automate tasks on the gaming site using their crowd-sourced captcha data.
Generally the use of a good captcha system will pretty well guarantee one thing: somewhere there is a human who typed the captcha text. What happens before and after that depends on how often you require captcha verification, and how determined the person making a bot is.
Cell-phone / credit-card verification
If you don't want to use Captchas, this type of verification is probably going to be pretty effective against all but the most determined bot-writer. While (just as with the captcha) it won't prevent an already-verified user from creating and using a bot, you can verify that a human being created the account, and if abused block that phone number/credit-card from being used to create another account.
Sites like Facebook and Craigslist have started using cell-phone verification to prevent spamming from bots. For example, in order to create apps on Facebook, you have to have a phone number on record, confirmed via text message or an automated phone call. Unless your attacker has access to a whole lot of active phone numbers, this could be an effective way to verify that a human created the account and that he only creates a limited number of accounts (one for most people).
Credit cards can also be used to confirm that a human is performing an action and limit the number of accounts a single human can create.
Other [less-effective] solutions
Log analysis
Analyzing your request logs will often reveal bots doing the same actions repeatedly, or sometimes using dictionary attacks to look for holes in your site's configuration. So logs will tell you after-the-fact whether a request was made by a bot or a human. This may or may not be useful to you, but if the requests were made on a cell-phone or credit-card verified account, you can lock the account associated with the offending requests to prevent further abuse.
Math/other questions
Math problems or other questions can be answered by a quick google or wolfram alpha search, which can be automated by a bot. Some questions will be harder than others, but the big search companies are working against you here, making their engines better at understanding questions like this, and in turn making this a less viable option for verifying that a user is human.
Hidden form fields
Some sites employ a mechanism where parameters such as the coordinates of the mouse when they clicked the "submit" button are added to the form submission via javascript. These are extremely easy to fake in most cases, but if you see in your logs a whole bunch of requests using the same coordinates, it's likely they are a bot (although a smart bot could easily give different coordinates with each request).
Javascript Cookies
Since most bots don't load or execute javascript, cookies set using javascript instead of a set-cookie HTTP header will make life slightly more difficult for most would-be bot makers. But not so hard as to prevent the bot from manually setting the cookie as well, once the developer figures out how to generate the same value the javascript generates.
IP address
An IP address alone isn't going to tell you if a user is a human. Some sites use IP addresses to try to detect bots though, and it's true that a simple bot might show up as a bunch of requests from the same IP. But IP addresses are cheap, and with Amazon's EC2 service or similar cloud services, you can spawn a server and use it as a proxy. Or spawn 10 or 100 and use them all as proxies.
UserAgent string
This is so easy to manipulate in a crawler that you can't count on it to mark a bot that's trying not to be detected. It's easy to set the UserAgent to the same string one of the major browsers sends, and may even rotate between several different browsers.
Complicated markup
The most difficult site I ever wrote a bot for consisted of frames within frames within frames....about 10 layers deep, on each page, where each frame's src was the same base controller page, but had different parameters as to which actions to perform. The order of the actions was important, so it was tough to keep straight everything that was going on, but eventually (after a week or so) my bot worked, so while this might deter some bot makers, it won't be useful against all. And will probably make your site about a gazillion times harder to maintain.
Disclaimer & Conclusion
Not all bots are "bad". Most of the crawlers/bots I have made were for users who wanted to automate some process on the site, such as data entry, that was too tedious to do manually. So make tedious tasks easy! Or, provide an API for your users. Probably one of the easiest way to discourage someone from writing a bot for your site is to provide API access. If you provide an API, it's a lot less likely someone will go to the effort to create a crawler for it. And you could use API keys to control how heavily someone uses it.
For the purpose of preventing spammers, some combination of captchas and account verification through cell numbers or credit cards is probably going to be the most effective approach. Add some logging analysis to identify and disable any malicious personalized bots, and you should be in pretty good shape.
My favorite way is presenting the "user" with a picture of a cat or a dog and asking, "Is this a cat or a dog?" No human ever gets that wrong; the computer gets it right perhaps 60% of the time (so you have to run it several times). There's a project that will give you bunches of pictures of cats and dogs -- plus, all the animals are available for adoption so if the user likes the pet, he can have it.
It's a Microsoft corporate project, which puts me in a state of cognitive dissonance, as if I found out that Harry Reid likes zydeco music or that George Bush smokes pot. Oh, wait...
I've seen/used a simple arithmetic problem with written numbers ie:
Please answer the following question to prove you are human:
"What is two plus four?"
and similar simple questions which require reading:
"What is man's best friend?"
you can supply an endless stream of questions, should the person attempting access be unfamiliar with the subject matter, and it is accessible to all readers, etc.
There's a reason why companies use captchas or logins. As ugly of a solution as captchas are, they're currently the best (most accurate, least disruptive to users) way of weeding out bots. If a login solution doesn't work for you, I'm afraid the only realistic solution is a captcha.
If users will be filling in a form, honeypot fields are simple to implement, and can be reasonably effective, but nothing is perfect. Create one or more hidden fields in the form, and if they contain anything when the form is posted, reject the form. Spambots will usually attempt to fill in everything.
You do need to be aware of accessibility. Hidden fields probably won't be filled in by those using a standard browser (where the field is not visible), but those using screen readers may be presented with the field. Be sure to label it correctly so that these users do not fill it in. Perhaps with something like "Please help us to prevent spam by leaving this field empty". Also, if you do reject the form, be sure to reject it with helpful error messages, just in case it has been filled in by a human.
I suggest getting the Growmap Anti Spambot Wordpress plugin and seeing what code you can borrow from it or just using the same technique. I've found this plugin to be very effective for curtailing automated spam on my WordPress sites and I've started adapting the same technique for my ASP.NET sites.
The only thing it doesn't deal with are human cut-and-paste spammers.
I created a download script which makes users wait five seconds before the downlaod starts automatically and also counts downloads. It's very simple. Now I need to find a way to block bots because I want the downlod count to be as realistic as possible, meaning I want it to count only users actually downloading and not bots. Is there a bots list somewhere or just a way to do what I have to? Thanks.
Normal "bots" aren't able to run javascript, so they can't wait (download it).
You can add capcha if you're afraid there are bots with knowledge of "javascript"
Well-behaved robots should respect robots.txt, which allows you to instruct robots how they are allowed to crawl your website.
You cannot reliably block non-well-behaved robots (sort of attempts at human detection such as captcha, as others have suggested). Even though many robots set a special user agent (you can see examples here), a robot can set the user agent to anything it wants to.
Use a captcha. I would suggest you to use Recaptcha.
There are various methods that you could use to get rid of a bot, but they'll also filter out some real users:
Only allow clients that send an acceptable User-Agent string.
Only allow clients that have JavaScript enabled.
Only allow clients that have cookies enabled.
Only allow clients that uncheck a checkbox that says, "I'm a robot."
Only allow clients that don't fill out a honeypot text input.
Have a CAPTCHA (this is used by webmasters who hate their users and have no respect for them; only suggested for sadists and jerks)
You can pick and choose, or combine them to create your own flavor of robot discrimination.
Honeypot fields and timestamp analysis.
Time goes by, but still no perfect solution...
See if someone has a bright idea to differentiate bot from human-loaded web page?
State of the art is still loading a long list of well-known SE bots and parse USER AGENT?
Testing has to be done before the page is loaded! No gifs or captchas!
If possible, I would try a honeypot approach to this one. It will be invisible to most users, and will discourage many bots, though none that are determined to work, as they could implement special code for your site that just skipped the honeypot field once they figure out your game. But it would take a lot more attention by the owners of the bot than is probably worth it for most. There will be tons of other sites accepting spam without any additional effort on their part.
One thing that gets skipped over from time to time is it is important to let the bot think that everything went fine, no error messages, or denial pages, just reload the page as you would for any other user, except skip adding the bots content to the site. This way there are no red flags that can be picked up in the bots logs, and acted upon by the owner, it will take much more scrutiny to figure out you are disallowing the comments.
Without a challenge (like CAPTCHA), you're just shooting in the dark. User agent can trivially be set to any arbitrary string.
What the others have said is true to an extent... if a bot-maker wants you to think a bot is a genuine user, there's no way to avoid that. But many of the popular search engines do identify themselves. There's a list here (http://www.jafsoft.com/searchengines/webbots.html) among other places. You could load these into a database and search for them there. I seem to remember that it's against Google's user agreement to make custom pages for their bots though.
The user agent is set by the client and thus can be manipulated. A malicious bot thus certainly would not send you an I-Am-MalBot user agent, but call himself some version of IE. Thus using the User Agent to prevent spam or something similar is pointless.
So, what do you want to do? What's your final goal? If we knew that, we could be better help.
The creators of SO should know why they are using Captcha in order to prevent bots from editing content. The reason is there is actually no way to be sure that a client is not a bot. And i think there never will be.
I myself is coding web crawlers for different purposes. And I use a web browser UserAgent.
As far as I know, you cannot distinguish bots from humans if a bot is using a legit UserAgent. Like:
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.11 (KHTML, like Gecko) Chrome/9.0.570.1 Safari/534.11
The only thing I can think of is JavaScript. Most custom web bots (like those that I code) can't execute javascript codes because it's a browser job. But if the bot is linked or using a web browser (Like firefox) then it will be undetected.
I'm sure I'm going to take a votedown on this, but I had to post it:
Constructive
In any case, captchas are the best way right now to protect against bots, short of approving all user-submitted content.
-- Edit --
I just noticed your P.S., and I'm not sure of anyway to diagnose a bot without interacting with it. Your best bet in this case might be to catch the bots as early as possible and implement a 1 month IP restriction, after which time the BOT should give up if you constantly return HTTP 404 to it. Bot's are often run from a server and don't change their IP, so this should work as a mediocre approach.
I would suggest using Akismet, a spam prevention plugin, rather than any sort of Captcha or CSS trick because it is very excellent at catching spam without ruining the user experience.
Honest bots, such as search engines, will typically access your robots.txt. From that you can learn their useragent string and add it to your bot list.
Clearly this doesn't help with malicious bots which are pretending to be human, but for some applications this could be good enough if all you want to do is filter search engine bots out of your logs (for example).
I am looking to roll my own simple web stats script.
The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).
Is there any open service that does that, like Akismet does for spam?
Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?
To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just
want to exclude as many as I can from my stats. In
know that parsing the user-Agent is an
option but maintaining the patterns to
parse for is a lot of work. My
question is whether there is any
project or service that does that
already.
Bounty: I thought I'd push this as a reference question on the topic. The best / most original / most technically viable contribution will receive the bounty amount.
Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.
Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.
Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.
So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.
The easiest way is to check if their useragent includes 'bot' or 'spider' in. Most do.
EDIT (10y later): As Lukas said in the comment box, almost all crawlers today support javascript so I've removed the paragraph that stated that if the site was JS based most bots would be auto-stripped out.
You can follow a bot list and add their user-agent to the filtering list.
Take a look at this bot list.
This user-agent list is also pretty good. Just strip out all the B's and you're set.
EDIT: Amazing work done by eSniff has the above list here "in a form that can be queried and parsed easier. robotstxt.org/db/all.txt Each new Bot is defined by a robot-id:XXX. You should be able to download it once a week and parse it into something your script can use" like you can read in his comment.
Hope it helps!
Consider a PHP stats script which is camouflaged as a CSS background image (give the right response headers -at least the content type and cache control-, but write an empty image out).
Some bots parses JS, but certainly no one loads CSS images. One pitfall -as with JS- is that you will exclude textbased browsers with this, but that's less than 1% of the world wide web population. Also, there are certainly less CSS-disabled clients than JS-disabled clients (mobiles!).
To make it more solid for the (unexceptional) case that the more advanced bots (Google, Yahoo, etc) may crawl them in the future, disallow the path to the CSS image in robots.txt (which the better bots will respect anyway).
I use the following for my stats/counter app:
<?php
function is_bot($user_agent) {
return preg_match('/(abot|dbot|ebot|hbot|kbot|lbot|mbot|nbot|obot|pbot|rbot|sbot|tbot|vbot|ybot|zbot|bot\.|bot\/|_bot|\.bot|\/bot|\-bot|\:bot|\(bot|crawl|slurp|spider|seek|accoona|acoon|adressendeutschland|ah\-ha\.com|ahoy|altavista|ananzi|anthill|appie|arachnophilia|arale|araneo|aranha|architext|aretha|arks|asterias|atlocal|atn|atomz|augurfind|backrub|bannana_bot|baypup|bdfetch|big brother|biglotron|bjaaland|blackwidow|blaiz|blog|blo\.|bloodhound|boitho|booch|bradley|butterfly|calif|cassandra|ccubee|cfetch|charlotte|churl|cienciaficcion|cmc|collective|comagent|combine|computingsite|csci|curl|cusco|daumoa|deepindex|delorie|depspid|deweb|die blinde kuh|digger|ditto|dmoz|docomo|download express|dtaagent|dwcp|ebiness|ebingbong|e\-collector|ejupiter|emacs\-w3 search engine|esther|evliya celebi|ezresult|falcon|felix ide|ferret|fetchrover|fido|findlinks|fireball|fish search|fouineur|funnelweb|gazz|gcreep|genieknows|getterroboplus|geturl|glx|goforit|golem|grabber|grapnel|gralon|griffon|gromit|grub|gulliver|hamahakki|harvest|havindex|helix|heritrix|hku www octopus|homerweb|htdig|html index|html_analyzer|htmlgobble|hubater|hyper\-decontextualizer|ia_archiver|ibm_planetwide|ichiro|iconsurf|iltrovatore|image\.kapsi\.net|imagelock|incywincy|indexer|infobee|informant|ingrid|inktomisearch\.com|inspector web|intelliagent|internet shinchakubin|ip3000|iron33|israeli\-search|ivia|jack|jakarta|javabee|jetbot|jumpstation|katipo|kdd\-explorer|kilroy|knowledge|kototoi|kretrieve|labelgrabber|lachesis|larbin|legs|libwww|linkalarm|link validator|linkscan|lockon|lwp|lycos|magpie|mantraagent|mapoftheinternet|marvin\/|mattie|mediafox|mediapartners|mercator|merzscope|microsoft url control|minirank|miva|mj12|mnogosearch|moget|monster|moose|motor|multitext|muncher|muscatferret|mwd\.search|myweb|najdi|nameprotect|nationaldirectory|nazilla|ncsa beta|nec\-meshexplorer|nederland\.zoek|netcarta webmap engine|netmechanic|netresearchserver|netscoop|newscan\-online|nhse|nokia6682\/|nomad|noyona|nutch|nzexplorer|objectssearch|occam|omni|open text|openfind|openintelligencedata|orb search|osis\-project|pack rat|pageboy|pagebull|page_verifier|panscient|parasite|partnersite|patric|pear\.|pegasus|peregrinator|pgp key agent|phantom|phpdig|picosearch|piltdownman|pimptrain|pinpoint|pioneer|piranha|plumtreewebaccessor|pogodak|poirot|pompos|poppelsdorf|poppi|popular iconoclast|psycheclone|publisher|python|rambler|raven search|roach|road runner|roadhouse|robbie|robofox|robozilla|rules|salty|sbider|scooter|scoutjet|scrubby|search\.|searchprocess|semanticdiscovery|senrigan|sg\-scout|shai\'hulud|shark|shopwiki|sidewinder|sift|silk|simmany|site searcher|site valet|sitetech\-rover|skymob\.com|sleek|smartwit|sna\-|snappy|snooper|sohu|speedfind|sphere|sphider|spinner|spyder|steeler\/|suke|suntek|supersnooper|surfnomore|sven|sygol|szukacz|tach black widow|tarantula|templeton|\/teoma|t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|theophrastus|titan|titin|tkwww|toutatis|t\-rex|tutorgig|twiceler|twisted|ucsd|udmsearch|url check|updated|vagabondo|valkyrie|verticrawl|victoria|vision\-search|volcano|voyager\/|voyager\-hc|w3c_validator|w3m2|w3mir|walker|wallpaper|wanderer|wauuu|wavefire|web core|web hopper|web wombat|webbandit|webcatcher|webcopy|webfoot|weblayers|weblinker|weblog monitor|webmirror|webmonkey|webquest|webreaper|websitepulse|websnarf|webstolperer|webvac|webwalk|webwatch|webwombat|webzinger|wget|whizbang|whowhere|wild ferret|worldlight|wwwc|wwwster|xenu|xget|xift|xirq|yandex|yanga|yeti|yodao|zao\/|zippp|zyborg|\.\.\.\.)/i', $user_agent);
}
//example usage
if (! is_bot($_SERVER["HTTP_USER_AGENT"])) echo "it's a human hit!";
?>
I removed a link to the original code source, because it now redirects to a food app.
Checking the user-agent will alert you to the honest bots, but not the spammers.
To tell which requests are made by dishonest bots, your best bet (based on this guy's interesting study) is to catch a Javascript focus event .
If the focus event fires, the page was almost certainly loaded by a human being.
Edit: it's true, people with Javascript turned off will not show up as humans, but that's not a large percentage of web users.
Edit2: Current bots can also execute Javascript, at least Google can.
I currently use AWstats and Webalizer to monitor my log files for Apasce2 and so far they have been doing a pretty good job of it. If you would like you can have a look at their source code as it is an open source project.
You can get the source at http://awstats.sourceforge.net or alternatively look at the FAQ http://awstats.sourceforge.net/docs/awstats_faq.html
Hope that helps,
RayQuang
Rather than trying to maintain an impossibly-long list of spider User Agents we look for things that suggest human behaviour. Principle of these is that we split our Session Count into two figures: the number of single-page-sessions, and the number of multi-page-sessions. We drop a session cookie, and use that to determine multi-page sessions. We also drop a persistent "Machine ID" cookie; a returning user (Machine ID cookie found) is treated as a multi-page session even if they only view one page in that session. You may have other characteristics that imply a "human" visitor - referrer is Google, for example (although I believe that the MS Search bot mascarades as a standard UserAgent referred with a realistic keyword to check that the site doesn't show different content [to that given to their Bot], and that behaviour looks a lot like a human!)
Of course this is not infalible, and in particular if you have lots of people who arrive and "click off" its not going to be a good statistic for you, nor if you have predominance of people with cookies turned off (in our case they won't be able to use our [shopping cart] site without session-cookies enabled).
Taking the data from one of our clients we find that the daily single-session count is all over the place - an order of magnitude different from day to day; however, if we subtract 1,000 from the multi-page session per day we then have a damn-near-linear rate of 4 multi-page-sessions per order placed / two session per basket. I have no real idea what the other 1,000 multi-page sessions per day are!
Record mouse movement and scrolling using javascript. You can tell from the recorded data wether it's a human or a bot. Unless the bot is really really sophisticated and mimics human mouse movements.
Prerequisite - referrer is set
apache level:
LogFormat "%U %{Referer}i %{%Y-%m-%d %H:%M:%S}t" human_log
RewriteRule ^/human/(.*) /b.gif [L]
SetEnv human_session 0
# using referrer
SetEnvIf Referer "^http://yoursite.com/" human_log_session=1
SetEnvIf Request_URI "^/human/(.*).gif$" human_dolog=1
SetEnvIf human_log_session 0 !human_dolog
CustomLog logs/human-access_log human_log env=human_dolog
In web-page, embed a /human/$hashkey_of_current_url.gif.
If is a bot, is unlikely have referrer set (this is a grey area).
If hit directly using browser address bar, it will not included.
At the end of each day, /human-access_log should contains all the referrer which actually is human page-view.
To play safe, hash of the referrer from apache log should tally with the image name
Now we have all kind of headless browsers. Chrome, Firefox or else that will execute whatever JS you have on your site. So any JS-based detections won't work.
I think the most confident way would be to track behavior on site. If I would write a bot and would like to by-pass checks, I would mimic scroll, mouse move, hover, browser history etc. events just with headless chrome. To turn it to the next level, even if headless chrome adds some hints about "headless" mode into the request, I could fork chrome repo, make changes and build my own binaries that will leave no track.
I think this may be the closest answer to real detection if it's human or not by no action from the visitor:
https://developers.google.com/recaptcha/docs/invisible
I'm not sure techniques behind this but I believe Google did a good job by analyzing billions of requests with their ML algorithms to detect if the behavior is human-ish or bot-ish.
while it's an extra HTTP request, it would not detect quickly bounced visitor so that's something to keep in mind.
Have a 1x1 gif in your pages that you keep track of. If loaded then its likely to be a browser. If it's not loaded it's likely to be a script.
=? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically.
Original answer follows (getting negative ratings!)
The only reliable way to tell bots
from humans are [CAPTCHAS][1]. You can
use [reCAPTCHA][2] if it suits you.
[1]:
http://en.wikipedia.org/wiki/Captcha
[2]: http://recaptcha.net/
You could exclude all requests that come from a User Agent that also requests robots.txt. All well behaved bots will make such a request, but the bad bots will escape detection.
You'd also have problems with false positives - as a human, it's not very often that I read a robots.txt in my browser, but I certainly can. To avoid these incorrectly showing up as bots, you could whitelist some common browser User Agents, and consider them to always be human. But this would just turn into maintaining a list of User Agents for browsers instead of one for bots.
So, this did-they-request-robots.txt approach certainly won't give 100% watertight results, but it may provide some heuristics to feed into a complete solution.
I'm surprised no one has recommended implementing a Turing test. Just have a chat box with human on the other end.
A programatic solution just won't do: See what happens when PARRY Encounters the DOCTOR
These two 'characters' are both "chatter" bots that were written in the course of AI research in the '70: to see how long they could fool a real person into thinking they were also a person. The PARRY character was modeled as a paranoid schizophrenic and THE DOCTOR as a stereotypical psychotherapist.
Here's some more background