Identify Crawlers From User Agent - php

I want to track all visitors(Os,Browser and more details) to my site.For that i am saving the useragent and URLs and other essential data into database.Later Upon execution of Crone,the user agent is analyzed and fetch browser,Os. But I want to identify crawlers(as they cannot be considerd as visitors). So is there any way to identify crawlers from user agent.
Did user Agents of Crawlers follow any common Patterns?

You can identify them by User-Agent or IP (subnet).
The first method isn't reliable, because anyone can identify as any Crawler just modifying the User-Agent.
The second method is obviously better.
These are two of the many lists on the web: http://www.user-agents.org/ (See the legend: R = Robot, crawler, spider) - http://www.robotstxt.org/db.html
Another one: http://www.karavadra.net/blog/2010/list-of-crawlers-bots-and-their-ip-addresses/

Using User-Agent strings for anything important is unreliable and a bad idea.
Any malicious crawlers will probably send the UA string of a popular browser. Proper search engine crawlers will always send a recognisable UA string, but theres nothing to stop me configuring my web browser to pretend to be one of those crawlers.
If you must do this, see get_browser() and the crawler element of the value if returns.

The Web Robots Page includes a list of known crawlers/robots that includes user agent patterns that may be used to identify known bots that are well behaved (and listed in the database).
But as DaveR said, it is difficult to stop someone who ignores the rules, and not every crawler is in the robotstxt.org database.

Related

How to detect browser spoofing and robots from a user agent string in php

So far I am able to detect robots from a list of user agent string by matching these strings to known user agents, but I was wondering what other methods there are to do this using php as I am retrieving fewer bots than expected using this method.
I am also looking to find out how to detect if a browser or robot is spoofing another browser using a user agent string.
Any advice is appreciated.
EDIT: This has to be done using a log file with lines as follows:
129.173.129.168 - - [11/Oct/2011:00:00:05 -0300] "GET /cams/uni_ave2.jpg?time=1318302291289 HTTP/1.1" 200 20240 "http://faculty.dentistry.dal.ca/loanertracker/webcam.html" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.23) Gecko/20110920 Firefox/3.6.23"
This means I can't check user behaviour aside from access times.
In addition to filtering key words in the user agent string, I have had luck with putting a hidden honeypot link on all pages:
<a style="display:none" href="autocatch.php">A</a>
Then in "autocatch.php" record the session (or IP address) as a bot. This link is invisible to users but it's hidden characteristic would hopefully not be realized by bots. Taking the style attribute out and putting it into a CSS file might help even more.
Because, as previously stated, you can spoof user-agents & IP, these cannot be used for reliable bot detection.
I work for a security company and our bot detection algorithm look something like this:
Step 1 - Gathering data:
a. Cross-Check user-agent vs IP. (both need to be right)
b. Check Header parameters (what is missing, what is the order and etc...)
c. Check behavior (early access and compliance to robots.txt, general behavior, number of pages visited, visit rates and etc)
Step 2 - Classification:
By cross verifying the data, the bot is classified as "Good", "Bad" or "Suspicious"
Step 3 - Active Challenges:
Suspicious bots undergo the following challenges:
a. JS Challenge (can it activate JS?)
b. Cookie Challenge (can it accept coockies?)
c. If still not conclusive -> CAPTCHA
This filtering mechanism is VERY effective but I don't really think it could be replicated by a single person or even an unspecialized provider (for one thing, challenges and bot DB needs to be constantly updated by security team).
We offer some sort of "do it yourself" tools in form of Botopedia.org, our directory that can be used for IP/User-name cross-verification, but for truly efficient solution you will have to rely on specialized services.
There are several free bot monitoring solutions, including our own and most will use the same strategy I've described above (or similar).
GL
Beyond just comparing user agents, you would keep a log of activity and look for robot behavior. Often times this will include checking for /robots.txt and not loading images. Another trick is to ask the client if they have javascript since most bots won't mark it as enabled.
However, beware, you may well accidently get some people who are genuinely people.
No, user agents can be spoofed so they are not to be trusted.
In addition to checking for Javascript or image/css loads, you can also measure pageload speed as bots will usually crawl your site a lot faster than any human visitor would jump around. But this only works for small sites, popular sites that would have a lot of visitors behind a shared external IP address (large corporation or university campus) might hit your site at bot-like rates.
I suppose you could also measure the order in which they load as bots would crawl in a first come first crawl order where as human users would usually not fit that pattern, but thats a bit more complicated to track
Your question specifically relates to detection using the user agent string. As many have mentioned this can be spoofed.
To understand what is possible in spoofing, and to see how difficult it is to detect, you are probably best advised to learn the art in PHP using cURL.
In essence using cURL almost everything that can be sent in a browser(client) request can be spoofed with the notable exception of the IP, but even here a determined spoofer will also hide themselves behind a proxy server to eliminate your detecting their IP.
It goes without saying that using the same parameters each time a request is made will enable a spoofer to be detected, but rotating with different parameters will make it very difficult, if not impossible to detect any spoofers amongst genuine traffic logs.

How can I track outgoing link clicks without tracking bots?

I have a few thoughts on this but I can see problems with both. I don't need 100% accurate data. An 80% solution that allows me to make generalizations about the most popular domains I'm routing users to is fine.
Option 1 - Use PHP. Route links through a file track.php that makes sure the referring page is from my domain before tracking the click. This page then routes the user to the final intended URL. Obviously bots could spoof this. Do many? I could also check the user agent. Again, I KNOW many bots spoof this.
Option 2 - Use JavaScript. Execute a JavaScript on click function that writes the click to the database and then directs the user to the final URL.
Both of these methods feel like they may cause problems with crawlers following my outgoing links. What is the most effective method for tracking these outgoing clicks?
The most effective method for tracking outgoing links (it's used by Facebook, Twitter, and almost every search engine) is a "track.php" type file.
Detecting bots can be considered a separate problem, and the methods are covered fairly well by these questions: http://duckduckgo.com/?q=how+to+detect+http+bots+site%3Astackoverflow.com But doing a simple string search for "bot" in the User-Agent will probably get you close to your 80%* (and watching for hits to /robots.txt will, depending on the type of bot you're dealing with, get you 95%*).
*: a semi-educated guess, based on zero concrete data
Well, Google analytics and Piwik use Javascript for that.
Since bots can't use JS, you'll only have humans. In the other way, humans can disable JS too (but sincerely, that's rarely the case)
Facebook, Deviantart, WLM, etc use server side script to track. I don't know how they filter bots but a nice robots.txt with one or two filter and that should be good enough to get 80% I guess.

How to identify web-crawler?

How can I filter out hits from webcrawlers etc. Hits which not is human..
I use maxmind.com to request the city from the IP.. It is not quite cheap if I have to pay for ALL hits including webcrawlers, robots etc.
There are two general ways to detect robots and I would call them "Polite/Passive" and "Aggressive". Basically, you have to give your web site a psychological disorder.
Polite
These are ways to politely tell crawlers that they shouldn't crawl your site and to limit how often you are crawled. Politeness is ensured through robots.txt file in which you specify which bots, if any, should be allowed to crawl your website and how often your website can be crawled. This assumes that the robot you're dealing with is polite.
Aggressive
Another way to keep bots off your site is to get aggressive.
User Agent
Some aggressive behavior includes (as previously mentioned by other users) the filtering of user-agent strings. This is probably the simplest, but also the least reliable way to detect if it's a user or not. A lot of bots tend to spoof user agents and some do it for legitimate reasons (i.e. they only want to crawl mobile content), while others simply don't want to be identified as bots. Even worse, some bots spoof legitimate/polite bot agents, such as the user agents of google, microsoft, lycos and other crawlers which are generally considered polite. Relying on the user agent can be helpful, but not by itself.
There are more aggressive ways to deal with robots that spoof user agents AND don't abide by your robots.txt file:
Bot Trap
I like to think of this as a "Venus Fly Trap," and it basically punishes any bot that wants to play tricks with you.
A bot trap is probably the most effective way to find bots that don't adhere to your robots.txt file without actually impairing the usability of your website. Creating a bot trap ensures that only bots are captured and not real users. The basic way to do it is to setup a directory which you specifically mark as off limits in your robots.txt file, so any robot that is polite will not fall into the trap. The second thing you do is to place a "hidden" link from your website to the bot trap directory (this ensures that real users will never go there, since real users never click on invisible links). Finally, you ban any IP address that goes to the bot trap directory.
Here are some instructions on how to achieve this:
Create a bot trap (or in your case: a PHP bot trap).
Note: of course, some bots are smart enough to read your robots.txt file, see all the directories which you've marked as "off limits" and STILL ignore your politeness settings (such as crawl rate and allowed bots). Those bots will probably not fall into your bot trap despite the fact that they are not polite.
Violent
I think this is actually too aggressive for the general audience (and general use), so if there are any kids under the age of 18, then please take them to another room!
You can make the bot trap "violent" by simply not specifying a robots.txt file. In this situation ANY BOT that crawls the hidden links will probably end up in the bot trap and you can ban all bots, period!
The reason this is not recommended is that you may actually want some bots to crawl your website (such as Google, Microsoft or other bots for site indexing). Allowing your website to be politely crawled by the bots from Google, Microsoft, Lycos, etc. will ensure that your site gets indexed and it shows up when people search for it on their favorite search engine.
Self Destructive
Yet another way to limits what bots can crawl on your website, is to serve CAPTCHAs or other challenges which a bot cannot solve. This comes at an expense of your users and I would think that anything which makes your website less usable (such as a CAPTCHA) is "self destructive." This, of course, will not actually block the bot from repeatedly trying to crawl your website, it will simply make your website very uninteresting to them. There are ways to "get around" the CAPTCHAs, but they're difficult to implement so I'm not going to delve into this too much.
Conclusion
For your purposes, probably the best way to deal with bots is to employ a combination of the above mentioned strategies:
Filter user agents.
Setup a bot trap (the violent one).
Catch all the bots that go into the violent bot trap and simply black-list their IPs (but don't block them). This way you will still get the "benefits" of being crawled by bots, but you will not have to pay to check the IP addresses that are black-listed due to going to your bot trap.
You can check USER_AGENT, something like:
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
array('Google', 'Google'),
array('msnbot', 'MSN'),
array('Rambler', 'Rambler'),
array('Yahoo', 'Yahoo'),
array('AbachoBOT', 'AbachoBOT'),
array('accoona', 'Accoona'),
array('AcoiRobot', 'AcoiRobot'),
array('ASPSeek', 'ASPSeek'),
array('CrocCrawler', 'CrocCrawler'),
array('Dumbot', 'Dumbot'),
array('FAST-WebCrawler', 'FAST-WebCrawler'),
array('GeonaBot', 'GeonaBot'),
array('Gigabot', 'Gigabot'),
array('Lycos', 'Lycos spider'),
array('MSRBOT', 'MSRBOT'),
array('Scooter', 'Altavista robot'),
array('AltaVista', 'Altavista robot'),
array('IDBot', 'ID-Search Bot'),
array('eStyle', 'eStyle Bot'),
array('Scrubby', 'Scrubby robot')
);
foreach ($crawlers as $c)
{
if (stristr($USER_AGENT, $c[0]))
{
return($c[1]);
}
}
return false;
}
// example
$crawler = crawlerDetect($_SERVER['HTTP_USER_AGENT']);
The user agent ($_SERVER['HTTP_USER_AGENT']) often identifies whether the connecting agent is a browser or a robot. Review logs/analytics for the user agents of crawlers that visit your site. Filter accordingly.
Take note that the user agent is a header supplied by the client application. As such it can be pretty much anything and shouldn't be trusted 100%. Plan accordingly.
Checking the User-Agent will protect you from legitimate bots like Google and Yahoo.
However, if you're also being hit with spam bots, then chances are User-Agent comparison won't protect you since those bots typically forge a common User-Agent string anyway. In that instance, you would need to imploy more sophisticated measures. If user input is required, a simple image verification scheme like ReCaptcha will work.
If you're looking to filter out all page hits from a bot, unfortunately, there's no 100% reliable way to do this if the bot is forging its credentials. This is just an annoying fact of life on the internet that web admins have to put up with.
I found this package, it's actively being developed and I'm quite liking it so far:
https://github.com/JayBizzle/Crawler-Detect
It's simple as this:
use Jaybizzle\CrawlerDetect\CrawlerDetect;
$CrawlerDetect = new CrawlerDetect;
// Check the user agent of the current 'visitor'
if($CrawlerDetect->isCrawler()) {
// true if crawler user agent detected
}
// Pass a user agent as a string
if($CrawlerDetect->isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')) {
// true if crawler user agent detected
}
// Output the name of the bot that matched (if any)
echo $CrawlerDetect->getMatches();
useragentstring.com is serving a lilst that you can use to analyze the userstring:
$api_request="http://www.useragentstring.com/?uas=".urlencode($_SERVER['HTTP_USER_AGENT'])."&getJSON=all";
$ua=json_decode(file_get_contents($api_request), true);
if($ua["agent_type"]=="Crawler") die();

How to determine real user are browsing my site or just crawling or else in PHP

I want to know whether a user are actually looking my site(I know it's just load by the browser and display to human, not actually human looking at it).
I know two method will work.
Javascript.
If the page was load by the browser, it will run the js code automatically, except forbid by the browser. Then use AJAX to call back the server.
1×1 transparent image of in the html.
Use img to call back the server.
Do anyone know the pitfall of these method or any better method?
Also, I don't know how to determine a 0×0 or 1×1 iframe to prevent the above method.
A bot can access a browser, e.g. http://browsershots.org
The bot can request that 1x1 image.
In short, there is no real way to tell. Best you could do is use a CAPTCHA, but then it degrades the experience for humans.
Just use a CAPTCHA where required (user sign up, etc).
I want to know whether a user are actually looking my site(I know it's just load by the browser and display to human, not actually human looking at it).
The image way seems better, as Javascript might be turned off by normal users as well. Robots generally don't load images, so this should indeed work. Nonetheless, if you're just looking to filter a known set of robots (say Google and Yahoo), you can simply check for the HTTP User Agent header, as those robots will actually identify themselves as being a robot.
you can create an google webmasters account
and it tells you how to configure your site for bots
also show how robot will read your website
I agree with others here, this is really tough - generally nice crawlers will identify themselves as crawlers so using the User-Agent is a pretty good way to filter out those guys. A good source for user agent strings can be found at http://www.useragentstring.com. I've used Chris Schulds php script (http://chrisschuld.com/projects/browser-php-detecting-a-users-browser-from-php/) to good effect in the past.
You can also filter these guys at the server level using the Apache config or .htaccess file, but I've found that to be a losing battle keeping up with it.
However, if you watch your server logs you'll see lots of suspect activity with valid (browser) user-agents or funky user-agents so this will only work so far. You can play the blacklist/whitelist IP game, but that will get old fast.
Lots of crawlers do load images (i.e. Google image search), so I don't think that will work all the time.
Very few crawlers will have Javascript engines, so that is probably a good way to differentiate them. And lets face it, how many users actually turn of Javascript these days? I've seen the stats on that, but I think those stats are very skewed by the sheer number of crawlers/bots out there that don't identify themselves. However, a caveat is that I have seen that the Google bot does run Javascript now.
So, bottom line, its tough. I'd go with a hybrid strategy for sure - if you filter using user-agent, images, IP and javascript I'm sure you'll get most bots, but expect some to get through despite that.
Another idea, you could always use a known Javascript browser quirk to test if the reported user-agent (if its a browser) is really actually that browser?
"Nice" robots like those from google or yahoo will usually respect a robots.txt file. Filtering by useragent might also help.
But in the end - if someone wants to gain automated access it will be very hard to prevent that; you should be sure it is worth the effort.
Inspect the User-Agent header of the http request.
Crawlers should set this to anything but a known browser.
here are the google-bot header http://code.google.com/intl/nl-NL/web/controlcrawlindex/docs/crawlers.html
In php you can get the user-agent with :
$Uagent=$_SERVER['HTTP_USER_AGENT'];
Then you just compare it with the known headers
as a tip preg_match() could be handy to do this all in a few lines of code.

Tell bots apart from human visitors for stats?

I am looking to roll my own simple web stats script.
The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).
Is there any open service that does that, like Akismet does for spam?
Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?
To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just
want to exclude as many as I can from my stats. In
know that parsing the user-Agent is an
option but maintaining the patterns to
parse for is a lot of work. My
question is whether there is any
project or service that does that
already.
Bounty: I thought I'd push this as a reference question on the topic. The best / most original / most technically viable contribution will receive the bounty amount.
Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.
Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.
Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.
So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.
The easiest way is to check if their useragent includes 'bot' or 'spider' in. Most do.
EDIT (10y later): As Lukas said in the comment box, almost all crawlers today support javascript so I've removed the paragraph that stated that if the site was JS based most bots would be auto-stripped out.
You can follow a bot list and add their user-agent to the filtering list.
Take a look at this bot list.
This user-agent list is also pretty good. Just strip out all the B's and you're set.
EDIT: Amazing work done by eSniff has the above list here "in a form that can be queried and parsed easier. robotstxt.org/db/all.txt Each new Bot is defined by a robot-id:XXX. You should be able to download it once a week and parse it into something your script can use" like you can read in his comment.
Hope it helps!
Consider a PHP stats script which is camouflaged as a CSS background image (give the right response headers -at least the content type and cache control-, but write an empty image out).
Some bots parses JS, but certainly no one loads CSS images. One pitfall -as with JS- is that you will exclude textbased browsers with this, but that's less than 1% of the world wide web population. Also, there are certainly less CSS-disabled clients than JS-disabled clients (mobiles!).
To make it more solid for the (unexceptional) case that the more advanced bots (Google, Yahoo, etc) may crawl them in the future, disallow the path to the CSS image in robots.txt (which the better bots will respect anyway).
I use the following for my stats/counter app:
<?php
function is_bot($user_agent) {
return preg_match('/(abot|dbot|ebot|hbot|kbot|lbot|mbot|nbot|obot|pbot|rbot|sbot|tbot|vbot|ybot|zbot|bot\.|bot\/|_bot|\.bot|\/bot|\-bot|\:bot|\(bot|crawl|slurp|spider|seek|accoona|acoon|adressendeutschland|ah\-ha\.com|ahoy|altavista|ananzi|anthill|appie|arachnophilia|arale|araneo|aranha|architext|aretha|arks|asterias|atlocal|atn|atomz|augurfind|backrub|bannana_bot|baypup|bdfetch|big brother|biglotron|bjaaland|blackwidow|blaiz|blog|blo\.|bloodhound|boitho|booch|bradley|butterfly|calif|cassandra|ccubee|cfetch|charlotte|churl|cienciaficcion|cmc|collective|comagent|combine|computingsite|csci|curl|cusco|daumoa|deepindex|delorie|depspid|deweb|die blinde kuh|digger|ditto|dmoz|docomo|download express|dtaagent|dwcp|ebiness|ebingbong|e\-collector|ejupiter|emacs\-w3 search engine|esther|evliya celebi|ezresult|falcon|felix ide|ferret|fetchrover|fido|findlinks|fireball|fish search|fouineur|funnelweb|gazz|gcreep|genieknows|getterroboplus|geturl|glx|goforit|golem|grabber|grapnel|gralon|griffon|gromit|grub|gulliver|hamahakki|harvest|havindex|helix|heritrix|hku www octopus|homerweb|htdig|html index|html_analyzer|htmlgobble|hubater|hyper\-decontextualizer|ia_archiver|ibm_planetwide|ichiro|iconsurf|iltrovatore|image\.kapsi\.net|imagelock|incywincy|indexer|infobee|informant|ingrid|inktomisearch\.com|inspector web|intelliagent|internet shinchakubin|ip3000|iron33|israeli\-search|ivia|jack|jakarta|javabee|jetbot|jumpstation|katipo|kdd\-explorer|kilroy|knowledge|kototoi|kretrieve|labelgrabber|lachesis|larbin|legs|libwww|linkalarm|link validator|linkscan|lockon|lwp|lycos|magpie|mantraagent|mapoftheinternet|marvin\/|mattie|mediafox|mediapartners|mercator|merzscope|microsoft url control|minirank|miva|mj12|mnogosearch|moget|monster|moose|motor|multitext|muncher|muscatferret|mwd\.search|myweb|najdi|nameprotect|nationaldirectory|nazilla|ncsa beta|nec\-meshexplorer|nederland\.zoek|netcarta webmap engine|netmechanic|netresearchserver|netscoop|newscan\-online|nhse|nokia6682\/|nomad|noyona|nutch|nzexplorer|objectssearch|occam|omni|open text|openfind|openintelligencedata|orb search|osis\-project|pack rat|pageboy|pagebull|page_verifier|panscient|parasite|partnersite|patric|pear\.|pegasus|peregrinator|pgp key agent|phantom|phpdig|picosearch|piltdownman|pimptrain|pinpoint|pioneer|piranha|plumtreewebaccessor|pogodak|poirot|pompos|poppelsdorf|poppi|popular iconoclast|psycheclone|publisher|python|rambler|raven search|roach|road runner|roadhouse|robbie|robofox|robozilla|rules|salty|sbider|scooter|scoutjet|scrubby|search\.|searchprocess|semanticdiscovery|senrigan|sg\-scout|shai\'hulud|shark|shopwiki|sidewinder|sift|silk|simmany|site searcher|site valet|sitetech\-rover|skymob\.com|sleek|smartwit|sna\-|snappy|snooper|sohu|speedfind|sphere|sphider|spinner|spyder|steeler\/|suke|suntek|supersnooper|surfnomore|sven|sygol|szukacz|tach black widow|tarantula|templeton|\/teoma|t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|theophrastus|titan|titin|tkwww|toutatis|t\-rex|tutorgig|twiceler|twisted|ucsd|udmsearch|url check|updated|vagabondo|valkyrie|verticrawl|victoria|vision\-search|volcano|voyager\/|voyager\-hc|w3c_validator|w3m2|w3mir|walker|wallpaper|wanderer|wauuu|wavefire|web core|web hopper|web wombat|webbandit|webcatcher|webcopy|webfoot|weblayers|weblinker|weblog monitor|webmirror|webmonkey|webquest|webreaper|websitepulse|websnarf|webstolperer|webvac|webwalk|webwatch|webwombat|webzinger|wget|whizbang|whowhere|wild ferret|worldlight|wwwc|wwwster|xenu|xget|xift|xirq|yandex|yanga|yeti|yodao|zao\/|zippp|zyborg|\.\.\.\.)/i', $user_agent);
}
//example usage
if (! is_bot($_SERVER["HTTP_USER_AGENT"])) echo "it's a human hit!";
?>
I removed a link to the original code source, because it now redirects to a food app.
Checking the user-agent will alert you to the honest bots, but not the spammers.
To tell which requests are made by dishonest bots, your best bet (based on this guy's interesting study) is to catch a Javascript focus event .
If the focus event fires, the page was almost certainly loaded by a human being.
Edit: it's true, people with Javascript turned off will not show up as humans, but that's not a large percentage of web users.
Edit2: Current bots can also execute Javascript, at least Google can.
I currently use AWstats and Webalizer to monitor my log files for Apasce2 and so far they have been doing a pretty good job of it. If you would like you can have a look at their source code as it is an open source project.
You can get the source at http://awstats.sourceforge.net or alternatively look at the FAQ http://awstats.sourceforge.net/docs/awstats_faq.html
Hope that helps,
RayQuang
Rather than trying to maintain an impossibly-long list of spider User Agents we look for things that suggest human behaviour. Principle of these is that we split our Session Count into two figures: the number of single-page-sessions, and the number of multi-page-sessions. We drop a session cookie, and use that to determine multi-page sessions. We also drop a persistent "Machine ID" cookie; a returning user (Machine ID cookie found) is treated as a multi-page session even if they only view one page in that session. You may have other characteristics that imply a "human" visitor - referrer is Google, for example (although I believe that the MS Search bot mascarades as a standard UserAgent referred with a realistic keyword to check that the site doesn't show different content [to that given to their Bot], and that behaviour looks a lot like a human!)
Of course this is not infalible, and in particular if you have lots of people who arrive and "click off" its not going to be a good statistic for you, nor if you have predominance of people with cookies turned off (in our case they won't be able to use our [shopping cart] site without session-cookies enabled).
Taking the data from one of our clients we find that the daily single-session count is all over the place - an order of magnitude different from day to day; however, if we subtract 1,000 from the multi-page session per day we then have a damn-near-linear rate of 4 multi-page-sessions per order placed / two session per basket. I have no real idea what the other 1,000 multi-page sessions per day are!
Record mouse movement and scrolling using javascript. You can tell from the recorded data wether it's a human or a bot. Unless the bot is really really sophisticated and mimics human mouse movements.
Prerequisite - referrer is set
apache level:
LogFormat "%U %{Referer}i %{%Y-%m-%d %H:%M:%S}t" human_log
RewriteRule ^/human/(.*) /b.gif [L]
SetEnv human_session 0
# using referrer
SetEnvIf Referer "^http://yoursite.com/" human_log_session=1
SetEnvIf Request_URI "^/human/(.*).gif$" human_dolog=1
SetEnvIf human_log_session 0 !human_dolog
CustomLog logs/human-access_log human_log env=human_dolog
In web-page, embed a /human/$hashkey_of_current_url.gif.
If is a bot, is unlikely have referrer set (this is a grey area).
If hit directly using browser address bar, it will not included.
At the end of each day, /human-access_log should contains all the referrer which actually is human page-view.
To play safe, hash of the referrer from apache log should tally with the image name
Now we have all kind of headless browsers. Chrome, Firefox or else that will execute whatever JS you have on your site. So any JS-based detections won't work.
I think the most confident way would be to track behavior on site. If I would write a bot and would like to by-pass checks, I would mimic scroll, mouse move, hover, browser history etc. events just with headless chrome. To turn it to the next level, even if headless chrome adds some hints about "headless" mode into the request, I could fork chrome repo, make changes and build my own binaries that will leave no track.
I think this may be the closest answer to real detection if it's human or not by no action from the visitor:
https://developers.google.com/recaptcha/docs/invisible
I'm not sure techniques behind this but I believe Google did a good job by analyzing billions of requests with their ML algorithms to detect if the behavior is human-ish or bot-ish.
while it's an extra HTTP request, it would not detect quickly bounced visitor so that's something to keep in mind.
Have a 1x1 gif in your pages that you keep track of. If loaded then its likely to be a browser. If it's not loaded it's likely to be a script.
=? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically.
Original answer follows (getting negative ratings!)
The only reliable way to tell bots
from humans are [CAPTCHAS][1]. You can
use [reCAPTCHA][2] if it suits you.
[1]:
http://en.wikipedia.org/wiki/Captcha
[2]: http://recaptcha.net/
You could exclude all requests that come from a User Agent that also requests robots.txt. All well behaved bots will make such a request, but the bad bots will escape detection.
You'd also have problems with false positives - as a human, it's not very often that I read a robots.txt in my browser, but I certainly can. To avoid these incorrectly showing up as bots, you could whitelist some common browser User Agents, and consider them to always be human. But this would just turn into maintaining a list of User Agents for browsers instead of one for bots.
So, this did-they-request-robots.txt approach certainly won't give 100% watertight results, but it may provide some heuristics to feed into a complete solution.
I'm surprised no one has recommended implementing a Turing test. Just have a chat box with human on the other end.
A programatic solution just won't do: See what happens when PARRY Encounters the DOCTOR
These two 'characters' are both "chatter" bots that were written in the course of AI research in the '70: to see how long they could fool a real person into thinking they were also a person. The PARRY character was modeled as a paranoid schizophrenic and THE DOCTOR as a stereotypical psychotherapist.
Here's some more background

Categories