Top techniques to avoid 'data scraping' from a website database - php

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.

While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.

If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.

There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.

Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.

force a reCAPTCHA every 10 page loads for each unique IP

There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.

Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.

I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.

Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.

There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.

What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!

Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?

Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.

My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

Related

How to block spider if he's disobeying the rules of robots.txt

Is there any way to block a crawler/spider search bots if they're not obeying the rules written in robots.txt file. If yes, where can I find more info about it?
I would prefer some .htaccess rule, if not then PHP.
There are ways to prevent most bots from spidering your site.
Aside from filtering by user agent and known IP adresses, you should as well implement behaviour driven blocking. That means, if it acts like a crawler, block it.
You can find multiple lists of search engine bots here. But most of the big players obey the robots.txt.
So the other, rather big part is the blocking because of the bots behaviour. Things are getting less complicated when you are using a framework like Laravel or Symfony, because you easily set a filter to be executed before every page load. If not, you'd have to implement a function which is called before every page load.
Now there are some things to consider. A spider usually crawls as fast as it can. So you could use the session to measure time between page loads and page loads in a given time span. If amount X this is exceed, the client is blocked.
Sadly, this approach relies on the bot handling sessions/cookies correctly, which may not always be the case.
Another or an additional approach would be to measure the amount of page loads from a given IP address. This is dangerous because there may be as well a huge amount of users using the same IP address. So this may exclude humans.
A third approach I can think of is to use some kind of honeypot. Create a link that leads to a specific site. That link has to be visible to computers, but not to humans. Hide it away with some css. If someone or something is accessing the page using the hidden link, you can be (close to) sure it is a program. But be aware, there are browser addons which are preloading every link they can find. So you cannot rely totally on this.
Depending on the nature of your site, on last approach would be to hide the complete site behind a capture. This is a harsh measure in terms of usability, so decide carefully if it applies to your use case.
Then there are techniques like using flash or complicated Javascript most bots do not understand, but it's disgusting and I don't want to talk about it. ^^
Finally, I will now come to a conclusion.
By using a well written robots.txt most robots will leave you alone. In addition to that, you should combine all or some of the approaches mentioned beforehand to get the bad guys.
Afterall, as long as your site is publically available, you can never evade a custom made bot tailored specifically for your site. When a browser can parse it, a robot can do it as well.
For a more useful answer I would need to know what you are trying to hide and why.

Algorithms used for catching robots

What kind of algorithm do websites, including stackexchange use to catch robots?
What makes them fail at times and present human-verification to normal users?
For web-applications and websites running on PHP, what would you recommend in order to stop robots and bot attacks and even content stealing?
Thank you.
Check out http://www.captcha.net/ for good and easy human-verification tools.
Preventing content stealing will be really difficult as you want the information to be available to your visitors.
Do not disable right click, it will only annoy your users and not stop content thiefs in any way.
You won't be able to keep out all bots, but you will be able to implement layers of security that will each stop a part of the bots.
A few hints and tips;
Use Captcha's for human verification, but don't use too many of them as they will tire users.
You could do e-mail verification with a Captcha and require a login for your content (if it doesn't scare away too many users). Or consider giving some part of the content for free and require registration for the full content.
Check for pieces of your content on other sites regularly (through Google, possibly automated with the Google API) and sue / DMCA notice if they blatantly stole (not quoted!) your content.
Limit the speed at which individual clients can make requests to your site. Bots will scrape often and quickly. Requesting content more than once a second is already a lot for human users. There are server tools that can accomplish this, eg. check out http://www.modsecurity.org/
I am sure there are more layers of security that can be thought of, but these come to mind directly.
I ran across an interesting article from Princeton University that presents nice ideas for automatic robot detection. The idea is quite simple. Humans behave differently than machines, and an automated access usually does things differently than a human.
The article presents some basic checks that can be done over the course of a few requests. You spend a few requests gathering information about how the client is browsing and after some time you take all your variables and make an assertion. Things to include are:
Mouse movement: a robot will most likely not use a mouse and therefore will not generate mouse movement events in the browser. You can prepare a javascript function, say "onBodyMouseMove()" and call it whenever the mouse moves over the entire area of page's body. If this function is called, count +1 in a session counter.
Javascript: some robots will not take the time to run javascript (i.e. curl, wget, axel, and other command line tools), since they are mostly sending specific requests that return useful output. You can prepare a function that is called after a page is loaded and count +1 in a session counter.
Invisble links: crawler robots are sucking machines that don't care about the content of a website. They are designed to click on all possible links and suck all the contents to a mirror location. You can insert invisible links somewhere in your webpage -- for example, a few nbsp; space characters at the bottom of the page surrounded by an anchor tag. Humans will not ever see this link, but you get a request on it, count +1 in a session counter.
CSS, images, and other visual components: robots will most likely ignore CSS and images, because they are not interested in rendering the webpage for viewing. You can hide a link to inside an URL that ends in *.css or *.jpg (you can use Apache rewrites or servlet mappings for Java). If these specific links are accessed, it's most likely a browser loading CSS and JPG for viewing.
NOTE: *.css, *.js, *.jpg, etc are usually loaded only once per page in a session. You need to append a unique counter at the end for the browser to reload these links everytime the page is requested.
Once you gather all that information in your session over the course of a few requests, you can make an assertion. For example, if you don't see any javascript, css or mouse move activity you can assume it's a bot. It's up to you to take these counters into consideration according to your needs.. so you can program it based on these variables any way you want. If you decide some client is a robot, you can force him to solve some captcha before continuing with further requests.
Just a note: Tablets will usually not create any mouse move events. So I'm still trying to figure out how to deal with them. Suggestions are welcome :)

Detect if user is human without captcha or useragent

I've a website where I'm providing email encryption to users and I'm trying to figure out if there's a way to detect if a user is human or a bot.
I've been digging into $_SESSION in php but it's easy to bypass, I'm also not interested in captcha, useragent or login solutions, any idea of what I need ?
There are other questions very similar to this one in SO but I couldn't find any straight answer...
Any help will be very welcome, thank you all !
This is a hard problem, and no solution I know of is going to be 100% perfect from a bot-defending and usability perspective. If your attacker is really determined to use a bot on your site, they probably will be able to. If you take things far enough to make it impractical for a computer program to access anything on your site, it's likely no human will want to either, but you can strike a good balance.
My point of view on this is partially as a web developer, but more so from the other side of things, having written numerous web crawler programs for clients all over the world. Not all bots have malicious intent, and can be used for things from automating form submissions to populating databases of doctors office addresses or analyzing stock market data. If your site is well designed from a usability standpoint, there should be no need for a bot that "makes things easier" for a user, but there are cases where there are special needs you can't plan for.
Of course there are those who do have malicious intent, which you definitely want to protect your site against as well as possible. There is virtually no site that can't be automated in some way. Most sites are not difficult at all, but here are a few ideas off the top of my head, from other answers or comments on this page, and from my experience writing (non-malicious) bots.
Types of bots
First I should mention that there are two different categories I would put bots into:
General purpose crawlers, indexers, or bots
Special purpose bots, made specifically for your site to perform some task
Usually a general-purpose bot is going to be something like a search engine's indexer, or possibly some hacker's script that looks for a form to submit, uses a dictionary attack to search for a vulnerable URL, or something like this. They can also attack "engine sites", such as Wordpress blogs. If your site is properly secured with good passwords and the like, these aren't usually going to pose much of a risk to you (unless you do use Wordpress, in which case you have to keep up with the latest versions and security updates).
Special purpose "personalized" bots are the kind I've written. A bot made specifically for your site can be made to act very much like a human user of your site, including inserting time delays between form submissions, setting cookies, and so on, so they can be hard to detect. For the most part this is the kind I'm talking about in the rest of this answer.
Captchas
Captchas are probably the most common approach to making sure a user is humanoid, and generally they are difficult to automatically get around. However, if you simply require the captcha as a one-time thing when the user creates an account, for example, it's easy for a human to get past it and then give their shiny new account credentials to a bot to automate usage of the system.
I remember a few years ago reading about a pretty elaborate system to "automate" breaking captchas on a popular gaming site: a separate site was set up that loaded captchas from the gaming site, and presented them to users, where they were essentially crowd-sourced. Users on the second site would get some sort of reward for each correct captcha, and the owners of the site were able to automate tasks on the gaming site using their crowd-sourced captcha data.
Generally the use of a good captcha system will pretty well guarantee one thing: somewhere there is a human who typed the captcha text. What happens before and after that depends on how often you require captcha verification, and how determined the person making a bot is.
Cell-phone / credit-card verification
If you don't want to use Captchas, this type of verification is probably going to be pretty effective against all but the most determined bot-writer. While (just as with the captcha) it won't prevent an already-verified user from creating and using a bot, you can verify that a human being created the account, and if abused block that phone number/credit-card from being used to create another account.
Sites like Facebook and Craigslist have started using cell-phone verification to prevent spamming from bots. For example, in order to create apps on Facebook, you have to have a phone number on record, confirmed via text message or an automated phone call. Unless your attacker has access to a whole lot of active phone numbers, this could be an effective way to verify that a human created the account and that he only creates a limited number of accounts (one for most people).
Credit cards can also be used to confirm that a human is performing an action and limit the number of accounts a single human can create.
Other [less-effective] solutions
Log analysis
Analyzing your request logs will often reveal bots doing the same actions repeatedly, or sometimes using dictionary attacks to look for holes in your site's configuration. So logs will tell you after-the-fact whether a request was made by a bot or a human. This may or may not be useful to you, but if the requests were made on a cell-phone or credit-card verified account, you can lock the account associated with the offending requests to prevent further abuse.
Math/other questions
Math problems or other questions can be answered by a quick google or wolfram alpha search, which can be automated by a bot. Some questions will be harder than others, but the big search companies are working against you here, making their engines better at understanding questions like this, and in turn making this a less viable option for verifying that a user is human.
Hidden form fields
Some sites employ a mechanism where parameters such as the coordinates of the mouse when they clicked the "submit" button are added to the form submission via javascript. These are extremely easy to fake in most cases, but if you see in your logs a whole bunch of requests using the same coordinates, it's likely they are a bot (although a smart bot could easily give different coordinates with each request).
Javascript Cookies
Since most bots don't load or execute javascript, cookies set using javascript instead of a set-cookie HTTP header will make life slightly more difficult for most would-be bot makers. But not so hard as to prevent the bot from manually setting the cookie as well, once the developer figures out how to generate the same value the javascript generates.
IP address
An IP address alone isn't going to tell you if a user is a human. Some sites use IP addresses to try to detect bots though, and it's true that a simple bot might show up as a bunch of requests from the same IP. But IP addresses are cheap, and with Amazon's EC2 service or similar cloud services, you can spawn a server and use it as a proxy. Or spawn 10 or 100 and use them all as proxies.
UserAgent string
This is so easy to manipulate in a crawler that you can't count on it to mark a bot that's trying not to be detected. It's easy to set the UserAgent to the same string one of the major browsers sends, and may even rotate between several different browsers.
Complicated markup
The most difficult site I ever wrote a bot for consisted of frames within frames within frames....about 10 layers deep, on each page, where each frame's src was the same base controller page, but had different parameters as to which actions to perform. The order of the actions was important, so it was tough to keep straight everything that was going on, but eventually (after a week or so) my bot worked, so while this might deter some bot makers, it won't be useful against all. And will probably make your site about a gazillion times harder to maintain.
Disclaimer & Conclusion
Not all bots are "bad". Most of the crawlers/bots I have made were for users who wanted to automate some process on the site, such as data entry, that was too tedious to do manually. So make tedious tasks easy! Or, provide an API for your users. Probably one of the easiest way to discourage someone from writing a bot for your site is to provide API access. If you provide an API, it's a lot less likely someone will go to the effort to create a crawler for it. And you could use API keys to control how heavily someone uses it.
For the purpose of preventing spammers, some combination of captchas and account verification through cell numbers or credit cards is probably going to be the most effective approach. Add some logging analysis to identify and disable any malicious personalized bots, and you should be in pretty good shape.
My favorite way is presenting the "user" with a picture of a cat or a dog and asking, "Is this a cat or a dog?" No human ever gets that wrong; the computer gets it right perhaps 60% of the time (so you have to run it several times). There's a project that will give you bunches of pictures of cats and dogs -- plus, all the animals are available for adoption so if the user likes the pet, he can have it.
It's a Microsoft corporate project, which puts me in a state of cognitive dissonance, as if I found out that Harry Reid likes zydeco music or that George Bush smokes pot. Oh, wait...
I've seen/used a simple arithmetic problem with written numbers ie:
Please answer the following question to prove you are human:
"What is two plus four?"
and similar simple questions which require reading:
"What is man's best friend?"
you can supply an endless stream of questions, should the person attempting access be unfamiliar with the subject matter, and it is accessible to all readers, etc.
There's a reason why companies use captchas or logins. As ugly of a solution as captchas are, they're currently the best (most accurate, least disruptive to users) way of weeding out bots. If a login solution doesn't work for you, I'm afraid the only realistic solution is a captcha.
If users will be filling in a form, honeypot fields are simple to implement, and can be reasonably effective, but nothing is perfect. Create one or more hidden fields in the form, and if they contain anything when the form is posted, reject the form. Spambots will usually attempt to fill in everything.
You do need to be aware of accessibility. Hidden fields probably won't be filled in by those using a standard browser (where the field is not visible), but those using screen readers may be presented with the field. Be sure to label it correctly so that these users do not fill it in. Perhaps with something like "Please help us to prevent spam by leaving this field empty". Also, if you do reject the form, be sure to reject it with helpful error messages, just in case it has been filled in by a human.
I suggest getting the Growmap Anti Spambot Wordpress plugin and seeing what code you can borrow from it or just using the same technique. I've found this plugin to be very effective for curtailing automated spam on my WordPress sites and I've started adapting the same technique for my ASP.NET sites.
The only thing it doesn't deal with are human cut-and-paste spammers.

How to protect website from bulk scraping /downloading? [duplicate]

This question already has answers here:
Top techniques to avoid 'data scraping' from a website database
(14 answers)
Closed 5 years ago.
I have LAMP server where I run a website, which I want to protect against bulk scraping / downloading. I know that there is no perfect solution for this, that the attacker will always find a way. But I would like to have at least some "protection" which hardenes the way of stealing data than just having nothing at all.
This website has cca. 5000 of subpages with valuable text data and couple of pictures on each page. I would like to be able online analyze incoming HTTP requests and if there is suspicious activity (e.g. tens of requests in one minute from one IP) it would automatically blacklist this certain IP address from further access to the site.
I fully realize that what I am asking for has many flaws, but I am not really looking for bullet-proof solution, but just a way how to limit script-kiddies from "playing" with easily scraped data.
Thank you for your on-topic answers and possible solution ideas.
Although this is a pretty old post, I think the answer isnt quite complete and I thought it worthwhile to add in my two cents. First, I agree with #symcbean, try to avoid using IP's but instead using a session, a cookie, or another method to track individuals. Otherwise you risk lumping together groups of users sharing an IP. The most common method for rate limiting, which is essentially what you are describing "tens of requests in one minute from one IP", is using the leaky bucket algorithm.
Other ways to combat web scrapers are:
Captchas
Make your code hard to interpret, and change it up frequently. This makes scripts harder to maintain.
Download IP lists of known spammers, proxy servers, TOR exit nodes, etc. This is going to be a lengthy list but its a great place to start. You may want to also block all amazon EC2 IP's.
This list, and rate limiting, will stop simple script kiddies but anyone with even moderate scripting experience will easily be able to get around you. Combating scrapers on your own is a futile effort but my opinion is biased because I am a cofounder of Distil Networks which offers anti-scraping protection as a service.
Sorry - but I'm not aware of any anti-leeching code available off-the-shelf which does a good job.
How do you limit access without placing burdens on legitimate users / withuot providing a mechanism for DOSing your site? Like spam prevention, the best solution is to use several approaches and maintain scores of badness.
You've already mentioned looking at the rate of requests - but bear in mind that increasingly users will be connecting from NAT networks - e.g. IPV6 pops. A better approach is to check per session - you don't need to require your users to register and login (although openId makes this a lot simpler) but you could redirect them to a defined starting point whenever they make a request without a current session and log them in with no username/password. Checking the referer (and that the referer really does point to the current content item) is a good idea too. Tracking 404 rates. Road blocks (when score exceeds threshold redirect to a capcha or require a login). Checking the user agent can be indicative of attacks - but should be used as part of the scoring mechanism, not as a yes/no criteria for blocking.
Another approach, rather than interrupting the flow, is when the thresholds are triggered start substituting content. Or do the same when you get repeated external hosts appearing in your referer headers.
Do not tar pit connections unless you've got a lot of resource serverside!
Referrer checking is one very simple technique that works well against automated attacks. You serve content normally if the referrer is your own domain (ie the user has reached the page by clicking a link on your own site), but if the referrer is not set, you can serve alternate content (such as a 404 not found).
Of course you need to set this up to allow search engines to read your content (assuming you want that) and also be aware that if you have any flash content, the referrer is never set, so you can't use this method.
Also it means that any deep links into your site won't work - but maybe you want that anyway?
You could also just enable it for images which makes it a bit harder for them to be scraped from the site.
Something that I've employed on some of my websites is to block known User-Agents of downloaders or archivers. You can find a list of them here: http://www.user-agents.org/ (unfortunately, not easy to sort by Type: D). In the host's setup, I enumerate the ones that I don't want with something like this:
SetEnvIf User-Agent ^Wget/[0-9\.]* downloader
Then I can do a Deny from env=downloader in the appropriate place. Of course, changing user-agents isn't difficult, but at least it's a bit of a deterrent if going through my logs is any indication.
If you want to filter by requests per minute or something along those lines, I don't think there's a way to do that in apache. I had a similar problem with ssh and saslauth, so I wrote a script to monitor the log files and if there were a certain number of failed login attempts made within a certain amount of time, it appended an iptables rule that blocked that IP from accessing those ports.
If you don't mind using an API, you can try our https://ip-api.io
It aggregates several databases of known IP addresses of proxies, TOR nodes and spammers.
I would advice one of 2 things,
First one would be, if you have information that other people want, give it to them in a controlled way, say, an API.
Second would be to try and copy google, if you scrape the results of google ALOT (and I mean a few hundred times a second) then it will notice it and force you to a Captcha.
I'd say that if a site is visited 10 times a second, its probably a bot. So give it a Captcha to be sure.
If a bot crawls your website slower then 10 times a second, I see no reason to try and stop it.
You could use a counter (DB or Session) and redirect the page if the limit is triggered.
/**Pseudocode*/
if( ip == currIp and sess = currSess)
Counter++;
if ( Count > Limit )
header->newLocation;
I think dynamic blocking of IPs using IP blocker will help better.

Tell bots apart from human visitors for stats?

I am looking to roll my own simple web stats script.
The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).
Is there any open service that does that, like Akismet does for spam?
Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?
To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just
want to exclude as many as I can from my stats. In
know that parsing the user-Agent is an
option but maintaining the patterns to
parse for is a lot of work. My
question is whether there is any
project or service that does that
already.
Bounty: I thought I'd push this as a reference question on the topic. The best / most original / most technically viable contribution will receive the bounty amount.
Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.
Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.
Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.
So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.
The easiest way is to check if their useragent includes 'bot' or 'spider' in. Most do.
EDIT (10y later): As Lukas said in the comment box, almost all crawlers today support javascript so I've removed the paragraph that stated that if the site was JS based most bots would be auto-stripped out.
You can follow a bot list and add their user-agent to the filtering list.
Take a look at this bot list.
This user-agent list is also pretty good. Just strip out all the B's and you're set.
EDIT: Amazing work done by eSniff has the above list here "in a form that can be queried and parsed easier. robotstxt.org/db/all.txt Each new Bot is defined by a robot-id:XXX. You should be able to download it once a week and parse it into something your script can use" like you can read in his comment.
Hope it helps!
Consider a PHP stats script which is camouflaged as a CSS background image (give the right response headers -at least the content type and cache control-, but write an empty image out).
Some bots parses JS, but certainly no one loads CSS images. One pitfall -as with JS- is that you will exclude textbased browsers with this, but that's less than 1% of the world wide web population. Also, there are certainly less CSS-disabled clients than JS-disabled clients (mobiles!).
To make it more solid for the (unexceptional) case that the more advanced bots (Google, Yahoo, etc) may crawl them in the future, disallow the path to the CSS image in robots.txt (which the better bots will respect anyway).
I use the following for my stats/counter app:
<?php
function is_bot($user_agent) {
return preg_match('/(abot|dbot|ebot|hbot|kbot|lbot|mbot|nbot|obot|pbot|rbot|sbot|tbot|vbot|ybot|zbot|bot\.|bot\/|_bot|\.bot|\/bot|\-bot|\:bot|\(bot|crawl|slurp|spider|seek|accoona|acoon|adressendeutschland|ah\-ha\.com|ahoy|altavista|ananzi|anthill|appie|arachnophilia|arale|araneo|aranha|architext|aretha|arks|asterias|atlocal|atn|atomz|augurfind|backrub|bannana_bot|baypup|bdfetch|big brother|biglotron|bjaaland|blackwidow|blaiz|blog|blo\.|bloodhound|boitho|booch|bradley|butterfly|calif|cassandra|ccubee|cfetch|charlotte|churl|cienciaficcion|cmc|collective|comagent|combine|computingsite|csci|curl|cusco|daumoa|deepindex|delorie|depspid|deweb|die blinde kuh|digger|ditto|dmoz|docomo|download express|dtaagent|dwcp|ebiness|ebingbong|e\-collector|ejupiter|emacs\-w3 search engine|esther|evliya celebi|ezresult|falcon|felix ide|ferret|fetchrover|fido|findlinks|fireball|fish search|fouineur|funnelweb|gazz|gcreep|genieknows|getterroboplus|geturl|glx|goforit|golem|grabber|grapnel|gralon|griffon|gromit|grub|gulliver|hamahakki|harvest|havindex|helix|heritrix|hku www octopus|homerweb|htdig|html index|html_analyzer|htmlgobble|hubater|hyper\-decontextualizer|ia_archiver|ibm_planetwide|ichiro|iconsurf|iltrovatore|image\.kapsi\.net|imagelock|incywincy|indexer|infobee|informant|ingrid|inktomisearch\.com|inspector web|intelliagent|internet shinchakubin|ip3000|iron33|israeli\-search|ivia|jack|jakarta|javabee|jetbot|jumpstation|katipo|kdd\-explorer|kilroy|knowledge|kototoi|kretrieve|labelgrabber|lachesis|larbin|legs|libwww|linkalarm|link validator|linkscan|lockon|lwp|lycos|magpie|mantraagent|mapoftheinternet|marvin\/|mattie|mediafox|mediapartners|mercator|merzscope|microsoft url control|minirank|miva|mj12|mnogosearch|moget|monster|moose|motor|multitext|muncher|muscatferret|mwd\.search|myweb|najdi|nameprotect|nationaldirectory|nazilla|ncsa beta|nec\-meshexplorer|nederland\.zoek|netcarta webmap engine|netmechanic|netresearchserver|netscoop|newscan\-online|nhse|nokia6682\/|nomad|noyona|nutch|nzexplorer|objectssearch|occam|omni|open text|openfind|openintelligencedata|orb search|osis\-project|pack rat|pageboy|pagebull|page_verifier|panscient|parasite|partnersite|patric|pear\.|pegasus|peregrinator|pgp key agent|phantom|phpdig|picosearch|piltdownman|pimptrain|pinpoint|pioneer|piranha|plumtreewebaccessor|pogodak|poirot|pompos|poppelsdorf|poppi|popular iconoclast|psycheclone|publisher|python|rambler|raven search|roach|road runner|roadhouse|robbie|robofox|robozilla|rules|salty|sbider|scooter|scoutjet|scrubby|search\.|searchprocess|semanticdiscovery|senrigan|sg\-scout|shai\'hulud|shark|shopwiki|sidewinder|sift|silk|simmany|site searcher|site valet|sitetech\-rover|skymob\.com|sleek|smartwit|sna\-|snappy|snooper|sohu|speedfind|sphere|sphider|spinner|spyder|steeler\/|suke|suntek|supersnooper|surfnomore|sven|sygol|szukacz|tach black widow|tarantula|templeton|\/teoma|t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|theophrastus|titan|titin|tkwww|toutatis|t\-rex|tutorgig|twiceler|twisted|ucsd|udmsearch|url check|updated|vagabondo|valkyrie|verticrawl|victoria|vision\-search|volcano|voyager\/|voyager\-hc|w3c_validator|w3m2|w3mir|walker|wallpaper|wanderer|wauuu|wavefire|web core|web hopper|web wombat|webbandit|webcatcher|webcopy|webfoot|weblayers|weblinker|weblog monitor|webmirror|webmonkey|webquest|webreaper|websitepulse|websnarf|webstolperer|webvac|webwalk|webwatch|webwombat|webzinger|wget|whizbang|whowhere|wild ferret|worldlight|wwwc|wwwster|xenu|xget|xift|xirq|yandex|yanga|yeti|yodao|zao\/|zippp|zyborg|\.\.\.\.)/i', $user_agent);
}
//example usage
if (! is_bot($_SERVER["HTTP_USER_AGENT"])) echo "it's a human hit!";
?>
I removed a link to the original code source, because it now redirects to a food app.
Checking the user-agent will alert you to the honest bots, but not the spammers.
To tell which requests are made by dishonest bots, your best bet (based on this guy's interesting study) is to catch a Javascript focus event .
If the focus event fires, the page was almost certainly loaded by a human being.
Edit: it's true, people with Javascript turned off will not show up as humans, but that's not a large percentage of web users.
Edit2: Current bots can also execute Javascript, at least Google can.
I currently use AWstats and Webalizer to monitor my log files for Apasce2 and so far they have been doing a pretty good job of it. If you would like you can have a look at their source code as it is an open source project.
You can get the source at http://awstats.sourceforge.net or alternatively look at the FAQ http://awstats.sourceforge.net/docs/awstats_faq.html
Hope that helps,
RayQuang
Rather than trying to maintain an impossibly-long list of spider User Agents we look for things that suggest human behaviour. Principle of these is that we split our Session Count into two figures: the number of single-page-sessions, and the number of multi-page-sessions. We drop a session cookie, and use that to determine multi-page sessions. We also drop a persistent "Machine ID" cookie; a returning user (Machine ID cookie found) is treated as a multi-page session even if they only view one page in that session. You may have other characteristics that imply a "human" visitor - referrer is Google, for example (although I believe that the MS Search bot mascarades as a standard UserAgent referred with a realistic keyword to check that the site doesn't show different content [to that given to their Bot], and that behaviour looks a lot like a human!)
Of course this is not infalible, and in particular if you have lots of people who arrive and "click off" its not going to be a good statistic for you, nor if you have predominance of people with cookies turned off (in our case they won't be able to use our [shopping cart] site without session-cookies enabled).
Taking the data from one of our clients we find that the daily single-session count is all over the place - an order of magnitude different from day to day; however, if we subtract 1,000 from the multi-page session per day we then have a damn-near-linear rate of 4 multi-page-sessions per order placed / two session per basket. I have no real idea what the other 1,000 multi-page sessions per day are!
Record mouse movement and scrolling using javascript. You can tell from the recorded data wether it's a human or a bot. Unless the bot is really really sophisticated and mimics human mouse movements.
Prerequisite - referrer is set
apache level:
LogFormat "%U %{Referer}i %{%Y-%m-%d %H:%M:%S}t" human_log
RewriteRule ^/human/(.*) /b.gif [L]
SetEnv human_session 0
# using referrer
SetEnvIf Referer "^http://yoursite.com/" human_log_session=1
SetEnvIf Request_URI "^/human/(.*).gif$" human_dolog=1
SetEnvIf human_log_session 0 !human_dolog
CustomLog logs/human-access_log human_log env=human_dolog
In web-page, embed a /human/$hashkey_of_current_url.gif.
If is a bot, is unlikely have referrer set (this is a grey area).
If hit directly using browser address bar, it will not included.
At the end of each day, /human-access_log should contains all the referrer which actually is human page-view.
To play safe, hash of the referrer from apache log should tally with the image name
Now we have all kind of headless browsers. Chrome, Firefox or else that will execute whatever JS you have on your site. So any JS-based detections won't work.
I think the most confident way would be to track behavior on site. If I would write a bot and would like to by-pass checks, I would mimic scroll, mouse move, hover, browser history etc. events just with headless chrome. To turn it to the next level, even if headless chrome adds some hints about "headless" mode into the request, I could fork chrome repo, make changes and build my own binaries that will leave no track.
I think this may be the closest answer to real detection if it's human or not by no action from the visitor:
https://developers.google.com/recaptcha/docs/invisible
I'm not sure techniques behind this but I believe Google did a good job by analyzing billions of requests with their ML algorithms to detect if the behavior is human-ish or bot-ish.
while it's an extra HTTP request, it would not detect quickly bounced visitor so that's something to keep in mind.
Have a 1x1 gif in your pages that you keep track of. If loaded then its likely to be a browser. If it's not loaded it's likely to be a script.
=? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically.
Original answer follows (getting negative ratings!)
The only reliable way to tell bots
from humans are [CAPTCHAS][1]. You can
use [reCAPTCHA][2] if it suits you.
[1]:
http://en.wikipedia.org/wiki/Captcha
[2]: http://recaptcha.net/
You could exclude all requests that come from a User Agent that also requests robots.txt. All well behaved bots will make such a request, but the bad bots will escape detection.
You'd also have problems with false positives - as a human, it's not very often that I read a robots.txt in my browser, but I certainly can. To avoid these incorrectly showing up as bots, you could whitelist some common browser User Agents, and consider them to always be human. But this would just turn into maintaining a list of User Agents for browsers instead of one for bots.
So, this did-they-request-robots.txt approach certainly won't give 100% watertight results, but it may provide some heuristics to feed into a complete solution.
I'm surprised no one has recommended implementing a Turing test. Just have a chat box with human on the other end.
A programatic solution just won't do: See what happens when PARRY Encounters the DOCTOR
These two 'characters' are both "chatter" bots that were written in the course of AI research in the '70: to see how long they could fool a real person into thinking they were also a person. The PARRY character was modeled as a paranoid schizophrenic and THE DOCTOR as a stereotypical psychotherapist.
Here's some more background

Categories