How to block bots in a download script

How to block bots in a download script - php

I created a download script which makes users wait five seconds before the downlaod starts automatically and also counts downloads. It's very simple. Now I need to find a way to block bots because I want the downlod count to be as realistic as possible, meaning I want it to count only users actually downloading and not bots. Is there a bots list somewhere or just a way to do what I have to? Thanks.

Normal "bots" aren't able to run javascript, so they can't wait (download it).
You can add capcha if you're afraid there are bots with knowledge of "javascript"

Well-behaved robots should respect robots.txt, which allows you to instruct robots how they are allowed to crawl your website.
You cannot reliably block non-well-behaved robots (sort of attempts at human detection such as captcha, as others have suggested). Even though many robots set a special user agent (you can see examples here), a robot can set the user agent to anything it wants to.

Use a captcha. I would suggest you to use Recaptcha.

There are various methods that you could use to get rid of a bot, but they'll also filter out some real users:
Only allow clients that send an acceptable User-Agent string.
Only allow clients that have JavaScript enabled.
Only allow clients that have cookies enabled.
Only allow clients that uncheck a checkbox that says, "I'm a robot."
Only allow clients that don't fill out a honeypot text input.
Have a CAPTCHA (this is used by webmasters who hate their users and have no respect for them; only suggested for sadists and jerks)
You can pick and choose, or combine them to create your own flavor of robot discrimination.

Honeypot fields and timestamp analysis.

Related

How can I track outgoing link clicks without tracking bots?

I have a few thoughts on this but I can see problems with both. I don't need 100% accurate data. An 80% solution that allows me to make generalizations about the most popular domains I'm routing users to is fine.
Option 1 - Use PHP. Route links through a file track.php that makes sure the referring page is from my domain before tracking the click. This page then routes the user to the final intended URL. Obviously bots could spoof this. Do many? I could also check the user agent. Again, I KNOW many bots spoof this.
Option 2 - Use JavaScript. Execute a JavaScript on click function that writes the click to the database and then directs the user to the final URL.
Both of these methods feel like they may cause problems with crawlers following my outgoing links. What is the most effective method for tracking these outgoing clicks?

The most effective method for tracking outgoing links (it's used by Facebook, Twitter, and almost every search engine) is a "track.php" type file.
Detecting bots can be considered a separate problem, and the methods are covered fairly well by these questions: http://duckduckgo.com/?q=how+to+detect+http+bots+site%3Astackoverflow.com But doing a simple string search for "bot" in the User-Agent will probably get you close to your 80%* (and watching for hits to /robots.txt will, depending on the type of bot you're dealing with, get you 95%*).
*: a semi-educated guess, based on zero concrete data

Well, Google analytics and Piwik use Javascript for that.
Since bots can't use JS, you'll only have humans. In the other way, humans can disable JS too (but sincerely, that's rarely the case)
Facebook, Deviantart, WLM, etc use server side script to track. I don't know how they filter bots but a nice robots.txt with one or two filter and that should be good enough to get 80% I guess.

PHP: Differentiate between a human user and a bot / other

I want to, using PHP, differentiate between an actual person and a bot. I currently track page views and they are massively inflated due to bots crawling my pages so I want to only record real people. It doesn't matter if its not 100% accurate I just want a nice simple way to do it via PHP.
To be clear, this is not for analytics's per se; it is so that I can track what images are being served daily so I can produce a "top images of the day" sort of script.

You should be checking the user agent string, most well behaved search bots will report themselves as such.
Google's spider for example.

First, the obvious: check the user agent.
I use another trick that works pretty good. I map robots.txt to a PHP file and log the IP into the database. Then when logging user activity, I make sure they aren't from one of those logged IPs. If the user authenticates via the login system then I track them regardless.
Of course neither solution guarantees any accuracy, but for general logging, it has been sufficient for my purposes.

I'm not sure that PHP is the best solution for this kind of problem.
You can read How to block bad bots and How to block spambots, ban spybots, and tell unwanted robots to go to hell to see more solutions about blocking bots but this time with apache.
Apache will act faster a require less CPU to do this sort of task than a php program.

Preventing Spam

How do you stop bots on a page which is accessible to registered users only? 90% page is accessed by real users and 10% are bot.
I do not want to put captcha or verification method on the page because I know that my users wont like this and they lazy also.
Please share your ideas
Edit
I want to make this question more clear
Registration page has captcha
My site allows users to submit contents in other words its UGC site. Spammers copy other users content and put them on my site so blocking them via askimet is not possible.
Possible Solution
Just got one thing in my mind.
When user click on submit button server will generate a random number (using javascript) which will be then used in hidden field for verification.
Do you think this solution is practically applicable?

One trick I like to use is to add a hidden input field to my forms that a real user would never see or change, but that a bot would blindly fill out.
Something like
<input name="spam_stopper" value="DO NOT CHANGE THIS" style="display:none;"/>
and then, in your form handling code, make sure the value of spam_stopper is "DO NOT CHANGE THIS".
A smart bot may ignore display:none, but that's not too likely - many do ignore <input type="hidden"> though, so I wouldn't use that...

Given you have excluded captcha (which isn't 100% bulletproof), you need to check what your users type and allow or forbid their postings.
This task isn't going to be an easy one, so I would suggest to turn your attention to ready-made solutions such as Akismet.

Since these bots don't follow robots.txt, you can always block them with an .htaccess, but it's lot of work (need to maintain the block list) since bots/spammers often change IPs. You also risk to block genuine users.
You can see Block Bad Bots for an example.
It can be useful but it's often too much work to block all of them VS let's say a CAPTCHA or similar system.

Firstly, do you do human-verification on sign-up? That's the first step you should take to prevent spam on your site. Captchas are very effective, and even if you don't want to make users answer a captcha each time they post on the site, having them fill one out to create an account is perfectly reasonable. It only takes 2-3 seconds, and they only need to do it once.
If you're not willing to do that, you're going to have to put up with spam so long as your site is indexed in search engines.

Prevent not sort out the spam

Yes, CAPTCHAs are not user-friendly. There are a few techniques that you can use to prevent spams without using CAPTCHAs which some of them have been already mentioned by others:
Smarter Server-side Validation: This is specific to the form but for example in contact us form you can filter lengthy messages or messages including a lot of URLs. Or if you expect to get an email you can ping the domain.
Blacklist Mechanism: flag spammers by IP or phrases in a blacklist database. If you're using PHP a simple library like Guard can be helpful
Honeypots: This is already mentioned in the accepted answer
Time-based Protection: To check time to post a request is more than X seconds
Score-based Google reCAPTCHA v3: This version is totally re-designed compared to the previous one and detect spams behind the scene.
I've written a post recently and you can find more in depth there.

Tell bots apart from human visitors for stats?

I am looking to roll my own simple web stats script.
The only major obstacle on the road, as far as I can see, is telling human visitors apart from bots. I would like to have a solution for that which I don't need to maintain on a regular basis (i.e. I don't want to update text files with bot-related User-agents).
Is there any open service that does that, like Akismet does for spam?
Or is there a PHP project that is dedicated to recognizing spiders and bots and provides frequent updates?
To clarify: I'm not looking to block bots. I do not need 100% watertight results. I just
want to exclude as many as I can from my stats. In
know that parsing the user-Agent is an
option but maintaining the patterns to
parse for is a lot of work. My
question is whether there is any
project or service that does that
already.
Bounty: I thought I'd push this as a reference question on the topic. The best / most original / most technically viable contribution will receive the bounty amount.

Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.
Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.
Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.
So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.

The easiest way is to check if their useragent includes 'bot' or 'spider' in. Most do.

EDIT (10y later): As Lukas said in the comment box, almost all crawlers today support javascript so I've removed the paragraph that stated that if the site was JS based most bots would be auto-stripped out.
You can follow a bot list and add their user-agent to the filtering list.
Take a look at this bot list.
This user-agent list is also pretty good. Just strip out all the B's and you're set.
EDIT: Amazing work done by eSniff has the above list here "in a form that can be queried and parsed easier. robotstxt.org/db/all.txt Each new Bot is defined by a robot-id:XXX. You should be able to download it once a week and parse it into something your script can use" like you can read in his comment.
Hope it helps!

Consider a PHP stats script which is camouflaged as a CSS background image (give the right response headers -at least the content type and cache control-, but write an empty image out).
Some bots parses JS, but certainly no one loads CSS images. One pitfall -as with JS- is that you will exclude textbased browsers with this, but that's less than 1% of the world wide web population. Also, there are certainly less CSS-disabled clients than JS-disabled clients (mobiles!).
To make it more solid for the (unexceptional) case that the more advanced bots (Google, Yahoo, etc) may crawl them in the future, disallow the path to the CSS image in robots.txt (which the better bots will respect anyway).

I use the following for my stats/counter app:
<?php
function is_bot($user_agent) {
return preg_match('/(abot|dbot|ebot|hbot|kbot|lbot|mbot|nbot|obot|pbot|rbot|sbot|tbot|vbot|ybot|zbot|bot\.|bot\/|_bot|\.bot|\/bot|\-bot|\:bot|\(bot|crawl|slurp|spider|seek|accoona|acoon|adressendeutschland|ah\-ha\.com|ahoy|altavista|ananzi|anthill|appie|arachnophilia|arale|araneo|aranha|architext|aretha|arks|asterias|atlocal|atn|atomz|augurfind|backrub|bannana_bot|baypup|bdfetch|big brother|biglotron|bjaaland|blackwidow|blaiz|blog|blo\.|bloodhound|boitho|booch|bradley|butterfly|calif|cassandra|ccubee|cfetch|charlotte|churl|cienciaficcion|cmc|collective|comagent|combine|computingsite|csci|curl|cusco|daumoa|deepindex|delorie|depspid|deweb|die blinde kuh|digger|ditto|dmoz|docomo|download express|dtaagent|dwcp|ebiness|ebingbong|e\-collector|ejupiter|emacs\-w3 search engine|esther|evliya celebi|ezresult|falcon|felix ide|ferret|fetchrover|fido|findlinks|fireball|fish search|fouineur|funnelweb|gazz|gcreep|genieknows|getterroboplus|geturl|glx|goforit|golem|grabber|grapnel|gralon|griffon|gromit|grub|gulliver|hamahakki|harvest|havindex|helix|heritrix|hku www octopus|homerweb|htdig|html index|html_analyzer|htmlgobble|hubater|hyper\-decontextualizer|ia_archiver|ibm_planetwide|ichiro|iconsurf|iltrovatore|image\.kapsi\.net|imagelock|incywincy|indexer|infobee|informant|ingrid|inktomisearch\.com|inspector web|intelliagent|internet shinchakubin|ip3000|iron33|israeli\-search|ivia|jack|jakarta|javabee|jetbot|jumpstation|katipo|kdd\-explorer|kilroy|knowledge|kototoi|kretrieve|labelgrabber|lachesis|larbin|legs|libwww|linkalarm|link validator|linkscan|lockon|lwp|lycos|magpie|mantraagent|mapoftheinternet|marvin\/|mattie|mediafox|mediapartners|mercator|merzscope|microsoft url control|minirank|miva|mj12|mnogosearch|moget|monster|moose|motor|multitext|muncher|muscatferret|mwd\.search|myweb|najdi|nameprotect|nationaldirectory|nazilla|ncsa beta|nec\-meshexplorer|nederland\.zoek|netcarta webmap engine|netmechanic|netresearchserver|netscoop|newscan\-online|nhse|nokia6682\/|nomad|noyona|nutch|nzexplorer|objectssearch|occam|omni|open text|openfind|openintelligencedata|orb search|osis\-project|pack rat|pageboy|pagebull|page_verifier|panscient|parasite|partnersite|patric|pear\.|pegasus|peregrinator|pgp key agent|phantom|phpdig|picosearch|piltdownman|pimptrain|pinpoint|pioneer|piranha|plumtreewebaccessor|pogodak|poirot|pompos|poppelsdorf|poppi|popular iconoclast|psycheclone|publisher|python|rambler|raven search|roach|road runner|roadhouse|robbie|robofox|robozilla|rules|salty|sbider|scooter|scoutjet|scrubby|search\.|searchprocess|semanticdiscovery|senrigan|sg\-scout|shai\'hulud|shark|shopwiki|sidewinder|sift|silk|simmany|site searcher|site valet|sitetech\-rover|skymob\.com|sleek|smartwit|sna\-|snappy|snooper|sohu|speedfind|sphere|sphider|spinner|spyder|steeler\/|suke|suntek|supersnooper|surfnomore|sven|sygol|szukacz|tach black widow|tarantula|templeton|\/teoma|t\-h\-u\-n\-d\-e\-r\-s\-t\-o\-n\-e|theophrastus|titan|titin|tkwww|toutatis|t\-rex|tutorgig|twiceler|twisted|ucsd|udmsearch|url check|updated|vagabondo|valkyrie|verticrawl|victoria|vision\-search|volcano|voyager\/|voyager\-hc|w3c_validator|w3m2|w3mir|walker|wallpaper|wanderer|wauuu|wavefire|web core|web hopper|web wombat|webbandit|webcatcher|webcopy|webfoot|weblayers|weblinker|weblog monitor|webmirror|webmonkey|webquest|webreaper|websitepulse|websnarf|webstolperer|webvac|webwalk|webwatch|webwombat|webzinger|wget|whizbang|whowhere|wild ferret|worldlight|wwwc|wwwster|xenu|xget|xift|xirq|yandex|yanga|yeti|yodao|zao\/|zippp|zyborg|\.\.\.\.)/i', $user_agent);
}
//example usage
if (! is_bot($_SERVER["HTTP_USER_AGENT"])) echo "it's a human hit!";
?>
I removed a link to the original code source, because it now redirects to a food app.

Checking the user-agent will alert you to the honest bots, but not the spammers.
To tell which requests are made by dishonest bots, your best bet (based on this guy's interesting study) is to catch a Javascript focus event .
If the focus event fires, the page was almost certainly loaded by a human being.
Edit: it's true, people with Javascript turned off will not show up as humans, but that's not a large percentage of web users.
Edit2: Current bots can also execute Javascript, at least Google can.

I currently use AWstats and Webalizer to monitor my log files for Apasce2 and so far they have been doing a pretty good job of it. If you would like you can have a look at their source code as it is an open source project.
You can get the source at http://awstats.sourceforge.net or alternatively look at the FAQ http://awstats.sourceforge.net/docs/awstats_faq.html
Hope that helps,
RayQuang

Rather than trying to maintain an impossibly-long list of spider User Agents we look for things that suggest human behaviour. Principle of these is that we split our Session Count into two figures: the number of single-page-sessions, and the number of multi-page-sessions. We drop a session cookie, and use that to determine multi-page sessions. We also drop a persistent "Machine ID" cookie; a returning user (Machine ID cookie found) is treated as a multi-page session even if they only view one page in that session. You may have other characteristics that imply a "human" visitor - referrer is Google, for example (although I believe that the MS Search bot mascarades as a standard UserAgent referred with a realistic keyword to check that the site doesn't show different content [to that given to their Bot], and that behaviour looks a lot like a human!)
Of course this is not infalible, and in particular if you have lots of people who arrive and "click off" its not going to be a good statistic for you, nor if you have predominance of people with cookies turned off (in our case they won't be able to use our [shopping cart] site without session-cookies enabled).
Taking the data from one of our clients we find that the daily single-session count is all over the place - an order of magnitude different from day to day; however, if we subtract 1,000 from the multi-page session per day we then have a damn-near-linear rate of 4 multi-page-sessions per order placed / two session per basket. I have no real idea what the other 1,000 multi-page sessions per day are!

Record mouse movement and scrolling using javascript. You can tell from the recorded data wether it's a human or a bot. Unless the bot is really really sophisticated and mimics human mouse movements.

Prerequisite - referrer is set
apache level:
LogFormat "%U %{Referer}i %{%Y-%m-%d %H:%M:%S}t" human_log
RewriteRule ^/human/(.*) /b.gif [L]
SetEnv human_session 0
# using referrer
SetEnvIf Referer "^http://yoursite.com/" human_log_session=1
SetEnvIf Request_URI "^/human/(.*).gif$" human_dolog=1
SetEnvIf human_log_session 0 !human_dolog
CustomLog logs/human-access_log human_log env=human_dolog
In web-page, embed a /human/$hashkey_of_current_url.gif.
If is a bot, is unlikely have referrer set (this is a grey area).
If hit directly using browser address bar, it will not included.
At the end of each day, /human-access_log should contains all the referrer which actually is human page-view.
To play safe, hash of the referrer from apache log should tally with the image name

Now we have all kind of headless browsers. Chrome, Firefox or else that will execute whatever JS you have on your site. So any JS-based detections won't work.
I think the most confident way would be to track behavior on site. If I would write a bot and would like to by-pass checks, I would mimic scroll, mouse move, hover, browser history etc. events just with headless chrome. To turn it to the next level, even if headless chrome adds some hints about "headless" mode into the request, I could fork chrome repo, make changes and build my own binaries that will leave no track.
I think this may be the closest answer to real detection if it's human or not by no action from the visitor:
https://developers.google.com/recaptcha/docs/invisible
I'm not sure techniques behind this but I believe Google did a good job by analyzing billions of requests with their ML algorithms to detect if the behavior is human-ish or bot-ish.
while it's an extra HTTP request, it would not detect quickly bounced visitor so that's something to keep in mind.

Have a 1x1 gif in your pages that you keep track of. If loaded then its likely to be a browser. If it's not loaded it's likely to be a script.

=? Sorry, misunderstood. You may try another option I have set up at my site: create a non-linked webpage with a hard/strange name and log apart visits to this page. Most if not all of the visitor to this page will be bots, that way you'll be able to create your bot list dynamically.
Original answer follows (getting negative ratings!)
The only reliable way to tell bots
from humans are [CAPTCHAS][1]. You can
use [reCAPTCHA][2] if it suits you.
[1]:
http://en.wikipedia.org/wiki/Captcha
[2]: http://recaptcha.net/

You could exclude all requests that come from a User Agent that also requests robots.txt. All well behaved bots will make such a request, but the bad bots will escape detection.
You'd also have problems with false positives - as a human, it's not very often that I read a robots.txt in my browser, but I certainly can. To avoid these incorrectly showing up as bots, you could whitelist some common browser User Agents, and consider them to always be human. But this would just turn into maintaining a list of User Agents for browsers instead of one for bots.
So, this did-they-request-robots.txt approach certainly won't give 100% watertight results, but it may provide some heuristics to feed into a complete solution.

I'm surprised no one has recommended implementing a Turing test. Just have a chat box with human on the other end.
A programatic solution just won't do: See what happens when PARRY Encounters the DOCTOR
These two 'characters' are both "chatter" bots that were written in the course of AI research in the '70: to see how long they could fool a real person into thinking they were also a person. The PARRY character was modeled as a paranoid schizophrenic and THE DOCTOR as a stereotypical psychotherapist.
Here's some more background

How do I stop bots from incrementing my file download counter in PHP?

When a user clicks a link to download a file on my website, they go to this PHP file which increments a download counter for that file and then header()-redirects them to the actual file. I suspect that bots are following the download link, however, so the number of downloads is inaccurate.
How do I let bots know that they shouldn't follow the link?
Is there a way to detect most bots?
Is there a better way to count the number of downloads a file gets?

robots.txt: http://www.robotstxt.org/robotstxt.html
Not all bots respect it, but most do. If you really want to prevent access via bots, make the link to it a POST instead of a GET. Bots will not follow POST urls. (I.E., use a small form that posts back to the site that takes you to the URL in question.)

I would think Godeke's robots.txt answer would be sufficient. If you absolutely cannot have the bots up your counter, then I would recommend using the robots file in conjunction with not not incrementing the clicks with some common robot user agents.
Neither way is perfect., but the mixture of the two is probably a little more strict. If is was me, I would probably just stick to the robots file though, since it is easy and probably the most effective solution.

Godeke is right, robots.txt is the first thing to do to keep the bots from downloading.
Regarding the counting, this is really a web analytics problem. Are you not keeping your www access logs and running them through an analytics program like Webalizer or AWStats (or fancy alternatives like Webtrends or Urchin)? To me that's the way to go for collecting this sort of info, because it's easy and there's no PHP, redirect or other performance hit when the user's downloading the file. You're just using the Apache logs that you're keeping anyway. (And grep -c will give you the quick 'n' dirty count on a particular file or wildcard pattern.)
You can configure your stats software to ignore hits by bots, or specific user agents and other criteria (and if you change your criteria later on, you just reprocess the old log data). Of course, this does require you have all your old logs, so if you've been tossing them with something like logrotate you'll have to start out without any historical data.

You can also detect malicious bots, which wouldn't respect robots.txt using http://www.bad-behavior.ioerror.us/.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.