HiI am at prototype stage. For the scenario below, I want to share my plan with you and I am asking your opinions if my plan makes sense or is there any better way to achieve my requirement that you may recommend. thanks, regardsSCENARIOA spam bot can haunt my forms. (mail, comment, article sharing) At the verification stage, I can detect if it’s a spammer by known methods. (captcha, time limit, secret question, hidden form element etc.) but what if the spam boy tries and tries continuously? It won’t able to validate and excute the form aim but it’ll consume the bandwidth continuously. MY REQUIREMENTNot only prevent to execute the form aim but also prevent bandwidth consumption on a continuous base. MY PLANUsing session abilities, count the number of attempts of the specific ip in a limited time. If the number of attempts is greater than n in x minutes, then redirect the visitor to a totally different url or ban the visitor ip at run-time with php codes. MY QUESTIONS
Is it logical to redirect the spammer to a totally different url? If
yes do you know any web page that welcomes the spammer IPs in order
to add them to their blacklists? In other way I am aware that this
activity will not be ethical so I must not apply this kind of
redirection.
Is it possible to ban an ip at php runtime in my related
verification pages?
1) It seems illogical to me. For example imagine brute-force bot that would try to get through your captcha. Most propably it will not even run in a web browser. If you send redirect headers, it might just ignore them and not load the target page. If you really need to save bandwidth, you could simply print blank page (for example you could use exit())
2)It is possible to ban ip anytime, you allways have $_SERVER['REMOTE_ADDR'] variable and if you decide to ban the ip, you can just add it to database (or file). And whenever somebody needs to be verified, you can query the database for their ip.
Related
I'm creating a web application where users will vote for some candidates by clicking thumbs up or thumbs down, and these users won't have any account on the site.
What is the best technique to use? Is it necessary to use captcha for more protection from spam?
Vote counts are expected to be millions, and the subject is not very critical as long as I get 95% accuracy that would be fine. Thanks.
You can combine these two methods:
Add a cookie to prevent multiple votes from the same machine
Log IP addresses and prevent voting more than a set number of times from the same address (for example, 5 times the same hour).
This will make it possible for multiple persons to vote from the same network but still prevent excessive cheating.
You may also make it harder to build a voting bot by adding some hidden form field with a token that must be included in the vote, and/or use Ajax for the voting. I know it's relatively easy to build a bot anyway but most cheaters aren't that smart.
Cookies and Session Ids will help, although both can be lost when the browser is closed (if the user has it enabled to delete them). Still, they will give you some degree of accuracy (ex. the lazy voters won't bother to close and reopen their browsers).
Using IP Addresses would also work, but as #Michael Dillon said people on the same IP address (same router) will not be able to vote.
You have several options, some or all of which you can use.
You can record IP and then check against IP, but then this isn't indicative of a specific person just a computer and sometimes not just a single computer.
You can also write a cookie to a user's browser but a user can use a different browser, machine etc.
Within a user's session you could create a session variable, although if you are expecting very high traffic this may not be the best option, and also only prevents re-voting within the same session.
If you are contemplating a captcha, you may as well ask the user to supply an email address and then you are assured of at least one vote per email address. However, even then you cannot be guaranteed valid email addresses.
You can ask their phone numbers when they want to vote and send to them one time password and use that as verification.
Some my also vote from another numbers but i think this is the most accurate way.
This question already has answers here:
Top techniques to avoid 'data scraping' from a website database
(14 answers)
Closed 5 years ago.
I have LAMP server where I run a website, which I want to protect against bulk scraping / downloading. I know that there is no perfect solution for this, that the attacker will always find a way. But I would like to have at least some "protection" which hardenes the way of stealing data than just having nothing at all.
This website has cca. 5000 of subpages with valuable text data and couple of pictures on each page. I would like to be able online analyze incoming HTTP requests and if there is suspicious activity (e.g. tens of requests in one minute from one IP) it would automatically blacklist this certain IP address from further access to the site.
I fully realize that what I am asking for has many flaws, but I am not really looking for bullet-proof solution, but just a way how to limit script-kiddies from "playing" with easily scraped data.
Thank you for your on-topic answers and possible solution ideas.
Although this is a pretty old post, I think the answer isnt quite complete and I thought it worthwhile to add in my two cents. First, I agree with #symcbean, try to avoid using IP's but instead using a session, a cookie, or another method to track individuals. Otherwise you risk lumping together groups of users sharing an IP. The most common method for rate limiting, which is essentially what you are describing "tens of requests in one minute from one IP", is using the leaky bucket algorithm.
Other ways to combat web scrapers are:
Captchas
Make your code hard to interpret, and change it up frequently. This makes scripts harder to maintain.
Download IP lists of known spammers, proxy servers, TOR exit nodes, etc. This is going to be a lengthy list but its a great place to start. You may want to also block all amazon EC2 IP's.
This list, and rate limiting, will stop simple script kiddies but anyone with even moderate scripting experience will easily be able to get around you. Combating scrapers on your own is a futile effort but my opinion is biased because I am a cofounder of Distil Networks which offers anti-scraping protection as a service.
Sorry - but I'm not aware of any anti-leeching code available off-the-shelf which does a good job.
How do you limit access without placing burdens on legitimate users / withuot providing a mechanism for DOSing your site? Like spam prevention, the best solution is to use several approaches and maintain scores of badness.
You've already mentioned looking at the rate of requests - but bear in mind that increasingly users will be connecting from NAT networks - e.g. IPV6 pops. A better approach is to check per session - you don't need to require your users to register and login (although openId makes this a lot simpler) but you could redirect them to a defined starting point whenever they make a request without a current session and log them in with no username/password. Checking the referer (and that the referer really does point to the current content item) is a good idea too. Tracking 404 rates. Road blocks (when score exceeds threshold redirect to a capcha or require a login). Checking the user agent can be indicative of attacks - but should be used as part of the scoring mechanism, not as a yes/no criteria for blocking.
Another approach, rather than interrupting the flow, is when the thresholds are triggered start substituting content. Or do the same when you get repeated external hosts appearing in your referer headers.
Do not tar pit connections unless you've got a lot of resource serverside!
Referrer checking is one very simple technique that works well against automated attacks. You serve content normally if the referrer is your own domain (ie the user has reached the page by clicking a link on your own site), but if the referrer is not set, you can serve alternate content (such as a 404 not found).
Of course you need to set this up to allow search engines to read your content (assuming you want that) and also be aware that if you have any flash content, the referrer is never set, so you can't use this method.
Also it means that any deep links into your site won't work - but maybe you want that anyway?
You could also just enable it for images which makes it a bit harder for them to be scraped from the site.
Something that I've employed on some of my websites is to block known User-Agents of downloaders or archivers. You can find a list of them here: http://www.user-agents.org/ (unfortunately, not easy to sort by Type: D). In the host's setup, I enumerate the ones that I don't want with something like this:
SetEnvIf User-Agent ^Wget/[0-9\.]* downloader
Then I can do a Deny from env=downloader in the appropriate place. Of course, changing user-agents isn't difficult, but at least it's a bit of a deterrent if going through my logs is any indication.
If you want to filter by requests per minute or something along those lines, I don't think there's a way to do that in apache. I had a similar problem with ssh and saslauth, so I wrote a script to monitor the log files and if there were a certain number of failed login attempts made within a certain amount of time, it appended an iptables rule that blocked that IP from accessing those ports.
If you don't mind using an API, you can try our https://ip-api.io
It aggregates several databases of known IP addresses of proxies, TOR nodes and spammers.
I would advice one of 2 things,
First one would be, if you have information that other people want, give it to them in a controlled way, say, an API.
Second would be to try and copy google, if you scrape the results of google ALOT (and I mean a few hundred times a second) then it will notice it and force you to a Captcha.
I'd say that if a site is visited 10 times a second, its probably a bot. So give it a Captcha to be sure.
If a bot crawls your website slower then 10 times a second, I see no reason to try and stop it.
You could use a counter (DB or Session) and redirect the page if the limit is triggered.
/**Pseudocode*/
if( ip == currIp and sess = currSess)
Counter++;
if ( Count > Limit )
header->newLocation;
I think dynamic blocking of IPs using IP blocker will help better.
I am about to write a voting method for my site. I want a method to stop people voting for the same thing twice. So far my thoughts have been:
Drop a cookie once the vote is complete (susceptible to multi browser gaming)
Log IP address per vote (this will fail in proxy / corporate environments)
Force logins
My site is not account based as such, although it aggregates Twitter data, so there is scope for using Twitter OAuth as a means of identification.
What existing systems exist and how do they do this?
The best thing would be to disallow anonymous voting. If the user is forced to log in you can save the userid with each vote and make sure that he/she only votes once.
The cookie approach is very fragile since cookies can be deleted easily. The IP address approach has the shortcoming you yourself describe.
One step towards a user auth system but not all of the complications:
Get the user to enter their email address and confirm their vote, you would not eradicate gaming but you would make it harder for gamers to register another email address and then vote etc.
Might be worth the extra step.
Let us know what you end up going for.
If you want to go with cookies after all, use an evercookie.
evercookie is a javascript API available that produces
extremely persistent cookies in a browser. Its goal
is to identify a client even after they've removed standard
cookies, Flash cookies (Local Shared Objects or LSOs), and
others.
evercookie accomplishes this by storing the cookie data in
several types of storage mechanisms that are available on
the local browser. Additionally, if evercookie has found the
user has removed any of the types of cookies in question, it
recreates them using each mechanism available.
Multi-browser cheating won't be affected, of course.
What type of gaming do you want to protect yourself against? Someone creating a couple of bots and bombing you with thousands (millions) of requests? Or someone with no better things to do and try to make 10-20 votes?
Yes, I know: both - but which one is your main concern in here?
Using CAPTCHA together with email based voting (send a link to the email to validate the vote) might work well against bots. But a human can more or less easily exploit the email system (as I comment in one answer and post here again)
I own a custom domain and I can have any email I want within it.
Another example: if your email is
myuser*#gmail.com*, you could use
"myuser+1#gmail.com"
myuser+2#gmail.com, etc (the plus sign and the text after
it are ignored and it is delivered
to your account). You can also include
dots in your username (my.user#gmail.com). (This only
works on gmail addresses!)
To protect against humans, I don't know ever-cookie but it might be a good choice. Using OAuth integrated with twitter, FB and other networks might also work well.
Also, remember: requiring emails for someone to vote will scare many people off! You will get many less votes!
Another option is to limit the number of votes your system accepts from each ip per minute (or hour or anything else). To protect against distributed attacks, limit the total number of votes your system accepts within a timeframe.
Different approach, just to provide an alternative:
Assuming most people know how to behave or just can't be bothered to misbehave, just retroactively clean the votes. This would also keep voting unobtrusive for the voters.
So, set cookies, log every vote and afterwards (or on a time interval?) go through the results and remove duplicates based on the cookie values, IP/UserAgent combinations etc.
I'd assume that not actively blocking multiple votes from same person keeps the usage of highly technical circumvention methods to a minimum and the results are easy to clean.
As a down side, you can't probably show the actual vote counts live on the user interface, or eyebrows will be raised when bunch of votes just happen to go missing.
Although I probably wouldn't do this myself, but look at these cookies, they are pretty hard to get rid of:
http://samy.pl/evercookie/
A different way that I had to approach this problem and fight voting fraud, was to require an email address, then a person could still vote, but the votes wouldn't count until they clicked on a link in the email. This was easier than full on registration, but was still very effective in eliminating most of the fraudulent votes.
If you don't want force users to log, consider this evercookie, but force java script to enable logging!
This evercookie is trivial to block because it is java script based. The attacker would not likely use browser, with curl he could generate tousends of requests. Hovewer such tools have usually poor javascript support.
Mail is even easier to cheat. When you run your own server, you can accept all email addresses, so you will have practically unlimited pool of addresses to use.
I'm coding a sweepstakes entry form in php where the User submits some information in a form and it is stored in a database.
I would like to find a way to restrict this form to one submission per person. Either dropping a cookie or by IP address. What would be the best way to approach this?
I'm building it on code igniter, if that makes any difference.
Simple answer, log the IP in the same row with the information store. If you do a cookie a bot or user can easily remove the cookie destroying your protection scheme. So simply log the IP address and then query each entry for uniqueness before accepting the submission.
They both have their own downsides tbh. Cookies are easy to forge and easy to remove which will allow multiple votes. Restricting by IP is better but IP addresses can be shared within networks and can also be proxied to avoid detection. Best bet is rely on something like email address and force the user to click an emailed link to confirm a vote, admittedly though this isn't great.
There are several methods you can use to mitigate against casual cheating. In my view you should not expect to be able to stop a determined cheater without a more formal validation process (cc authorization..etc).
The easiest approach is to ask for a residential address to send goods when they win :)
First and foremost deny the cheater any feedback channel to be able to tell if their submission was accepted or rejected. If there is a slight delay for accepted entries make sure you add a fake delay with some jitter so they can't tell if their scheme for thwarting your anti-cheating method worked or even if you have any anti-cheating methods at all. Detecting bulk submissions by a cheater are much easier when they don't feel they need to be creative.
IP Address as you mentioned. Perhaps use geoip, whois..etc to get distributions over time WRT area.
User agent and system fingerprinting - there is a huge amount of information you can get from the browser that may or may not be unique. Browser type, version, operating system, screen resolution, color depth, installed fonts, plugins (flash, pdf, java...etc) and associated version numbers, language, browsers local time (log client clock skew)
Use of cookies, perhaps hide references to an innocent sounding domain in an included javascript you also control. This may be used to correlate the manual deletion of obvious cookies with the hidden cookies. Its less known that cookies can also be stored in separate databases of other plugins the user may have such as flash player. These are NOT removed when the browser cookies are deleted.
Use of images with cache headers. The first time a user visits the site display an image after their entry is submitted. If they've already filled out the form and they submit again the image would be cached and you can use the absence of the image request to assume submitted entries are a result of cheating.
Why not drop both. Throw a cookie on the user's machine. Then, in a database keep a field with an ip address. That way, if they have different ip addresses (due to certain internet company configs), the cookie can catch it. The database field will serve to be more secure and a backup if people don't allow cookies. These solutions will not be 100% foolproof, however, because if a person had changing ip addresses and doesn't allow cookies, you could run into problems. I would check for cookies being enabled to get around this. Try to set a cookie and read it. If you can, you're good to go. Otherwise, prompt them to allow cookies.
Best of luck
To add to the others, you could require a login/signup to vote.
As stated by others cookies are easy to fake / delete. The client IP seen for a single user can change even mid session, and there may be thousands of users sharing the same client address.
Email addresses are harder to forge - and you can add a verification stage to the process - its information you need to capture anyway - but do keep track of the user agent and client address each submission originates from and is verified from - then you can make a smart determination about the winner instead of trying to check every submission.
C.
I'm creating a contact form for my company and I want to make it as spam-proof as possible. I've created a honey pot + session checking, but I also want to make it so that it's only possible to submit the form once every x minutes. In other words, banning the IP from using the form for x amount of time.
What is the best solution to do this?
I can think of a few but none of them seem ideal.
Store the users IP in a database every time the form is submitted, along with the timestamp. When a user submits the form, first check the database to see if they submitted within the timeframe.
Some problems could arise from large networks where users could the same IP though. It depends on the target audience, really.
Database. Store the IPs in there and timestamp them.
A nice approach I've seen being used on some blogs is to use JavaScript to protect against bots. Like in the onsubmit() event change the method of a form from GET to POST. You can do other magic too. Bots are very inept at executing JavaScript so you can use that to your advantage.
On the other hand - this might hurt the 0.0000001% of users that don't have JavaScript enabled. Well, your choice really. :)
If you don't mind restricting the form to cookie-enabled browsers (eliminating some "browsers" aka bots I assume), you could do something like this:
Form page loads, it checks for a session variable with a timestamp. If none is found, it creates one and redirects to the same page, but with a GET parameter specifying "action=start" or something. So on the second load, if you see $_GET['action'] == 'start', you can check for that session variable. If you don't find one, you can redirect elsewhere saying cookies are required.
Now you can check the timestamp and do something else if it's been too soon.
This method will at least allow the same IP, since if you're dealing with a large group of people behind a firewall you don't have to block the whole group.
The database thing is probably your best bet, because it doesn't require them to allow anything, it just logs their data. The only issue with that is that they could be masking their IP or hitting it from multiple places. I'd try cross-referencing the IP on their session/cookie with the database. If the same person is hitting your site really fast from the same IP address it'll be obvious, but if you create a user ID as well, you can see if they're rapidly switching IP addresses.
It also wouldn't hurt to have some kind of cron script (or at least a tool written and on standby) ready to cleanup a mess that does manage to get through. For my site I'm writing one to flag exactly identical submissions from multiple ips within a very small timespan (Within 10 seconds).
At the very least you could write some queries to show questionable submissions to the comment form.