How to protect website from bulk scraping /downloading? [duplicate] - php

This question already has answers here:
Top techniques to avoid 'data scraping' from a website database
(14 answers)
Closed 5 years ago.
I have LAMP server where I run a website, which I want to protect against bulk scraping / downloading. I know that there is no perfect solution for this, that the attacker will always find a way. But I would like to have at least some "protection" which hardenes the way of stealing data than just having nothing at all.
This website has cca. 5000 of subpages with valuable text data and couple of pictures on each page. I would like to be able online analyze incoming HTTP requests and if there is suspicious activity (e.g. tens of requests in one minute from one IP) it would automatically blacklist this certain IP address from further access to the site.
I fully realize that what I am asking for has many flaws, but I am not really looking for bullet-proof solution, but just a way how to limit script-kiddies from "playing" with easily scraped data.
Thank you for your on-topic answers and possible solution ideas.

Although this is a pretty old post, I think the answer isnt quite complete and I thought it worthwhile to add in my two cents. First, I agree with #symcbean, try to avoid using IP's but instead using a session, a cookie, or another method to track individuals. Otherwise you risk lumping together groups of users sharing an IP. The most common method for rate limiting, which is essentially what you are describing "tens of requests in one minute from one IP", is using the leaky bucket algorithm.
Other ways to combat web scrapers are:
Captchas
Make your code hard to interpret, and change it up frequently. This makes scripts harder to maintain.
Download IP lists of known spammers, proxy servers, TOR exit nodes, etc. This is going to be a lengthy list but its a great place to start. You may want to also block all amazon EC2 IP's.
This list, and rate limiting, will stop simple script kiddies but anyone with even moderate scripting experience will easily be able to get around you. Combating scrapers on your own is a futile effort but my opinion is biased because I am a cofounder of Distil Networks which offers anti-scraping protection as a service.

Sorry - but I'm not aware of any anti-leeching code available off-the-shelf which does a good job.
How do you limit access without placing burdens on legitimate users / withuot providing a mechanism for DOSing your site? Like spam prevention, the best solution is to use several approaches and maintain scores of badness.
You've already mentioned looking at the rate of requests - but bear in mind that increasingly users will be connecting from NAT networks - e.g. IPV6 pops. A better approach is to check per session - you don't need to require your users to register and login (although openId makes this a lot simpler) but you could redirect them to a defined starting point whenever they make a request without a current session and log them in with no username/password. Checking the referer (and that the referer really does point to the current content item) is a good idea too. Tracking 404 rates. Road blocks (when score exceeds threshold redirect to a capcha or require a login). Checking the user agent can be indicative of attacks - but should be used as part of the scoring mechanism, not as a yes/no criteria for blocking.
Another approach, rather than interrupting the flow, is when the thresholds are triggered start substituting content. Or do the same when you get repeated external hosts appearing in your referer headers.
Do not tar pit connections unless you've got a lot of resource serverside!

Referrer checking is one very simple technique that works well against automated attacks. You serve content normally if the referrer is your own domain (ie the user has reached the page by clicking a link on your own site), but if the referrer is not set, you can serve alternate content (such as a 404 not found).
Of course you need to set this up to allow search engines to read your content (assuming you want that) and also be aware that if you have any flash content, the referrer is never set, so you can't use this method.
Also it means that any deep links into your site won't work - but maybe you want that anyway?
You could also just enable it for images which makes it a bit harder for them to be scraped from the site.

Something that I've employed on some of my websites is to block known User-Agents of downloaders or archivers. You can find a list of them here: http://www.user-agents.org/ (unfortunately, not easy to sort by Type: D). In the host's setup, I enumerate the ones that I don't want with something like this:
SetEnvIf User-Agent ^Wget/[0-9\.]* downloader
Then I can do a Deny from env=downloader in the appropriate place. Of course, changing user-agents isn't difficult, but at least it's a bit of a deterrent if going through my logs is any indication.
If you want to filter by requests per minute or something along those lines, I don't think there's a way to do that in apache. I had a similar problem with ssh and saslauth, so I wrote a script to monitor the log files and if there were a certain number of failed login attempts made within a certain amount of time, it appended an iptables rule that blocked that IP from accessing those ports.

If you don't mind using an API, you can try our https://ip-api.io
It aggregates several databases of known IP addresses of proxies, TOR nodes and spammers.

I would advice one of 2 things,
First one would be, if you have information that other people want, give it to them in a controlled way, say, an API.
Second would be to try and copy google, if you scrape the results of google ALOT (and I mean a few hundred times a second) then it will notice it and force you to a Captcha.
I'd say that if a site is visited 10 times a second, its probably a bot. So give it a Captcha to be sure.
If a bot crawls your website slower then 10 times a second, I see no reason to try and stop it.

You could use a counter (DB or Session) and redirect the page if the limit is triggered.
/**Pseudocode*/
if( ip == currIp and sess = currSess)
Counter++;
if ( Count > Limit )
header->newLocation;
I think dynamic blocking of IPs using IP blocker will help better.

Related

How to prevent excessive site visits (suspected screen scraping) from hackers?

I have a website that has been hacked once to have it's database stolen. I think it was done by an automated process that simply accessed the visible website using a series of searches, in the style of 'give me all things beginning with AA', then 'with AB', then 'with AC' and so on. The reality is a little more complicated than this, but that illustrates the principal of the attack. I found the thief and am now taking steps against them, but I want to prevent more like this in the future.
I thought there must be some ready made PHP (which I use) scripts out there. Something that for instance recorded the IP address of the last (say) 50 visitors and tracked the frequency of their requests over the last (say) 5 minutes. It would ban them for (say) 24 hours if they exceeded a certain threshold of requests. However to my amazement I can find no such class, library or example of code intended for this purpose, anywhere online.
Am I missing a trick, or is there a solution here - like the one I imagine, or maybe an even simpler and more effective safeguard?
Thanks.
There are no silver bullets. If you are trying to brainstorm some possible workarounds and solutions there are none that are particularly easy but here are some things to consider:
Most screen scrapers will be using curl to do their dirty work. There is some discussion such as here on SO about whether trying to block based on User-Agent (or lack thereof) is a good way to prevent screen scrapes. Ultimately, if it helps at all it is probably a good idea (and Google does it to prevent websites from screen scraping them). Because User-Agent spoofing is possible this measure can be overcome fairly easily.
Log user requests. If you notice an outlier that is far beyond your average number of user requests (up to you to determine what is uneacceptable), then you can serve them an HTTP 500 error until they revert back to an acceptable range.
Check number of broken links attempted. If a request to a broken link is served, add it to a log. A few of these should be fine, but it should be pretty clear to find someone who is fishing for data. If they are looking for AA, AB, AC, etc. When that occurs, start to serve HTTP 500 errors for all of your pages for a set amount of time. You can do this by serving all of your page requests through a Front Controller, or by creating a custom 404-file not found page and redirecting requests there. The 404 page can log them for you.
Set errors when there is a sudden change in statistics. This is not to shut anyone down, this is just to get you to investigate. The last thing you want to do is shut someone down by accident, because to them it will just seem like the website is down. If you set up a script to send you an e-mail when there has been a sudden change in usage patterns but before you shut someone down, it can help you adjust your decision making appropriately.
These are all fairly broad concepts and there are plenty of other solutions or tweaks on this that can work. In order to do it successfully you will need to monitor your own web patterns in order to determine a safe fix. This is not a small undertaking to craft such a solution (at least not well).
A Caveat
This is important: Security is always going to be counterbalanced by useability. If you do it right you won't be sacrificing too much security and your users will never run into these issues. Extensive testing will be important, and because of the nature of websites and downtime being so crucial, perform extensive testing whenever you introduce a new security measure, before bringing it live. Otherwise, you will have a group of very unhappy people to deal with and a potential en mass loss of users. And in the end, screen scraping is probably a better thing to deal with than angry users.
Another caveat
This could interfere with SEO for your web page, as search engines like Google employ screen scraping to keep records up to date. Again, the note on balance applies. I am sure there is a fix here that can be figured out but it would stray too far from the original question to look into it.
If you're using Apache, I'd look into mod_evasive:
http://www.zdziarski.com/blog/?page_id=442
mod_evasive is an evasive maneuvers module for Apache to provide
evasive action in the event of an HTTP DoS or DDoS attack or brute
force attack. It is also designed to be a detection and network
management tool, and can be easily configured to talk to ipchains,
firewalls, routers, and etcetera. mod_evasive presently reports abuses
via email and syslog facilities.
...
"Detection is performed by creating an internal dynamic hash table of
IP Addresses and URIs, and denying any single IP address from any of
the following:
Requesting the same page more than a few times per second
Making more than 50 concurrent requests on the same child per second
Making any requests while temporarily blacklisted (on a blocking list)"

Efficiently limit number of hits per minute (block web scraping or copy-bots) in PHP

I am faced with the problem of bots copying all the content off my webpage (which I try to update quite often).
I try to ban them, or obfuscate code to make it more difficult to copy. However, they find some way to overcome these limitations.
I'd like to try to limit the number of hits per minute (or X time, not neccesarily minutes), but use a Captcha to overcome those limits. Something like if you've requested more than 10 pages in the last 5 minutes, you need to prove you are human using a Captcha. So, if the user is a legitimate user, you'll be able to continue surfing the web.
I'd like to do it only in the content pages (to do it more efficiently). I had thought of MemCached, but since I don't owe the server, I can't use it. If I were using Servlets I'd use HashMap or similar, but since I use PHP, I am still trying to think of a solution.
I don't see MySql (or databases) as a solution, since I can have many hits per seconds. And I should be deleting after a few minutes old request, creating a lot of unnecesary and non-efficient traffic.
Any ideas?
A summary:
If I get too many hits per minute in a section of the webpage, I'd like to limit it using Captcha efficiently, in PHP. Something like if you've requested more than 10 pages in the last 5 minutes, you need to prove you are human using a Captcha.
Your questions kind of goes against the spirit of the internet.
Everyone copies/borrows from everyone
Every search engine has a copy of everything else on the web
I would guess the problem you're having is that these bots are stealing your traffic? If so, I'd suggest you try implementing an API allowing them to use your content legitimately.
This way you can control access, and crucially you can ask for a linkback to your site in return for using your content. This way your site should be number 1 for the content. You don't even really need an API to implement this policy.
If you insist on restricting user access you have the following choices:
Use a javascript solution and load the content into the page using Ajax. Even this is not going to fool the best bots.
Put all your content behind a username/password system.
Block offending IPs - it's a maintenance nightmare and you'll never have a guarantee but it'll probably help.
The problem is - if you want your content to be found by Google AND restricted to other bots you're asking the impossible.
Your best option is create an API and control people copying your stuff rather than trying to prevent it.

PHP: Differentiate between a human user and a bot / other

I want to, using PHP, differentiate between an actual person and a bot. I currently track page views and they are massively inflated due to bots crawling my pages so I want to only record real people. It doesn't matter if its not 100% accurate I just want a nice simple way to do it via PHP.
To be clear, this is not for analytics's per se; it is so that I can track what images are being served daily so I can produce a "top images of the day" sort of script.
You should be checking the user agent string, most well behaved search bots will report themselves as such.
Google's spider for example.
First, the obvious: check the user agent.
I use another trick that works pretty good. I map robots.txt to a PHP file and log the IP into the database. Then when logging user activity, I make sure they aren't from one of those logged IPs. If the user authenticates via the login system then I track them regardless.
Of course neither solution guarantees any accuracy, but for general logging, it has been sufficient for my purposes.
I'm not sure that PHP is the best solution for this kind of problem.
You can read How to block bad bots and How to block spambots, ban spybots, and tell unwanted robots to go to hell to see more solutions about blocking bots but this time with apache.
Apache will act faster a require less CPU to do this sort of task than a php program.

Top techniques to avoid 'data scraping' from a website database

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.
While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.
If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.
There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.
Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.
force a reCAPTCHA every 10 page loads for each unique IP
There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.
Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.
I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.
Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.
There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.
What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!
Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?
Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.
My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

Will <insert popular website here> restrict me from accessing their website if I request it too many times?

I ask this because I am creating a spider to collect data from blogger.com for a data visualisation project for university.
The spider will look for about 17,000 values on the browse function of blogger and (anonymously) save certain ones if they fit the right criteria.
I've been running the spider (written in PHP) and it works fine, but I don't want to have my IP blacklisted or anything like that. Does anyone have any knowledge on enterprise sites and the restrictions they have on things like this?
Furthermore, if there are restrictions in place, is there anything I can do to circumvent them? At the moment all I can think of to help the problem slightly is; adding a random delay between calls to the site (between 0 and 5 seconds) or running the script through random proxies to disguise the requests.
By having to do things like the methods above, it makes me feel as if I'm doing the wrong thing. I would be annoyed if they were to block me for whatever reason because blogger.com is owned by Google and their main product is a web spider. Allbeit, their spider does not send its requests to just one website.
It's likely they have some kind of restriction, and yes there are ways to circumvent them (bot farms and using random proxies for example) but it is likely that none of them would be exactly legal, nor very feasible technically :)
If you are accessing blogger, can't you log in using an API key and query the data directly, anyway? It would be more reliable and less trouble-prone than scraping their page, which may be prohibited anyway, and lead to trouble once the number of requests is big enough that they start to care. Google is very generous with the amount of traffic they allow per API key.
If all else fails, why not write an E-Mail to them. Google have a reputation of being friendly towards academic projects and they might well grant you more traffic if needed.
Since you are writing a spider, make sure it reads the robots.txt file and does accordingly. Also, one of the rules of HTTP is not to have more than 2 concurrent requests on the same server. Don't worry, Google's servers are really powerful. If you only read pages one at the time, they probably won't even notice. If you inject 1 second interval, it will be completely harmless.
On the other hand, using a botnet or other distributed approach is considered harmful behavior, because it looks like DDOS attack. You really shouldn't be thinking in that direction.
If you want to know for sure, write an eMail to blogger.com and ask them.
you could request it through TOR you would have a different ip each time at a peformance cost.

Categories