I have a website that has been hacked once to have it's database stolen. I think it was done by an automated process that simply accessed the visible website using a series of searches, in the style of 'give me all things beginning with AA', then 'with AB', then 'with AC' and so on. The reality is a little more complicated than this, but that illustrates the principal of the attack. I found the thief and am now taking steps against them, but I want to prevent more like this in the future.
I thought there must be some ready made PHP (which I use) scripts out there. Something that for instance recorded the IP address of the last (say) 50 visitors and tracked the frequency of their requests over the last (say) 5 minutes. It would ban them for (say) 24 hours if they exceeded a certain threshold of requests. However to my amazement I can find no such class, library or example of code intended for this purpose, anywhere online.
Am I missing a trick, or is there a solution here - like the one I imagine, or maybe an even simpler and more effective safeguard?
Thanks.
There are no silver bullets. If you are trying to brainstorm some possible workarounds and solutions there are none that are particularly easy but here are some things to consider:
Most screen scrapers will be using curl to do their dirty work. There is some discussion such as here on SO about whether trying to block based on User-Agent (or lack thereof) is a good way to prevent screen scrapes. Ultimately, if it helps at all it is probably a good idea (and Google does it to prevent websites from screen scraping them). Because User-Agent spoofing is possible this measure can be overcome fairly easily.
Log user requests. If you notice an outlier that is far beyond your average number of user requests (up to you to determine what is uneacceptable), then you can serve them an HTTP 500 error until they revert back to an acceptable range.
Check number of broken links attempted. If a request to a broken link is served, add it to a log. A few of these should be fine, but it should be pretty clear to find someone who is fishing for data. If they are looking for AA, AB, AC, etc. When that occurs, start to serve HTTP 500 errors for all of your pages for a set amount of time. You can do this by serving all of your page requests through a Front Controller, or by creating a custom 404-file not found page and redirecting requests there. The 404 page can log them for you.
Set errors when there is a sudden change in statistics. This is not to shut anyone down, this is just to get you to investigate. The last thing you want to do is shut someone down by accident, because to them it will just seem like the website is down. If you set up a script to send you an e-mail when there has been a sudden change in usage patterns but before you shut someone down, it can help you adjust your decision making appropriately.
These are all fairly broad concepts and there are plenty of other solutions or tweaks on this that can work. In order to do it successfully you will need to monitor your own web patterns in order to determine a safe fix. This is not a small undertaking to craft such a solution (at least not well).
A Caveat
This is important: Security is always going to be counterbalanced by useability. If you do it right you won't be sacrificing too much security and your users will never run into these issues. Extensive testing will be important, and because of the nature of websites and downtime being so crucial, perform extensive testing whenever you introduce a new security measure, before bringing it live. Otherwise, you will have a group of very unhappy people to deal with and a potential en mass loss of users. And in the end, screen scraping is probably a better thing to deal with than angry users.
Another caveat
This could interfere with SEO for your web page, as search engines like Google employ screen scraping to keep records up to date. Again, the note on balance applies. I am sure there is a fix here that can be figured out but it would stray too far from the original question to look into it.
If you're using Apache, I'd look into mod_evasive:
http://www.zdziarski.com/blog/?page_id=442
mod_evasive is an evasive maneuvers module for Apache to provide
evasive action in the event of an HTTP DoS or DDoS attack or brute
force attack. It is also designed to be a detection and network
management tool, and can be easily configured to talk to ipchains,
firewalls, routers, and etcetera. mod_evasive presently reports abuses
via email and syslog facilities.
...
"Detection is performed by creating an internal dynamic hash table of
IP Addresses and URIs, and denying any single IP address from any of
the following:
Requesting the same page more than a few times per second
Making more than 50 concurrent requests on the same child per second
Making any requests while temporarily blacklisted (on a blocking list)"
Related
Is there any way to block a crawler/spider search bots if they're not obeying the rules written in robots.txt file. If yes, where can I find more info about it?
I would prefer some .htaccess rule, if not then PHP.
There are ways to prevent most bots from spidering your site.
Aside from filtering by user agent and known IP adresses, you should as well implement behaviour driven blocking. That means, if it acts like a crawler, block it.
You can find multiple lists of search engine bots here. But most of the big players obey the robots.txt.
So the other, rather big part is the blocking because of the bots behaviour. Things are getting less complicated when you are using a framework like Laravel or Symfony, because you easily set a filter to be executed before every page load. If not, you'd have to implement a function which is called before every page load.
Now there are some things to consider. A spider usually crawls as fast as it can. So you could use the session to measure time between page loads and page loads in a given time span. If amount X this is exceed, the client is blocked.
Sadly, this approach relies on the bot handling sessions/cookies correctly, which may not always be the case.
Another or an additional approach would be to measure the amount of page loads from a given IP address. This is dangerous because there may be as well a huge amount of users using the same IP address. So this may exclude humans.
A third approach I can think of is to use some kind of honeypot. Create a link that leads to a specific site. That link has to be visible to computers, but not to humans. Hide it away with some css. If someone or something is accessing the page using the hidden link, you can be (close to) sure it is a program. But be aware, there are browser addons which are preloading every link they can find. So you cannot rely totally on this.
Depending on the nature of your site, on last approach would be to hide the complete site behind a capture. This is a harsh measure in terms of usability, so decide carefully if it applies to your use case.
Then there are techniques like using flash or complicated Javascript most bots do not understand, but it's disgusting and I don't want to talk about it. ^^
Finally, I will now come to a conclusion.
By using a well written robots.txt most robots will leave you alone. In addition to that, you should combine all or some of the approaches mentioned beforehand to get the bad guys.
Afterall, as long as your site is publically available, you can never evade a custom made bot tailored specifically for your site. When a browser can parse it, a robot can do it as well.
For a more useful answer I would need to know what you are trying to hide and why.
This question already has answers here:
Top techniques to avoid 'data scraping' from a website database
(14 answers)
Closed 5 years ago.
I have LAMP server where I run a website, which I want to protect against bulk scraping / downloading. I know that there is no perfect solution for this, that the attacker will always find a way. But I would like to have at least some "protection" which hardenes the way of stealing data than just having nothing at all.
This website has cca. 5000 of subpages with valuable text data and couple of pictures on each page. I would like to be able online analyze incoming HTTP requests and if there is suspicious activity (e.g. tens of requests in one minute from one IP) it would automatically blacklist this certain IP address from further access to the site.
I fully realize that what I am asking for has many flaws, but I am not really looking for bullet-proof solution, but just a way how to limit script-kiddies from "playing" with easily scraped data.
Thank you for your on-topic answers and possible solution ideas.
Although this is a pretty old post, I think the answer isnt quite complete and I thought it worthwhile to add in my two cents. First, I agree with #symcbean, try to avoid using IP's but instead using a session, a cookie, or another method to track individuals. Otherwise you risk lumping together groups of users sharing an IP. The most common method for rate limiting, which is essentially what you are describing "tens of requests in one minute from one IP", is using the leaky bucket algorithm.
Other ways to combat web scrapers are:
Captchas
Make your code hard to interpret, and change it up frequently. This makes scripts harder to maintain.
Download IP lists of known spammers, proxy servers, TOR exit nodes, etc. This is going to be a lengthy list but its a great place to start. You may want to also block all amazon EC2 IP's.
This list, and rate limiting, will stop simple script kiddies but anyone with even moderate scripting experience will easily be able to get around you. Combating scrapers on your own is a futile effort but my opinion is biased because I am a cofounder of Distil Networks which offers anti-scraping protection as a service.
Sorry - but I'm not aware of any anti-leeching code available off-the-shelf which does a good job.
How do you limit access without placing burdens on legitimate users / withuot providing a mechanism for DOSing your site? Like spam prevention, the best solution is to use several approaches and maintain scores of badness.
You've already mentioned looking at the rate of requests - but bear in mind that increasingly users will be connecting from NAT networks - e.g. IPV6 pops. A better approach is to check per session - you don't need to require your users to register and login (although openId makes this a lot simpler) but you could redirect them to a defined starting point whenever they make a request without a current session and log them in with no username/password. Checking the referer (and that the referer really does point to the current content item) is a good idea too. Tracking 404 rates. Road blocks (when score exceeds threshold redirect to a capcha or require a login). Checking the user agent can be indicative of attacks - but should be used as part of the scoring mechanism, not as a yes/no criteria for blocking.
Another approach, rather than interrupting the flow, is when the thresholds are triggered start substituting content. Or do the same when you get repeated external hosts appearing in your referer headers.
Do not tar pit connections unless you've got a lot of resource serverside!
Referrer checking is one very simple technique that works well against automated attacks. You serve content normally if the referrer is your own domain (ie the user has reached the page by clicking a link on your own site), but if the referrer is not set, you can serve alternate content (such as a 404 not found).
Of course you need to set this up to allow search engines to read your content (assuming you want that) and also be aware that if you have any flash content, the referrer is never set, so you can't use this method.
Also it means that any deep links into your site won't work - but maybe you want that anyway?
You could also just enable it for images which makes it a bit harder for them to be scraped from the site.
Something that I've employed on some of my websites is to block known User-Agents of downloaders or archivers. You can find a list of them here: http://www.user-agents.org/ (unfortunately, not easy to sort by Type: D). In the host's setup, I enumerate the ones that I don't want with something like this:
SetEnvIf User-Agent ^Wget/[0-9\.]* downloader
Then I can do a Deny from env=downloader in the appropriate place. Of course, changing user-agents isn't difficult, but at least it's a bit of a deterrent if going through my logs is any indication.
If you want to filter by requests per minute or something along those lines, I don't think there's a way to do that in apache. I had a similar problem with ssh and saslauth, so I wrote a script to monitor the log files and if there were a certain number of failed login attempts made within a certain amount of time, it appended an iptables rule that blocked that IP from accessing those ports.
If you don't mind using an API, you can try our https://ip-api.io
It aggregates several databases of known IP addresses of proxies, TOR nodes and spammers.
I would advice one of 2 things,
First one would be, if you have information that other people want, give it to them in a controlled way, say, an API.
Second would be to try and copy google, if you scrape the results of google ALOT (and I mean a few hundred times a second) then it will notice it and force you to a Captcha.
I'd say that if a site is visited 10 times a second, its probably a bot. So give it a Captcha to be sure.
If a bot crawls your website slower then 10 times a second, I see no reason to try and stop it.
You could use a counter (DB or Session) and redirect the page if the limit is triggered.
/**Pseudocode*/
if( ip == currIp and sess = currSess)
Counter++;
if ( Count > Limit )
header->newLocation;
I think dynamic blocking of IPs using IP blocker will help better.
I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.
While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.
If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.
There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.
Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.
force a reCAPTCHA every 10 page loads for each unique IP
There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.
Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.
I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.
Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.
There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.
What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!
Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?
Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.
My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.
How can you prevent "referral scams"?
For example, in a wordpress-based site of mine, I suddenly noticed that someone clicked a link from some site I had never heard of. When I followed the link, there was obviously not a link to MY site. The site was selling products, in this case books. All comments followed a similar speech pattern, and the website URL for each owner of these "comments" was the amazon.com link to the product.
Obviously a scam, I quickly backed off the website.
Is there any way to prevent these forged referrals via PHP?
Some way of telling if they are automated or do not come from a reputable source?
As an answer I am afraid you can't. There is no way to control what referrer people send to you.
You can reduce it by doing as Chris suggested. But as a rule anyone who uses a bot to deliberately create this type of spam will change the User-Agent string. Heck I do it to prevent the stupid firewall I am behind from preventing me using Firefox, because hey we know how safe IE is.
So using that technique will only stop a very small percentage.
The important thing to remember is anyone can fake anything sent to your server, form values, http headers, cookies even IP addresses, so don't trust any of it and don't worry about it.
Not the answer you wanted but unfortunately the only real answer. If you really really must, then you would get the referrer, scrape that page and if no link found ignore it. but thats a lot of work and ignores javascript created links (from ads etc).
Sometimes you get a bad referrer simply from a broken browser or scraping software or even a search bot.
Depending on how much control you have over the server, you might find it useful to install mod_security (Apache module). mod_security acts as a firewall for Apache, allowing you to block requests that match (or do not match) a set of criteria (including user agent, referring site, etc.).
Here is a blog post that has information on using mod_security to deal with referral spam:
http://atomicplayboy.net/blog/2005/01/30/an-introduction-to-mod-security/
There are ways to prevent this, even 12 years later this continues to happen. Bizarrely this was a bona-fide tactic to improve rankings for some time. People would install mediawiki two moths before launch and them delete it at launch. The downside was that the site would appear to the educated to have been compromised. But the educated did not click links in spam.
Moderate your comments, do not just let them be posted, but review every one. This was "Newsgroup 100" back in the day..
Don't allow comments at all. This will hurt your character and your reputation, as something you host may differ from accepted wisdom.
Install a plugin to help with moderation. Tune it.
But yes, you need a MODERATOR and/or an APPROVER. A daily task with a queue.
I ask this because I am creating a spider to collect data from blogger.com for a data visualisation project for university.
The spider will look for about 17,000 values on the browse function of blogger and (anonymously) save certain ones if they fit the right criteria.
I've been running the spider (written in PHP) and it works fine, but I don't want to have my IP blacklisted or anything like that. Does anyone have any knowledge on enterprise sites and the restrictions they have on things like this?
Furthermore, if there are restrictions in place, is there anything I can do to circumvent them? At the moment all I can think of to help the problem slightly is; adding a random delay between calls to the site (between 0 and 5 seconds) or running the script through random proxies to disguise the requests.
By having to do things like the methods above, it makes me feel as if I'm doing the wrong thing. I would be annoyed if they were to block me for whatever reason because blogger.com is owned by Google and their main product is a web spider. Allbeit, their spider does not send its requests to just one website.
It's likely they have some kind of restriction, and yes there are ways to circumvent them (bot farms and using random proxies for example) but it is likely that none of them would be exactly legal, nor very feasible technically :)
If you are accessing blogger, can't you log in using an API key and query the data directly, anyway? It would be more reliable and less trouble-prone than scraping their page, which may be prohibited anyway, and lead to trouble once the number of requests is big enough that they start to care. Google is very generous with the amount of traffic they allow per API key.
If all else fails, why not write an E-Mail to them. Google have a reputation of being friendly towards academic projects and they might well grant you more traffic if needed.
Since you are writing a spider, make sure it reads the robots.txt file and does accordingly. Also, one of the rules of HTTP is not to have more than 2 concurrent requests on the same server. Don't worry, Google's servers are really powerful. If you only read pages one at the time, they probably won't even notice. If you inject 1 second interval, it will be completely harmless.
On the other hand, using a botnet or other distributed approach is considered harmful behavior, because it looks like DDOS attack. You really shouldn't be thinking in that direction.
If you want to know for sure, write an eMail to blogger.com and ask them.
you could request it through TOR you would have a different ip each time at a peformance cost.