Since few hours, I'm having multiple requests coming from various IP's to our website occuring every second(maybe 4 or 5 requests per second).
The website's usual traffic is about 3 to 5 requests per minute.
The requests are very random, for example:
/gtalczp/197zbcylgxpoaj-26228e-dtmlnaibx/
/109/jxwhezsivr/10445_xwvpfdyzhea.cgi
/nouyaku.html
/index.php/43e3133-pmuwbfgoedakvxs/
/keyword_list/s_index=L
The site's indexing in google is now all in japanese characters and messed up.
I have tried blocking IP's(via .htaccess) that make all these random requests, but every time a new IP is making a new request. How can I stop all of these requests? Can I use an .htaccess rule that allows only the links that are available in the site?
EDIT: Our site is running Wordpress latest version, with custom built features. If this was some kind of hack, how can I find the infected files/database tables?
EDIT 2: these look like legit google bots, but why are they trying to access these random links which don't exist...
This traffic is coming from automated security scanners. They scan blocks of IP ranges used by AWS, Digital Ocean etc looking for known security bugs on the web server.
Can you stop it? Sort of.
One quick way to catch the low hanging fruit is to put a /password.txt on the root of the webserver. Every scanner on this planet will scan for that. Block any IP that accesses it. You can use Fail2Ban for this.
You can also rate limit access to your webserver. If a client is scanning pages very quickly it's likely a scanner and in which case ban the IP. But could also be a search engine spider etc. In which case this will likely hurt your SEO.
With request for slugs containing Japanese keywords like nouyaku and Google indexing of pages in Japanese might well indicate the Japanese Keyword Hack. This Google article provides an explanation and some generic general fixes and preventitive measures: https://developers.google.com/web/fundamentals/security/hacked/fixing_the_japanese_keyword_hack
Fixing Wordpress hacks already covered elsewhere: you will find numerous questions and answers about this on Stackoverflow or via Google.
.htaccess Google's article advises replacing your htaccess. A useful start would be adding and tweaking either Geof Starr's 6G "Firewall" or 7G(beta) code.
The rate of requests is DDOS like; so it makes sense to cater for this at same time (e.g. mod_evasive, Fail2ban, and modsecurity) google protecting Apache from DDOS attacks.
DDOS, brute force and Wordpress - stopping dodgy requests before PHP/Wordpress code/SQL is run will massively reduce server load. If there is no need for the Public to log in to Wordpress, then use htaccess to password protect wp-login and also maybe wp-admin folder (may cause problems on some sites).
Related
Sometimes we don't have the APIs we would like to, and this is one of these cases.
I want to extract certain information from certain website, so I was considering using a CURL request to hundreds of pages within a site in a programmatically way by using a CRON job in my server.
Then caching the response and firing it again after one or multiple days.
Could that potentially be considered as some kind of attack by the server who might see hundreds of calls to certain sites in a very short period of time from the same server IP?
Lets say, 500 hundred curls?
What would you recommend me? Perhaps making use of the sleep command from curl to curl to reduce the frequency of those requests?
There are a lot of situations where your scripts could end up getting blocked by the website's firewall. One of the best steps you can take in seeing if this is allowed is by contacting the site owner and letting them know what you want to do. If that's not possible read their Terms of Service, and see if it's strictly prohibited.
If time is not of the essence when making these calls then, yes, you can definitely utilize the sleep command to delay the time between each request, and I would recommend it if you find out you need to make a few less requests per second.
You could definitely do this. However you should keep a few things in mind:
Most competent sites will have a clause in their Terms of Service which prohibit the use of the site in anyway other than the interface provided.
If the site see's what you are doing and notices a detrimental effect on their network they will block your ip (our organization was running into this issue enough that it warranted us developing a program that logs ips and the rate at which they access content, then if they attempt to access more than x number of pages in y number of seconds we ban the ip for z minutes), however you might be able to circumvent this by utilizing the sleep command as you had mentioned.
If you require information on the page that is loaded dynamically via javascript after the markup has been rendered, the response you receive from your curl request will not include this information. For cases such as these there are programs such as iMacros which allow you to write scripts in your browser to carry out actions programmatically as if you were actually using the browser.
As mentioned by #RyanCady the best solution may be to reach out to the owner of the site and explain what you are doing and see if they can accommodate your requirement.
Situation:
For a web shop, I want to build paged product lists - and filters on these lists - using Elasticsearch. I want to bypass the PHP/MySQL server on which the application runs entirely and communicate with Elasticsearch directly from the customer's browser through AJAX calls. Advantages are:
A large portion of the load on the PHP/MySQL server will be handled by the ES cluster instead
CDN opportunities (scaling!)
Problem:
This approach would take a massive load off of our backend server but creates a few new issues. Anonymous users will generate lots of requests but we need some control over those:
Traffic control:
How to defend against malicious users making lots of calls and scanning/downloading our entire product catalogue that way? (e.g. competition scraping pricing information)
How can I block IP's that have been identified (somehow) as behaving badly?
Access control:
How to make sure the frontend can only make the queries we want to allow?
How to make sure customers only see a selection of the result fields and can't get any data out of ES that's not intended for them?
It's essential not to have a single machine somewhere taking care of all this cause this would just recreate a single machine responsible for handling everything. I want to take real advantage of the ES cluster without having any middleware that has to deal with the scaling issue as well.
We don't want to be fully dependent on a 3rd party, we're looking for a solution that has some flexibility regarding the partners we're working with (e.g. switch between elastic and AWS).
Possible solutions or partial solutions:
I've been looking at a few 'Elasticsearch as a service' options but I'm not confident about their quality or even if I can solve the issues mentioned with them:
www.elastic.co/found, their premium solution has a 'shield' service which does not seem to cover all of the cases mentioned above (only IP blocking as far as I can tell), but there is a custom plugin (https://github.com/floragunncom/search-guard) that can do filtering on result fields and provides a way to do user management etc. This seems like a reasonable option but it is expensive and ties the application to the 'found' product. We should be able to switch partners should the need arise.
Amazon AWS Elasticsearch service has basic IAM support and it's possible to put CloudFront in front of it but does not provide any access control.
Installing a separate L7 application filtering solution for detecting scrapers etc.
Question:
Is there anyone out there who has this type of approach working and found a good setup that tackles all of these issues?
First thing I would recommend is restrict access to your elastic search instance from behind a security group and only allow the application servers IP address access on ports 22, 80, 9200, and 9300 which are the ports used by ElasticSearch.
As for protecting against scrapping there is no absolute solution for protection however if your aim is to simply limit the load these scrappers put on your application server and ES instance you can take a look at https://github.com/davedevelopment/stiphle which is aimed at rate limited users, the exmaple they use on their page is limiting to 5 requests a second which would seem very reasonable for the average user and could be lowered even further if need be to make scrapping a time consuming effort.
I have been facing a weird situation for a while now and need a guidance regarding this.
Problem:
Since last two days we were experiencing very slow website as compared to what it was when we launched the server . We thought it was temperory issue . But now , it has gone dead slow & a page takes atleast 3 mins to load. I also checked that the CPU uitlization somehow reached 100% and believe that the crawling might doing this.
We are using some third party to do our SEO and google dynamic remarketing and advertising of our magento website. i firmly believe that these things needs to crawl my website for indexing over the search engine.
I have seen that google and bing crawls our website regularly. You may call it google bot and bing bot and suddenly it has seen a largest spike.
Have a look at the screenshot:
https://www.dropbox.com/s/2c4u04rhtbi99j0/Screenshot%202015-11-14%2014.16.41.png?dl=0
With the largest spike being caused by bing and google at the same time and the smaller ones appear to only be google bot.
So i just had a quick question regarding this?
Do you guys think that if a bot IP is whitelisted, will we have a problem with the SEO and google advertising and dynamic remarketing, because then it will not allow that IP to crawl on our website???
Is this a spam or the bots crawling our store which is causing store response time to reduce which can impact search engine ranking and conversions on our store??
Can a large instance type of AWS will help us solving our CPU-usage problem??
Note: We are already using m3.large instance type.
Is this a spam or the bots crawling our store which is causing store response time to reduce which can impact search engine ranking and conversions on our store??
Bots and crawlers can cause a sustainable traffic and resource spike for a single magento server. Regardless of what's in place to speed up magento's performance like: magento's default caching, nginx or apache settings, installed extensions etc...
Can a large instance type of AWS will help us solving our CPU-usage problem?? Note: We are already using m3.large instance type.
Absolutely -- a Burstable t2.large instance can be more cost effective and could better handle traffic spikes like those caused by bots. So long as you have a semi-predictable traffic pattern. Like higher traffic during the day and lower overnight, the instance will gain credits that it can use to burst above normal CPU capacity see this for a thorough explanation:
https://aws.amazon.com/blogs/aws/low-cost-burstable-ec2-instances/
The biggest help I saw was having a properly configured robots.txt for magento It makes sure crawlers are directed to the right places making sure your server only has to serve up only the pages it needs to. This post is a great place to start:
https://magento.stackexchange.com/questions/14891/how-do-i-configure-robots-txt-in-magento
In Google and Bing's webmaster tool once you verify your domain you can change the crawl rate if necessary.
You can also implement a referral spam blocking with Nginx see:
https://github.com/Stevie-Ray/referrer-spam-blocker
Background
Legitimate spiders are great. Its part of the web, I'm happy for them to access my site.
None authorised spiders which scrape my site are bad and I want rid of them.
I have a PHP application that monitors my website access files. Every time a user with a suspect UserAgent hits the site the system check the access log for entries from the same IP address and makes a judgement about its behaviour. If its not a human, and I have not authorised it then it gets logged and I may (or may not) take action such as blocking etc.
The way it works is that every time a page loads this process of checking the access file happens. I only check suspect UserAgent's so the number of checks is kept to a minimum.
Question
What I want to do is check every single visit that hits the site (i.e. check the last 50 lines of the access file to see if any relate to that visits IP). But that means every child process my web server handles will want to open the one single access log file. This sounds like a resource and I/O blocking nightmare.
Is there a way I can 'tail' the access.log file into some sort of central memory that all the web processes have access to at the same time (or very quickly at least). Perhaps loading it into Memcache or similar. But how would I do that in realtime? So the last 500 lines of the access.log file loads into memory continuously (but only 500 expunging as it goes, not an ever increasing number).
So in simple terms: is there a php or linux or 'other' way of buffering an ever increasing file (i.e. nginx log file) into memory so that other processes can access the information concurrently (or at least quicker than all reading the file off the hard drive).
It is important to know that a well-written service will always be able to mimic a browser's behaviour, unless you do some very weird stuff that will influence the user experience of legitimate visitors.
However, there are a few measures to deal even with sophisticated scrapers:
0. Forget about …
… referrer and UA strings. Those are easy to forge, and some legitimate users don't have a common one. You will get lots of false positives/negatives and not gain much.
1. Throttle
Web servers like Apache or nginx have core or addon features to throttle the request rate for certain requests. For example, you could allow the downloading of one *.html page per two seconds, but not limit assets like JS/CSS. (Keep in mind that you should also notify legitimate bots via robots.txt of the delays).
2. Fail2ban
Fail2ban does something similar to what you want to do: it scans log files for malicious requests and blocks them. It works great against malware bots, it should be possible to configure it to deal with scrapers (at least the less clever ones).
--
These are the ones that specifically answer your question, but there are a couple more, which you could consider:
3. Modify contents
This is actually a real fun one: From time to time, we make minor (automated) modifications of the HTML pages and of the JSON feeds, which force the scrapers to adapt their parsers. The fun part is when we see outdated data on their websites for a couple of days until they catch up. Then we modify it again.
4. Restrict: Captchas and Logins
Apart from the throttling on the web server level, we count the requests per IP address per hour. If it's more than a certain number (which should be enough for a legitimate user), each request to the API requires solving a captcha.
Other APIs require authentication, so they won't even get into those areas.
5. Abuse nofifications
If get regular visits from a certain IP address or subnet, you can do a WHOIS lookup for the network service from which they are running their bots. Usually, they have Abuse contacts, and usually those contacts are very eager to hear about policy violations. Because the last thing they want is to get on blacklists (which we will submit them to, if they don't cooperate).
Also, if you see advertising on the scraper's website, you should notify the advertising networks of the fact that they're being used in the context of stolen material.
6. IP bans
Quite obviously you can block a single IP address. What we do is even blocking entire data centers like those of AWS, Azure, etc. There are lists of IP ranges available on the web for all of those services.
Of course, if there are partner services legitimately accessing your site from a data-center, you must whitelist them.
By the way, we don't do this in the web server but on the firewall level (IPtables).
7. Legal measures
If you think that the scraper might be afraid of legal actions from your side, you should not hesitate to contact them and make clear that they infringe on your copyright and terms of usage, and they may become subject to legal actions.
8. Conclusion
After all, fighting scrapers is a “fight against windmills”, and it may take a lot of effort. You will not be able to prevent all of it, but you should concentrate on the ones that harm you, e.g. by wasting your ressources or making money that would belong to you.
Good luck!
I was thinking about web-security and then this thought popped into my head.
Say that there's this jerk who hates me and knows how to program. I am managing a nice website/blog with a considerable amount of traffic. Then that jerk creates a program that automatically request my website over and over again.
So if I am hosting my website on a shared hosting provider then obviously my website will stop responding.
This type of attacks may not be common, but if someone attempts something like that on my website i must do something about it. I don't think that popular CMS's like wordpress or drupal do something about this type of attacks.
My assumption is ;
If a user requests more than x times (let's say 50) in 1-minute, block that user. (stop responding)
My questions are;
Is my assumption ok ? If not what to do about it ?
Do websites like Google, Facebook, Youtube...[etc] do something about this type of attacks.
What you are facing is the DoS.[Denial of Service] Attack. Where one system tries to go on sending packets to your webserver and makes it unresponsive.
You have mentioned about a single jerk, what if the same jerk had many friends and here comes DDoS [Distributed DoS] Attack. Well this can't be prevented.
A Quick fix from Apache Docs for the DoS but not for the DDoS ...
All network servers can be subject to denial of service attacks that
attempt to prevent responses to clients by tying up the resources of
the server. It is not possible to prevent such attacks entirely, but
you can do certain things to mitigate the problems that they create.
Often the most effective anti-DoS tool will be a firewall or other
operating-system configurations. For example, most firewalls can be
configured to restrict the number of simultaneous connections from any
individual IP address or network, thus preventing a range of simple
attacks. Of course this is no help against Distributed Denial of
Service attacks (DDoS).
Source
The issue is partly one of rejecting bad traffic, and partly one of improving the performance of your own code.
Being hit with excess traffic by malicious intent is called a Denial of Service attack. The idea is to hit the site with traffic to the point that the server can't cope with the load, stops responding, and thus no-one can get through and the site goes off-line.
But you can also be hit with too much traffic simply because your site becomes popular. This can easily happen overnight and without warning, for example if someone posts a link to your site on another popular site. This traffic might actually be genuine and wanted (hundred of extra sales! yay!), but can have the same effect on your server if you're not prepared for it.
As others have said, it is important to configure your web server to cope with high traffic volumes; I'll let the other answers speak for themselves on this, and it is an important point, but there are things you can do in your own code to improve things too.
One of the main reasons that a server fails to cope with increased load is because of the processing time taken by the request.
Your web server will only have the ability to handle a certain number of requests at once, but the key word here is "simultaneous", and the key to reducing the number of simultaneous requests is to reduce the time it takes for your program to run.
Imagine your server can handle ten simultaneous requests, and your page takes one second to load.
If you get up to ten requests per second, everything will work seamlessly, because the server can cope with it. But if you go just slightly over that, then the eleventh request will either fail or have to wait until the other ten have finished. It will then run, but will eat into the next second's ten requests. By the time ten seconds have gone by, you're a whole second down on your response time, and it keeps getting worse as long as the requests keep pouring in at the same level. It doesn't take long for the server to get overwhelmed, even when it's only just a fraction over it's capacity.
Now imagine the same page could be optimised to take less time, lets say half a second. Your same server can now cope with 20 requests per second, simply because the PHP code is quicker. But also, it will be easier for it recover from excess traffic levels. And because the PHP code takes less time to run, there is less chance of any two given requests being simultaneous anyway.
In short, the server's capacity to cope with high traffic volumes increases enormously as you reduce the time taken to process a request.
So this is the key to a site surviving a surge of high traffic: Make it run faster.
Caching: CMSs like Drupal and Wordpress have caching built in. Make sure it's enabled. For even better performance, consider a server-level cache system like Varnish. For a CMS type system where you don't change the page content much, this is the single biggest thing you can do to improve your performance.
Optimise your code: while you can't be expected to fix performance issues in third-party software like Drupal, you can analyse the performance of your own code, if you have any. Custom Drupal modules, maybe? Use a profiler tool to find your bottlenecks. Very often, this kind of analysis can reveal that a single bottleneck is responsible for 90% of the page load time. Don't bother with optimising the small stuff, but if you can find and fix one or two big bottlenecks like this, it can have a dramatic effect.
Hope that helps.
These types of attacks are called (D)DoS (Distributed Denial of Service) attacks and are usually prevented by the webserver hosting your PHP Application. Since apache is used the most, I found an article you might find interesting: http://www.linuxforu.com/2011/04/securing-apache-part-8-dos-ddos-attacks/.
The article states that apache has multiple mods available specifically created to prevent (D)DoS attacks. These still need to be installed and configured to match your needs.
I do believe that Facebook, Google etc. have their own similar implementations to prevent DoS attacks. I know for a fact that Google Search engine uses a captcha if alot of search requests are coming from the same network.
Why it is not wise to prevent DoS within a PHP script is because the PHP processor still needs to be started whenever a request is made, which causes alot of overhead. By using the webserver for this you will have less overhead.
EDIT:
As stated in another answer it is also possible to prevent common DoS attacks by configuring the server's firewall. Checking for attacks with firewall rules happens before the webserver is getting hit, so even less overhead there. Furthermore you can detect attacks on other ports aswell (such as portscans). I believe a combination of the 2 works best as both complement each other.
In my opinion, the best way to prevent DoS is to set the firewall to the lower level: at the entry of the server. By settings some network firewall config with iptables, you can drop packets from senders which are hitting too hard your server.
It'll be more efficient than passing through PHP and Apache, since them need to use a lot (relatively) of processus to do the checking and they may block your website, even if you detect your attacker(s).
You can check on this topic for more information: https://serverfault.com/questions/410604/iptables-rules-to-counter-the-most-common-dos-attacks