Related
Concerning security; Should I check every page for how many variables are sent, the size of variables that has been sent and block GET if it is not needed in the page for example.
I mean maybe someone send very very large text for many many times as GET variables to overload my server.
Is it possible? what can I do about it?
Using GET request you can't send huge amount of data (Apache has a default of 8000 characters, check browser limitations here). And if you don't use anywhere $_GET parameters, than it will be mostly no impact for server. What matters here is requests per second. Normal user will not generate lots of request.
If you are looking for security holes, start from Uploaded files execution restrictions (like PHP code in image.jpg) and other insecure access to files, XSS attacks, weak passwords generation and so on.
There was a big problem with how POST/GET values were handled in most languages, including PHP that could result in DOS attacks via specifically crafted requests. It was first discussed in this talk (slides are available here).
You can also read about it here and here. The main idea was that POST/GET are arrays, and that arrays are stored using hashtables. An attacker could create a DOS by purposefully creating collisions (data has same hash value), which results in a lot of computations.
But this isn't something that should be handled at application level, so you as a PHP coder do not have to worry about it. The problem described above is an issue of how PHP handles hashtables, but you can also prevent it by limiting the size of POST/GET requests in your PHP configuration.
If you are worried about DDoS, this also would not have to be handled by your application code, but externally, eg by a firewall.
My answer somewhat links your question to your comment:
No, I'm worried about hackers
Security wise I think the first thing you should check and optimize is the site structure. The problem you mentioned is very specific and to a certain degree may help, however probably won't be their primary attack.
You could always limit the GET requests (by default is somewhere around 8KB for most servers) somewhere in the server configs. You may also create a custom 414 explaining the reason for the shorter request length.
All in all, if it's security that you're aiming for, I'd start off elsewhere (the broader picture) and then slowly tackle my way until I hit the core.
This question is not about protecting against SQL injection attacks. That question has been answered many times on StackOverflow and I have implemented the techniques. This is about stopping the attempts.
Recently my site has been hit with huge numbers of injection attacks. Right now, I trap them and return a static page.
Here's what my URL looks like:
/products/product.php?id=1
This is what an attack looks like:
/products/product.php?id=-3000%27%20IN%20BOOLEAN%20MODE%29%20UNION%20ALL%20SELECT%2035%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C%27qopjq%27%7C%7C%27ijiJvkyBhO%27%7C%7C%27qhwnq%27%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35%2C35--%20
I know for sure that this isn’t just a bad link or fat-fingered typing so I don't want to send them to an overview page. I also don’t want to use any resources on my site delivering static pages.
I’m considering just letting the page die with die(). Is there anything wrong with this approach? Or is there an HTML return code that I can set with PHP that would be more appropriate?
Edit:
Based on a couple of comments below, I looked up how to return 'page not found'. This Stack Overflow answer by icktoofay suggests using a 404 and then the die(); - the bot thinks that there isn’t a page and might even go away, and no more resources are used to display a page not found message.
header("HTTP/1.0 404 Not Found");
die();
Filtering out likely injection attempts is what mod_security is for.
It can take quite a bit of work to configure it to recognize legitimate requests for your app.
Another common method is to block IP addresses of malicious clients when you detect them.
You can attempt to stop this traffic from reaching your server with hardware. Most devices that do packet inspection can be of use. I use an F5 for this purpose (among others). The F5 has a scripting language of its own called iRules which affords great control and customization.
The post has been unblocked, so I thought I’d share what I’ve been doing to reduce attacks from the same ip address. I still get a half dozen a day, but they usually only try once or twice from each ip address.
Note: In order to return the 404 error message, all of this must come before any HTML is sent. I’m using PHP and redirect all errors to an error file.
<?php
require_once('mysql_database.inc');
// I’m using a database, so mysql_real_escape_string works.
// I don’t use any special characters in my productID, but injection attacks do. This helps trap them.
$productID = htmlspecialchars( (isset($_GET['id']) ? mysql_real_escape_string($_GET['id']) : '55') );
// Product IDs are all numeric, so it’s an invalid request if it isn’t a number.
if ( !is_numeric($productID) ) {
$url = $_SERVER['REQUEST_URI']; // Track which page is under attack.
$ref = $_SERVER['HTTP_REFERER']; // I display the referrer just in case I have a bad link on one of my pages
$ip = $_SERVER['REMOTE_ADDR']; // See if they are comng from the same place each time
// Strip spaces just in case they typed the URL and have an extra space in it
$productID=preg_replace('/[\s]+/','',$productID);
if ( !is_numeric($productID) ) {
error_log("Still a long string in products.php after replacement: URL is $url and IP is $ip & ref is $ref");
header("HTTP/1.0 404 Not Found");
die();
}
}
I also have lots of pages where I display different content depending on the category that is picked. In these cases I have a series of if statements, like this if ($cat == 'Speech') { } There is no database lookup, so no chance of SQL injection, but I still want to stop the attacks and not waste bandwidth displaying a default page to a bot. Usually the category is a short word so I modify the is_numeric conditional above to check for string length e.g. if ( strlen($cat) > 10 ) Since most to the attempts have more than 10 characters in them, it works quite well.
A very good Question +1 from me and answer is not simple.
PHP does not provide way to maintain data for different pages and different sessions, so you can't limit access by IP address unless you store access details somewhere.
If you don't want to use a database connection for this, you can of course use the filesystem. I'm sure you already know how to do this, but you can see an example here:
DL's Script Archives
http://www.digi-dl.com/
(click on "HomeGrown PHP Scripts", then on "IP/networking", then
on "View Source" for the "IP Blocker with Time Limit" section)
The best option used to be "mod_throttle". Using that, you could restrict each IP address to one access per five seconds by adding this directive to your Apache config file:
<IfModule mod_throttle.c>
ThrottlePolicy Request 1 5
</IfModule>
But there's some bad news. The author of mod_throttle has abandoned the product:
"Snert's Apache modules currently CLOSED to the public
until further notice. Questions as to why or requests
for archives are ignored."
Another apache module, mod_limitipconn, is used more often nowadays. It doesn't let you make arbitrary restrictions (such as "no more than ten requests in each fifteen seconds"). All you can do is to limit each IP address to a certain number of concurrent connections. Many webmasters seem to be advocating that as a good way to fight bot spam, but it does seem less flexible than mod_throttle.
You need different versions of mod_limitipconn depending which version of Apache you're running:
mod_limitipconn.c - for Apache 1.3
http://dominia.org/djao/limitipconn.html
mod_limitipconn.c - Apache 2.0 port
http://dominia.org/djao/limitipconn2.html
Finally, if your Apache server is hosted on a Linux machine, there's a solution you can use which doesn't involve recompiling the kernel. Instead, it uses the "iptables" firewall rules. This method is rather elegant, and is flexible enough to impose constraints such as "no more than three connections from this IP in one minute". Here's how it's done:
Linux Noob forums - SSH Rate Limit per IP
http://www.linux-noob.com/forums/index.php?showtopic=1829
I realize that none of these options will be ideal, but they illustrate what is possible. Perhaps using a local database will end up being best after all? In any case, bear in mind that simply limiting the rate of requests, or limiting the bandwidth, doesn't solve the problem of bots. They may take longer, but they'll eventually drain just as many resources as they would if they were not slowed down. It's necessary to actually reject their HTTP requests, not simply delay them or spread them out.
Good luck in the escalating battle between content and spam!
I am currently faced with an issue, and am trying to explore the security risks involved in the following scenario.
Website A has the following code:
<img src="http://www.websiteb.com/loadimage.php?path=http://www.websitea.com/images/logo.png" />
Website B:
The following comments is an example of what will happen in loadimage.php (I do not require the code for this page)
/* Use CURL to load image from $_GET['path'] and output it to page.
Do you believe there could be any security risks associated with Website B being exploitable somehow?
Thanks
Yes - you're opening yourself up for abuse (assuming you're writing the CURL function). Others can create spurious links and use your code to request pages from others, or to deliver attacks or to try to distribute malicious content (e.g. they host a virus, put your website's script to deliver that virus and your website gets a bad name).
But you can mitigate it in the following ways (pick and choose, depnding on your situation):
If possible, remove the domain name from the path; if you know all the images come from the domain, then remove it and add in the PHP. This restricts people from abusing it as it restricts purely to your domain.
If you have a selection of domains, then instead verify the domain in the URL matches what you expect - again, to restrict free reign of what gets downloaded.
If you can strip all paramters from the image URL (if you know you'll never need them) then also remove parameters. Or if you can match a particular pattern of parameters, strip all the others. This limits potential a bit.
Validate it's an image when you've pulled it in.
Track downloads from a particular IP address. If they exceed an expected amount, then stop delivering more. You'll need to know what an expected amount is.
If you deliver both the HTML and the image download, you can only deliver the files you're expoecting to deliver to that page. Basically if you get a request to deliver the HTML page then you know what images will also be requested subsequently. Log them against the requesting IP and the requesteing agent, and allow delivery for 60 minutes. If you're not epecting a request (i.e. no match with IP / agent) then don't deliver. (Note: normally you can't rely on IP or Agent for stuff as they can both be forged, but for these purposes, it's fine).
Track by cookies. Similar to above, but use a cookie to narrow down the browser as opposed ot tracking by IP and agent.
Also similar to above, you can create a unique id for each file (e.g. "?path=avnd73q4nsdfyq347dfh" and you store in a database what image you're going to deliver for that unique_id. Unique_id's expire after a while.
Final measure, change the name of the script peridically - overlay for a bit, then retire the old script. .
I hope that gives an idea of what you can do. Choose accroding to what you can.
It can be used to proxy attacks to another site.
This question already has answers here:
Top techniques to avoid 'data scraping' from a website database
(14 answers)
Closed 5 years ago.
I have LAMP server where I run a website, which I want to protect against bulk scraping / downloading. I know that there is no perfect solution for this, that the attacker will always find a way. But I would like to have at least some "protection" which hardenes the way of stealing data than just having nothing at all.
This website has cca. 5000 of subpages with valuable text data and couple of pictures on each page. I would like to be able online analyze incoming HTTP requests and if there is suspicious activity (e.g. tens of requests in one minute from one IP) it would automatically blacklist this certain IP address from further access to the site.
I fully realize that what I am asking for has many flaws, but I am not really looking for bullet-proof solution, but just a way how to limit script-kiddies from "playing" with easily scraped data.
Thank you for your on-topic answers and possible solution ideas.
Although this is a pretty old post, I think the answer isnt quite complete and I thought it worthwhile to add in my two cents. First, I agree with #symcbean, try to avoid using IP's but instead using a session, a cookie, or another method to track individuals. Otherwise you risk lumping together groups of users sharing an IP. The most common method for rate limiting, which is essentially what you are describing "tens of requests in one minute from one IP", is using the leaky bucket algorithm.
Other ways to combat web scrapers are:
Captchas
Make your code hard to interpret, and change it up frequently. This makes scripts harder to maintain.
Download IP lists of known spammers, proxy servers, TOR exit nodes, etc. This is going to be a lengthy list but its a great place to start. You may want to also block all amazon EC2 IP's.
This list, and rate limiting, will stop simple script kiddies but anyone with even moderate scripting experience will easily be able to get around you. Combating scrapers on your own is a futile effort but my opinion is biased because I am a cofounder of Distil Networks which offers anti-scraping protection as a service.
Sorry - but I'm not aware of any anti-leeching code available off-the-shelf which does a good job.
How do you limit access without placing burdens on legitimate users / withuot providing a mechanism for DOSing your site? Like spam prevention, the best solution is to use several approaches and maintain scores of badness.
You've already mentioned looking at the rate of requests - but bear in mind that increasingly users will be connecting from NAT networks - e.g. IPV6 pops. A better approach is to check per session - you don't need to require your users to register and login (although openId makes this a lot simpler) but you could redirect them to a defined starting point whenever they make a request without a current session and log them in with no username/password. Checking the referer (and that the referer really does point to the current content item) is a good idea too. Tracking 404 rates. Road blocks (when score exceeds threshold redirect to a capcha or require a login). Checking the user agent can be indicative of attacks - but should be used as part of the scoring mechanism, not as a yes/no criteria for blocking.
Another approach, rather than interrupting the flow, is when the thresholds are triggered start substituting content. Or do the same when you get repeated external hosts appearing in your referer headers.
Do not tar pit connections unless you've got a lot of resource serverside!
Referrer checking is one very simple technique that works well against automated attacks. You serve content normally if the referrer is your own domain (ie the user has reached the page by clicking a link on your own site), but if the referrer is not set, you can serve alternate content (such as a 404 not found).
Of course you need to set this up to allow search engines to read your content (assuming you want that) and also be aware that if you have any flash content, the referrer is never set, so you can't use this method.
Also it means that any deep links into your site won't work - but maybe you want that anyway?
You could also just enable it for images which makes it a bit harder for them to be scraped from the site.
Something that I've employed on some of my websites is to block known User-Agents of downloaders or archivers. You can find a list of them here: http://www.user-agents.org/ (unfortunately, not easy to sort by Type: D). In the host's setup, I enumerate the ones that I don't want with something like this:
SetEnvIf User-Agent ^Wget/[0-9\.]* downloader
Then I can do a Deny from env=downloader in the appropriate place. Of course, changing user-agents isn't difficult, but at least it's a bit of a deterrent if going through my logs is any indication.
If you want to filter by requests per minute or something along those lines, I don't think there's a way to do that in apache. I had a similar problem with ssh and saslauth, so I wrote a script to monitor the log files and if there were a certain number of failed login attempts made within a certain amount of time, it appended an iptables rule that blocked that IP from accessing those ports.
If you don't mind using an API, you can try our https://ip-api.io
It aggregates several databases of known IP addresses of proxies, TOR nodes and spammers.
I would advice one of 2 things,
First one would be, if you have information that other people want, give it to them in a controlled way, say, an API.
Second would be to try and copy google, if you scrape the results of google ALOT (and I mean a few hundred times a second) then it will notice it and force you to a Captcha.
I'd say that if a site is visited 10 times a second, its probably a bot. So give it a Captcha to be sure.
If a bot crawls your website slower then 10 times a second, I see no reason to try and stop it.
You could use a counter (DB or Session) and redirect the page if the limit is triggered.
/**Pseudocode*/
if( ip == currIp and sess = currSess)
Counter++;
if ( Count > Limit )
header->newLocation;
I think dynamic blocking of IPs using IP blocker will help better.
I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.
While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.
If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.
There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.
Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.
force a reCAPTCHA every 10 page loads for each unique IP
There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.
Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.
I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.
Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.
There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.
What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!
Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?
Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.
My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.