I am currently faced with an issue, and am trying to explore the security risks involved in the following scenario.
Website A has the following code:
<img src="http://www.websiteb.com/loadimage.php?path=http://www.websitea.com/images/logo.png" />
Website B:
The following comments is an example of what will happen in loadimage.php (I do not require the code for this page)
/* Use CURL to load image from $_GET['path'] and output it to page.
Do you believe there could be any security risks associated with Website B being exploitable somehow?
Thanks
Yes - you're opening yourself up for abuse (assuming you're writing the CURL function). Others can create spurious links and use your code to request pages from others, or to deliver attacks or to try to distribute malicious content (e.g. they host a virus, put your website's script to deliver that virus and your website gets a bad name).
But you can mitigate it in the following ways (pick and choose, depnding on your situation):
If possible, remove the domain name from the path; if you know all the images come from the domain, then remove it and add in the PHP. This restricts people from abusing it as it restricts purely to your domain.
If you have a selection of domains, then instead verify the domain in the URL matches what you expect - again, to restrict free reign of what gets downloaded.
If you can strip all paramters from the image URL (if you know you'll never need them) then also remove parameters. Or if you can match a particular pattern of parameters, strip all the others. This limits potential a bit.
Validate it's an image when you've pulled it in.
Track downloads from a particular IP address. If they exceed an expected amount, then stop delivering more. You'll need to know what an expected amount is.
If you deliver both the HTML and the image download, you can only deliver the files you're expoecting to deliver to that page. Basically if you get a request to deliver the HTML page then you know what images will also be requested subsequently. Log them against the requesting IP and the requesteing agent, and allow delivery for 60 minutes. If you're not epecting a request (i.e. no match with IP / agent) then don't deliver. (Note: normally you can't rely on IP or Agent for stuff as they can both be forged, but for these purposes, it's fine).
Track by cookies. Similar to above, but use a cookie to narrow down the browser as opposed ot tracking by IP and agent.
Also similar to above, you can create a unique id for each file (e.g. "?path=avnd73q4nsdfyq347dfh" and you store in a database what image you're going to deliver for that unique_id. Unique_id's expire after a while.
Final measure, change the name of the script peridically - overlay for a bit, then retire the old script. .
I hope that gives an idea of what you can do. Choose accroding to what you can.
It can be used to proxy attacks to another site.
Related
I have a webserver, and certain users have been retrieving my images using an automated script.I wish to redirect them to a error page or give them an invalid image only if it's a CURL request.
my image resides in http://example.com/images/AIDd232320233.png, is there someway I can route it with .htaccess to my controller index function to where I can check if it's an authentic request?
and my other question, how can I check browser headers to distinguish between most likely authentic ones and ones done with a cURL request?
Unfortunately, the short answer is 'no.'
cURL provides all of the necessary options to "spoof" any browser. That is to say, more specifically, browsers identify themselves via specific header information, and cURL provides all of the tools to set header data in whatever manner you choose. So, directly distinguishing two requests from one another is not possible.*
*Without more information. Common methods to determine if there is a Live Human initiating the traffic are to set cookies during previous steps (attempts to ensure that the request is a natural byproduct of a user being on your website), or using a Captcha and a cookie (validate someone can pass a test).
The simplest is to set a cookie, which will really only ensure that bad programmers don't get through, or programmers who don't want to spend the time to tailor their scraper to your site.
The more tried and true approach is a Captcha, as it requires the user to interact to prove they have blood in their veins.
If the image is not a "download" but more of a piece of a greater whole (say, just an image on your site), a Captcha could be used to validate a human before giving them access to the site as a whole. Or if it is a download, it would be presented before unlocking the download.
Unfortunately, Captchas are are "a pain," both to set up, and for the end-user. They don't make a whole lot of sense for general-purpose access, they are a little overboard.
For general-purpose stuff, you can really only throttle IPs, download limits and the like. And even there, you have nothing you can do if the requests are distributed. Them's the breaks, really...
I use on my server a Text-to-Speech Synthesis platform (probably written in Java).
While the above application is running on my server, users can get audio as a URL to a wav file using the embedded HTML <audio> tag, as follows:
<audio controls>
<source src=”http://myserver.com:59125/process?INPUT_TEXT=Hello%20world” type=”audio/wav”>
</audio>
In the above ‘src’ attribute, ‘process’ requests the synthesis of some text using local port 59125.
My concern is that I might start seeing performance issues and out of memory errors, which would cause the TTS Synthesis platform server (but not the website) to crash every few days, apparently triggered by one or more entities abusing it as some sort of webservice for their own applications.
I wish to secure the URL requests so that a third party couldn't use my text-to-speech server for audio clips not related to my website.
How to secure the URL service?
I take it this URL is embedded in a public website, so any random public user needs to be able to access this URL to download the file. This makes it virtually impossible to secure as is.
The biggest problem is that you're publicly exposing a useful service which is usable for anyone to do something useful. I.e., just by requesting a URL which I construct, I can get your server to do useful work for me (turn my text into speech). The core problem here is that the input text is fully configurable by the end user.
To take away any incentive for any random person to use your server, you need to take away the ability for anyone to convert any random text. If you are the only one who wants to be in charge of what input texts are allowed, you'll have to either whitelist and validate the input, or identify it using ids. E.g., instead of
http://myserver.com:59125/process?INPUT_TEXT=Hello%20world
your URLs look more like:
http://myserver.com:59125/process?input_id=42
42 is substituted to Hello world on the server. Unknown ids won't be served.
Alternatively, again, validate and whitelist:
GET http://myserver.com:59125/process?INPUT_TEXT=Foo%20bar
404 Not Found
Speech for "Foo bar" does not exist.
For either approach, you'll need some sort of proxy in-between instead of directly exposing your TTS engine to the world. This proxy can also cache the resulting file to avoid repeatedly converting the same input again and again.
The end result would work like this:
GET http://myserver.com/tts?input=Hello%20world
myserver.com validates input, returns 403 or 404 for invalid input
myserver.com proxies a request to localhost:59125?INPUT_TEXT=Hello%20World if not already cached
myserver.com caches the result
myserver.com serves the result
This can be accomplished in any number of ways using any number of different web servers and/or CGI programs which do the necessary steps 2 and possibly 3.
This depends on what server you are using. Possible methods are:
Authentication: Use a username and password combination or ask for a SSH certificate; this could be provided via cURL when one webservice requests another one
IP whitelist: allow only specific IP's to access this server
IP whitelist example in Apache:
Deny from all
# server himself
Allow from 127.0.0.1
Allow from 192.168.1.14 # maybe some additional internal network IP
Allow from 192.168.1.36 # or another machine in the local network
Allow from 93.184.216.34 # or some machine somewhere else on the web
your best bet is using the answer above from feeela to limit the usage of the TTS platform to a said webserver (this will be where the users request the audio from and where your security logic should be implemented)
after that you need to write a "proxy" script that gets a token generated on-the-fly from the page that hosts the audio tag with a logic/method of your choice and check its validity (you can use the session/other user data and a salt), if valid it should call the TTS engine and return the audio, otherwise generate an error/a redirect/whatever you want
It depends what you mean by "securing it".
Maybe you want it to only be accessible to certain users? In that case, you have an easy answer: issue each user with login credentials that they need to enter when they visit the site, and pass those credentials through to the API. Anyone without valid credentials will be unable to use the API. Job done.
Or maybe you want it to work for anyone, but only to be used from specific sites? This is more difficult, because any kind of authentication key you have would need to be within the site's Javascript code, and thus visible to someone wanting to copy it. There isn't a foolproof solution, but the best solution I can suggest is to link each API key to the URL of the site that owns it. Then use the HTTP referrer header to check that calls made using a given API key are being called from the correct site. HTTP requests can be spoofed, including the referrer header, so this isn't foolproof, but will prevent most unauthorised use -- someone would have to go a fair distance out of their way to get around it (they'd probably have to set up a proxy server that forwarded your API requests and spoofed the headers). This is unlikely to happen unless your API is an incredibly valuable asset, but if you are worried about that, then you could make it harder for them by having the API keys change frequently and randomly.
But whatever else you do, the very first thing you need to do to secure it is to switch to HTTPS rather than HTTP.
Im not talking about extracting a text, or downloading a web page.
but I see people downloading whole web sites, for example, there is a directory called "example" and it isnt even linked in web site, how do I know its there? how do I download "ALL" pages of a website? and how do I protect against?
for example, there is "directory listing" in apache, how do I get list of directories under root, if there is a index file already?
this question is not language-specific, I would be happy with just a link that explains techniques that does this, or a detailed answer.
Ok so to answer your questions one by one; how do you know that a 'hidden' (unlinked) directory is on the site? Well you don't, but you can check the most common directory names, whether they return HTTP 200 or 404... With couple of threads you will be able to check even thousands a minute. That being said, you should always consider the amount of requests you are making in regards to the specific website and the amount of traffic it handles, because for small to mid-sized websites this could cause connectivity issues or even a short DoS, which of course is undesirable. Also you can use search engines to search for unlinked content, it may have been discovered by the search engine on accident, there might have been a link to it from another site etc. (for instance google site:targetsite.com will list all the indexed pages).
How you download all pages of a website has already been answered, essentially you go to the base link, parse the html for links, images and other content which points to a onsite content and follow it. Further you deconstruct links to their directories and check for indexes. You will also bruteforce common directory and file names.
Well you really effectively can't protect against bots, unless you limit user experience. For instance you could limit the number of requests per minute; but if you have ajax site, a normal user will also be producing a large number of requests so that really isn't a way to go. You can check user agent and white list only 'regular' browsers, however most scraping scripts will identify themselves as regular browsers so that won't help you much either. Lastly you can blacklist IPs, however that is not very effective, there is plenty of proxies, onion routing and other ways to change your IP.
You will get directory list only if a) it is not forbidden in the server config and b) there isn't the default index file (default on apache index.html or index.php).
In practical terms it is good idea not to make it easier to the scraper, so make sure your website search function is properly sanitized etc. (it doesn't return all records on empty query, it filters % sign if you are using LIKE mysql syntax...). And of course use CAPTCHA if appropriate, however it must be properly implemented, not a simple "what is 2 + 2" or couple of letters in common font with plain background.
Another protection from scraping might be using referer checks to allow access to certain parts of the website; however it is better to just forbid access to any parts of the website you don't want public on server side (using .htaccess for example).
Lastly from my experience scrapers will only have basic js parsing capabilities, so implementing some kind of check in javascript could work, however here again you'd also be excluding all web visitors with js switched off (and with noscript or similar browser plugin) or with outdated browser.
To fully "download" a site you need a web crawler, that in addition to follow the urls also saves their content. The application should be able to :
Parse the "root" url
Identify all the links to other pages in the same domain
Access and download those and all the ones contained in these child pages
Remember which links have already been parsed, in order to avoid loops
A search for "web crawler" should provide you with plenty of examples.
I don't know counter measures you could adopt to avoid this: in most cases you WANT bots to crawl your websites, since it's the way search engines will know about your site.
I suppose you could look at traffic logs and if you identify (by ip address) some repeating offenders you could blacklist them preventing access to the server.
This question already has answers here:
Top techniques to avoid 'data scraping' from a website database
(14 answers)
Closed 5 years ago.
I have LAMP server where I run a website, which I want to protect against bulk scraping / downloading. I know that there is no perfect solution for this, that the attacker will always find a way. But I would like to have at least some "protection" which hardenes the way of stealing data than just having nothing at all.
This website has cca. 5000 of subpages with valuable text data and couple of pictures on each page. I would like to be able online analyze incoming HTTP requests and if there is suspicious activity (e.g. tens of requests in one minute from one IP) it would automatically blacklist this certain IP address from further access to the site.
I fully realize that what I am asking for has many flaws, but I am not really looking for bullet-proof solution, but just a way how to limit script-kiddies from "playing" with easily scraped data.
Thank you for your on-topic answers and possible solution ideas.
Although this is a pretty old post, I think the answer isnt quite complete and I thought it worthwhile to add in my two cents. First, I agree with #symcbean, try to avoid using IP's but instead using a session, a cookie, or another method to track individuals. Otherwise you risk lumping together groups of users sharing an IP. The most common method for rate limiting, which is essentially what you are describing "tens of requests in one minute from one IP", is using the leaky bucket algorithm.
Other ways to combat web scrapers are:
Captchas
Make your code hard to interpret, and change it up frequently. This makes scripts harder to maintain.
Download IP lists of known spammers, proxy servers, TOR exit nodes, etc. This is going to be a lengthy list but its a great place to start. You may want to also block all amazon EC2 IP's.
This list, and rate limiting, will stop simple script kiddies but anyone with even moderate scripting experience will easily be able to get around you. Combating scrapers on your own is a futile effort but my opinion is biased because I am a cofounder of Distil Networks which offers anti-scraping protection as a service.
Sorry - but I'm not aware of any anti-leeching code available off-the-shelf which does a good job.
How do you limit access without placing burdens on legitimate users / withuot providing a mechanism for DOSing your site? Like spam prevention, the best solution is to use several approaches and maintain scores of badness.
You've already mentioned looking at the rate of requests - but bear in mind that increasingly users will be connecting from NAT networks - e.g. IPV6 pops. A better approach is to check per session - you don't need to require your users to register and login (although openId makes this a lot simpler) but you could redirect them to a defined starting point whenever they make a request without a current session and log them in with no username/password. Checking the referer (and that the referer really does point to the current content item) is a good idea too. Tracking 404 rates. Road blocks (when score exceeds threshold redirect to a capcha or require a login). Checking the user agent can be indicative of attacks - but should be used as part of the scoring mechanism, not as a yes/no criteria for blocking.
Another approach, rather than interrupting the flow, is when the thresholds are triggered start substituting content. Or do the same when you get repeated external hosts appearing in your referer headers.
Do not tar pit connections unless you've got a lot of resource serverside!
Referrer checking is one very simple technique that works well against automated attacks. You serve content normally if the referrer is your own domain (ie the user has reached the page by clicking a link on your own site), but if the referrer is not set, you can serve alternate content (such as a 404 not found).
Of course you need to set this up to allow search engines to read your content (assuming you want that) and also be aware that if you have any flash content, the referrer is never set, so you can't use this method.
Also it means that any deep links into your site won't work - but maybe you want that anyway?
You could also just enable it for images which makes it a bit harder for them to be scraped from the site.
Something that I've employed on some of my websites is to block known User-Agents of downloaders or archivers. You can find a list of them here: http://www.user-agents.org/ (unfortunately, not easy to sort by Type: D). In the host's setup, I enumerate the ones that I don't want with something like this:
SetEnvIf User-Agent ^Wget/[0-9\.]* downloader
Then I can do a Deny from env=downloader in the appropriate place. Of course, changing user-agents isn't difficult, but at least it's a bit of a deterrent if going through my logs is any indication.
If you want to filter by requests per minute or something along those lines, I don't think there's a way to do that in apache. I had a similar problem with ssh and saslauth, so I wrote a script to monitor the log files and if there were a certain number of failed login attempts made within a certain amount of time, it appended an iptables rule that blocked that IP from accessing those ports.
If you don't mind using an API, you can try our https://ip-api.io
It aggregates several databases of known IP addresses of proxies, TOR nodes and spammers.
I would advice one of 2 things,
First one would be, if you have information that other people want, give it to them in a controlled way, say, an API.
Second would be to try and copy google, if you scrape the results of google ALOT (and I mean a few hundred times a second) then it will notice it and force you to a Captcha.
I'd say that if a site is visited 10 times a second, its probably a bot. So give it a Captcha to be sure.
If a bot crawls your website slower then 10 times a second, I see no reason to try and stop it.
You could use a counter (DB or Session) and redirect the page if the limit is triggered.
/**Pseudocode*/
if( ip == currIp and sess = currSess)
Counter++;
if ( Count > Limit )
header->newLocation;
I think dynamic blocking of IPs using IP blocker will help better.
I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.
While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.
If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.
There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.
Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.
force a reCAPTCHA every 10 page loads for each unique IP
There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.
Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.
I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.
Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.
There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.
What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!
Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?
Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.
My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.