How to prevent someone from scraping my website data?

How to prevent someone from scraping my website data? - php

I am using PHP to write my server side code for my website. What is the best way to prevent someone from scraping my data?
Like in PHP if someone uses file_get_contents() or someone fetches my login form in an iframe element or the data entered in the login form -
how can I prevent such things?
I am using PHP 5.47, MySQL, HTML and CSS.

I think that being a web-developer these days is terrifying and that maybe there is a temptation to go into "overkill" when it comes to web security. As the other answers have mentioned, it is impossible to stop automated scraping and it shouldn't worry you if you follow these guidelines:
It is great that you are considering website security. Never change.
Never send anything from the server you don't want the user to see. If the user is not authorised to see it, don't send it. Don't "hide" important bits and pieces in jQuery.data() or data-attributes. Don't squirrel things away in obfuscated JavaScript. Don't use techniques to hide data on the page until the user logs in, etc, etc.
Everything - everything - is visible if it leaves the server.
If you have content you want to protect from "content farm" scraping use email verified user registration (including some form of GOOD reCaptcha to confound - most of - the bots).
Protect your server!!! As best you can, make sure you don't leave any common exploits. Read this -> http://owasp.org/index.php/Category:How_To <- Yes. All of it ;)
Prevent direct access to your files. The more traditional approach is defined('_SOMECONSTANT') or die('No peeking, hacker!'); at the top of your PHP document. If the file is not accessed through the proper channels, nothing important will be sent from the server.
You can also meddle with your .htaccess or go large and in charge.
Are you perhaps worried about cross site scripting (XSS)?
If you are worried about data being intercepted when the user enters login information, you can implement double verification (like Facebook) or use SSL
It really all boils down to what your site will do. If it is a run of the mill site, cover the basics in the bullet points and hope for the best ;) If it is something sensitive like a banking site... well... don't do a banking site just yet :P
Just as an aside: I never touch credit card numbers and such. Any website I develop will politely API on to a company with insurance and fleets of staff dedicated to security (not just little old me and my shattered nerves).

No there is no way to make this sure. You can implement some Javascript functions which try to prevent this, but if the client just deactivate JS (or a server just ignores it), you can't prevent this.

It is really hard to prevent this. I have found a similar discussion here. This will answer most of your queries but if you want even more perfect protection then sophisticated programs and services like Scrapesentry and Distil would be needed.

Using JavaScript or php, you just decrease the data scraping, but you can't stop the data scraping.
Browser can read the html data so user can view your page source and get that. You can disable key events but can't stop the scraping.

Related

AJAX-Driven Site

Well, I have a completely AJAX driven site. I inserted a jQuery code that affects all forms and queries site-wide. It's to the point, even though I want to change it, I fathom and accept the idea of a website that utitlizes a single function to process all queries (search, links, & profile, etc....)
How do you accomodate speed and security to such a platform? My php files can be accessed directly from their location's link. That is a threat. Help me; as well as AJAX, I need validation and '777' protection.

Before you read my answer read also this (as answer to a comment on your question) : Possible to view PHP code of a website?
Don't put speed and security in the same box. A website can be secure and fast at the same time.
I would secure a folder with 777 access (why not 755?) with an empty 0Kb index.html file (yes, even if you have inside a bunch of .php files!) ad put an .htaccess with restrictions deny from all that allows a folder to be accessed internally but not from 'outside'.
Than I would NEVER send sensitive data through the requests, but rather a client-side transformed Hashing algorithm like MD5 or SHA1 to compare data and validate on server-side. So don't ever ever send sensitive data in it's pure state over the yellow wire.
Need more security? https
Regarding a "single function" that drives your JS client-end of the site, well, if well formatted the browser doesn't care if it's one or hundreds, a good code is a readable code. Performance wise there's lots of suggestions on the www how to speed up your code.
To add to a really good #Tim's comment/tip, you can still open your console in Firebug (Net) and inspect every single piece of information that is sent from your Page to the server (and vice-versa!!) and act accordingly.

Securing a php contact form

i have made a simple php contact form following this tutorial:
http://www.catswhocode.com/blog/how-to-create-a-built-in-contact-form-for-your-wordpress-theme
The big problem is that this form processing is not safe, I have heard people can use it to send spam and/or hack my server.
What are the basic steps needed to make this form more secure?
Ps: I don't want to use re-captcha if it can be avoided...
Edit: I need suggestions to what php functions are used to filter and secure that the form is submitted "the right way" and not altered and/or used to hack my site or send email to other people (using the site to send spam to other people). Do i just need to use strip_slashes? or is there a better way?

One way: If you're not a huge site, it's not likely anyone is going to figure this out/take the time to.
You could use some tricky JS to handle tokens on click. So your server issues token-id's to clickable/focus-able elements on the page during the backend render phase. Log these in a database or data file. Then, when users click around and submit, you can compare the id's sent via the onclick() function. You could also apply some heuristics to determine if the history of clicks is reasonably paced. Posts are too fast to be a human or not, that is, even if they scripted the hijacking of the token-ids and auto submitted, you could check that the time between click events appears automated. Signed up for a twitter account lately? They use passive human detection that while not 100% foolproof, it is slower and more difficult to break. Somebody would REALLY want to hack/spam your site.
Important Step 2: strip out/URLEncode strange characters if you think this will break your page. common ones that break things are " and ' and :
Another Way: http://areyouahuman.com/
As long as you are using encrypted methods verifying humanity without crappy CAPTCHA is possible.I mean, don't ignore your headers either. These are complimentary ways.
The key is to have enough complexity to make for an NP-Complete problem. http://en.wikipedia.org/wiki/NP-complete
When the day comes when AI can solve multiple complex Human problems on their own, we will have other things to worry about than request tampering.
http://louisville.academia.edu/RomanYampolskiy/Papers/1467394/AI-Complete_AI-Hard_or_AI-Easy_Classification_of_Problems_in_Artificial
Another company doing interesting research is http://www.vouchsafe.com/play-games they actually use games designed to trick the RTT into training the RTT how to be more solvable by only humans!
Here's a great article on NP-Hard problems. I can see a huge possibility here: http://www.i-programmer.info/news/112-theory/3896-classic-nintendo-games-are-np-hard.html

Security of Flex for payment website

So, it's been about 3 years since I wrote and went live with my company's main internet facing website. Originally written in php, I've since just been making minor changes here and there to progress the site as we've needed to.
I've wanted to rewrite it from the ground up in the last year or so and now, we want to add some major features so this is a perfect time.
The website in question is as close to a banking website as you'd get (without being a bank; sorry for the obscurity, but the less info I can give out, the better).
For the rewrite, I want to separate the presentation layer from the processing layer as much as I can. I want the end user to be stuck in a box and not be able to get out so to speak
(this is all because of PCI complacency, being PEN tested every 3 months, etc...)
So, being probed every 3 months has increasingly made me nervous. We haven't failed yet and there hasen't been a breach yet, but I want to make sure I continue to pass (as much as I can anyways)
So, I'm considering rewriting the presentation layer in Adobe Flex and do all the processing in PHP (effectively IMO, separating presentation from processing) - I would do all my normal form validation in flex (as opposed to javascript or php) and do my reads and writes to the db via php.
My questions are:
I know Flash has something like 99% market penetration - do people find this to be true? Has anyone seen on their own sites being in flash that someone couldn't access it?
Flash in general has come under alot of attacks about security and the like - i know this. I would use a swf encryptor - disable debugging (which i got snagged on once on a different application), continue to use https and any other means i can think of.
At the end of the day, everyone knows if someone wants in to the data bad enough, their going to find a ways in; i just wanna make it as difficult for them as i can.
Any thoughts are appreciated.
-Mario

There are always people who, for one reason or another, don't install the Flash plugin. Bear in mind that these are distinctly in the minority. Realize also that some people still refuse to enable Javascript. The question you have to ask yourself is whether this small group is enough to get you to move off of some newer technologies.
If the answer to that is yes, you will have to resort to vanilla HTML form processing, sending everything to the server for validation, etc.
If the answer is no, don't be afraid to use Flex. It works fine with https protocol, and is as secure as you want. That said, I wouldn't use it for username/password validation on the client; that information should always be encrypted and sent to a secure server. But validation of other types of field (phone number, etc.) shouldn't be a problem.

There are definitely people who don't have Flash installed and yes, there are people who have JavaScript disabled. But no matter whether you develop for the common denominator which is plain HTML forms or if you go high end, e.g. Flex or AJAX, never ever rely on the client to validate the inputs. It's a good first step, but everything that comes from the client, be it Flash or Ajax or Silverlight or whatever, could be forged.

Security precautions and techniques for a User-submitted Code Demo Area

Maybe this isn't really feasible. But basically, I've been developing a snippet-sharing website and I would like it to have a 'live demo area'.
For example, you're browsing some snippets and click the Demo button. A new window pops up which executes the web code.
I understand there are a gazillion security risks involved in doing this - XSS, tags, nasty malware/drive by downloads, pr0n, etc. etc. etc.
The community would be able to flag submissions that are blatantly naughty but obviously some would go undetected (and, in many cases, someone would have to fall victim to discover whatever nasty thing was submitted).
So I need to know:
What should I do - security wise - to make sure that users can submit code, but that nothing malicious can be run - or executed offsite, etc?
For your information my site is powered by PHP using CodeIgniter.
Jack

As Frank pointed out, if you want to maintain a high level of security use a whitelist technique. This of course comes with a price (might be too restrictive, hard to implement).
The alternative route is to develop a blacklist technique. i.e. only allow code that hasn't triggered any bells. This is easier, because you have to specify less things, but it will not catch new exploits.
There is plenty information available on the web on both techniques.
Relying on CodeIgniters security functions (XSS filtering etc.) will not get you very far as most of the snippets will not be allowed through.
Whatever you do you have to remember this:
Do not think malicious code will aim to just harm your website's visitors. It may as well aim to compromise your server via your parser/code inspector. For example, lets say Alice uploads snippet foo. Alice intentionally crafts the snippet so that your parser will flag it as malicious due to an XSS exploit. Lets say your parser also updates a database with the malicious snippet for further investigation. Alice knows this. Along with the XSS exploit Alice has injected some SQL code in the snippet, so that when you INSERT the snippet to the database it will do all sorts of bad stuff.
If you are really paranoid, you could have an isolated server which its solely responsibility would be to inspect code snippets. So in the WCS only that low-risk server would be compromised, and you would have (hopefully) enough time to fix/audit the situation.
Hope this helps.

You cannot whitelist or blacklist PHP, it just doesn't work. If you write up a list of commands that I can use, or stop me from using malicious functions, what is to stop me from writing:
$a = 'mai';
{$a .'l'}('somebody#important.com', 'You suck', 'A dodgy message sent from your server');
You cannot whitelist or blacklist PHP.

For your information my site is powered by PHP using CodeIgniter
Sorry Jack, if you think that is in the least bit relevant you're a very long way from understanding any valid answer to the question - let alone being able to distinguish the invalid ones.
Any sandbox you create which will prevent someone from attacking your machine or your customers will be so restrictive that your clients will not be able to do much more than 'print'.
You'd need to run a CLI version of suhosin on a custom chroot jail - and maintianing seperate environments for every script would be totally impractical.
C.

Assuming you are only allowing javascript code, then you should do the following -
Purchase a throw-away domain name that is not identifiable with your domain
Serve the user-entered code in an iframe that is hosted from the throw-away domain
This is essentially what iGoogle does. It prevents XSS because you are using a different domain. The only loophole I am aware of is that evil code can change the location of the webpage.
If you intend to share snippets of server side code, then it is a different ballgame. For java/jsp snippets, you could use JVMs internal Security classes to run the code in a sandbox. You should find a lot of information on this if you google. I would like to think this is what google uses in App Engine (I am not sure though).
Anything other than Java, I am not sure how to protect. Dot Net perhaps has a similar concept, but I doubt you could sandbox PHP code snippets in a similar manner.

Top techniques to avoid 'data scraping' from a website database

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.

While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.

If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.

There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.

Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.

force a reCAPTCHA every 10 page loads for each unique IP

There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.

Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.

I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.

Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.

There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.

What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!

Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?

Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.

My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.