Right now I'm working on a service that handles reviews/recommendations of local restaurants overlayed on Google Maps. Basically Yelp, but restricted to a certain niche. Anyhow, since I don't want to have to load every location and review at once, I'm finally getting into using jQuery and AJAX calls.
The question I have is: How do I prevent other people from 'scraping' data from my ajax scripts on the server?
The main map/location info functionality needs to be public, in that users should not have to log in to use the application, so it may simply boil down to making it difficult to scrape. I'm hoping that one of you AJAX veteran out there can point me in the direction of a better idea, or some 'best practices' docs that I haven't been able to find yet.
So far all I've been able to come up with is:
The user-facing scripts open a short-lived session on the server and the AJAX calls will not run without an active session.
Send some sort of access key along with the application code and require that in all of the AJAX calls. But not sure how to best implement this in a way that's not trivially easy to get around.
You can't completely protect your AJAX web services. Even if you mangle your data and obfuscate your source code, it is trivial to just fire up a packet sniffer or debugging proxy, figure it out, and scrape from it.
What I would do is exactly what you propose... only users with an active session on the site can make calls. Then from there, throttle requests.
Even a busy normal user won't make more than a handful of requests per minute. You can analyze your logs to figure out what a good number would be. Even if you limited your service to 20 calls per minute, that kind of limitation makes it fairly useless for folks that want to duplicate all of your content.
Don't limit just on session data either... keep an eye on IP addresses. It's entirely possible to fire off a request and get a new session at any time. Periodically check your logs to see if anything is getting through, and adjust your strategy accordingly.
Finally, regularly search for your content. Google is a great tool for finding copyright infringers. If you use specific data, such as GPS coordinates, you can actually watermark the coordinates with a specific value in the noise area of the coordinate.
From what I hear, you want to protect the JavaScript side of the service. This is not possible as JavaScript is essentially fully open source (albeit not public domain)
Google offers a tool called Google Closure which can compact the script by removing white space and tabs. It can also obfuscate a document for you by replacing variable names and function names with random characters. It is customizable so you can tell it what you want. From what I can tell, Google uses it for their own website (evident by viewing the source of their pages)
Related
I need some advice on website design.
Lets take example of twitter for my question. Lets say I am making twitter. Now on the home_page.php ,I need both, Data about tweets (Tweet id , who tweeted , tweet time etc. etc) and Data about the user( userId , username , user profile pic).
Now to display all this, I have two option in mind..
1) Making separate php files like tweets.php and userDetails.php. By using AJAX queries, I can get the data on the home_page.php.
2) Adding all the php code (connecting to db, fetching data ) in the home_page.php itself.
In option one, I need to make many HTTP requests, which (i think) will be load to the network. So it might slow down the website.
But option two, I will have a defined REST API. Which will be good of adding more features in the future.
Please give me some advice on picking the best. Also I am still a learner, so if there are more options of implementing this, please share.
In number 1 you're reliant on java-script which doesn't follow progressive enhancement or graceful degradation; if a user doesn't have JS they will see zero content which is obviously bad.
Split your code into manageable php files to make it easier to read and require them all in one main php file; this wont take any extra http requests because all the includes are done server side and 1 page is sent back.
You can add additional javascript to grab more "tweets" like twitter does, but dont make the main functionality rely on javascript.
Don't think of PHP applications as a collection of PHP files that map to different URLs. A single PHP file should handle all your requests and include functionality as needed.
In network programming, it's usually good to minimize the number of network requests, because each request introduces an overhead beyond the time it takes for the raw data to be transmitted (due to protocol-specific information being transmitted and the time it takes to establish a connection for example).
Don't rely on JavaScript. JavaScript can be used for usability enhancements, but must not be used to provide essential functionality of your application.
Adding to Kiee's answer:
It can also depend on the size of your content. If your tweets and user info is very large, the response the single PHP file will take considerable time to prepare and deliver. Then you should go for a "minimal viable response" (i.e. last 10 tweets + 10 most popular users, or similar).
But what you definitely will have to do: create an API to bring your page to life. No matter which approach you will use...
I developed a PHP application, its main purpose is to fetch data from a database. I want to prevent fetching all records from database by using machine requests (I mean requests those are made by non-human i.e. some mechanism like CURL, you generally prevent such requests via CAPTCHA.).
How can I let only search engines to grab my data but no one else without sensible usability damage ?
related: Preventing non-human generated requests
To open your question, I clicked the link and my browser made the request to the stackOverflow server and asked for this page. That's the same what cURL does... except it can't handle JavaScript. But again, I didn't parse the JavaScript on behalf of my browser. It was again, a program.
what I really needed to emphasis is that, virtually there is no way you can prevent a machine from faking a user activity.
But here are some tricks if you are interested. Personally I prefer methods that doesn't involve the human directly.
Add a captcha challenge to pages.
If your target audience is mostly modern people with modern browsers, use some Ajax page loading. This will keep most low end scrapers but not all. Google can process some ajax requests. See hashbangs.
Add a captcha challenge to pages.
If your target audience is mostly modern people with modern browsers, use some Ajax page loading. This will keep most low end scrapers but not all. Google can process some ajax requests. See hashbangs.
Log IP addresses of the users and look for guys with several thousands of hits in a small time.
Add some flood control to the site. You can disallow a form submission (for example) from being processing more than once in a minute.
Add tokens to the form and validate it. This will at least make the crawling a two step process.
And make your site fetch a little data from the database. For an example, if your application is a calendar, you can disallow all requests to show dates in a range longer than an year.
You can't block bots by its user agent. cURL and other programs can use a user-given different user agent when making the request.
You can adjust how googlebot should behave in Google web master central. Try to match it with your flood control mechanism.
and remember, Google advices you not to depend on its user agent.
What kind of algorithm do websites, including stackexchange use to catch robots?
What makes them fail at times and present human-verification to normal users?
For web-applications and websites running on PHP, what would you recommend in order to stop robots and bot attacks and even content stealing?
Thank you.
Check out http://www.captcha.net/ for good and easy human-verification tools.
Preventing content stealing will be really difficult as you want the information to be available to your visitors.
Do not disable right click, it will only annoy your users and not stop content thiefs in any way.
You won't be able to keep out all bots, but you will be able to implement layers of security that will each stop a part of the bots.
A few hints and tips;
Use Captcha's for human verification, but don't use too many of them as they will tire users.
You could do e-mail verification with a Captcha and require a login for your content (if it doesn't scare away too many users). Or consider giving some part of the content for free and require registration for the full content.
Check for pieces of your content on other sites regularly (through Google, possibly automated with the Google API) and sue / DMCA notice if they blatantly stole (not quoted!) your content.
Limit the speed at which individual clients can make requests to your site. Bots will scrape often and quickly. Requesting content more than once a second is already a lot for human users. There are server tools that can accomplish this, eg. check out http://www.modsecurity.org/
I am sure there are more layers of security that can be thought of, but these come to mind directly.
I ran across an interesting article from Princeton University that presents nice ideas for automatic robot detection. The idea is quite simple. Humans behave differently than machines, and an automated access usually does things differently than a human.
The article presents some basic checks that can be done over the course of a few requests. You spend a few requests gathering information about how the client is browsing and after some time you take all your variables and make an assertion. Things to include are:
Mouse movement: a robot will most likely not use a mouse and therefore will not generate mouse movement events in the browser. You can prepare a javascript function, say "onBodyMouseMove()" and call it whenever the mouse moves over the entire area of page's body. If this function is called, count +1 in a session counter.
Javascript: some robots will not take the time to run javascript (i.e. curl, wget, axel, and other command line tools), since they are mostly sending specific requests that return useful output. You can prepare a function that is called after a page is loaded and count +1 in a session counter.
Invisble links: crawler robots are sucking machines that don't care about the content of a website. They are designed to click on all possible links and suck all the contents to a mirror location. You can insert invisible links somewhere in your webpage -- for example, a few nbsp; space characters at the bottom of the page surrounded by an anchor tag. Humans will not ever see this link, but you get a request on it, count +1 in a session counter.
CSS, images, and other visual components: robots will most likely ignore CSS and images, because they are not interested in rendering the webpage for viewing. You can hide a link to inside an URL that ends in *.css or *.jpg (you can use Apache rewrites or servlet mappings for Java). If these specific links are accessed, it's most likely a browser loading CSS and JPG for viewing.
NOTE: *.css, *.js, *.jpg, etc are usually loaded only once per page in a session. You need to append a unique counter at the end for the browser to reload these links everytime the page is requested.
Once you gather all that information in your session over the course of a few requests, you can make an assertion. For example, if you don't see any javascript, css or mouse move activity you can assume it's a bot. It's up to you to take these counters into consideration according to your needs.. so you can program it based on these variables any way you want. If you decide some client is a robot, you can force him to solve some captcha before continuing with further requests.
Just a note: Tablets will usually not create any mouse move events. So I'm still trying to figure out how to deal with them. Suggestions are welcome :)
I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.
While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.
If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.
There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.
Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.
force a reCAPTCHA every 10 page loads for each unique IP
There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.
Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.
I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.
Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.
There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.
What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!
Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?
Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.
My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.
I have been working on a site that makes some pretty big use of AJAX and dynamic JavaScript on the front end and it's time to start stress testing. But how do you properly stress test something that requires clicking several links on the front-end? One way I was able to easily hit every page of the site quickly and repeatedly was to point a Google Mini at it. But that's not going to click links and then navigate Modal windows and things like that.
Edit - I should point out that the site is done in PHP5 and the JavaScript library used is jQuery. Not sure if this would make any difference but felt it might be useful to know.
JMeter is great at this. You may record your sessions and tweak them to your liking.
So-called 'ajax load testing' is a recurring subject on this site, and is often confused. So let's get it straight: There is really no difference between load testing a normal web page and load testing with ajax. It all boils down to discrete requests; they just happen to not be full page refreshes.
One thing to keep in mind is there is a distinct difference between load testing the server processing the requests (a load test) and the performance on screen of the UI components being updated (how well your javascript performs.)
Simple load test example:
initial page load
login
navigate?
5-10 'ajax' requests (or whatever may fit your application usage pattern)
logout
There are load testing tools that can support AJAX. For example, WebLoad
http://www.radview.com/solutions/ajax-load-testing.aspx
What you really want is to stress test is the server's ability to handle the ajax requests. Use a load tool that looks at the requests while "recording" the test, and then tune as appropriate. I have only used the vs test edition one, so I can't point you to a low cost one.
I disagree with Nathan and Freddy to some degree. They are correct that "AJAX testing" is really no different in that HTTP requests are made. But it's not that simple. See my article on Ajaxian.com on Why Load Testing Ajax is Hard.
JMeter, Pylot, and The Grinder are all great tools for generating HTTP requests (I personally recommend Pylot). But at their core, they don't act as a browser and process JavaScript, meaning all they do is replay the traffic they saw at record time. If those AJAX requests were unique to that session, they may not be suitable/correct to replay in large volumes.
The fact is that as more logic is pushed down in to the browser, it becomes much more difficult (if not impossible) to properly simulate the traffic using traditional load testing tools.
In my article, I give a simple example of how difficult it becomes to test something like Google's home page when you want to query 1000's of different search terms (an important goal during load testing). To do it with JMeter/Pylot/Grinder you effectively end up re-writing parts of the AJAX code (in your case w/ jQuery) over again in the native language of the tool.
It gets even more complex if your goal is to measure the response time as perceived by the user (which is arguably the most important thing at the end of the day). For really complex applications that use Comet/"Reverse Ajax" (a technique that keeps open sockets for long periods of time), traditional load tools don't work at all.
My company, BrowserMob, provides a load testing service that uses Firefox browsers, powered by Selenium, to drive hundreds or thousands of real browsers, allowing you to measure and time the performance of visual elements as seen in the browser. We also support traditional virtual users (blind HTTP traffic) and a simulated browser (via HtmlUnit).
All that said, usually a mix of a service like BrowserMob plus traditional load testing is the right approach. That is, real browsers are great for a full-fidelity load test, but they will never be as economical as "virtual users", since they require 10-100X more RAM and CPU. See my recent blog post on whether to simulate or not to simulate virtual users.
Hope that helps!
You could use something like openSTA.
This allows a session with a web site to be recorded and then played back via a relatively simple script language.
You can also easily test web services and write your own scripts.
It allows you to put scripts together in a test in any way you want and configure the number of iterations, the number of users in each iteration, the ramp up time to introduce each new user and the delay between each iteration. Tests can also be scheduled in the future.
It's open source and free.
It produces a number of reports which can be saved to a spreadsheet. We then use a pivot table to easily analyse and graph the results.