the context: while content and copy protection is practically a moot point, and I've heard every argument there is about right-context menu disable, the state of the art is to secure a page with JavaScript measures, and then somehow create a Page wipe if the security is bypassed by disabling JavaScript.
Another way to say this is "make the content depend on continuous availability of JavaScript". My target is a page wipe initiated within 1 second of JavaScript being disabled.
The methods I'm exploring are PHP for WordPress...and I'm considering initiating a process on page load which wipes the page unless a JavaScript token is present to stop the process. I also thought of a method which starts a process server side to rewrite the page content if err status can be proven to show JS disabled.
I agree with #alessioalex. What you're asking for is not possible.
The only possible way to continuously monitor anything on a web page is with Javascript. And the only way to change the page content after page load is with Javascript. And if I've got Firebug installed (or the developer tools for any other browser), I can see the entire content of your site, and even alter your Javascript while it's running. I don't even need to turn Javascript off to disable your protection feature.
There is a very limited amount of protection you can get from the basic "disable context menu" function. It isn't secure, but it does protect you against people who don't know what they're doing or who can't be bothered to waste their time getting past it.
But there is zero additional protection to be had from doing anything more than that in a browser environment, because all your code and graphics are available and it is trivial for a user with even limited knowledge to get at it.
If someone knows enough to try switching off Javascript (and if they're determined enough to get at your code to do that) then you've already lost the battle.
You say you've heard every argument about copy protection on the web, but if you're asking this question you clearly haven't understood those arguments.
Bottom line: If copy protection is that important to you, then a web page is the wrong environment for your content.
Probably not something you are looking for, since the workaround for this technique would be to use IE, but in the future one could use CSS Transitions to schedule "disappearance" of the content and reset it with JS periodically :)
Related
What kind of algorithm do websites, including stackexchange use to catch robots?
What makes them fail at times and present human-verification to normal users?
For web-applications and websites running on PHP, what would you recommend in order to stop robots and bot attacks and even content stealing?
Thank you.
Check out http://www.captcha.net/ for good and easy human-verification tools.
Preventing content stealing will be really difficult as you want the information to be available to your visitors.
Do not disable right click, it will only annoy your users and not stop content thiefs in any way.
You won't be able to keep out all bots, but you will be able to implement layers of security that will each stop a part of the bots.
A few hints and tips;
Use Captcha's for human verification, but don't use too many of them as they will tire users.
You could do e-mail verification with a Captcha and require a login for your content (if it doesn't scare away too many users). Or consider giving some part of the content for free and require registration for the full content.
Check for pieces of your content on other sites regularly (through Google, possibly automated with the Google API) and sue / DMCA notice if they blatantly stole (not quoted!) your content.
Limit the speed at which individual clients can make requests to your site. Bots will scrape often and quickly. Requesting content more than once a second is already a lot for human users. There are server tools that can accomplish this, eg. check out http://www.modsecurity.org/
I am sure there are more layers of security that can be thought of, but these come to mind directly.
I ran across an interesting article from Princeton University that presents nice ideas for automatic robot detection. The idea is quite simple. Humans behave differently than machines, and an automated access usually does things differently than a human.
The article presents some basic checks that can be done over the course of a few requests. You spend a few requests gathering information about how the client is browsing and after some time you take all your variables and make an assertion. Things to include are:
Mouse movement: a robot will most likely not use a mouse and therefore will not generate mouse movement events in the browser. You can prepare a javascript function, say "onBodyMouseMove()" and call it whenever the mouse moves over the entire area of page's body. If this function is called, count +1 in a session counter.
Javascript: some robots will not take the time to run javascript (i.e. curl, wget, axel, and other command line tools), since they are mostly sending specific requests that return useful output. You can prepare a function that is called after a page is loaded and count +1 in a session counter.
Invisble links: crawler robots are sucking machines that don't care about the content of a website. They are designed to click on all possible links and suck all the contents to a mirror location. You can insert invisible links somewhere in your webpage -- for example, a few nbsp; space characters at the bottom of the page surrounded by an anchor tag. Humans will not ever see this link, but you get a request on it, count +1 in a session counter.
CSS, images, and other visual components: robots will most likely ignore CSS and images, because they are not interested in rendering the webpage for viewing. You can hide a link to inside an URL that ends in *.css or *.jpg (you can use Apache rewrites or servlet mappings for Java). If these specific links are accessed, it's most likely a browser loading CSS and JPG for viewing.
NOTE: *.css, *.js, *.jpg, etc are usually loaded only once per page in a session. You need to append a unique counter at the end for the browser to reload these links everytime the page is requested.
Once you gather all that information in your session over the course of a few requests, you can make an assertion. For example, if you don't see any javascript, css or mouse move activity you can assume it's a bot. It's up to you to take these counters into consideration according to your needs.. so you can program it based on these variables any way you want. If you decide some client is a robot, you can force him to solve some captcha before continuing with further requests.
Just a note: Tablets will usually not create any mouse move events. So I'm still trying to figure out how to deal with them. Suggestions are welcome :)
I've done web scraping before but it was never this complex. I want to grab course information from a school website. However all the course information is displayed in a web scraper's nightmare.
First off, when you click the "Schedule of Classes" url, it directs you through several other pages first (I believe to set cookies and check other crap).
Then it finally loads a page with an iframe that apparently only likes to load when it's loaded from within the institution's webpage (ie arizona.edu).
From there the form submissions have to be made via buttons that don't actually reload the page but merely submit a AJAX query and I think it just manipulates the iframe.
This query is particularly hard for me to replicate. I've been using PHP and curl to simulate a browser visiting the initial page, gather's the proper cookies and such. But I think I have a problem with the headers that my curl function is sending because it never lets me execute any sort of query after the initial "search form" loads.
Any help would be awesome...
http://www.arizona.edu/students/registering-classes -> "Schedule of Classes"
Or just here:
http://schedule.arizona.edu/
If you need to scrape a site with heavy JS / AJAX usage - you need something more powerful than php ;)
First - it must be full browser with capability to execute JS, and second - there must be some api for auto-browsing.
Assuming that you are a kid (who else would need to parse a school) - try Firefox with iMacros. If you are more seasoned veteran - look towards Selenium.
I used to scrap a lot of pages with JS, iframes and all kinds of that stuff. I used PhantomJS as a headless browser, that later I wrapped with PhantomCurl wrapper. The wrapper is a python script that can be run from command line or imported as a module
Are you sure you are allowed to scrape the site?
If yes, then they could just give you a simple REST api?
In rare case when they would allow you to get to the data, but would not provide API, my advice would be to install some software to record your HTTP interaction with web site, maybe wireshark, or some HTTP proxy, but it is important that you get all details of http requests recorded. After you have that, analyze it, and try to replay it up to the latest bit.
Among possible chores, it might be that at some point in time server sends you generated javascript, that needs to be executed by the client browser in order to get to the next step. In this case you would need to figure how to parse received javascript, and figure out how to move next.
An also good idea would be not to fire all your http requests in burst mode, put put some random delays so that it appears to the server more "human" like.
But in the end you need to figure out if all this is worth the trouble? Since almost any road block to scraping can be worked around, but it can get quite involved and time consuming.
A couple of years ago, before I knew about Stack Overflow, I was working in an office with a lot of competition between the programmers. There, I had to code a web page in PHP with Drupal, that needed to get data from another site by RSS. What happened was that there was no way to get the data beforehand: the data depended on the content of the page which itself was dynamic, so the page stopped loading for a couple of seconds while PHP went to get the RSS data. That was bad. The page depended on a couple of parameters out of a huge list. So fetching all possible combinations in davance was out of the questions. It was some sort of search page, that included the results of a sister site, I think.
The first thing I did to improve that was to set up a caching system. When the page was loaded, it launched a Javascript method that saved the RSS data back into the database for this specific page, using AJAX. That meant that if the same page was requested again, the old data would be sent immediately. and the AJAX script would get the cache updated with the new data, if needed. The Javascript pretty much opened a hidden page on the site with a GET instruction that matched the current page's parameters. It's only a couple of days later that I realised that I could have cached the data without the AJAx. (Trust me, it's easier to spot in hindsight.) But that's not the issue I'm asking about.
But I was told not to do any caching at all. I was told that my AJAX page "exposed the API". That a malicious user could hit the hidden page again and again to do a Denial of Service attack. I thought my AJAX was a temporary solution anyway, but that caching was needed. But mostly: wasn't the DoS argument true of ANY page on the site? Did the fact that my hidden page did not appear in the menus and returned no content make it worse?
As I said, there was a lot of competition between programmers, so the people around me, who were unanimous, might have been right, or they may have tried to stop me from doing something that was bad because they were not the ones doing it. (It happened a lot.) But I'm still curious. I was fully aware that my AJAX thing was a hack. I wanted to change that system as soon as I found something better, but I thought that no caching at all was even worse. Which was true? Doesn't, by that logic, ALL AJAX expose the API? If we look past the fact that my AJAX was an ugly hack, was it really that dangerous?
I'll admit again and again that it was an ugly, temporary fix, but my question is about having a "hidden" page that returns no content that makes the server do something. How horrible is that?
both sides are right. Yes, it does "expose" the api, but ajax requests can only access publicly accessible documents/scripts in the first place, so yes, all ajax requests "expose" their target script in the same way. DoS attacks are not script specific, they are server specific, so one can perform a DoS using anything pointing to the server, not just this script your ajax calls. I would tell your buddies their argument is weak and grasping at straws, and don't be jealous :P
If I read your post correctly, it seems as if the AJAX requested version of the page would know to invalidate the cache each time?
If that's the case, then I suppose your co-worker might have been saying that the hidden page would be susceptible to a DDOS attack in a way that the full pageload wasn't. I.E. The full pageload would get a cached version on each pageload after the first, where as the AJAX version would get fresh content each time. If that's the case, then s/he's right.
By "expose the API", your co-worker was saying that you were exposing the URL of a page that was doing work that should be done in the background. The outside world should not know about a URL whose sole purpose is to do some heavy lifting task. As you even said, you found a backend solution that didn't require the user's browser knowing about your worker process at all.
Yes, having no cache at all when the page relies on heavy content is worse than having an ajax version of the page do the caching, but I think the warning from your coworker was that no page, EVEN if it's AJAX, should have the power to break the cache in a way you didn't expect or intend.
The only way this would be a problem is said "hidden page that returns no content that makes the server do something" had different authentication scheme or permissioning from the rest of the pages, or if what it made the back-end do would be inordinately heavy compared to any other page on the site that posted something.
When is it appropriate to use AJAX?
what are the pros and cons of using AJAX?
In response to my last question: some people seemed very adamant that I should only use AJAX if the situation was appropriate:
Should I add AJAX logic to my PHP classes/scripts?
In response to Chad Birch's answer:
Yes, I'm referring to when developing a "standard" site that would employ AJAX for its benefits, and wouldn't be crippled by its application. Using AJAX in a way that would kill search rankings would not be acceptable. So if "keeping the site intact" requires more work, than that would be a "con".
It's a pretty large subject, but you should be using AJAX to enhance the user experience, without making the site totally dependent on it. Remember that search engines and some other visitors won't be able to execute the AJAX, so if you rely on it to load your content, that will not work in your favor.
For example, you might think that it would be nice to have users visit your blog, and then have the page dynamically load the newest article(s) with AJAX once they're already there. However, when Google tries to index your blog, it's just going to get the blank site.
A good search term to find resources related to this subject is "progressive enhancement". There's plenty of good stuff out there, spend some time following the links around. Here's one to start you off:
http://www.alistapart.com/articles/progressiveenhancementwithjavascript/
When you are only updating part of a page or perhaps performing an action that doesn't update the page at all AJAX can be a very good tool. It's much more lightweight than an entire page refresh for something like this. Conversely, if your entire page reloads or you change to a different view, you really should just link (or post) to the new page rather than download it via AJAX and replace the entire contents.
One downside to using AJAX is that it requires javascript to be working OR you to construct your view in such a way that the UI still works without it. This is more complicated than doing it just via normal links/posts.
AJAX is usually used to perform an HTTP request while the page is already loaded (without loading another page).
The most common use is to update part of the view. Note that this does not include refreshing the whole view since you could just navigate to a new page.
Another common use is to submit forms. In all cases, but especially for forms, it is important to have good ways of handling browsers that do not have javascript or where it is disabled.
I think the advantage of using ajax technologies isn't only for creating better user-experiences, the ability to make server calls for only specific data is a huge performance benefit.
Imagine having a huge bandwidth eater site (like stackoverflow), most of the navigation done by users is done through page reloads, and data that is continuously sent over HTTP.
Of course caching and other techniques help this bandwidth over-head problem, but personally I think that sending huge chunks of HTML everytime is really a waste.
Cons are SEO (which doesn't work with highly based ajax sites) and people that have JavaScript disabled.
When your application (or your users) demand a richer user experience than a traditional webpage is able to provide.
Ajax gives you two big things:
Responsiveness - you can update only parts of a web page at a time if need be (saving the time to re-load a page). It also makes it easier to page data that is presented in a table for instance.
User Experience - This goes along with responsiveness. With AJAX you can add animations, cooler popups and special effects to give your web pages a newer, cleaner and cooler look and feel. If no one thinks this is important then look to the iPhone. User Experience draws people into an application and make them want to use it, one of the key steps in ensuring an application's success.
For a good case study, look at this site. AJAX effects like animating your new Answer when posted, popups to tell you you can't do certain things and hints that new answers have been posted since you started your own answer are all part of drawing people into this site and making it successful.
Javascript should always just be an addition to the functionality of your website. You should be able to use and navigate the site without any Javascript involved. You can use Javascript as an addition to existing functionality, for example to avoid full-page reloads. This is an important factor for accessibility. Javascript should never be used as the only possibility to reach or complete a request on your site.
As AJAX makes use of Javascript, the same applies here.
Ajax is primarily used when you want to reload part of a page without reposting all the information to the server.
Cons:
More complicated than doing a normal post (working with different browsers, writing server side code to hadle partial postbacks)
Introduces potential security vulnerabilities (
You are introducing additional code that interacts with the server. This can be a problem on both the client and server.
On the client, you need ways of sending and receiving responses. It's another way of interacting with the browser which means there is another point of entry that has to be guarded. Executing arbritary code, posting data to a non-intended source etc. There are several exploits for Ajax apps that have been plugged over time, but there will always be more.
)
Pros:
It looks flashier to end users
Allows a lot of information to be displayed on the page without having to load all at the same time
Page is more interactive.
I have a doubt: -
Is there any standard/convention that when should I use "Smarty templating" and when should I use Javascript Ajax calls to produce the content? I can use Ajax/Javascript calls to produce the content dynamically.
My application uses both Ajax and Smarty, but I want to set a rule for developers
You should only use AJAX calls to load dynamic data that is not known at the moment the page is loaded. For example, when you click the "comments" link for a given question/answer on Stack Overflow, an AJAX call is made to dynamically load the data. This is a result of the user clicking the comments link, not a result of the page loading. You don't know that you should show those comments at the time the page is loaded, so it is appropriate to make an AJAX call in this case.
You should use templating to show any data that is known at the moment the page is loaded. It makes it easier to deal with people that have Javascript disabled (I know, not a lot), and it provides a clear separation of logic from presentation. Another important benefit of using templating is the fact that is can significantly decrease the number of HTTP requests made from the client's browser.
This is especially important in the mobile browsing world where latency, not bandwidth, is your biggest obstacle. In mobile Safari, for example, a single HTTP request to a Smarty-templated page will load significantly faster than a request to load a Javascript-templated page that makes five or six additional HTTP requests. This is especially true when using EDGE, 3G, and other non-wifi mobile data services. In fact, this is so important that it is the first guideline in Yahoo's Best Practices for Speeding Up Your Website.
Ideally, you should also gracefully degrade functionality when Javascript is disabled. A good example is an auto-completing search box. It's really cool to have suggested search terms magically appear as you type, but if you turn off Javascript, you still have a functional search box. That's a classic example of a good degradation in service. Stack Overflow generally does a great job providing a solid non-Javascript experience. One place it falls short is in the comments. When Javascript is disabled, only the most popular comments are displayed, and posting new comments is disabled.
Unless absolutely necessary, you should think of Javascript as a bonus feature that might not be enabled, not as something that should be used to construct critical pieces of your website. There are obviously exceptions (some things just can't be done without Javascript). You'll notice, for example, that Stack Overflow is very usable with Javascript turned off. You won't get the real-time updates when new answers are posted, or fancy real-time Markdown previews, but the core functionality is still there. All the "heavy lifting" is done with HTML and CSS. Javascript is just icing (admittedly very good icing) on the cake. This is kind of a side note, but it's important enough to mention.
It probably depends on what sort of work you are doing in your templates. Personally, I hate doing a lot of heavy style/layout stuff strictly in Javascript. If you can load the bulk of your layout via. Smarty and just change specific bits of data (just the data, not the markup/style, if possible) that might be a good place to start with for standardizing within your own developer team.
Use templates for server-side generation and DHTML/AJAX for anything after the original page load (not using a refresh). Even then, the server response for the AJAX call itself can be assembled with a template, which may work best for any non-trivial content.