How would you protect a database of links from being scraped?

How would you protect a database of links from being scraped? - php

I have a large database of links, which are all sorted in specific ways and are attached to other information, which is valuable (to some people).
Currently my setup (which seems to work) simply calls a php file like link.php?id=123, it logs the request with a timestamp into the DB. Before it spits out the link, it checks how many requests were made from that IP in the last 5 minutes. If its greater than x, it redirects you to a captcha page.
That all works fine and dandy, but the site has been getting really popular (as well as been getting DDOsed for about 6 weeks), so php has been getting floored, so Im trying to minimize the times I have to hit up php to do something. I wanted to show links in plain text instead of thru link.php?id= and have an onclick function to simply add 1 to the view count. Im still hitting up php, but at least if it lags, it does so in the background, and the user can see the link they requested right away.
Problem is, that makes the site REALLY scrapable. Is there anything I can do to prevent this, but still not rely on php to do the check before spitting out the link?

It seems that the bottleneck is at the database. Each request performs an insert (logs the request), then a select (determine the number of requests from the IP in the last 5 minutes), and then whatever database operations are necessary to perform the core function of the application.
Consider maintaining the request throttling data (IP, request time) in server memory rather than burdening the database. Two solutions are memcache (http://www.php.net/manual/en/book.memcache.php) and memcached (http://php.net/manual/en/book.memcached.php).
As others have noted, ensure that indexes exist for whatever keys are queried (fields such as the link id). If indexes are in place and the database still suffers from the load, try an HTTP accelerator such as Varnish (http://varnish-cache.org/).

You could do the ip throttling at the web server level. Maybe a module exists for your webserver, or as an example, using apache you can write your own rewritemap and have it consult a daemon program so you can do more complex things. Have the daemon program query a memory database. It will be fast.

Check your database. Are you indexing everything properly? A table with this many entries will get big very fast and slow things down. You might also want to run a nightly process that deletes entries older than 1 hour etc.
If none of this works, you are looking at upgrading/load balancing your server. Linking directly to the pages will only buy you so much time before you have to upgrade anyway.

Every thing you do on the client side can't be protected, Why not just use AJAX ?
Have a onClick event that call's an ajax function, that returns just the link and fill it in a DIV on your page, beacause the size of the request an answer is small, it will work fast enougth for what you need. Just make sure in the function you call to check the timestamp, It is easy to make a script that call that function many times to steel you links.
You can check out jQuery, or other AJAX libraries (i use jQuery and sAjax). And I have lots of page that dinamicly change content very fast, The client doesn't even know is not pure JS.

Most scrapers just analyze static HTML so encode your links and then decode them dynamically in the client's web browser with JavaScript.
Determined scrapers can still get around this, but they can get around any technique if the data is valuable enough.

Related

Many HTTP requests (API) Vs Everything in single php

I need some advice on website design.
Lets take example of twitter for my question. Lets say I am making twitter. Now on the home_page.php ,I need both, Data about tweets (Tweet id , who tweeted , tweet time etc. etc) and Data about the user( userId , username , user profile pic).
Now to display all this, I have two option in mind..
1) Making separate php files like tweets.php and userDetails.php. By using AJAX queries, I can get the data on the home_page.php.
2) Adding all the php code (connecting to db, fetching data ) in the home_page.php itself.
In option one, I need to make many HTTP requests, which (i think) will be load to the network. So it might slow down the website.
But option two, I will have a defined REST API. Which will be good of adding more features in the future.
Please give me some advice on picking the best. Also I am still a learner, so if there are more options of implementing this, please share.

In number 1 you're reliant on java-script which doesn't follow progressive enhancement or graceful degradation; if a user doesn't have JS they will see zero content which is obviously bad.
Split your code into manageable php files to make it easier to read and require them all in one main php file; this wont take any extra http requests because all the includes are done server side and 1 page is sent back.
You can add additional javascript to grab more "tweets" like twitter does, but dont make the main functionality rely on javascript.

Don't think of PHP applications as a collection of PHP files that map to different URLs. A single PHP file should handle all your requests and include functionality as needed.
In network programming, it's usually good to minimize the number of network requests, because each request introduces an overhead beyond the time it takes for the raw data to be transmitted (due to protocol-specific information being transmitted and the time it takes to establish a connection for example).
Don't rely on JavaScript. JavaScript can be used for usability enhancements, but must not be used to provide essential functionality of your application.

Adding to Kiee's answer:
It can also depend on the size of your content. If your tweets and user info is very large, the response the single PHP file will take considerable time to prepare and deliver. Then you should go for a "minimal viable response" (i.e. last 10 tweets + 10 most popular users, or similar).
But what you definitely will have to do: create an API to bring your page to life. No matter which approach you will use...

Push data to page without checking periodically for it?

Is there any way you can push data to a page rather than checking for it periodically?
Obviously you can check for it periodically with ajax, but is there any way you can force the page to reload when a php script is executed?
Theoretically you can improve an ajax request's speed by having a table just for when the ajax function is supposed to execute (update a value in the table when the ajax function should retrieve new data from the database) but this still requires a sizable amount of memory and a mysql connection as well as still some waiting time while the query executes even when there isn't an update/you don't want to execute the ajax function that retrieves database data.
Is there any way to either make this even more efficient than querying a database and checking the table that stores the 'if updated' data OR tell the ajax function to execute from another page?
I guess node.js or HTML5 webSocket could be a viable solution as well?
Or you could store 'if updated' data in a text file? Any suggestions are welcome.

You're basically talking about notifying the client (i.e. browser) of server-side events. It really comes down to two things:
What web server are you using? (are you limited to a particular language?)
What browsers do you need to support?
Your best option is using WebSockets to do the job, anything beyond using web-sockets is a hack. Still, many "hacks" work just fine, I suggest you try Comet or AJAX long-polling.
There's a project called Atmosphere (and many more) that provide you with a solution suited towards the web server you are using and then will automatically pick the best option depending on the user's browser.
If you aren't limited by browsers and can pick your web stack then I suggest using SocketIO + nodejs. It's just my preference right now, WebSockets is still in it's infancy and things are going to get interesting once it starts to develop more. Sometimes my entire application isn't suited for nodejs, so I'll just offload the data operation to it alone.
Good luck.

Another possibility, if you can store the data in a simple format in a file, you update a file with the data and use the web server to check its timestamp.
Then the browser can poll, making HEAD requests, which will check the update times on the file to see if it needs an updated copy.
This avoids making a DB call for anything that doesn't change the data, but at the expense of keeping file system copies of important resources. It might be a good trade-off, though, if you can do this for active data, and roll them off after some time. You will need to ensure that you manage to change this on any call that updates the data.
It shares the synchronization risks of any systems with multiple copies of the same data, but it might be worth investigating if the enhanced responsiveness is worth the risks.

There was once a technology called "server push" that kept a Web server process sitting there waiting for more output from your script and forwarding it on to the client when it appeared. This was the hot new technology of 1995 and, while you can probably still do it, nobody does because it's a freakishly terrible idea.
So yeah, you can, but when you get there you'll most likely wish you hadn't.

Well you can (or will) with HTML5 Sockets.
This page has some great info about this technology:
http://www.html5rocks.com/en/tutorials/websockets/basics/

How can I display currently online users, without a database? Or at least very efficiently. Preferably in PHP

This issue has been quite the brain teaser for me for a little while. Apologies if I write quite a lot, I just want to be clear on what I've already tried etc.
I will explain the idea of my problem as simply as possible, as the complexities are pretty irrelevant.
We may have up to 80-90 users on the site at any one time. They will likely all be accessing the same page, that I will call result.php. They will be accessing different results however via a get variable for the ID (result.php?ID=456). It is likely that less than 3 or 4 users will be on an individual record at any one time, and there are upwards of 10000 records.
I need to know, with less than a 20-25 second margin of error (this is very important), who is on that particular ID on that page, and update the page accordingly. Removing their name once they are no longer on the page, once again as soon as possible.
At the moment, I am using a jQuery script which calls a php file, reading from a database of "Currently Accessing" usernames who are accessing this particular ID, and only if the date at which they accessed it is within the last 25 seconds. The file will also remove all entries older than 5 minutes, to keep the table tidy.
This was alright with 20 or 30 users, but now that load has more than doubled, I am noticing this is a particularly slow method.
What other methods are available to me? Has anyone had any experience in a similar situation?
Everything we use at the moment is coded in PHP with a little jQuery. We are running on a server managed offsite by a hosting company, if that matters.
I have come across something called Comet or a Comet Server which sounds like it could potentially be of assistance, but it also sounds extremely complicated for my purposes and far beyond my understanding at the moment.

Look into websockets for a realtime socket connection. You could use websockets to push out updates in real time (instead of polling) to ensure changes in the 'currently online users' is sent within milliseconds.

What you want is an in-memory cache with a service layer that maintains the state of activity on the site. Using memcached might be a good starting point. Your pseudo-code would be something like:
On page access, make a call to CurrentUserService
CurrentUserService takes as a parameter the page you're accessing and who you are.
Each time you call it, it removes whatever you were accessing before from the cache.
Then it adds what you're currently accessing.
Then it compiles a list of who else is accessing the same thing based on the current state in the cache.
It returns this list, which your page processes and displays.
If you record when someone accesses a page, you can set a timeout for when the service stops 'counting' them as accessing the page.

Guidance on the number of http calls; is too much AJAX a problem?

I develop a website based on the way that every front end thing is written in JavaScript. Communication with server is made trough JSON. So I am hesitating about it: - is the fact I'm asking for every single data with http request query OK, or is it completely unacceptable? (after all many web developers change multiple image request to css sprites).
Can you give me a hint please?
Thanks

It really depends upon the overall server load and bandwidth use.
If your site is very low traffic and is under no CPU or bandwidth burden, write your application in whatever manner is (a) most maintainable (b) lowest chance to introduce bugs.
Of course, if the latency involved in making thirty HTTP requests for data is too awful, your users will hate you :) even if you server is very lightly loaded. Thirty times even 30 milliseconds equals an unhappy experience. So it depends very much on how much data each client will need to render each page or action.
If your application starts to suffer from too many HTTP connections, then you should look at bundling together the data that is always used together -- it wouldn't make sense to send your entire database to every client on every connection :) -- so try to hit the 'lowest hanging fruit' first, and combine the data together that is always used together, to reduce extra connections.

If you can request multiple related things at once, do it.
But there's no real reason against sending multiple HTTP requests - that's how AJAX apps usually work. ;)
The reason for using sprites instead of single small images is to reduce loading times since only one file has to be loaded instead of tons of small files at once - or at a certain time when it'd be desirable to have the image already available to be displayed.

My personal philosophy is:
The initial page load should be AJAX-free.
The page should operate without JavaScript well enough for the user to do all basic tasks.
With JavaScript, use AJAX in response to user actions and to replace full page reloads with targeted AJAX calls. After that, use as many AJAX calls as seem reasonable.

Best way to send PHP script usage statistics to an external script

I'm writing an application in PHP which I plan to sell copies of. If you need to know, it's used to allow users to download files using expiring download links.
I would like to know whenever one of my sold applications generates a download.
What would be the best way to send a notice to my php application on my server, which simply tells it "Hey, one of your scripts has done something", and what would be the best way to keep a count of the number of "hits" my server gets of this nature? A database record, or a flat text file?
I ask because I want to display a running count of the total number of downloads on my homepage, sort of like:
"Responsible for X downloads so far!"
A pure PHP solution is idea, but I suppose an ajax call would be OK too. The simpler the better, since all I am really doing is a simple $var++, only on a larger scale, right?
Anyone care to point me in the right direction?

Whether by javascript or php, you need to set a url up on your server that other scripts can call. That URL should then point to a script that increments a counter. I'd put a number in a database and increment it, or if you wanted to be more detailed, you could easily break this down by month/client etc.
If you go with calling the URL from PHP, take care to ensure that the URL doesn't block the execution - ie: if your site goes down, the script you sell doesn't sit waiting for your server to respond. You can work around this in various ways - I'd do it by registering a shutdown function.
Alternatives that don't have this problem are loading the url with javascript, or as an image (but they will both likely be slightly less accurate) - I would go with image myself, as you'll get marginally better browser support.
Also, remember that unless you compile the code with something like Zend Guard, anyone can remove the remote call and prevent your counter incrementing!

Yeah something url callable is what you what. SOAP is probably the easiest way to go about this.
http://php.net/soap
Benlumley has a lot of valid points regarding this solution.
Also if you want to offset computation to the users browser(making web-service calls might annoy the people who buy your app, bandwidth/CPU cost) then AJAX might be a better solution.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.