Best way to send PHP script usage statistics to an external script - php

I'm writing an application in PHP which I plan to sell copies of. If you need to know, it's used to allow users to download files using expiring download links.
I would like to know whenever one of my sold applications generates a download.
What would be the best way to send a notice to my php application on my server, which simply tells it "Hey, one of your scripts has done something", and what would be the best way to keep a count of the number of "hits" my server gets of this nature? A database record, or a flat text file?
I ask because I want to display a running count of the total number of downloads on my homepage, sort of like:
"Responsible for X downloads so far!"
A pure PHP solution is idea, but I suppose an ajax call would be OK too. The simpler the better, since all I am really doing is a simple $var++, only on a larger scale, right?
Anyone care to point me in the right direction?

Whether by javascript or php, you need to set a url up on your server that other scripts can call. That URL should then point to a script that increments a counter. I'd put a number in a database and increment it, or if you wanted to be more detailed, you could easily break this down by month/client etc.
If you go with calling the URL from PHP, take care to ensure that the URL doesn't block the execution - ie: if your site goes down, the script you sell doesn't sit waiting for your server to respond. You can work around this in various ways - I'd do it by registering a shutdown function.
Alternatives that don't have this problem are loading the url with javascript, or as an image (but they will both likely be slightly less accurate) - I would go with image myself, as you'll get marginally better browser support.
Also, remember that unless you compile the code with something like Zend Guard, anyone can remove the remote call and prevent your counter incrementing!

Yeah something url callable is what you what. SOAP is probably the easiest way to go about this.
http://php.net/soap
Benlumley has a lot of valid points regarding this solution.
Also if you want to offset computation to the users browser(making web-service calls might annoy the people who buy your app, bandwidth/CPU cost) then AJAX might be a better solution.

Related

Efficiently limit number of hits per minute (block web scraping or copy-bots) in PHP

I am faced with the problem of bots copying all the content off my webpage (which I try to update quite often).
I try to ban them, or obfuscate code to make it more difficult to copy. However, they find some way to overcome these limitations.
I'd like to try to limit the number of hits per minute (or X time, not neccesarily minutes), but use a Captcha to overcome those limits. Something like if you've requested more than 10 pages in the last 5 minutes, you need to prove you are human using a Captcha. So, if the user is a legitimate user, you'll be able to continue surfing the web.
I'd like to do it only in the content pages (to do it more efficiently). I had thought of MemCached, but since I don't owe the server, I can't use it. If I were using Servlets I'd use HashMap or similar, but since I use PHP, I am still trying to think of a solution.
I don't see MySql (or databases) as a solution, since I can have many hits per seconds. And I should be deleting after a few minutes old request, creating a lot of unnecesary and non-efficient traffic.
Any ideas?
A summary:
If I get too many hits per minute in a section of the webpage, I'd like to limit it using Captcha efficiently, in PHP. Something like if you've requested more than 10 pages in the last 5 minutes, you need to prove you are human using a Captcha.
Your questions kind of goes against the spirit of the internet.
Everyone copies/borrows from everyone
Every search engine has a copy of everything else on the web
I would guess the problem you're having is that these bots are stealing your traffic? If so, I'd suggest you try implementing an API allowing them to use your content legitimately.
This way you can control access, and crucially you can ask for a linkback to your site in return for using your content. This way your site should be number 1 for the content. You don't even really need an API to implement this policy.
If you insist on restricting user access you have the following choices:
Use a javascript solution and load the content into the page using Ajax. Even this is not going to fool the best bots.
Put all your content behind a username/password system.
Block offending IPs - it's a maintenance nightmare and you'll never have a guarantee but it'll probably help.
The problem is - if you want your content to be found by Google AND restricted to other bots you're asking the impossible.
Your best option is create an API and control people copying your stuff rather than trying to prevent it.

Execute linux command from web interface (just like in Django CMS demo generator)

I need to create a webpage that will generate demo similar to https://django-cms.org/en/demo/.
To generate demo, just click Get your demo!
I dont mind about the language, PHP or anything, as long as free and open source.
When the button clicked, it will run /var/testing/makesite.sh
Inside makesite.sh, it has code urlnya="$thegeneratedurl".
If we run echo http://$urlnya/, it will show the full URL like http://site1045.demosite.com/.
After demo website has been generated, I need it display link
There are some PHP example to achieve this by shell_exec, but I scared if its not really safe, and I dont know how to show progress and return the demo URL just like in the Django CMS site.
Well in python you have subprocess and envoy to execute GNU/Linux commands. You can also use fabric to do this.
In order to acheive this you might need to learn how to use virtualenv and unique port no has to created for each application.
This deserves a separate wrong write up. I cannot write whole source code though it is interesting.
For url part, recommend you to use random subdomain or different idea. You might need to store list of all previously generated values in DB or check currently running demo sites to avoid clash.
References
http://docs.python.org/library/subprocess.html
https://github.com/kennethreitz/envoy
http://pypi.python.org/pypi/virtualenv
http://docs.fabfile.org/en/1.4.2/index.html
The possible issue with shell_exec() has nothing to do with just using it. The scary risky thing is letting your user specify the string that's input to it - the string could very occasionally include all manner of weird attempts to "fool" your system and hack into it. But you don't need to do that. So long as you construct the string that's input to it in your own code, there should be no issue.
The string should be exactly the same thing you'd type at a Linux shell prompt. Depending on the details of what the script wants, the string that starts the script might simply be something like "/usr/local/bin/makesite.sh". Or it might also contain some parameters, like "/usr/local/bin/makesite.sh --ownername clientsname". If there are parameters, substitute them in yourself in the code you write, rather than asking the user to substitute them in for you - that way the security risk is minimal.
The result of the "echo $urlnya" in makesite.sh {and all the other output of the shell_exec() command too} will be handed to you by shell_exec() as a chunk of text to do whatever you wish with. Your code can parse it, use bits of it in your own web page, track bits of it internally, extract some sort of unique ID, and so forth. You may for example wish to place a hyperlink to that URL on the web page you're producing behind a button labelled something like {See Created Web Page}.
For the progress bar, get a widget or library that provides the functionality (but see a couple paragraphs further on:-). The ways to do it are a little weird, and cross-browser issues can be substantial, so a progress bar is one of the things where making use of somebody else's encapsulation and testing of the functionality is a really good idea. I believe a library is available from Yahoo!, and I believe JQuery includes the functionality.
The local browser/client will over and over manipulate the progress bar however it chooses for a few seconds, then "re-sync" with the server so it's displaying accurate information. One sometimes sees for example moving stripes; that movement is probably purely local and purely a "guess". But since the page will "re-sync" with the server in a few seconds to readjust its length (or even stop the stripes altogether if something has gone wrong), that should be sufficient.
Displaying a progress bar is only part of the problem though. The bigger part of the problem is what to display. Something on the system needs to be able to say things like "I'm 55% done". But how (or even if) makesite.sh does that I don't know. I don't know of any capabilities built in to Linux to help produce such information. You may need to run the command several times yourself to see how long it takes and what the milestones are, then create some tracker program of your own that checks for those milestones yourself. It may be more trouble than it's worth. You may wish to instead create something much simpler, say for example just some nearly-brain-dead text like "roughly estimated time to setup completion 2 more minutes" or "setup failed, please try again".

Collecting and Processing data with PHP (Twitter Streaming API)

after reading through all of the twitter streaming API and Phirehose PHP documentation i've come across something I have yet to do, collect and process data separately.
The logic behind it, If I understand correctly, is to prevent a log jam at the processing phase that will back up the collecting process. I've seen examples before but they basically write right to a MySQL database right after collection which seems to go against what twitter recommends you do.
What I'd like some advice/help on is, what is the best way to handle this and how. It seems that people recommend writing all the data directly to a text file then parsing/processing it with a separate function. But with this method, I'd assume it could be a memory hog.
Here's the catch, it's all going to be running as a daemon/background process. So does anyone have any experience with solving a problem like this, or more specifically, the twitter phirehose library? Thanks!
Some notes:
*The connection will be through a socket so my guess is that the file will constantly be appended? not sure if anyone has any feedback on that
The phirehose library comes with an example of how to do this. See:
Collect: https://github.com/fennb/phirehose/blob/master/example/ghetto-queue-collect.php
Consume: https://github.com/fennb/phirehose/blob/master/example/ghetto-queue-consume.php
This uses a flat file, which is very scalable and fast, ie: Your average hard disk can write sequentially at 40MB/s+ and scales linearly (ie: unlike a database, it doesn't slow down as it gets bigger).
You don't need any database functionality to consume a stream (ie: you just want the next tweet, there's no "querying" involved).
If you rotate the file fairly often, you will get near-realtime performance (if desired).

How would you protect a database of links from being scraped?

I have a large database of links, which are all sorted in specific ways and are attached to other information, which is valuable (to some people).
Currently my setup (which seems to work) simply calls a php file like link.php?id=123, it logs the request with a timestamp into the DB. Before it spits out the link, it checks how many requests were made from that IP in the last 5 minutes. If its greater than x, it redirects you to a captcha page.
That all works fine and dandy, but the site has been getting really popular (as well as been getting DDOsed for about 6 weeks), so php has been getting floored, so Im trying to minimize the times I have to hit up php to do something. I wanted to show links in plain text instead of thru link.php?id= and have an onclick function to simply add 1 to the view count. Im still hitting up php, but at least if it lags, it does so in the background, and the user can see the link they requested right away.
Problem is, that makes the site REALLY scrapable. Is there anything I can do to prevent this, but still not rely on php to do the check before spitting out the link?
It seems that the bottleneck is at the database. Each request performs an insert (logs the request), then a select (determine the number of requests from the IP in the last 5 minutes), and then whatever database operations are necessary to perform the core function of the application.
Consider maintaining the request throttling data (IP, request time) in server memory rather than burdening the database. Two solutions are memcache (http://www.php.net/manual/en/book.memcache.php) and memcached (http://php.net/manual/en/book.memcached.php).
As others have noted, ensure that indexes exist for whatever keys are queried (fields such as the link id). If indexes are in place and the database still suffers from the load, try an HTTP accelerator such as Varnish (http://varnish-cache.org/).
You could do the ip throttling at the web server level. Maybe a module exists for your webserver, or as an example, using apache you can write your own rewritemap and have it consult a daemon program so you can do more complex things. Have the daemon program query a memory database. It will be fast.
Check your database. Are you indexing everything properly? A table with this many entries will get big very fast and slow things down. You might also want to run a nightly process that deletes entries older than 1 hour etc.
If none of this works, you are looking at upgrading/load balancing your server. Linking directly to the pages will only buy you so much time before you have to upgrade anyway.
Every thing you do on the client side can't be protected, Why not just use AJAX ?
Have a onClick event that call's an ajax function, that returns just the link and fill it in a DIV on your page, beacause the size of the request an answer is small, it will work fast enougth for what you need. Just make sure in the function you call to check the timestamp, It is easy to make a script that call that function many times to steel you links.
You can check out jQuery, or other AJAX libraries (i use jQuery and sAjax). And I have lots of page that dinamicly change content very fast, The client doesn't even know is not pure JS.
Most scrapers just analyze static HTML so encode your links and then decode them dynamically in the client's web browser with JavaScript.
Determined scrapers can still get around this, but they can get around any technique if the data is valuable enough.

Top techniques to avoid 'data scraping' from a website database

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.
While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.
If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.
There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.
Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.
force a reCAPTCHA every 10 page loads for each unique IP
There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.
Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.
I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.
Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.
There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.
What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!
Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?
Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.
My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

Categories