delaying script to slow leechers - php

I am developing an image bank site that will hold royalty-free images for download. I want to slow down anyone using a bot or who is downloading too often, so I have a daily file limit and have incorporated a variable sleep into the script that delivers the files. I do that by writing the completion time of the last download to a database, then checking the elapsed time when the next download begins. If that is less that N seconds then I delay the download by M seconds, doubling M on successive infractions. That works fine until the script hits the server's execution time limit.
My hosting company confirms that sleep time counts towards execution time.
Am I being over-cautious at the development stage?
Any suggestions about how to detect and slow down users who are abusing the site without using php sleep?

I don't think you're being over-cautious, but I do think that this is a bad way to be cautious. If sleep time counts toward execution time, aren't you paying for that? It probably also counts toward CPU usage and a bunch of other cost factors too. Additionally, slowly choking off service doesn't give your user any indication that they are doing something wrong, it just makes your service seem slow.
You'd probably be better off serving a friendly message-image letting the person know what's going on so they can modify their behavior (this is particularly good given that some people might trigger it by accident while performing completely innocent activities). If they insist on serving your message-image more than five or ten times, then it's definitely a script, so just stop answering their requests entirely.

Why don't you simply make the user aware of what he/she is doing "wrong" and display an error?
This way, the user will know what is going on and might decide to correct the behavior. With random delays, I would suspect something wrong with your server and maybe just look for a competing offering that works more stable.

Use a div with a time counter and implement this time mechanism in javascript.example: (www.rapidshare.com) If sleep time is counted as execution time, that means that you have a pretty high chance of crossing the execution time limit.

If any one delay is much longer than the script execution timeout, you might want to block that user entirely for some period of time (24 hours?).
How are you deciding exactly who is aggressively downloading? The IP address is not 100% reliable, as you might have a number of people behind NAT that all appear to come from the same IP address.

Related

Prevent PHP script using up all resources while it runs?

I have a daily cron job which takes about 5 minutes to run (it does some data gathering and then various database updates). It works fine, but the problem is that, during those 5 minutes, the site is completely unresponsive to any requests, HTTP or otherwise.
It would appear that the cron job script takes up all the resources while it runs. I couldn't find anything in the PHP docs to help me out here - how can I make the script know to only use up, say, 50% of available resources? I'd much rather have it run for 10 minutes and have the site available to users during that time, than have it run for 5 minutes and have user complaints about downtime every single day.
I'm sure I could come up with a way to configure the server itself to make this happen, but I would much prefer if there was a built-in approach in PHP to resolving this issue. Is there?
Alternatively, as plan B, we could redirect all user requests to a static downtime page while the script is running (as opposed to what's happening now, which is the page loading indefinitely or eventually timing out).
A normal script can't hog up 100% of resources, resources get split over the processes. It could slow everything down intensly, but not lock all resources in (without doing some funky stuff). You could get a hint by doing top -s in your commandline, see which process takes up a lot.
That leads to conclude that something locks all further processes. As Arkascha comments, there is a fair chance that your database gets locked. This answer explains which table type you should use; If you do not have it set to InnoDB, you probally want that, at least for the locking tables.
It could also be disk I/O if you write huge files, try to split it into smaller read/writes or try to place some of the info (e.g. if it are files with lists) to your database (assuming that has room to spare).
It could also be CPU. To fix that, you need to make your code more efficient. Recheck your code, see if you do heavy operations and try to make those smaller. Normally you want this as fast as possible, now you want them as lightweight as possible, this changes the way you write code.
If it still locks up, it's time to debug. Turn off a large part of your code and check if the locking still happens. Continue turning on code untill you notice locking. Then fix that. Try to figure out what is costing you so much. Only a few scripts require intense resources, it is now time to optimize. One option might be splitting it into two (or more) steps. Run a cron that prepares/sanites the data, and one that processed the data. These dont have to run at syncronical, there might be a few minutes between them.
If that is not an option, benchmark your code and improve as much as you can. If you have a heavy query, it might improve by selecting only ID's in the heavy query and use a second query just to fetch the data. If you can, use your database to filter, sort and manage data, don't do that in PHP.
What I have also implemented once is a sleep every N actions.
If your script really is that extreme, another solution could be moving it to a time when little/no visitors are on your site. Even if you remove the bottleneck, nobody likes a slow website.
And there is always the option of increasing your hardware.
You don't mention which resources are your bottleneck; CPU, memory or disk I/O.
However if it is CPU or memory you can do something this in you script:
http://php.net/manual/en/function.sys-getloadavg.php
http://php.net/manual/en/function.memory-get-usage.php
$yourlimit = 100000000;
$load = sys_getloadavg();
if ($load[0] > 0.80 || memory_get_usage() > $yourlimit) {
sleep(5);
}
Another thing to try would be to set your process priority in your script.
This requires SU though, which should be fine for a cronjob?
http://php.net/manual/en/function.proc-nice.php
proc_nice(50);
I did a quick test for both and it work like a charm, thanks for asking I have cronjob like that as well and will implement it. It looks like the proc_nice only will do fine.
My test code:
proc_nice(50);
$yourlimit = 100000000;
while (1) {
$x = $x+1;
$load = sys_getloadavg();
if ($load[0] > 0.80 || memory_get_usage() > $yourlimit) {
sleep(5);
}
echo $x."\n";
}
It really depend of your environment.
If using a unix base, there is built-in tools to limit cpu/priority of a given process.
You can limit the server or php alone, wich is probably not what you are looking for.
What you can do first is to separate your task in a separate process.
There is popen for that, but i found it much more easier to make the process as a bash script. Let''s name it hugetask for the example.
#!/usr/bin/php
<?php
// Huge task here
Then to call from the command line (or cron):
nice -n 15 ./hugetask
This will limit the scheduling. It mean it will low the priority of the task against others. The system will do the job.
You can as well call it from your php directly:
exec("nice -n 15 ./hugetask &");
Usage: nice [OPTION] [COMMAND [ARG]...] Run COMMAND with an adjusted
niceness, which affects process scheduling. With no COMMAND, print the
current niceness. Niceness values range from
-20 (most favorable to the process) to 19 (least favorable to the process).
To create a cpu limit, see the tool cpulimit which has more options.
This said, usually i am just putting some usleep() in my scripts, to slow it down and avoid to create a funnel of data. This is ok if you are using loops in your script. If you slow down your task to run in say 30 minutes, there won't be much issues.
See also proc_nice http://php.net/manual/en/function.proc-nice.php
proc_nice() changes the priority of the current process by the amount
specified in increment. A positive increment will lower the priority
of the current process, whereas a negative increment will raise the
priority.
And sys_getloadavg can also help. It will return an array of the system load in the last 1,5, and 15 minutes.
It can be used as a test condition before launching the huge task.
Or to log the average to find the best day time to launch huge task. It can be susrprising!
print_r(sys_getloadavg());
http://php.net/manual/en/function.sys-getloadavg.php
You could try to delay execution using sleep. Just cause your script to pause between several updates of your database.
sleep(60); // stop execution for 60 seconds
Although this depends a lot on the kind of process you are doing in your script. Maybe or not helpful in your case. Worth a try, so you could
Split your queries
do the updates in steps with sleep inbetween
References
Using sleep for cron process
I could not describe it better than the quote in the above answer:
Maybe you're walking the database of 9,000,000 book titles and updating about 10% of them. That process has to run in the middle of the day, but there are so many updates to be done that running your batch program drags the database server down to a crawl for other users.
So modify the batch process to submit, say, 1000 updates, then sleep for 5 seconds to give the database server a chance to finish processing any requests from other users that have backed up.
Sleep and server resources
sleep resources depend on OS
adding sleep to allevaite server resources
Probably to minimize you memory usage you should process heavy and lengthy operations in batches. If you query the database using an ORM like doctrine you can easily use existing functions
http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/batch-processing.html
It's hard to tell what exactly the issue may be without having a look at your code (cron script). But to confirm that the issue is caused by the cron job you can run the script manually and check website responsiveness. If you notice the site being down when running the cron job then we would have to have a look at your script in order to come up with a solution.
Many loops in your cron script might consume a lot of CPU resources.
To prevent that and reduce CPU usage simply put some delays in your script, for example:
while($long_time_condition) {
//Do something here
usleep(100000);
}
Basically, you are giving the processor some time to do something else.
Also you can use the proc_nice() function to change the process priority. For example proc_nice(20);//very low priority. Look at this question.
If you want to find the bottlenecks in your code you can try to use Xdebug profiler.
Just set it up in your dev environment, start the cron manually and then profile any page. Also you can profile your cron script as well php -d xdebug.profiler_enable=On script.php, look at this question.
If you suspect that the database is your bottleneck than import pretty large dataset (or entire database) in your local database and repeat the steps, logging and inspecting all the queries.
Alternatively if it possible setup the Xdebug on the staging server where the server is as close as possible to production and profile the page during cron execution.

Time syncing vs high latency

I'm doing an auction script and time syncing between visitors and the server is necessary (when will the auction end). Every time a user bids, auction end time is extended for a few seconds. My problem is that several users are complaining about their timers skipping (some seconds) and figured out that it is because of a high latency connection.
My current algorithm has a javascript function that runs every second, getting time left for the auction through ajax requests. Is there a better way to approach this, especially for high latency users, to prevent the timer skipping problem?
Adaptive intervals
First of all, I would suggest that you decrease the amount of polling. I don't know about your server implementation, but the current setup will create a lot of requests once you have a couple of users.
I would suggest that you adjust the polling interval depending on how much time is left. If there are two hours left until the end of an auction, we might not really care if the additional seconds are only fetched from the server every minute, right? You could do it like this
pollingInterval = secondsLeft / 100
The interval is shorter and the result is more accurate towards the end of the auction.
Server Sent Events
For the last minute or so, when you want a high accuracy, regular polling at short intervals is not the best solution, as discussed in the comments. Long polling is an option, but you should also look into HTML5 Server Sent Events, which is like a native browser implementation of long polling. There's a good introduction and comparison to Websockets. Browser support is already pretty good, there's a polyfill for unsupported browsers which falls back to...polling.
Have you looked into long polling? Use you could use a jquery/javascript countdown clock and then just change the countdown time whenever a new bid is placed. Should cut your ajax calls drastically.
javascript function that runs every second
This the old way to do what you want.
I think you need to use web-sockets to ensure real-time delivery for all users.
If you want to save time you can use any web-socket servers available instead of making it yourself.
I prefer Real-Time Pusher
It's easy and you can use it free but with a limited number of users. Also you can upgrade for more users.
www.pusher.com
Also, have good API documentation to help you to implement what you want fast and easy.
For any help with Pusher-or-websockets feel free to ask.

Should I be using message queuing for this?

I have a PHP application that currently has 5k users and will keep increasing for the forseeable future. Once a week I run a script that:
fetches all the users from the database
loops through the users, and performs some upkeep for each one (this includes adding new DB records)
The last time this script ran, it only processed 1400 users before dieing due to a 30 second maximum execute time error. One solution I thought of was to have the main script still fetch all the users, but instead of performing the upkeep process itself, it would make an asynchronous cURL call (1 for each user) to a new script that will perform the upkeep for that particular user.
My concern here is that 5k+ cURL calls could bring down the server. Is this something that could be remedied by using a messaging queue instead of cURL calls? I have no experience using one, but from what I've read it seems like this might help. If so, which message queuing system would you recommend?
Some background info:
this is a Symfony project, using Doctrine as my ORM and MySQL as my DB
the server is a Windows machine, and I'm using Windows' task scheduler and wget to run this script automatically once per week.
Any advice and help is greatly appreciated.
If it's possible, I would make a scheduled task (cron job) that would run more often and use LIMIT 100 (or some other number) to process a limited number of users at a time.
A few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but more than 30 seconds would be a start.
Track Upkeep against Users
Maybe add a field for each user, last_check and have that field set to the date/time of the last successful "Upkeep" action performed against that user.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.
Why don't you still use the cURL idea, but instead of processing only one user for each, send a bunch of users to one by splitting them into groups of 1000 or something.
Have you considered changing your logic to commit changes as you process each user? It sounds like you may be running a single transaction to process all users, which may not be necessary.
How about just increasing the execution time limit of PHP?
Also, looking into if you can improve your upkeep-procedure to make it faster can help too. Depending on what exactly you are doing, you could also look into spreading it out a bit. Do a couple once in a while rather than everyone at once. But depends on what exactly you're doing of course.

Cron job for big data

I'm working on a social network like Friendfeed. When user add his feed links, I use a cron job to parse each user feed. Is this possible with large number of users, like parsing 10.000 links each hour or will that cause problems? If it isn't possible, what is used on Friendfeed or RSS readers to do that?
You might consider adding some information about your hardware to your question, this makes a big difference for someone looking to advise you on how easily your implementation will scale.
If you end up parsing millions of links, one big cron job is going to become problematic. I am assuming you are doing the following (if not, you probably should):
Realizing when users subscribe to the same feed, to avoid fetching it twice.
When fetching a new feed, check for the existence of a site map that tells you how often the feed is likely to change, re-visit that value on a sensible interval
Checking system load and memory usage to know when to 'back off' and go to sleep for a while.
This reduces the amount of sweat that an hourly cron would produce.
If you are harvesting millions of feeds, you'll probably want to distribute that work, something that you might want to keep in mind while you're still desigining your database.
Again, please update your question with details on the hardware you are using and how big your solution needs to scale. Nothing scales 'infinitely', so please be realistic :)
Don't have quite enough information to judge whether this design is good or not, but to answer the basic question, unless you are doing some very intensive processing on 10k questions, that should be trivial for an hourly cron job to handle.
More information on how you process the feeds, and in particular how the process scales with respect to number of users who have feeds and number of feeds per user, would be useful in giving you further advice.
Your limiting factor will be the network access to these 10,000 feeds. You could process the feeds serially and likely do 10,000 in an hour (you'd need to average about 350ms latency).
Of course you'd want to have more than one process doing the work simultaneously to speed things up.
What ever solution you select, if you meet success (which I hope), you will have performance issue.
As the founder of FF said many times: the only solution to select the best actual solution is to profile/measure. With numbers the choice will be obvious.
So: build a test architecture close to your expected (=realistic) situation in a few months and profile/measure.
You might want to consider checking out IronWorker for big data jobs like this. It's made for it and since it's a service you don't need to deal with servers or scale. It has scheduling built in so you would schedule a worker task to run each hour and that task can then queue up 10,000 other jobs and run them all in parallel.

Processing many rss/xml feeds in a cron file without overloading server

I have a cron that for the time being runs once every 20 minutes, but ultimately will run once a minute. This cron will process potentially hundreds of functions that grab an XML file remotely, and process it and perform its tasks. Problem is, due to speed of the remote sites, this script can sometimes take a while to run.
Is there a safe way to do this without [a] the script timing out, [b] overloading the server [c] overlapping and not completing its task for that minute before it runs again (would that error out?)
Unfortunately caching isnt an option as the data changes near real-time, and is from a variety of sources.
I think a slight design change would benefit this process quite a bit. Given that a remote server could time out, or a connection could be slow, you'll definitely run into concurrency issues if one slow job is still writing files when another one starts up.
I would break it into two separate scripts. Have one script that is only used for fetching the latest XML data, and another for processing it. The fetch script can take it's sweet time if it needs to, while the process script continually looks for the newest file available in order to process it.
This way they can operate independently, and the processing script can always work with the latest data, irregardless of how long either script takes to perform.
have a stack that you keep all the jobs on, have a handful of threads who's job it is to:
Pop a job off the stack
Check if you need to refresh the xml file (check for etags, expire headers, etc.)
grab the XML (this is the bit that could take the time hence spreading the load over threads) if need be, this should time out if it takes too long and raise the fact it did to someone as you might have a site down, dodgy rss generator or whatever.
then process it
This way you'll be able to grab lots of data each time.
It could be that you don't need to grab the file at all (would help if you could store the last etag for a file etc.)
One tip, don't expect any of them to be in a valid format. Suggest you have a look at Mark Pilgrims RSS RegExp reader which does a damn fine job of reading most RSS's
Addition: I would say hitting the same sites every minute is not really playing nice to the servers and creates alot of work for your server, do you really need to hit it that often?
You should make sure to read the <ttl> tag of the feeds you are grabbing to ensure you are not unnecessarily grabbing feeds before they change. <ttl> holds the update period. So if a feed has <ttl>60</ttl> then it should be only updated every 60 minutes.

Categories