We are starting to have a big problem with our site. Some of our users are using auto-refreshers and macro programs to take advantage of certain parts of our site, and now it's beginning to take some serious effect. Our site lags most of the day due to this and we need to find out which of our users are performing these tasks so that we can punish them directly. We are using PHP with this project.
I can use any help with this problem. The site lags so badly at times, it's difficult to keep it running.
Parse your web server daemon's access log and calculate the interval between requests for each visitor IP. If they are very regular (i.e. every five seconds +/- 0.25 seconds), flag them.
It is very difficult to stop this type of behaviour if your users are determined, but you can curb the problem. Firstly look for automated user behaviour in your logs, refreshes/actions/requests within a fixed intervals, often these patterns are very obvious because a human could not behave in such a manner, due to speed or activity period.
Use caching or forward proxy like Squid or Varnish.
Caching the costly parts of page generation will make the site run faster. You don't need to display real time information?
Add cache headers (e.g. "Cache-Control: public,max-age=60") and set up a forward proxy like Squid or Varnish. This will make the site run faster most of the time without you having to add caching or optimize your code.
I'd say the real problem is not your users, but your code. The two methods above will help you deal with the situation in the short run. For long-term solution, you should refactor and optimize your code. Reloaders are invisible for properly designed sites. They're so rare that you can't be having many clients.
Here's a simple way to test that your site is up to bar with "reloaders". Open your site in a browser like Firefox. Press and hold F5 for a minute (or less). Release F5; if the site shows up immediately, you've fine; if you need to reboot your server to make it responsive again, you're vulnerable to reload DOS. If you can't handle multiple concurrent requests, your site can be taken down by anyone, not just hardcore users with reload applications.
Related
If I have a loop with a lot of curl executions happening, will that slow down the server that is running that process? I realize that when this process runs, and I open a new tab to access some other page on the website, it doesn't load until this curl process that's happening finishes, is there a way for this process to run without interfering with the performance of the site?
For example this is what I'm doing:
foreach ($chs as $ch) {
$content = curl_exec($ch);
... do random stuff...
}
I know I can do multi curl, but for the purposes of what I'm doing, I need to do it like this.
Edit:
Okay, maybe this might change things a bit but I actually want this process to run using WordPress cron. If this is running as a WordPress "cron", would it hinder the page performance of the WordPress site? So in essence, if the process is running, and people try to access the site, will they be lagged up?
The curl requests are not asynchronous so using curl like that, any code after that loop will have to wait to execute until after the curl requests have each finished in turn.
curl_multi_init is PHP's fix for this issue. You mentioned you need to do it the way you are, but is there a way you can refactor to use that?
http://php.net/manual/en/function.curl-multi-init.php
As an alternate, this library is really good for this purpose too: https://github.com/petewarden/ParallelCurl
Not likely unless you use a strictly 1-thread server for development. Different requests are eg in Apache handled by workers (which depending on your exact setup can be either threads or separate processes) and all these workers run independently.
The effect you're seeing is caused by your browser and not by the server. It is suggested in rfc 2616 that a client only opens a limited number of parallel connections to a server:
Clients that use persistent connections SHOULD limit the number of
simultaneous connections that they maintain to a given server. A
single-user client SHOULD NOT maintain more than 2 connections with
any server or proxy.
btw, the standard usage of capitalized keywords like here SHOULD and SHOULD NOT is explained in rfc 2119
and that's what eg Firefox and probably other browsers also use as their defaults. By opening more tabs you quickly exhaust these parallel open channels, and that's what causes the wait.
EDIT: but after reading #earl3s 'reply I realize that there's more to it: earl3s addresses the performance within each page request (and thus the server's "performance" as experienced by the individual user), which can in fact be sped up by parallelizing curl requests. But at the cost of creating more than one simultaneous link to the system(s) you're querying... And that's where rfc2616's recommendation comes back into play: unless the backend systems delivering the content are under your control you should think twice before paralleling your curl requests, as each page hit on your system will hit the backend system with n simultaneous hits...
EDIT2: to answer OP's clarification: no (for the same reason I explained in the first paragraph - the "cron" job will be running in another worker than those serving your users), and if you don't overdo it, ie, don't go wild on parallel threads, you can even mildly parallelize the outgoing requests. But the latter more to be a good neighbour than because of fear to met down your own server.
I just tested it and it looks like the multi curl process running on WP's "cron" made no noticeable negative impact on the site's performance. I was able to load multiple other pages with no terrible lag on the site while the site was running the multi curl process. So looks like it's okay. And I also made sure that there is locking so that this process doesn't get scheduled multiple times. And besides, this process will only run once a day in U.S. low-peak hours. Thanks.
My hosting says that apache connections limit is 30. I don't whether its enough or not for an average site with 100 visitors per day. I want to know what are the things I should adapt for this limit while coding the site. Mostly I 'll use php sessions and little ajax . I want to know if there any precautions and recommended practices (if any) to avoid hitting this limit.
Thank you.
Since you will be using AJAX, I can't stress this enough...Do not long poll with Apache! It will hold your connections open and effectively perform a DOS(Denial of Service) on your own site.
Other than that, minimize the time it takes between when Apache receives a request to when it outputs and closes. The big blinking neon sign here is to use caching. Whether it is file based caching or something like Memcached or APC, this can drastically reduce the time Apache holds a connection open.
Taken by itself, the statement "apache connections limit is 30" doesn't actually mean much -- Apache configuration can be fairly involved and there are a lot of numbers/parameters. But if we assume that what this really means is 'MaxClients is 30', then what you need to know is that you have a limit of 30 simultaneous connections. However, connection 31 isn't rejected -- it should just be queued until there's a thread available to respond to the request. There's a lot of specifics according to the config, etc, but I doubt you need to worry much.
This means there are 30 possible concurrent connections possible, if you have 100 visitors per day, it's very unlikely to have about a third at the same time.
As you are growing with your site I'd recommend you another server/hoster.
But as if you don't make long running persistent connections and high frequent AJAX call all the time, this should be enough.
Connection limit is most probably simultaneous requests. So if you're only at the development stage, that is fine. But as for once it has launched, that is a different story. If your expected traffic is only about 100 visitors a day, then you will most probably be fine. I would however recommend to change your VPS host if it is anything over that, as if the server is turning away visitors, then it is not good for business.
But in all honesty you're better off developing locally for now to save your bandwidth for actual visitors, as from your description you don't seem to be using anything that requires a live site.
As all my requests goes through an index script, I tried to time the respond time of all my requests.
Its simply the difference between the start time (start of the script) and end time (end of the script).
As I cache my data on memcached and user are all served using memcached.
I mostly get less than a second respond time but at times there's wierd spike of more than a seconds. the worse case can go up to 200+ seconds.
I was wondering if mobile users had a slow connection, does that reflect on my respond time?
I am serving primary mobile users.
Thanks!
No, it's the runtime of your script. It does not count the latency to the user, that's something the underlying web server is worrying about. Something in your script just takes very long. I recommend you profile your script to find what that is. Xdebug is a good way to do so.
If you're measuring in PHP (which it sounds like you are), thats the time it takes for the page to be generated on the server side, not the time it takes to be downloaded.
Drop timers in throughout the page, and try and break it down to a section that is causing the huge delay of 200+ seconds.
You could even add a small script that will email you details of how long each section took to load if it doesn't happen often enough to see it yourself.
It could be that the script cannot finish because a client downloads the results very-very slowly. If you don't use a front-end server like nginx, the first thing to do is to try it.
Someone already mentioned xdebug, but normally you would not want to run xdebug in production. I would suggest using xhprof to profile pages on development/staging/production. You can turn on xhprof conditionally, which makes it really easy to run on production.
Is there a standard solution to scale up a website which runs on PHP + Apache web server ?
As in I get a traffic of about 100,000 requests/day as of now. 6 months down the line I expect it to grow to 200,000 requests/day. The first cut solution which comes to my mind is deploying more Apache web servers with mod_php, but something seems so wrong about it.
Any ideas ?
Try these two options first before adding new servers. They may allow you to stick with one server, but your results may vary.
For speeding the site up when you are hit with many concurrent users, look into installing the APC PECL extension (http://us2.php.net/manual/en/book.apc.php). APC will allow you to cache the compiled version of your scripts, saving the step of the PHP interpreter running each time a script is executed.
Also, if you are experiencing heavy load on the database server, look into installing memcached and caching database results for a certain time period, if possible (http://us2.php.net/manual/en/book.memcache.php).
Finally, if you do decide to get a separate server, look into possibly getting a dedicated SQL box. This, of course, assumes that your application is a database heavy application, as web apps are these days. Segregating SQL into a separate box allows it to take advantage of all of the resources on that box, with more cache and processing power. It could be the way to go.
i don't have any experience with scaling realy large websites, but i don't think you'll need so scale to different servers in this case. i have a browsergame with 40.000-60.000 requests per day, some cronjobs doing a lot of stuff every 5 minutes and a teamspeak-server on a small server (40 $ / month) and havn't got any performance problems till now.
20.000 requests / day is only one every fifth second, sounds like one box should be able to deal with that just fine? If not I'd first have a look at bottlenecks in your code. Redundant database calls? Double-looping database calls rather than simple joins? Are you caching anything?
How to scale after this is totally dependent on your application, how/where do you keep session state and so forth, general advice has limited applicability.
if you like it then you should have put a cache on it
That question may appear strange.
But every time I made PHP projects in the past, I encountered this sort of bad experience:
Scripts cancel running after 10 seconds. This results in very bad database inconsistencies (bad example for an deleting loop: User is about to delete an photo album. Album object gets deleted from database, and then half way down of deleting the photos the script gets killed right where it is, and 10.000 photos are left with no reference).
It's not transaction-safe. I've never found a way to do something securely, to ensure it's done. If script gets killed, it gets killed. Right in the middle of a loop. It gets just killed. That never happened on tomcat with java. Java runs and runs and runs, if it takes long.
Lot's of newsletter-scripts try to come around that problem by splitting the job up into a lot of packages, i.e. sending 100 at a time, then relading the page (oh man, really stupid), doing the next one, and so on. Most often something hangs or script will take longer than 10 seconds, and your platform is crippled up.
But then, I hear that very big projects use PHP like studivz (the german facebook clone, actually the biggest german website). So there is a tiny light of hope that this bad behavior just comes from unprofessional hosting companies who just kill php scripts because their servers are so bad. What's the truth about this? Can it be configured in such a way, that scripts never get killed because they take a little longer?
Is PHP suitable for very large projects?
Whenever I see a question like that, I get a bit uneasy. What does very large mean? What may be large to you, may be small to me or vice versa. And that is even assuming that we use the same metric. Are you measuring time to build the project, complete life-cycle of the project, money that are involved, number of people using it, number of developers to build/maintain it, etc. etc.
That said, the problems you're describing sounds like you don't know your technology good enough. That would be a problem for you regardless of which technology you picked. For example, use database transactions to ensure atomicity. And use asynchronous offline jobs to process long running tasks (Such as dispatching a mailing list).
A lot if the bad behaviour is covered in good frameworks like the Zend Framework.
Anything that takes longer the 10 seconds is really messed up but you can always raise the execution time with http://de3.php.net/set_time_limit
A lot of big sites are writen in PHP: Facebook, Wikipedia, StudiVZ, Digg.com etc.. a lot of the things you are talking about are just configuration things maybe you should look into that?
Are you looking for set_time_limit() and ignore_user_abort()?
Performance is not a feature you can just throw in after most of the site is done.
You have to design the site for heavy load.
If a database task is normally involving 10K rows, you should be prepared not just the execution time issues, but other maintenance questions.
Worst case: make a consistency tool to check and fix those errors.
Better: instead of phisically delete the images, just flag them and let background services to take care of the expensive maneuvers.
Best: you can utilize a job queue service and add this job to the queue.
If you do need to do transactions in php, you can just do:
mysql_query("BEGIN");
/// do your queries here
mysql_query("COMMIT");
The commit command will just complete the transaction.
If any errors occur, you can just rollback with:
mysql_query("ROLLBACK");
Edit: Note this will only work if you are using a database that supports transactions, such as InnoDB
You can configure how much time is allowed for executing a script, either in the php.ini setting or via ini_set/set_time_limit
Instead of studivz (the German Facebook clone), you could look at the actual Facebook which is entirely PHP. Or Digg. Or many Yahoo sites. Or many, many others.
ignore_user_abort is probably what you're looking for, but you could also add another layer in terms of scheduled maintenance jobs. They basically run on a specified interval and do various things to make sure your data/filesystem are in a state that you want... deleting old/unlinked files is just one of many things you can do.
For these large loops like deleting photo albums or sending 1000's of emails your looking for ignore_user_abort and set_time_limit.
Something like this:
ignore_user_abort(true); //users leaves webpage will not kill script
set_time_limit(0); //script can take as long as it wants
for(i=0;i<10000;i++)
costly_very_important_operation();
Be carefull however that this could potentially run the script forever:
ignore_user_abort(true); //users leaves webpage will not kill script
set_time_limit(0); //script can take as long as it wants
while(true)
do_something();
That script will never die, unless you restart your server.
Therefore it is best to never set the time_limit the 0.
Technically no programming language is transaction safe, it's the database that needs to be transaction safe. So if the script/code running dies or disconnects, for whatever reason, the transaction will be rolled back.
Putting queries in a loop is a very bad idea unless it is specifically design to be running in batches and breaking a much larger set into smaller pieces. Adjusting PHP timers and limits is generally a stop gap solution, you are still dependent on the client browser if using the web to kick off a script.
If I have a long process that needs to be kicked off by a browser, I "disconnect" the process from the browser and web server so control is returned to the user while the script runs. PHP scripts run from the command line can run for hours if you want. You can then use AJAX, or reload the page, to check on the progress of the long running script.
There are security concern with this code, but to "disconnect" a process from PHP running under something like Apache:
exec("nohup /usr/bin/php -f /path/to/script.php > /dev/null 2>&1 &");
But that really has nothing to do with PHP being suitable for large projects or being transaction safe. PHP can be used for large projects, but since by default there is no code that remains "resident" between hits, it can get slow if not designed right. Also, since there is no namespace support, you want to plan ahead if you have a large development team.
It's fine for a Java based system to take a few minutes to startup, initialize and load all the default objects. But this is unacceptable with PHP. PHP will take more planning for larger systems. The question is, when does the time saved in using PHP get wasted by the additional planning time required for a large system?
The reason you most likely experienced bad database consistencies in the past is because you were using the MyISAM engine for mysql (which DOES NOT support transactions). Use InnoDB instead, it supports transactions and performs row level locking.
Or use postgreSQL.
Many, many software sites are made in PHP. However, you will not hear about millions of web pages made in PHP that do not exist anymore because they were abandoned. Those pages may have burned all company money for dealing with PHP mess, or maybe they bankrupted because their soft was so crappy that customer did not want it… PHP seems good at the startup, but it does not scale very well. Yes, there are many huge web sites made in PHP, but they are rather exceptions, than a norm.