Since about 2 months ago, I've experienced my website slows down (timeout problem), however I've made some checking on the server and my settings and behavior is this (LAMP, VPS 6GB Ram - 4 cpu cores, but I'm not expert on Linux or apache):
When it suddenly starts to hang, I've check the browser network behavior and I've found that images takes up to 24-30 seconds to load (small images from 4K to 180K), some of them fail to load. It also happens to .css files sometimes (10 seconds to load). During this period, only 1 core is used and RAM stays at 1.4GB tops. The server is hosting a website based on CMS (Joomla - SSL - gzip set).
Check browser network tab here
I have apache MPM as prefork with these settings:
KeepAlive On
KeepAliveTimeout 3
MaxKeepAliveRequests 500
StartServers 5
MinSpareServers 5
MaxSpareServers 10
ServerLimit 100
MaxClients 100
MaxRequestsPerChild 3000
I have mod_security enabled, but there isn't any suspicious behavior. I have also server-status enable, and I'm not sure but it doesn't look very loaded (most of process in K and W). The access log shows the usual behavior and no error logs found.
The Database is MariaDB, no hanged queries during this periods and nothing in slow query logs.
The thing is, even if I restart the apache service, the website still hangs. So I tried restarting the server (shutdown -r) and when the server and services are up again, it also hangs. Sometimes when I'm not monitoring the website, it comes back to normal after 20 minutes, but sometimes takes even 3 hours. The problem is that it's a production server and it's not always that happens. Sometimes it happens 2 days in a row, then after 3 or 4 days, sometimes happens twice in the same day.
Any idea what could be happening here? I'm out of clues right now. Thanks in advance
I suspect strongly its the server host. I once had the same exact problem and first thing I did was to get a copy of the website and ran it on wamp server on local machine. That way I was able to clear the confusion of whether its a server or CMS issue.
I posted the case on webmasters.stackexchange.com but no luck, they closed the question. As last resource, a couple of weeks ago I finally decided to move the entire website to a different server with the same characteristics, and boom!, problem solved. So bottom line, it seems the problem was a strange problem with the server itself.
Related
I have fastCGI with PHP
However, server runs fine for several hours and then suddenly I have php error log full of:
mod_fcgid: can't apply process slot for /var/www/php-bin/website/php
At the same time, there is no spike in activity on the web, CPUs are not spiking, all seems normal based on the server usage.
If I restart apache, all is running ok again for several hours and then situation repeats.
I have tried to set higher values to fcgid settings:
FcgidMaxProcesses 18000
FcgidMaxProcessesPerClass 3800
but the problem stil persist.
What is interesting, I also have the second server with totally different setup and SW versions (FastCGI as PHP module), but the same problem sometimes (not so frequently) occurs there as well.
So I am wondering, can this problem be caused by some PHP script? On both servers, there are some PHP libraries that are the same.
In generall, how to track down what is causing this? On debug server, this problem is non-existing and on production, I cannot blindly change settings and reset server over and over again.
we have nginx/1.6.2 running with php5-fpm (5.6) on a debian 8 system.
In the past days we got higher load than usual due to more users hitting our servers. With most visitors coming in the evening hours between 6pm and midnight.
Since a couple of days, two different servers runnning the above setup showed very slow response rates for several hours. In Munin, we saw, that there were suddenly hundreds of nginx connections in "writing" state were there were previously only about 20 at a time.
We do not get any errors other than timed out connections on remote hosts when trying to access those servers. All logs I saw were just normal.
The problem can be fixed with a restart of php5-fpm.
My question now is: why do suddenly hundreds of processes claim they are writing? Is there some known issue or maybe config setting we missed which could cause this?
Here is the complete list of symptoms we see:
Instead of < 20 very fast active connections /s we see up to 100 to 900 connections in writing state (all nginx connections hit php5-fpm, static content is not served by these servers) Avg. script runtime for the php scripts is 80ms.
Problem occurs only if total amount of nginx requests /s goes above 300 /s, It then drops from ~350 to ~250 req/s but these 250 show up to 900 "writing" connections
Many of these connections eventually time out and give no correct result
There are no errors in our logs
The eth / database traffic as well as CPU load correspond to the lower level of 250req/s to which the total drops, so there is no "writing" happening afaik.
For the setup:
as stated above. We use the build-in opcode cache of Zend, the APCu for some user variable cache, one of the servers runs a memcache instance (which works fine throughout the problem) and the other is running a Redis version, which also runs fine while the problem occurs.
Can anyone shed some light to what the problem might be?
Thanks!
We found the problem: APCu seems to be unstable with PHP 5.6.
Details:
debian 8
nginx/1.6.2
PHP 5.6.14-0+deb8u1
APCu 4.0.7 (Revision: 328290, 126M shm_size)
we used xhprof to profile requests when the server was slow (see question) and noticed, that APCu took > 100ms per read/write operation. Clearing the APCu variables did not help. All other parts of the code had normal speed.
We completely disabled our use of APCu and the system has been stable since.
So it seems, that this APCu version is unstable under load with PHP 5.6. At least for us.
We had the same problem, and the reason for that was that the data in Redis was more than the "maxmemory" so redis was unable to write any more data. I could login with redis-cli but couldn't set a value, if you are having this issue, you could login to redis using redis-cli and try to set something, if the redis memory is full you'll get an error.
I am running a PHP application (Laravel and MySQL) on a Ubuntu VPS with nginx and php5-fpm installed (both with default settings). I soon experienced some totally random 502 errors, apparently due to php5-fpm which timed out and lost connection to nginx every now and then.
I was desperately looking for a solution on SO and any other resource I could find, but the error persisted: The webserver didn't respond about 40 times over 2 days, with a "downtime" of about 2 mins each. I changed the workers in php5-fpm, the maximum execution time... nothing. The server only showed very low CPU and RAM usage.
I eventually killed the VPS and set up a new one from scratch - with the same result. But instead of showing 502 errors, the request simply takes about 40 secs of constant loading without any content or error displayed. And about 2 mins later, once I hit reload the page loads instantly.
The only thing left I could think of was changing php5-fpm. What I did. I tried using hhvm. But again the same result of constant loading.
I seriously don't know what to do anymore... did anyone of you run into the same problem before?
Cheers
With the help of slow logs I found the issue, it was an external service (GeoJSON request) that randomly slowed down the page and therefore caused the error.
My Drupal 6 site has been running smoothly for years but recently has experienced intermittent periods of extreme slowness (10-60 sec page loads). Several hours of slowness followed by hours of normal (4-6 sec) page loads. The page always loads with no error, just sometimes takes forever.
My setup:
Windows Server 2003
Apache/2.2.15 (Win32) Jrun/4.0
PHP 5
MySql 5.1
Drupal 6
ColdFusion 9
Vmware virtual environment
DMZ behind a corporate firewall
Traffic: 1-3 hits/sec peak
Troubleshooting
No applicable errors in apache error log
No errors in drupal event log
Drupal devel module shows 242 queries in 366.23 milliseconds,page execution time 2069.62 ms. (So it looks like queries and php scripts are not the problem)
NO unusually high CPU, memory, or disk IO
Cold fusion apps, and other static pages outside of drupal also load slow
webpagetest.org test shows very high time-to-first-byte
The problem seems to be with Apache responding to requests, but previously I've only seen this behavior under 100% cpu load. Judging solely by resource monitoring, it looks as though very little is going on.
Here is the kicker - roughly half of the site's access comes from our LAN, but if I disable the firewall rule and block access from outside of our network, internal (LAN) access (1000+ devices) is speedy. But as soon as outside access is restored the site is crippled.
Apache config? Crawlers/bots? Attackers? I'm at the end of my rope, where should I be looking to determine where the problem lies?
------Edit:-----
Attached is a waterfall chart from webpagetest.org showing a 15 second load time. I've seen times as high as several minutes. And again, the server runs fine much of the time. The green areas indicate that the browser has sent a request and is waiting to recieve the first byte of data back from the server. This is certainly a back-end delay, but it is puzzling that the CPU is barely used during this slowness.
(Not enough rep to post an image, see https://webmasters.stackexchange.com/questions/54658/apache-very-high-page-load-time
------Edit------
On the Apache side of things - Is this possibly a ThreadsPerChild issue?
After much research, I may have found the solution. If I'm correct, it was an apache config problem. Specifically, the "ThreadsPerChild" directive. See... http://httpd.apache.org/docs/2.2/platform/windows.html
Because Apache for Windows is multithreaded, it does not use a
separate process for each request, as Apache can on Unix. Instead
there are usually only two Apache processes running: a parent process,
and a child which handles the requests. Within the child process each
request is handled by a separate thread.
ThreadsPerChild: This directive is new. It tells the server how many
threads it should use. This is the maximum number of connections the
server can handle at once, so be sure to set this number high enough
for your site if you get a lot of hits. The recommended default is
ThreadsPerChild 150, but this must be adjusted to reflect the greatest
anticipated number of simultaneous connections to accept.
Turns out, this directive was not set at all in my config and thus defaulted to 64. I confirmed this by viewing the number of threads for the second httpd.exe process in task manager. When the server was hitting more than 64 connections, the excess requests were simply having to wait for a thread to open up. I added ThreadsPerChild 150 in my httpd.conf.
Additionally, I enabled the apache status module
http://httpd.apache.org/docs/2.2/mod/mod_status.html
...which, among other things, allows one to see the total number of active request on the server at any given moment. Right away, I could see spikes of up to 80 active request. Time will tell, but I'm confident that this will resolve my issue. So far, 30 hours without a hiccup.
Apache is too bulk and clumsy for "1-3 hits/sec avg".
Once I have similar problem with much lighter (almost static-html, no DB) site, and similar hits/second.
No errors, no high network/CPU/memory/disk loads. Apache on WinXP.
I inserted nginx before Apache for static files and it started working like a charm.
Caching. The solution it caching.
Drupal (in common with most other large CMS platforms) has a tendency toward this kind of thing due to its nature -- every page is built on the fly, constructed from a whole stack of database tables and code modules. The more you've got in there, the slower it will be, but even fairly simple pages can become horribly slow if your site gets a bit of traffic.
Drupal has a page cache mechanism built-in which will cut your load dramatically. As long as your pages are static (ie no dynamic content) then you can simply switch on caching and watch the performance go right back up.
If you have dynamic content, you can still enable caching for the static parts of the page. It is a bit more complex (and beyond the scope of this answer), but it is worth the effort.
If that's still not enough, a server-based caching solution such as Varnish will definitely help.
We try to implement long-polling based notification service in our company's ERP. Similar to Facebook notifications.
Technologies used:
PHP with timeout set to 60 seconds and 1 second sleep in each iteration of loop.
jQuery for AJAX handling.
Apache as web server.
After nearly month of coding, we went to production. Few minutes after deployment we had to rollback everything. It turned out that our server (8 cores) couldn't handle long requests from 20 employees, using ~5 browser tabs each.
For example: User opened 3 tabs with our ERP, with one long-polling AJAX on each tab. Opening 4th tab is impossible - it hangs until one of previous 3 is killed (and therefore AJAX is stopped).
'Apache limitations', we thought. So we went googling. I found some info about Apache's MPM modules and configs, so I gave it a try. Our server use prefork MPM, as apachectl -l shown us. So I changed few lines in config to look something like this:
<IfModule mpm_prefork_module>
StartServers 1
MinSpareServers 16
MaxSpareServers 32
ServerLimit 50%
MaxClients 150
MaxClients 50%
MaxRequestsPerChild 0
</IfModule>
Funny thing is, it works on my local machine with similar config. On server, it looks like Apache ignores config, because with MinSpareServers set to 16, it lauches 8 after restart. Whe have no idea what to do.
Passerby in first comment of previous post gave me good direction to check out if we hit max browser connections to one server.
As it turns out, each browser has those limit and you can't change them (as far as I know).
We made a workaround to make it work.
Let's assume that I was getting AJAX data from
http://domain.com/ajax
To avoid hitting max browser connections, each long-polling AJAX connects to random subdomain, like:
http://31289.domain.com/ajax
http://43289.domain.com/ajax
and so on. There's a wildcard on a DNS server pointing from *.domain.com to domain.com, and subdomain is unique random number, generated by JS on each tab.
For more information, check out this thread.
There's been also some problems with AJAX Same Origin Security, but we managed to work it out, using appropriate headers on both JS and PHP sides.
If you want to know more about headers, check it out here on StackOverflow, and here on Mozilla Developer's page. Thanks!
I have successfully implemented a LAMP setup with long polling. Two things to keep in mind, the php internal execution clock for linux is not altered or incremented by the 'usleep()' function. Therefore, setting the maximum execution time would only be needed for rare edge cases where obtaining the data takes longer than normal, or possibly for a windows setup. In addition, with long polling bare in mind that once you go over 20+ seconds, you are vulnerable to having browser timeouts occur.
Secondly, you will need to verify that your sessions aren't locking up (if sessions are being used).
Apache really shouldn't have any issue with what you are looking to do. Though, I will admit that webservers like nginx or an ajax-specific webserver may handle the concurrent connections better. If you could post your code for the ajax handler, we might be able to figure out where the problem is.
Utilizing subdomains, or as other threads have suggested -- multiple webservers on separate ports, remember that you may encounter JavaScript domain security issues.
I say, don't change apache config until you encounter an issue and have exhausted all other options; be careful with the PHP sessions, and make sure AJAX is waiting for a response, before sending another request ;)