Nginx php-fpm clogs up with writing connections under high load - php

we have nginx/1.6.2 running with php5-fpm (5.6) on a debian 8 system.
In the past days we got higher load than usual due to more users hitting our servers. With most visitors coming in the evening hours between 6pm and midnight.
Since a couple of days, two different servers runnning the above setup showed very slow response rates for several hours. In Munin, we saw, that there were suddenly hundreds of nginx connections in "writing" state were there were previously only about 20 at a time.
We do not get any errors other than timed out connections on remote hosts when trying to access those servers. All logs I saw were just normal.
The problem can be fixed with a restart of php5-fpm.
My question now is: why do suddenly hundreds of processes claim they are writing? Is there some known issue or maybe config setting we missed which could cause this?
Here is the complete list of symptoms we see:
Instead of < 20 very fast active connections /s we see up to 100 to 900 connections in writing state (all nginx connections hit php5-fpm, static content is not served by these servers) Avg. script runtime for the php scripts is 80ms.
Problem occurs only if total amount of nginx requests /s goes above 300 /s, It then drops from ~350 to ~250 req/s but these 250 show up to 900 "writing" connections
Many of these connections eventually time out and give no correct result
There are no errors in our logs
The eth / database traffic as well as CPU load correspond to the lower level of 250req/s to which the total drops, so there is no "writing" happening afaik.
For the setup:
as stated above. We use the build-in opcode cache of Zend, the APCu for some user variable cache, one of the servers runs a memcache instance (which works fine throughout the problem) and the other is running a Redis version, which also runs fine while the problem occurs.
Can anyone shed some light to what the problem might be?
Thanks!

We found the problem: APCu seems to be unstable with PHP 5.6.
Details:
debian 8
nginx/1.6.2
PHP 5.6.14-0+deb8u1
APCu 4.0.7 (Revision: 328290, 126M shm_size)
we used xhprof to profile requests when the server was slow (see question) and noticed, that APCu took > 100ms per read/write operation. Clearing the APCu variables did not help. All other parts of the code had normal speed.
We completely disabled our use of APCu and the system has been stable since.
So it seems, that this APCu version is unstable under load with PHP 5.6. At least for us.

We had the same problem, and the reason for that was that the data in Redis was more than the "maxmemory" so redis was unable to write any more data. I could login with redis-cli but couldn't set a value, if you are having this issue, you could login to redis using redis-cli and try to set something, if the redis memory is full you'll get an error.

Related

webpage request with 10 pictures and some text freezes on nginx and php-fpm and disconnects other services?

is it possible that a request for a page, where the server or php might have an issue freezes and even disconnects other not related SSH services?
I am running a simple webpage (10 pictures and some text) on a dockerized environment with separate reverse proxy, a web server, a database (nginx, php-fpm and postgresql).
The whole system was up without a restart for a year or so, without problems. Now I have a newly occurring issue (about a month) with page/system freezes. When I visit my webpage it locks up from time to time (sometimes 1 instance is enough, other times, I need to open up to 20x) and needs about 30 seconds to start reacting again.The strange thing is that if I am connected in parallel with SSH to the server, it sometimes (not always) also disconnects my terminal. Which is why I believed it hast to do something with the system (but can't find anything there, so trying a different perspective here).
server (only remote access available):
Debian GNU/Linux 9.4 (stretch)
Kernel: 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux
68GB Ram, 8 Core, 2x4 TB HDDs and 1TB SDD
1 GBit-Uplink
I have monitoring installed and there does not seem to be any high workload on the IOs, network, CPU, or other during the lock up (I am not monitoring php stats though). I also have the same setup running on a local test server (different hardware and Kernel 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux) and that server has no freezing issues, so again an argument against the issue being with the dockerized environment or my page code.
I have done so far on the hardware side:
1.) SMART diagnostics - without any obvious issues (the "backup disk (not the one the servers are saved on)" has for some time: 191
G-Sense_Error_Rate 0x0032 001 001 000 , but the provider ran a
separate test some time ago and said that the disk has no issue, and
that the G-Sense_Error_Rate has little informational value anyhow)
2.) atop ( htop and iotop are live and SSH disconnects, thus I can't watch it as the problem occurs) over a 1s interval and 300 samples
(thus 5mins), where i was able to produce multiple freezes, but there
were no obvious load issues (granted this is the first time I am
looking at those things! - but there was also no high level line
coloring that atop does automatically)
3.) I have also a dockerized monitoring stack running (the freeze occurs with it running and with it being disabled, so it should not
come from here either) where I can view the dockers separately and
they also do not show anything alarming
4.) restarted the whole server - issue continues
5.) memtester-d 55 of 65 RAM without issues
6.) no problems in syslog
7.) ping the server, while producing the error and the ping is quick with 27ms, but when the server hangs, I lose 1 ping in about 10 (in those 30-40s, then ping is perfect again). But I cannot figure out, why that is
Where else could I look????
Any suggestions are highly appreciated!
Thanks!
Strange that this has only started to happen within the last few months and was fine previously.
Are you pulling down the latest image for nginx, postgres... etc? Maybe its a problem with the version of the images and could try using a specific release.

Server getting slower over time

I am using a dedicated server for my PHP application. Server slowing down day by day after a reboot everything goes normal. I cache my json results as files and serve them to clients when everything normal response time is about 50ms but when server slows down response time goes up to 17 seconds or more.
This issue affects all server, I can't even login with ssh when this happens.
I don't have enough knowledge about servers.
How can I track this problem?
System Up for 6 Days now, and slowing down started again -
Here are my results
# lsof | wc -l
34255
# free
total used free shared buff/cache available
Mem: 32641048 1826832 6598216 232780 24216000 29805868
Swap: 16760828 0 16760828
My Server has 32gb Ram, 8 Core cpu, Centos 7.
I run a laravel application with 500 unique users daily.
I restarted MySQL service, httpd service, ngnix service, and cleared memory cache, nothing changed. Only the server reboot helps.
Static files response normally, but files served by PHP application or HTTP responses very slow and getting slower day by day.
Login with ssh getting slower too, I use plesk as control panel but it is getting slower too.
I mean this problem affects not only my application but all server.

nginx & php5-fpm respond extremely slow with Laravel application

I am running a PHP application (Laravel and MySQL) on a Ubuntu VPS with nginx and php5-fpm installed (both with default settings). I soon experienced some totally random 502 errors, apparently due to php5-fpm which timed out and lost connection to nginx every now and then.
I was desperately looking for a solution on SO and any other resource I could find, but the error persisted: The webserver didn't respond about 40 times over 2 days, with a "downtime" of about 2 mins each. I changed the workers in php5-fpm, the maximum execution time... nothing. The server only showed very low CPU and RAM usage.
I eventually killed the VPS and set up a new one from scratch - with the same result. But instead of showing 502 errors, the request simply takes about 40 secs of constant loading without any content or error displayed. And about 2 mins later, once I hit reload the page loads instantly.
The only thing left I could think of was changing php5-fpm. What I did. I tried using hhvm. But again the same result of constant loading.
I seriously don't know what to do anymore... did anyone of you run into the same problem before?
Cheers
With the help of slow logs I found the issue, it was an external service (GeoJSON request) that randomly slowed down the page and therefore caused the error.

Process running php.ini Using Up Excessive Memory In CentOS

Bear with me here.
I'm seeing some php.ini processes (?) running or processes touching php.ini that are using up to 80% or more of the CPU and i have no idea what would cause this. All database processing is offloaded on a separate VPS and the whole service is supported by a CDN. I've provided a screenshot of "top -cM"
Setup:
MediaTemple DV level 2 application server (the server we are looking at in the images), 8 cores, 2GB RAM
Mediatemple VE level 1 database server
Cloudflare CDN
CentOS 6.5
NginX
Mysql 5.4, ect
EDIT
I'm seeing about 120K pageviews a day here, with a substantial number of concurrent connections
Where do i start looking to find what is causing this?
Thanks in advance

Apache VERY high page load time

My Drupal 6 site has been running smoothly for years but recently has experienced intermittent periods of extreme slowness (10-60 sec page loads). Several hours of slowness followed by hours of normal (4-6 sec) page loads. The page always loads with no error, just sometimes takes forever.
My setup:
Windows Server 2003
Apache/2.2.15 (Win32) Jrun/4.0
PHP 5
MySql 5.1
Drupal 6
ColdFusion 9
Vmware virtual environment
DMZ behind a corporate firewall
Traffic: 1-3 hits/sec peak
Troubleshooting
No applicable errors in apache error log
No errors in drupal event log
Drupal devel module shows 242 queries in 366.23 milliseconds,page execution time 2069.62 ms. (So it looks like queries and php scripts are not the problem)
NO unusually high CPU, memory, or disk IO
Cold fusion apps, and other static pages outside of drupal also load slow
webpagetest.org test shows very high time-to-first-byte
The problem seems to be with Apache responding to requests, but previously I've only seen this behavior under 100% cpu load. Judging solely by resource monitoring, it looks as though very little is going on.
Here is the kicker - roughly half of the site's access comes from our LAN, but if I disable the firewall rule and block access from outside of our network, internal (LAN) access (1000+ devices) is speedy. But as soon as outside access is restored the site is crippled.
Apache config? Crawlers/bots? Attackers? I'm at the end of my rope, where should I be looking to determine where the problem lies?
------Edit:-----
Attached is a waterfall chart from webpagetest.org showing a 15 second load time. I've seen times as high as several minutes. And again, the server runs fine much of the time. The green areas indicate that the browser has sent a request and is waiting to recieve the first byte of data back from the server. This is certainly a back-end delay, but it is puzzling that the CPU is barely used during this slowness.
(Not enough rep to post an image, see https://webmasters.stackexchange.com/questions/54658/apache-very-high-page-load-time
------Edit------
On the Apache side of things - Is this possibly a ThreadsPerChild issue?
After much research, I may have found the solution. If I'm correct, it was an apache config problem. Specifically, the "ThreadsPerChild" directive. See... http://httpd.apache.org/docs/2.2/platform/windows.html
Because Apache for Windows is multithreaded, it does not use a
separate process for each request, as Apache can on Unix. Instead
there are usually only two Apache processes running: a parent process,
and a child which handles the requests. Within the child process each
request is handled by a separate thread.
ThreadsPerChild: This directive is new. It tells the server how many
threads it should use. This is the maximum number of connections the
server can handle at once, so be sure to set this number high enough
for your site if you get a lot of hits. The recommended default is
ThreadsPerChild 150, but this must be adjusted to reflect the greatest
anticipated number of simultaneous connections to accept.
Turns out, this directive was not set at all in my config and thus defaulted to 64. I confirmed this by viewing the number of threads for the second httpd.exe process in task manager. When the server was hitting more than 64 connections, the excess requests were simply having to wait for a thread to open up. I added ThreadsPerChild 150 in my httpd.conf.
Additionally, I enabled the apache status module
http://httpd.apache.org/docs/2.2/mod/mod_status.html
...which, among other things, allows one to see the total number of active request on the server at any given moment. Right away, I could see spikes of up to 80 active request. Time will tell, but I'm confident that this will resolve my issue. So far, 30 hours without a hiccup.
Apache is too bulk and clumsy for "1-3 hits/sec avg".
Once I have similar problem with much lighter (almost static-html, no DB) site, and similar hits/second.
No errors, no high network/CPU/memory/disk loads. Apache on WinXP.
I inserted nginx before Apache for static files and it started working like a charm.
Caching. The solution it caching.
Drupal (in common with most other large CMS platforms) has a tendency toward this kind of thing due to its nature -- every page is built on the fly, constructed from a whole stack of database tables and code modules. The more you've got in there, the slower it will be, but even fairly simple pages can become horribly slow if your site gets a bit of traffic.
Drupal has a page cache mechanism built-in which will cut your load dramatically. As long as your pages are static (ie no dynamic content) then you can simply switch on caching and watch the performance go right back up.
If you have dynamic content, you can still enable caching for the static parts of the page. It is a bit more complex (and beyond the scope of this answer), but it is worth the effort.
If that's still not enough, a server-based caching solution such as Varnish will definitely help.

Categories