Nginx scaling and bottleneck identification on an EC2 cluster

Nginx scaling and bottleneck identification on an EC2 cluster - php

I am developing a big application and i have to load test it. It is a EC2 based cluster with one HighCPU Ex.Large instance for application which runs PHP / NGinx.
This applicaton is responsible for reading data from a redis server which holds some 5k - 10k key values, it then makes the response and logs the data into a mongoDB server and replies back to client.
Whenever i send a request to the app server, it does all its computations in about 20 - 25 ms which is awesome.
I am now trying to do some load testing and i run a php based app on my laptop to send requests to server. Many thousands of them quickly over 20 - 30 seconds. During this load period, whenever i open the app URL in the browser, it replies back with the execution time of around 25 - 35 ms which is again cool. So i am sure that redis and mongo are not causing bottlenecks. But it is taking about 25 seconds to get the response back during load.
The high CPU ex. large instance has 8 GB RAM and 8 cores.
Also, during the load test, the top command shows about 4 - 6 php_cgi processes consuming some 15 - 20% of CPU.
I have 50 worker processes on nginx and 1024 worker connections.
What could be the issue causing the bottleneck ?
IF this doesnt work out, i am seriously considering moving out to a whole java application with an embedded webserver and an embedded cache.
UPDATE - increased PHP_FCGI_CHILDREN to 8 and it halfed the response time during load

50 worker processes is too many, you need only one worker process per CPU core. Using more worker processes will invoke inter-process switching, that will consume many time.
What you can do now:
1. Set worker process to minimum (one worker per CPU, e.g. 4 worker process if you have 4 cpu units), but worker connections - to maximum (10240 for example)
Tune up TCP stack via sysctl. You can reach stack limits if you have many connections
Get statistics from nginx stub_status module (you can use munin + nginx, its easy to setup and gave you enough information about system status).
Check nginx error.log and system messages log for errors.
Tune up nginx (decrease connection timings and max query size).
I hope that helps you.

Related

PHP Cassandra random busy I/O threads, request timed out

I have a 3 node Cassandra v3.11.4 cluster. Replication factor = 3 and around 70GB data in each cluster.
Node hardware: m5.2xlarge (8 vCPU, 32 GB RAM, 500GB SSD)
Some YAML values:
num_tokens: 256
concurrent_reads: 32
concurrent_writes: 32
concurrent_counter_writes: 32
endpoint_snitch: SimpleSnitch
PHP connection from load balanced compute engines:
$cluster = Cassandra::cluster()->withPort(PORT)->withDefaultConsistency(Cassandra::CONSISTENCY_LOCAL_QUORUM)->withContactPoints(HOST_VAL)->withIOThreads(5)->withCredentials(CASS_USER, CASS_PASS)->build();
$session = $cluster->connect(KEYSPACE);
$statement = $session->prepare($query);
$stmt = $session->execute($statement, ['arguments' => $bindParams]);
The Cassandra service runs smoothly for most of the time but for 5-10 minutes every 5-6 hours it starts giving errors from PHP operations:
Cassandra\Exception\RuntimeException: All connections on all I/O threads are busy
Cassandra\Exception\RuntimeException: All hosts in current policy attempted and were either unavailable or failed
Cassandra\Exception\TimeoutException: Request timed out
I am guessing the issue is with PHP Connections either stalling Cassandra nodes or generating too many connections.
Please help me where to look for possible reasons or if there is any SHOW PROCESSLIST like command to monitor current connections like in MySQL.

Those errors indicate that the nodes are getting overloaded and become unresponsive leading to the TimeoutException -- the replicas don't respond to the coordinator within the request timeout.
When nodes are busy, the requests queue up waiting to be served. There comes a point when the queues reach the maximum size and any new requests from the client are no longer queued.
Check for long GCs which are indicative of nodes being overloaded. Correlate those times with the amount of read/write traffic from your application (you'll get these metrics from your app monitoring).
My guess is that your application doesn't have a smooth amount of traffic but instead is peaking every few hours. It is during app traffic peaks that the cluster gets overloaded. If this is the case, you need to size your cluster to cope with the peak traffic by adding more nodes. Cheers!

Mosquitto MQTT Disk Read Spike

I've got an issue with one of our MQTT brokers running Mosquitto. Where there is a consistent increase in disk usage at a seemingly random time, until it reaches an equilibrium of about 50MB/s read (Shown in graph bellow).
This also causes a secondary issue with our PHP server (Laravel, PHP:7.4 running php-mqtt/client: v1.1.0), where once a connection issue occurs PHP CPU spikes to 100% until the broker is restarted and then PHP server is restarted (From testing must be in that order, else problem still persists). MQTT messages are broadcast on all Laravel jobs as part of a queue monitoring system. A note during this PHP issues, connections to the broker slowly increase, I assume this is just due to the broker holding them open but too slow to process them.
At a guess it might be something like QoS >0 messages stacking up on the broker causing excessive disk usage. Though I can't confirm this and would expect high disk write instead of read. I've checked for network bottlenecks and other potential causes but nothing found so far.
Currently at a loss as to what could cause this issue as 99% of the time the broker operates normally and there is nothing we know of that would trigger this consistent increase in disk I/O. Issue also occurred several weeks ago but at a different time of day.
Broker is running on a GCE Instance with the following specs:
CPU: 2 vCPUs (E2 : Intel Broadwell)
Memory: 4 GB
Disk: 250GB (Max Throughput: 70MB/s)
OS: Container-Optimized OS, 89
Broker disk monitoring
Normal throughput is what is seen before 1:30am
Broker network traffic
Network traffic is within the norms for this time of day. Dropped between 5:30 and 6:30 due to us trying to fix the issue.
Broker CPU
Never seems to reach 100% but could be due to high frequency spikes that the monitoring is averaging.
PHP Server CPU
Note CPU does not spike for serval hours after broker shows signs of a problem.
Mosquitto Version
Official Docker Container: eclipse-mosquitto:1.6.14
Mosquitto Config
listener 1883 0.0.0.0
protocol mqtt
listener 9001 0.0.0.0
protocol websockets
allow_anonymous false
password_file /mosquitto/config/mosquitto.passwd
acl_file /mosquitto/config/mosquitto.acl
Normal Traffic
Messages: ~1000 per minute throughput on the broker
PHP server only publishes short-lived messages with the following config:
QoS: 1
Retain: 0
Timeout: 1
Dump of some $SYS topics as of now:
$SYS/broker/version: "mosquitto version 1.6.14"
$SYS/broker/uptime: "23529 seconds"
$SYS/broker/messages/stored: "93" // Remained stable since monitoring
$SYS/broker/messages/received: "895858"
$SYS/broker/messages/sent: "1405546"
$SYS/broker/retained messages/count: "93" // Remained stable since monitoring
$SYS/broker/clients/total: "122078"
$SYS/broker/clients/inactive: "121984"
$SYS/broker/clients/disconnected: "121984"
$SYS/broker/clients/active: "94"
$SYS/broker/clients/connected: "94"
$SYS/broker/clients/maximum: "122078"
$SYS/broker/subscriptions/count: "393"
In the process of adding monitoring to these, gather data if the issue occurs again.

PHP-FPM performance tuning - bursts of traffic

I have a web application written in Laravel / PHP that is in the early stages and generally serves about 500 - 600 reqs/min. We use Maria DB and Redis for caching and everything is on AWS.
For events we want to promote on our platform, we send out a push notification (mobile platform) to all users which results in a roughly 2-min long traffic burst that takes us to 3.5k reqs/min
At our current server scale, this completely bogs down the application servers' CPU which usually operate at around 10% CPU. The Databases and Redis clusters seem fine during this burst.
Looking at the logs, it seems all PHP-FPM worker pool processes get occupied and begin queuing up requests from the Nginx upstream.
We currently have:
three m4.large servers (2 cores, 8gb RAM each)
dynamic PHP-FPM process management, with a max of 120 child processes (servers)on each box
My questions:
1) Should we increase the FPM pool? It seems that memory-wise, we're probably nearing our limit
2) Should we decrease the FPM pool? It seems possible that we're spinning up so many process that the CPU is getting bogged down and is unable to really complete any of them. I wonder if we'd therefore get better results with less.
3) Should we simply use larger boxes with more RAM and CPU, which will allow us to add more FPM workers?
4) Is there any FPM performance tuning that we should be considering? We use Opcache, however, should we switch to static process management for FPM to cut down on the overhead of processes spinning up and down?

There are too many child processes in relation to the number of cores.
First, you need to know the server status at normal and burst time.
1) Check the number of php-fpm processes.
ps -ef | grep 'php-fpm: pool' | wc -l
2) Check the load average. At 2 cores, 2 or more means that the work's starting delayed.
top
htop
glances
3) Depending on the service, we start to adjust from twice the number of cores.
; Example
;pm.max_children = 120 ; normal) pool 5, load 0.1 / burst) pool 120, load 5 **Bad**
;pm.max_children = 4 ; normal) pool 4, load 0.1 / burst) pool 4, load 1
pm.max_children = 8 ; normal) pool 6, load 0.1 / burst) pool 8, load 2 **Good**
load 2 = Maximum Performance 2 cores
It is more accurate to test the web server with a load similar to the actual load through the apache benchmark(ab).
ab -c100 -n10000 http://example.com
Time taken for tests: 60.344 seconds
Requests per second: 165.72 [#/sec] (mean)
100% 880 (longest request)

PHP processes load and stuck taking all CPUs

I have a PHP script that enables me to have a Social Network and such similiar.
Normally, there isn't any problem, my server is a VPS with:
2.4 GHz CPU
4 Cores
8 GB of RAM
150GB SSD
CentOS 7.1 with cPanel.
The problem is that normally server can mantain at a CPU load of 30-40% around 30 concurrent users. But sometimes, I don't know for what reason, the load goes really high, to 98-100% all the time. Even if users log out and there is even just 3-4 persons in the website, the server load remains to 98-100% all the time 'til I don't restart the server.
So, I noticed, using top command via SSH, that gets created a process in PHP with the user as the owner of the webspace (created via cPanel) and as command, PHP. The load for this process is from 20% to 27%.
The fact is that more of these PHP processes get created more time that pass.
For example, after 30 minutes, there is another PHP process with the same characteristics of the first process. And both, together, take 50-60% of the CPU load. More time pass, more process get created, to a max of 4 processes like this. (Is because my CPU has 4 cores?).
If I kill these processes via kill [pid] in 1-2 minutes, server goes back to 3% even with 10-15 concurrent users.
What is the problem? It is strictly php-file related or what? I even tried doing events on the website to check WHAT actions these PHP processes (even useless) that start. Because if I kill them, website continues to work very good!
What could be the problem?
There is a screen of CPU usage:
Thank you all.

If a process is making a lot of I/O operations like database calls etc, it can considerably increase the CPU load. In your case you are sure of the process which is the cause behind this high load. Noticing that load increases overt time,you should carefully look at the PHP script for memory leaks, lots of sessions, lots of nested loops with IO tugged in between and try to isolate the reason for it. good luck

Apache Server Slots, Memory, KeepAlive, PHP and SSL - how to speed up

on a Debian web server (VPS) with good CPU, 6 GB RAM, and fast backbone Internet connection, I run a PHP application. PHP runs in "prefork" mode (incl. APC opcache), because whenever you search for PHP and the MPM worker, there are abundant warning regarding thread safety. The PHP application is quite large, so each server process requires about 20 to 30 MB RAM. There is sensible data processed by the application, therefore, all connections to the Apache server are SSL encrypted.
Typically, the application shows no or few images (about 1-3 files incl CSS and JS per request) and the users send a new request each 1 minute (30 sec. to 4 minutes, depeding on the user).
Recently, this application faced a big storm of user requests (that was planned, no DoS, about 2.500 concurrent users). While the CPU did fine (<50% use), my server ran quickly out of slots. The point is that - in prefork mode - each slot requires memory and the 6 GB are just enough for "MaxClients" about 200 slots).
Problem 1: According to Apache server-status, most slots were occupied "..reading..". Sometimes reading for 10 seconds and more, while PHP processing takes 0.1 to 2 seconds. Few data is sent by the users, so I guess that this actually is the SSL handshake. This, of course, occupies lots of slots (I also enabled and configured mod_reqtimeout to drop very slow clients and - according to http://unhandledexpression.com/2013/01/25/5-easy-tips-to-accelerate-ssl/ - used SSLHonorCipherOrder to use faster encryption ciphers, SSLCertificateChainFile is also transmitted).
Problem 2: If I enable KeepAlive (only 1 or 2 seconds) to reduce the SSL overhead, slots are kept open and, therefore, occupied twice as long, as PHP processing would require.
Problem 3: If I actually wanted to serve 2.500 users, and want to use KeepAlive to speed up SSL, I would require 2.500 slots. However, I won't have a machine with 32 GB RAM.
With enough users on the server, to test its limits, I were stuck with about 110 requests per second, about 50% CPU load on a quadcore system (max. 400%). Less req/sec if I (re-)enabled KeepAlive. 110 req/sec on a modern webserver - this seems ridiculous! I cannot believe that this is actually what Apache, PHP and SSL can perform.
Is there a major fail in my thinking? Do I encounter a basic limitation of the prefork mode? Did I ignore the obvious? Is SSL acutually such a performance-eater? Thanks for any hints!

I'm the author of that article about SSL performance. I don't think the handshake is responsible for the 8+ seconds on reads. You can get useful information by using http://www.webpagetest.org/ . The handshake is done when a request is marked as "connected".
My guess would be that the slow processing of the PHP app with a lot of concurrent users can make some users wait a lot more.
Here are some ideas to get better performance:
I don't think the KeepAlive would be a good idea if each client does a request every minute.
You could enable SSL session tickets to reduce the handshake overhead.
MPM-Worker works fine for a lot of different setups, so I encourage you to try it out.
caching will probably not help you if the clients recieve a different response every time.
you should test PHP-FPM, that could speed up the PHP code.
also, test APC, to cache precompiled PHP code.
I don't know anything about the architecture of your app, but you could defer sending the results: get the data from the client, send an immediate answer ("processing data..." or something like that), process the data in a background process, then on the next request, send the calculated answer.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.