I have a 3 node Cassandra v3.11.4 cluster. Replication factor = 3 and around 70GB data in each cluster.
Node hardware: m5.2xlarge (8 vCPU, 32 GB RAM, 500GB SSD)
Some YAML values:
num_tokens: 256
concurrent_reads: 32
concurrent_writes: 32
concurrent_counter_writes: 32
endpoint_snitch: SimpleSnitch
PHP connection from load balanced compute engines:
$cluster = Cassandra::cluster()->withPort(PORT)->withDefaultConsistency(Cassandra::CONSISTENCY_LOCAL_QUORUM)->withContactPoints(HOST_VAL)->withIOThreads(5)->withCredentials(CASS_USER, CASS_PASS)->build();
$session = $cluster->connect(KEYSPACE);
$statement = $session->prepare($query);
$stmt = $session->execute($statement, ['arguments' => $bindParams]);
The Cassandra service runs smoothly for most of the time but for 5-10 minutes every 5-6 hours it starts giving errors from PHP operations:
Cassandra\Exception\RuntimeException: All connections on all I/O threads are busy
Cassandra\Exception\RuntimeException: All hosts in current policy attempted and were either unavailable or failed
Cassandra\Exception\TimeoutException: Request timed out
I am guessing the issue is with PHP Connections either stalling Cassandra nodes or generating too many connections.
Please help me where to look for possible reasons or if there is any SHOW PROCESSLIST like command to monitor current connections like in MySQL.
Those errors indicate that the nodes are getting overloaded and become unresponsive leading to the TimeoutException -- the replicas don't respond to the coordinator within the request timeout.
When nodes are busy, the requests queue up waiting to be served. There comes a point when the queues reach the maximum size and any new requests from the client are no longer queued.
Check for long GCs which are indicative of nodes being overloaded. Correlate those times with the amount of read/write traffic from your application (you'll get these metrics from your app monitoring).
My guess is that your application doesn't have a smooth amount of traffic but instead is peaking every few hours. It is during app traffic peaks that the cluster gets overloaded. If this is the case, you need to size your cluster to cope with the peak traffic by adding more nodes. Cheers!
Related
I've got an issue with one of our MQTT brokers running Mosquitto. Where there is a consistent increase in disk usage at a seemingly random time, until it reaches an equilibrium of about 50MB/s read (Shown in graph bellow).
This also causes a secondary issue with our PHP server (Laravel, PHP:7.4 running php-mqtt/client: v1.1.0), where once a connection issue occurs PHP CPU spikes to 100% until the broker is restarted and then PHP server is restarted (From testing must be in that order, else problem still persists). MQTT messages are broadcast on all Laravel jobs as part of a queue monitoring system. A note during this PHP issues, connections to the broker slowly increase, I assume this is just due to the broker holding them open but too slow to process them.
At a guess it might be something like QoS >0 messages stacking up on the broker causing excessive disk usage. Though I can't confirm this and would expect high disk write instead of read. I've checked for network bottlenecks and other potential causes but nothing found so far.
Currently at a loss as to what could cause this issue as 99% of the time the broker operates normally and there is nothing we know of that would trigger this consistent increase in disk I/O. Issue also occurred several weeks ago but at a different time of day.
Broker is running on a GCE Instance with the following specs:
CPU: 2 vCPUs (E2 : Intel Broadwell)
Memory: 4 GB
Disk: 250GB (Max Throughput: 70MB/s)
OS: Container-Optimized OS, 89
Broker disk monitoring
Normal throughput is what is seen before 1:30am
Broker network traffic
Network traffic is within the norms for this time of day. Dropped between 5:30 and 6:30 due to us trying to fix the issue.
Broker CPU
Never seems to reach 100% but could be due to high frequency spikes that the monitoring is averaging.
PHP Server CPU
Note CPU does not spike for serval hours after broker shows signs of a problem.
Mosquitto Version
Official Docker Container: eclipse-mosquitto:1.6.14
Mosquitto Config
listener 1883 0.0.0.0
protocol mqtt
listener 9001 0.0.0.0
protocol websockets
allow_anonymous false
password_file /mosquitto/config/mosquitto.passwd
acl_file /mosquitto/config/mosquitto.acl
Normal Traffic
Messages: ~1000 per minute throughput on the broker
PHP server only publishes short-lived messages with the following config:
QoS: 1
Retain: 0
Timeout: 1
Dump of some $SYS topics as of now:
$SYS/broker/version: "mosquitto version 1.6.14"
$SYS/broker/uptime: "23529 seconds"
$SYS/broker/messages/stored: "93" // Remained stable since monitoring
$SYS/broker/messages/received: "895858"
$SYS/broker/messages/sent: "1405546"
$SYS/broker/retained messages/count: "93" // Remained stable since monitoring
$SYS/broker/clients/total: "122078"
$SYS/broker/clients/inactive: "121984"
$SYS/broker/clients/disconnected: "121984"
$SYS/broker/clients/active: "94"
$SYS/broker/clients/connected: "94"
$SYS/broker/clients/maximum: "122078"
$SYS/broker/subscriptions/count: "393"
In the process of adding monitoring to these, gather data if the issue occurs again.
The use case:
8 servers of 300 php-fpm concurrent child process each, produces records to Apache Kafka.
Each one produces 1 Kafka record, 1000 records per second.
Why do we need so many connections?
We have a web API, that is getting 60K calls per minute. Those requests are doing many things and processed via thousands of web php-fpm workers (unfortunately). As part of the request handling, we produce events to Kafka.
The problem:
I cannot find a way to persist connections between php-fpm web requests, creating some that seems to me so inefficient as might hit Kafka bounderies(will it?).
The result is 1000 producer connections per second getting established, sending one single record ach and getting closed just after.
I read here https://www.reddit.com/r/PHP/comments/648zrk/kafka_php_71_library/ that php-rdkafka is efficient, but I dont know if it can solve this issue.
I thought that Opcache might be handy to reuse connections but I cannot find a way to do it.
The question
Is establishing and closing 1000 connections per second fast and cheap or is a proxy with cheap connectivity that will reuse Kafka connection a must for such a use case?
We are using https://github.com/php-amqplib/php-amqplib to consume messages from a rabbitmq. We've got several consumers that are running happily with no issues. Today we discovered that one of our consumer processes is consuming around 7% CPU of the host when idling (no messages in the queue) whereas the rest are consuming about 1% each.
On top of that when switching this process on and off we see large changes in the CPU utilizations of our db (AWS RDS postgres instance). With 3 consumer processes running our DB is at >30% CPU utilization all the time (even when there's nothing in the queue).
We've got a standard symfony configuration and our consumers are run using app/console rabbitmq:consumer -w consumer_name. The consumer in question has nothing special about it as far as we can tell. We are completely out of clues here so any help will be much appreciated.
more details:
When we turn the on the consumer we can see the same set of queries running a huge amount of times on the db (200,001 times in the space of 10minutes). There are no unacked messages in the queue. The consumer processes messages correctly otherwise. The query is a SELECT that would be run normally as part of the logic of the consumer.
on a Debian web server (VPS) with good CPU, 6 GB RAM, and fast backbone Internet connection, I run a PHP application. PHP runs in "prefork" mode (incl. APC opcache), because whenever you search for PHP and the MPM worker, there are abundant warning regarding thread safety. The PHP application is quite large, so each server process requires about 20 to 30 MB RAM. There is sensible data processed by the application, therefore, all connections to the Apache server are SSL encrypted.
Typically, the application shows no or few images (about 1-3 files incl CSS and JS per request) and the users send a new request each 1 minute (30 sec. to 4 minutes, depeding on the user).
Recently, this application faced a big storm of user requests (that was planned, no DoS, about 2.500 concurrent users). While the CPU did fine (<50% use), my server ran quickly out of slots. The point is that - in prefork mode - each slot requires memory and the 6 GB are just enough for "MaxClients" about 200 slots).
Problem 1: According to Apache server-status, most slots were occupied "..reading..". Sometimes reading for 10 seconds and more, while PHP processing takes 0.1 to 2 seconds. Few data is sent by the users, so I guess that this actually is the SSL handshake. This, of course, occupies lots of slots (I also enabled and configured mod_reqtimeout to drop very slow clients and - according to http://unhandledexpression.com/2013/01/25/5-easy-tips-to-accelerate-ssl/ - used SSLHonorCipherOrder to use faster encryption ciphers, SSLCertificateChainFile is also transmitted).
Problem 2: If I enable KeepAlive (only 1 or 2 seconds) to reduce the SSL overhead, slots are kept open and, therefore, occupied twice as long, as PHP processing would require.
Problem 3: If I actually wanted to serve 2.500 users, and want to use KeepAlive to speed up SSL, I would require 2.500 slots. However, I won't have a machine with 32 GB RAM.
With enough users on the server, to test its limits, I were stuck with about 110 requests per second, about 50% CPU load on a quadcore system (max. 400%). Less req/sec if I (re-)enabled KeepAlive. 110 req/sec on a modern webserver - this seems ridiculous! I cannot believe that this is actually what Apache, PHP and SSL can perform.
Is there a major fail in my thinking? Do I encounter a basic limitation of the prefork mode? Did I ignore the obvious? Is SSL acutually such a performance-eater? Thanks for any hints!
I'm the author of that article about SSL performance. I don't think the handshake is responsible for the 8+ seconds on reads. You can get useful information by using http://www.webpagetest.org/ . The handshake is done when a request is marked as "connected".
My guess would be that the slow processing of the PHP app with a lot of concurrent users can make some users wait a lot more.
Here are some ideas to get better performance:
I don't think the KeepAlive would be a good idea if each client does a request every minute.
You could enable SSL session tickets to reduce the handshake overhead.
MPM-Worker works fine for a lot of different setups, so I encourage you to try it out.
caching will probably not help you if the clients recieve a different response every time.
you should test PHP-FPM, that could speed up the PHP code.
also, test APC, to cache precompiled PHP code.
I don't know anything about the architecture of your app, but you could defer sending the results: get the data from the client, send an immediate answer ("processing data..." or something like that), process the data in a background process, then on the next request, send the calculated answer.
I am developing a big application and i have to load test it. It is a EC2 based cluster with one HighCPU Ex.Large instance for application which runs PHP / NGinx.
This applicaton is responsible for reading data from a redis server which holds some 5k - 10k key values, it then makes the response and logs the data into a mongoDB server and replies back to client.
Whenever i send a request to the app server, it does all its computations in about 20 - 25 ms which is awesome.
I am now trying to do some load testing and i run a php based app on my laptop to send requests to server. Many thousands of them quickly over 20 - 30 seconds. During this load period, whenever i open the app URL in the browser, it replies back with the execution time of around 25 - 35 ms which is again cool. So i am sure that redis and mongo are not causing bottlenecks. But it is taking about 25 seconds to get the response back during load.
The high CPU ex. large instance has 8 GB RAM and 8 cores.
Also, during the load test, the top command shows about 4 - 6 php_cgi processes consuming some 15 - 20% of CPU.
I have 50 worker processes on nginx and 1024 worker connections.
What could be the issue causing the bottleneck ?
IF this doesnt work out, i am seriously considering moving out to a whole java application with an embedded webserver and an embedded cache.
UPDATE - increased PHP_FCGI_CHILDREN to 8 and it halfed the response time during load
50 worker processes is too many, you need only one worker process per CPU core. Using more worker processes will invoke inter-process switching, that will consume many time.
What you can do now:
1. Set worker process to minimum (one worker per CPU, e.g. 4 worker process if you have 4 cpu units), but worker connections - to maximum (10240 for example)
Tune up TCP stack via sysctl. You can reach stack limits if you have many connections
Get statistics from nginx stub_status module (you can use munin + nginx, its easy to setup and gave you enough information about system status).
Check nginx error.log and system messages log for errors.
Tune up nginx (decrease connection timings and max query size).
I hope that helps you.