We are using https://github.com/php-amqplib/php-amqplib to consume messages from a rabbitmq. We've got several consumers that are running happily with no issues. Today we discovered that one of our consumer processes is consuming around 7% CPU of the host when idling (no messages in the queue) whereas the rest are consuming about 1% each.
On top of that when switching this process on and off we see large changes in the CPU utilizations of our db (AWS RDS postgres instance). With 3 consumer processes running our DB is at >30% CPU utilization all the time (even when there's nothing in the queue).
We've got a standard symfony configuration and our consumers are run using app/console rabbitmq:consumer -w consumer_name. The consumer in question has nothing special about it as far as we can tell. We are completely out of clues here so any help will be much appreciated.
more details:
When we turn the on the consumer we can see the same set of queries running a huge amount of times on the db (200,001 times in the space of 10minutes). There are no unacked messages in the queue. The consumer processes messages correctly otherwise. The query is a SELECT that would be run normally as part of the logic of the consumer.
Related
I've got an issue with one of our MQTT brokers running Mosquitto. Where there is a consistent increase in disk usage at a seemingly random time, until it reaches an equilibrium of about 50MB/s read (Shown in graph bellow).
This also causes a secondary issue with our PHP server (Laravel, PHP:7.4 running php-mqtt/client: v1.1.0), where once a connection issue occurs PHP CPU spikes to 100% until the broker is restarted and then PHP server is restarted (From testing must be in that order, else problem still persists). MQTT messages are broadcast on all Laravel jobs as part of a queue monitoring system. A note during this PHP issues, connections to the broker slowly increase, I assume this is just due to the broker holding them open but too slow to process them.
At a guess it might be something like QoS >0 messages stacking up on the broker causing excessive disk usage. Though I can't confirm this and would expect high disk write instead of read. I've checked for network bottlenecks and other potential causes but nothing found so far.
Currently at a loss as to what could cause this issue as 99% of the time the broker operates normally and there is nothing we know of that would trigger this consistent increase in disk I/O. Issue also occurred several weeks ago but at a different time of day.
Broker is running on a GCE Instance with the following specs:
CPU: 2 vCPUs (E2 : Intel Broadwell)
Memory: 4 GB
Disk: 250GB (Max Throughput: 70MB/s)
OS: Container-Optimized OS, 89
Broker disk monitoring
Normal throughput is what is seen before 1:30am
Broker network traffic
Network traffic is within the norms for this time of day. Dropped between 5:30 and 6:30 due to us trying to fix the issue.
Broker CPU
Never seems to reach 100% but could be due to high frequency spikes that the monitoring is averaging.
PHP Server CPU
Note CPU does not spike for serval hours after broker shows signs of a problem.
Mosquitto Version
Official Docker Container: eclipse-mosquitto:1.6.14
Mosquitto Config
listener 1883 0.0.0.0
protocol mqtt
listener 9001 0.0.0.0
protocol websockets
allow_anonymous false
password_file /mosquitto/config/mosquitto.passwd
acl_file /mosquitto/config/mosquitto.acl
Normal Traffic
Messages: ~1000 per minute throughput on the broker
PHP server only publishes short-lived messages with the following config:
QoS: 1
Retain: 0
Timeout: 1
Dump of some $SYS topics as of now:
$SYS/broker/version: "mosquitto version 1.6.14"
$SYS/broker/uptime: "23529 seconds"
$SYS/broker/messages/stored: "93" // Remained stable since monitoring
$SYS/broker/messages/received: "895858"
$SYS/broker/messages/sent: "1405546"
$SYS/broker/retained messages/count: "93" // Remained stable since monitoring
$SYS/broker/clients/total: "122078"
$SYS/broker/clients/inactive: "121984"
$SYS/broker/clients/disconnected: "121984"
$SYS/broker/clients/active: "94"
$SYS/broker/clients/connected: "94"
$SYS/broker/clients/maximum: "122078"
$SYS/broker/subscriptions/count: "393"
In the process of adding monitoring to these, gather data if the issue occurs again.
I'm using Laravel 5.5 and I'm trying to setup some fast queue processing. I've been running into one roadblock after another.
This site is an employer/employee matching service. So when an employer posts a job position, it needs to then run through all the employees in our system and calculate a number of variables to determine how well they match to the job. We have this all figured out, but it takes a long time to process one at a time when you have thousands of employees in the system. So, I set up to write a couple of tables. The first is a simple table that defines the position ID and the status. The second is a table listing all the employee IDs, the position ID, and the status of that employee being processed. This takes only a few seconds to write and then allows the user to move on in the application.
Then I have another server setup to run a cron every minute that checks for new entries in the first table. When found, it marks it out as started and then grabs all the employees and runs through each employee and starts a queued job in Laravel. The job I have defined does properly submit to the queue and running queue:work does in fact process the job properly. This is all tested.
However, the problem I'm running into is that I've tried database (MySQL), Redis and SQS for the queue and they are all very slow. I was using this same server to try to operate the queue:work (using Supervisor and attempting to run up to 300 processes) but then created 3 clones that don't run the cron but only run Supervisor (100 processes per clone) and killed Supervisor on the first server. With database it would process ok, though to run through 10k queued jobs would take hours, but with SQS and Redis I'm getting a ton of failures. The scripts are taking too long or something. I checked the CPUs on the clones running the workers and they are barely hitting 40% so I'm not over-taxing the servers.
I was just reading about Horizon and I'm not sure if it would help the situation. I keep trying to find information about how to properly setup a queue processing system with Laravel and just keep running into more questions than answers.
Is anyone familiar with this stuff and have any advice on how to set this up correctly so that it's very fast and failure free (assuming my code has no bugs)?
UPDATE: Following some other post advice, I figured I'd share a few more details:
I'm using Forge as the setup tool with AWS EC2 servers with 2G of RAM.
Each of the three clones has the following worker configuration:
command=php /home/forge/default/artisan queue:work sqs --sleep=10 --daemon --quiet --timeout=30 --tries=3
process_name=%(program_name)s_%(process_num)02d
autostart=true
autorestart=true
stopasgroup=true
killasgroup=true
user=forge
numprocs=100
stdout_logfile=/home/forge/.forge/worker-149257.log
The database is on Amazon RDS.
I'm curious if the Laravel cache will work with the queue system. There's elements of the queued script that are common to every run so perhaps if I queued that data up from the beginning it may save some time. But I'm not convinced it will be a huge improvement.
If we ignore the actual logic processed by each job, and consider the overhead of running jobs alone, Laravel's queueing system can easily handle 10,000 jobs per hour, if not several times that, in the environment described in the question—especially with a Redis backend.
For a typical queue setup, 100 queue worker processes per box seems extremely high. Unless these jobs spend a significant amount of time in a waiting state—such as jobs that make requests to web services across a network and use only a few milliseconds processing the response—the large number of processes running concurrently will actually diminish performance. We won't gain much by running more than one worker per processor core. Additional workers create overhead because the operating system must divide and schedule compute time between all the competing processes.
I checked the CPUs on the clones running the workers and they are barely hitting 40% so I'm not over-taxing the servers.
Without knowing the project, I can suggest that it's possible that these jobs do spend some of their time waiting for something. You may need to tune the number of workers to find the sweet spot between idle time and overcrowding.
With database it would process ok, though to run through 10k queued jobs would take hours, but with sqs and redis I'm getting a ton of failures.
I'll try to update this answer if you add the error messages and any other related information to the question.
I'm curious if the Laravel cache will work with the queue system. There's elements of the queued script that are common to every run so perhaps if I queued that data up from the beginning it may save some time.
We can certainly use the cache API when executing jobs in the queue. Any performance improvement we see depends on the cost of reproducing the data for each job that we could store in the cache. I can't say for sure how much time caching would save because I'm not familiar with the project, but you could profile sections of the code in the job to find expensive operations.
Alternatively, we could cache reusable data in memory. When we initialize a queue worker using artisan queue:work, Laravel starts a PHP process and boots the application once for all of the jobs that the worker executes. This is different from the application lifecycle for a typical PHP web app wherein the application reboots for every request and disposes state at the end of each request. Because every job executes in the same process, we can create an object that caches shared job data in the process memory, perhaps by binding a singleton into the IoC container, which the jobs can read much faster than even a Redis cache store because we avoid the overhead needed to fetch the data from the cache backend.
Of course, this also means that we need to make sure that our jobs don't leak memory, even if we don't cache data as described above.
I was just reading about Horizon and I'm not sure if it would help the situation.
Horizon provides a monitoring service that may help to track down problems with this setup. It may also improve efficiency a bit if the application uses other queues that Horizon can distribute work between when idle, but the question doesn't seem to indicate that this is the case.
Each of the three clones has the following worker configuration:
command=php /home/forge/default/artisan queue:work sqs --sleep=10 --daemon --quiet --timeout=30 --tries=3
(Sidenote: for Laravel 5.3 and later, the --daemon option is deprecated, and the queue:work command runs in daemon mode by default.)
I have a Laravel application which sends data to SQS on nearly every request. However, every so often, one of these requests takes several seconds to execute. Attached is a stack trace from New Relic. It seems that the tick() method (within CURL) gets called many times and the seconds just pile up. It also seems to be making several attempts to connect to the same endpoint, though they are AWS services so I can't imagine they'd be unresponsive this often.
Any idea why this might occur?
My code is hosted on AWS, on two m4.large instances behind an ELB. In general, the application is operating at a fairly low throughput -- roughly 50 - 100 requests per minute.
Stack trace: https://ibb.co/f05gLk
Additional thought: being that these instances are in a private subnet, is it possible the long request times to endpoints on SQS is a DNS-related issue?
SQS pushes are slow(>50ms) sometimes (especially if your packet size is big). I noticed SQS pushes to take around 80ms for a fairly small packet(200k). I shifted the push to redis and a batched push from redis to sqs to solve this.
I did not spend time investigating why the pushes are slow.
I'm trying to figure out what is causing my system to open a large number of PHP threads. This issue has occurred 3 times over the last 2 weeks, and is capable of crashing our application if undetected for several hours, as once it opens up 300 database connections it prevents anyone further from connecting.
The application is based on CakePHP 2.X, is running across multiple EC2 Instances, which share an RDS database.
The primary identifier that something is going wrong is high number of database connections, as shown by this graph:
We have CloudWatch monitoring setup to notify us on slack when average connections go above 40 for more than 5 minutes (normally connections don't go much above 10).
Looking at New Relic I can also see that the number of php processes steadily increased by 1 per minute. This is on our operations server which just handles background processing and tasks, and does not handle any web traffic.
Over the same time the graphs on the web servers appear normal.
In looking at New Relics information on long-running processes there is no information provided that would suggest any php processes ran for 20+ minutes, however, these processes were killed manually which may be why they're not visible within New Relic - I believe it may not record processes which are killed.
While this issue has now occurred 3 times, I'm still unsure what is causing the problem or how to debug what a particular running php thread is doing.
The last time this happened I could see all the php threads running, and could see they had been running for some time, but had no idea what they were doing or how to find out what they were doing, and to prevent the database from becoming overloaded I had to kill them all.
Are there any tools, or other information I am overlooking here which may help me in my search to determine which particular process is causing this issue?
I'm trying to create a distributed system with PHP daemons (via Upstart), running SWF deciders and activities, to replace a lot of our cron jobs and some processes that could do with being asynchronously run in the background.
However there are things I'm not sure on:
What's good way to upgrade these scripts when they're running, potentially on more than one server?
How can I ensure that any running activities finish before upgrading the scripts and restarting the daemon
I have to stick with PHP due to the codebase, but that doesn't exclude a bit of other "wrapping" scripting if needed.
In the worst case, you can never guarantee that an activity worker won't pick up an activity before you kill it.
You should turn the problem around - SWF activites are supposed to be idempotent, i.e., give the same result even if run multiple times for the same input. If you have long running activities (which I assume you do), use heartbeats to let SWF periodically know that your activites are alive and well (If you have short activities, the low activity timeouts themselves should suffice). Now, when a deployment comes and kills an activity worker on one machine, SWF will schedule the killed activities for processing on another machine (because the heartbeat timeout or activity timeout expired!)
If you build your activities with heartbeats (for long running activities and small timeouts for quick activities), you never need to worry about deployments or machine failures because any time an activity worker goes down for whatever reason, SWF will schedule the task to a different worker.
Along these lines, the best way to deploy is to do a staggered deployment - deploy to a section of your hosts at a given point of time, and based on their health, proceed to more sections till all your hosts are upgraded. This will give SWF the space to schedule activities killed by deployments to be scheduled, and help you prevent quickly detectable bugs due to deployments from spreading to the rest of your system.