AWS RDS goes 100% with number of DB connections

AWS RDS goes 100% with number of DB connections - php

I have deployed a Laravel 4.2 based project on AWS.
I am facing an issue, when there is a load on my site RDS touches 100% CPU utilization.
For example, last time we have a load when there were 422 DB connections.
The strange part is at the same time Write IOPS and Read IOPS are just normal, slow query logs are fine.
I need a direction, how to debug this issue. I want to know which thing is causing high CPU utilization.
Thanks

Related

Docker container sometimes gives Unknown database

I am currently running a project on docker, and I am trying to communicate with the database via an API. They are both deployed on docker.
The problem is, when I make a call to the API (via Postman or the front-end) I SOMETIMES get a 500 error.
When I look at the logs (docker-compose logs -f) in the container folder, I get the following message
test-api_php | [2021-03-18 10:09:04] application.ERROR: SQLSTATE[HY000] [1049] Unknown database 'test-api'-#0 /app/config/....
the thing is, It happens like 50% of the time, the other times it works as intended. I've noticed that if I make a request and wait a few seconds, the next request is almost always going through, but if I make multiple requests consecutively, it's very unlikely to go through. I've given docker as much resources as possible 8 CPUs 12GB RAM 2GB Swap 60GB Disk Image Size.
My colleagues also have the same issue when they try it on their machine, so it's not machine specific

AWS EC2: Internal server error when huge number of api calls at the same time from jmeter

We have made the backend of a mobile AP in laravel and mysql. The application is hosted on AWS Ec2 and using RDS mysql database.
We are stress testing the app using jmeter. When we send upto 1000 API requests from jmeter, it seems to work fine. However, when we send more than 1000 (roughly) requests in parallel, The jmeter starts getting internal server error (500) as a response for many requests. the internal 500 error percentage increases as we increase the number of APIs
Normally, we would expect that if we increase the APIs, they should be queued and the response should slow if the server is out of resources. We also monitored the resources on the server and they never reached even 50% of the available resources
Is there any timeout setting or any other possible setting that I could tweak so that the we dont get the internal server error before reaching 80% of the resource usage
Regards
Syed

500 is the externally visible symptom of some sort of failure in the server delivering your API. You should look at the error log of that server to see details of the failure.
If you are using php scripts to deliver the API, your mysql (rds) server may be running out of connections. Here's how that might work.
A php-driven web server under heavy load runs a lot of php instances. Each php instance opens up one or more connections to the mysql server. When there are too many php instances x connections per instance the mysql server starts refusing more of them.
Here's what you need to do: restrict the number of php instances your web server is allowed to use at a time. When you restrict that number, incoming requests will queue up (in the TCP connect queue of your OS's communication stack). Then, when an instance is available to serve each item in the queue it will do so.
Apache has a MaxRequestWorkers parameter, with a default extremely large value of 256. Try setting it much lower, for example to 32, and see whether your problem changes.
If you can shrink the number of request workers, you paradoxically may improve high-load performance. Serializing many requests often generates better throughput than trying to do many of them in parallel.
The same goes for the number of active connections to your MySQL server. It obviously depends on the nature of the queries you use, but generally speaking fewer concurrent queries improves performance. So, you won't solve a real-world problem by adding MySQL connections.
You should be aware that the kind of load imposed by server-hammering tools like jmeter is not representative of real world load. 1000 simultaneous jmeter operations without failure is a very good result. If your load-testing setup is robust and powerful, you will always be able to bring your server system to its knees. So, deciding when to stop is an important part of a load testing plan. If this were my system I would stop at 1000 for now.
For your app to be robust in the field, you probably should program it to respond to 500 status by waiting a random amount of time and trying again.

EC2 CPU utilization vs memory

I am running some set of CRON Jobs(every hour) to extract latest data from one DB and writing into CSVs using PHP.
Recently I have faced something unusual in my EC2 server. I could see CSV generated with header only, but then there was data. Also all my logger to track the process shown data extracted and count of extracted records as well. Only issue I found was CPU utilization was 100% during this scenario. Later everything went fine once CPU utilization become normal.
Then after 4 days, this time CSV generated with data twice. That means only one header but then same set of data repeated twice in the CSV. My logger to track the process shown correct count this time as well. Again only issue found was CPU utilization climbed up to 100% during this period of time.
Is there any connection between EC2 CPU utilization and this process, may be any memory related? Or anyone faced similar issues, even in different cloud?
Please advice.
Thanks

If the jobs takes more than one hour (because of high CPU utilization for example), then there will be another instance of the job and likely you will get the duplicated results in the CSV file. So, you should prevent the CRON jobs from being executed if there is already a running one. More information can be found here and here.

Find causes for slower elasticsearch responses

I have been using elasticsearch for an e-commerce site for quite some time - not only for search, but also to retrieve product data (/index/type/{id}) to avoid SQL queries.
Generally this works really well and most requests are answered between 1ms and 3ms. But there are some requests which take 100ms-250ms - just for a GET request like /index/type/{id}, where no actual searching is done and which normally takes 1-2ms. It seems to me that something must be wrong if such a response takes more than 100ms, because the server has a lot of RAM & a fast 6-core-CPU, the data is stored on very fast SSDs, there are only 150'000 entries (about 300MB in Elasticsearch) and there is almost no load. Elasticsearch has 5GB of RAM, and there is enough spare RAM for Lucene to cache all entries all the time. Requests are made through a local network with a dedicated switch. The index has only one shard and I am running Elasticsearch 2.3.
I am doing the requests in PHP. I have already tried using Nginx as reverse proxy for Elasticsearch, but this did not solve anything - it happens with and without Nginx inbetween.
Edit: Slow requests happen about 1% of the time (in relation to total number of requests). I can also reproduce it by just making 1000 requests in PHP to /index/type/{id} in Elasticsearch - always 1% will be really slow, even when using the same ID like /index/type/55 (as long as the ID exists). This also means there is no "cache effect" - after the first request Elasticsearch should have the data "ready", yet the number of slow requests is the same no matter what IDs I request or if I request the same ID over and over.
Edit2: I have looked at the stats of my nodes with Marvel & Kibana, and nothing indicates slowdowns there: between 20-40% of JVM heap memory is used, and almost no latency (between 0.1ms and 0.5ms). It confirms that there are more than enough resources and I see no correlations or hints for the cause of any slow requests.
After a lot of testing:
These are now my definite testing results:
The larger the response from Elasticsearch, the more likely slow requests are going to happen. Many small responses have a MUCH larger chance of not being exceptionally slow than one large response.
Bombarding Elasticsearch with simple GET requests reduces the likelihood of slow responses when I run more requests in parallel.
When using a simple search for one keyword over and over again, Elasticsearch tells me in the response it "took" 2-3ms, even when a response takes 200ms until my application receives it. But also here: the larger the response, the higher the chances are of slow responses. 1KB response is never slow when I run loops of requests, 2.5KB is only a little slow (30ms) in very few instances, 10KB response always has up to 1% of slow requests with up to 200ms.
I have taken into account that it might be a network "problem", especially when Elasticsearch thinks it is fast even when it is slow. But it would be a strange root cause, because my setup is so standard (Debian Jessie). Also, keep-alive connections and TCP_NODELAY do nothing to improve this problem.
Anybody know how to find the root cause, and what could possibly be happening?

I finally found the reason for the measureable slow responses: It was the network driver or maybe even the hardware implementation on the network card.
When running tests from the node itself the slow responses disappeared, and I also noticed the older servers (8 years old compared to the only-2-years-old newer servers) had no slow responses when running tests on them, which indicated the requesting server was at fault, not the responding ES server, but it also indicated the network itself was fine, because only the "new" servers had this problem.
I went down the rabbit hole of TCP/network settings and found ethtool, which shows network configuration and also allows to change it. I learned there was something called "offloading", where a lot of network operations are offloaded to the network card (especially splitting up requests and responses into segments), and tried the following command to disable all offloading:
ethtool -K eth1 tx off rx off sg off tso off ufo off gso off gro off lro off rxvlan off txvlan off rxhash off
Afterwards my request-1000-identical-searches-from-ES were as fast as expected - no slow requests anymore. My network card (Intel® 82574L Dual port GbE LAN on a SuperMicro X9SRL-F running the e1000e driver) seems to do something in hardware which slows down responses, or holds them back, or whatever. The older servers are running the tg3 driver - offloading is enabled on them (according to ethtool), but it does not cause these delayed responses. Disabling offloading had no noticeable effect on CPU load, which is probably to be expected with any modern CPUs.
With the new settings, I was able to lower the number of slow pages due to slow Elasticsearch responses to 0.07%, where before it was about 1%. I also noticed that using Nginx as reverse proxy for Elasticsearch caused some slow responses, even though they were not many - usually about 3-5 responses for every 150'000 were above 50ms. Without Nginx, by just querying Elasticsearch directly, I am now unable to reproduce any slow requests anymore, even at a grand scale.
UPDATE 11/2017
After updating to Debian Stretch and running the server with kernel 4.9 all remaining "slow requests" disappeared. So this problem seems to be at least partly rooted in older linux kernels.

Redis request latency

I'm using redis (2.6.8) with php-fpm and phpredis driver and have some trouble with redis latency issues. Under certain load first request to redis from our application takes about 1-1.5s and redis-cli --latency shows the same latency.
I've already checked the latency guide.
We use redis on the same host with Unix sockets
slowlog has no entries longer 5ms
we don't use AOF
redis takes about 3.5Gb memory of 16Gb available (i suppose it's not too much)
our system is not swapping
there is no other process doing disk I/O
I'm using persistent connections and amount of connected client is varying from 5 to 25 (sometimes strikes to 60-80).
Here is the graph.
It looks like problems starts when there are 20 or more simultaneously connected clients.
Can you help me to figure out where is the problem?
Update
I investigated the problem and it seemed like redis did not have enough processor time for some reason to operate properly.
I thoroughly checked communication between php-fpm and redis with the help of network sniffer. Redis received request over tcp but sent the answer back only after one and a half seconds. It obviously signified that the problem is inside redis, that it cannot process so many requests in the given conditions (possibly processor starvation as the processor was only 50% loaded for the whole system).
The problem was resolved by moving redis to other server that was nearly idle. I suppose that we should have played with linux scheduler to make it work on the same server, but have not done it yet.

Bear in mind that Redis is single-threaded. If the operations that you're doing err on the processor-intensive side, your requests could be blocking on each other. For instance, if you're doing HVALS against hashes with very large values, you're going to make all of your clients wait while you pull out all that data and copy it to the output buffer.
Part of what you need to do here (regardless if this is the issue) is to look at all of the commands that you're using and determine the complexity of each command. If you're doing a bunch of O(N) commands against very large amounts of data, it's not impossible that you're simply doing too much stuff at a time.
TL;DR Nobody on here can debug this issue with real certainty without knowing which commands you're using and what your data looks like. But you can look up the time complexity of each method you're using and make sure it's reasonable.

I ran across this in researching an issue I'm working on but thought it might help here:
https://groups.google.com/forum/#!topic/redis-db/uZaXHZUl0NA
If you read through the thread there is some interesting info.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.