Find causes for slower elasticsearch responses - php

I have been using elasticsearch for an e-commerce site for quite some time - not only for search, but also to retrieve product data (/index/type/{id}) to avoid SQL queries.
Generally this works really well and most requests are answered between 1ms and 3ms. But there are some requests which take 100ms-250ms - just for a GET request like /index/type/{id}, where no actual searching is done and which normally takes 1-2ms. It seems to me that something must be wrong if such a response takes more than 100ms, because the server has a lot of RAM & a fast 6-core-CPU, the data is stored on very fast SSDs, there are only 150'000 entries (about 300MB in Elasticsearch) and there is almost no load. Elasticsearch has 5GB of RAM, and there is enough spare RAM for Lucene to cache all entries all the time. Requests are made through a local network with a dedicated switch. The index has only one shard and I am running Elasticsearch 2.3.
I am doing the requests in PHP. I have already tried using Nginx as reverse proxy for Elasticsearch, but this did not solve anything - it happens with and without Nginx inbetween.
Edit: Slow requests happen about 1% of the time (in relation to total number of requests). I can also reproduce it by just making 1000 requests in PHP to /index/type/{id} in Elasticsearch - always 1% will be really slow, even when using the same ID like /index/type/55 (as long as the ID exists). This also means there is no "cache effect" - after the first request Elasticsearch should have the data "ready", yet the number of slow requests is the same no matter what IDs I request or if I request the same ID over and over.
Edit2: I have looked at the stats of my nodes with Marvel & Kibana, and nothing indicates slowdowns there: between 20-40% of JVM heap memory is used, and almost no latency (between 0.1ms and 0.5ms). It confirms that there are more than enough resources and I see no correlations or hints for the cause of any slow requests.
After a lot of testing:
These are now my definite testing results:
The larger the response from Elasticsearch, the more likely slow requests are going to happen. Many small responses have a MUCH larger chance of not being exceptionally slow than one large response.
Bombarding Elasticsearch with simple GET requests reduces the likelihood of slow responses when I run more requests in parallel.
When using a simple search for one keyword over and over again, Elasticsearch tells me in the response it "took" 2-3ms, even when a response takes 200ms until my application receives it. But also here: the larger the response, the higher the chances are of slow responses. 1KB response is never slow when I run loops of requests, 2.5KB is only a little slow (30ms) in very few instances, 10KB response always has up to 1% of slow requests with up to 200ms.
I have taken into account that it might be a network "problem", especially when Elasticsearch thinks it is fast even when it is slow. But it would be a strange root cause, because my setup is so standard (Debian Jessie). Also, keep-alive connections and TCP_NODELAY do nothing to improve this problem.
Anybody know how to find the root cause, and what could possibly be happening?

I finally found the reason for the measureable slow responses: It was the network driver or maybe even the hardware implementation on the network card.
When running tests from the node itself the slow responses disappeared, and I also noticed the older servers (8 years old compared to the only-2-years-old newer servers) had no slow responses when running tests on them, which indicated the requesting server was at fault, not the responding ES server, but it also indicated the network itself was fine, because only the "new" servers had this problem.
I went down the rabbit hole of TCP/network settings and found ethtool, which shows network configuration and also allows to change it. I learned there was something called "offloading", where a lot of network operations are offloaded to the network card (especially splitting up requests and responses into segments), and tried the following command to disable all offloading:
ethtool -K eth1 tx off rx off sg off tso off ufo off gso off gro off lro off rxvlan off txvlan off rxhash off
Afterwards my request-1000-identical-searches-from-ES were as fast as expected - no slow requests anymore. My network card (Intel® 82574L Dual port GbE LAN on a SuperMicro X9SRL-F running the e1000e driver) seems to do something in hardware which slows down responses, or holds them back, or whatever. The older servers are running the tg3 driver - offloading is enabled on them (according to ethtool), but it does not cause these delayed responses. Disabling offloading had no noticeable effect on CPU load, which is probably to be expected with any modern CPUs.
With the new settings, I was able to lower the number of slow pages due to slow Elasticsearch responses to 0.07%, where before it was about 1%. I also noticed that using Nginx as reverse proxy for Elasticsearch caused some slow responses, even though they were not many - usually about 3-5 responses for every 150'000 were above 50ms. Without Nginx, by just querying Elasticsearch directly, I am now unable to reproduce any slow requests anymore, even at a grand scale.
UPDATE 11/2017
After updating to Debian Stretch and running the server with kernel 4.9 all remaining "slow requests" disappeared. So this problem seems to be at least partly rooted in older linux kernels.

Related

AWS EC2: Internal server error when huge number of api calls at the same time from jmeter

We have made the backend of a mobile AP in laravel and mysql. The application is hosted on AWS Ec2 and using RDS mysql database.
We are stress testing the app using jmeter. When we send upto 1000 API requests from jmeter, it seems to work fine. However, when we send more than 1000 (roughly) requests in parallel, The jmeter starts getting internal server error (500) as a response for many requests. the internal 500 error percentage increases as we increase the number of APIs
Normally, we would expect that if we increase the APIs, they should be queued and the response should slow if the server is out of resources. We also monitored the resources on the server and they never reached even 50% of the available resources
Is there any timeout setting or any other possible setting that I could tweak so that the we dont get the internal server error before reaching 80% of the resource usage
Regards
Syed
500 is the externally visible symptom of some sort of failure in the server delivering your API. You should look at the error log of that server to see details of the failure.
If you are using php scripts to deliver the API, your mysql (rds) server may be running out of connections. Here's how that might work.
A php-driven web server under heavy load runs a lot of php instances. Each php instance opens up one or more connections to the mysql server. When there are too many php instances x connections per instance the mysql server starts refusing more of them.
Here's what you need to do: restrict the number of php instances your web server is allowed to use at a time. When you restrict that number, incoming requests will queue up (in the TCP connect queue of your OS's communication stack). Then, when an instance is available to serve each item in the queue it will do so.
Apache has a MaxRequestWorkers parameter, with a default extremely large value of 256. Try setting it much lower, for example to 32, and see whether your problem changes.
If you can shrink the number of request workers, you paradoxically may improve high-load performance. Serializing many requests often generates better throughput than trying to do many of them in parallel.
The same goes for the number of active connections to your MySQL server. It obviously depends on the nature of the queries you use, but generally speaking fewer concurrent queries improves performance. So, you won't solve a real-world problem by adding MySQL connections.
You should be aware that the kind of load imposed by server-hammering tools like jmeter is not representative of real world load. 1000 simultaneous jmeter operations without failure is a very good result. If your load-testing setup is robust and powerful, you will always be able to bring your server system to its knees. So, deciding when to stop is an important part of a load testing plan. If this were my system I would stop at 1000 for now.
For your app to be robust in the field, you probably should program it to respond to 500 status by waiting a random amount of time and trying again.

Apache server slow when high HTTP API call

I am running HTTP API which should be called more than 30,000 time per minute simultaneously.
Currently I can call it 1,200 time per minute. If I call 1200 time per minute, all the request are completed and get response immediately.
But if I called 12,000 time per minute simultaneously it take 10 minute to complete all the request. And during that 10 minute, I cannot browse any webpage on the server. It is very slow
I am running CentOS 7
Server Specification
Intel® Xeon® E5-1650 v3 Hexa-Core Haswell,
RAM 256 GB DDR4 ECC RAM,
Hard Drive2 x 480 GB SSD(Software-RAID 1),
Connection 1 Gbit/s
API- simple php script that echo the time-stamp
echo time();
I check the top command, there is no load in the server
please help me on it
Thanks
Sounds like a congestion problem.
It doesn't matter how quick your script/page handling is, if the next request gets done within the execution time of the previous:
It is going to use resources (cpu, ram, disk, network traffic and connections).
And make everything parallel to it slower.
There are multiple things you could do, but you need to figure out what exactly the problem is for your setup and decide if the measure produces the desired result.
If the core problem is that resources get hogged by parallel processes, you could lower connection limits so more connections go in to wait mode, which keeps more resources available for actually handing out a page instead of congesting everything even more.
Take a look at this:
http://oxpedia.org/wiki/index.php?title=Tune_apache2_for_more_concurrent_connections
If the server accepts connections quicker then it can handle them, you are going to have a problem which ever you change. It should start dropping connections at some point. If you cram down French baguettes down its throat quicker then it can open its mouth, it is going to suffocate either way.
If the system gets overwhelmed at the network side of things (transfer speed limit, maximum possible of concurent connections for the OS etc etc) then you should consider using a load balancer. Only after the loadbalancer confirms the server has the capacity to actually take care of the page request it will send the user further.
This usually works well when you do any kind of processing which slows down page loading (server side code execution, large volumes of data etc).
Optimise performance
There are many ways to execute PHP code on a webserver and I assume you use appache. I am no expert, but there are modes like CGI and FastCGI for example. Which can greatly enhance execution speed. And tweaking settings connected to these can also show you what is happening. It could for example be that you use to little number of PHP threats to handle that number of concurrent connections.
Have a look at something like this for example
http://blog.layershift.com/which-php-mode-apache-vs-cgi-vs-fastcgi/
There is no 'best fit for all' solution here. To fix it, you need to figure out what the bottle neck for the server is. And act accordingly.
12000 Calls per minute == 200 calls a second.
You could limit your test case to a multitude of those 200 and increase/decrease it while changing settings. Your goal is to dish that number of requestst out in a shortest amount of time as possible, thus ensuring the congestion never occurs.
That said: consequences.
When you are going to implement changes to optimise the maximum number of page loads you want to achieve you are inadvertently going to introduce other conditions. For example if maximum ram usage by Apache would be the problem, the upping that limit will ensure better performance, but heightens the chance the OS runs out of memory when other processes also want to claim more memory.
Adding a load balancer adds another possible layer of failure and possible slow downs. Yes you prevent congestion, but is it worth the slow down caused by the rerouting?
Upping performance will increase the load on the system, making it possible to accept more concurrent connections. So somewhere along the line a different bottle neck will pop up. High traffic on different processes could always end in said process crashing. Apache is a very well build web server, so it should in theories protect you against said problem, however tweaking settings wrongly could still cause crashes.
So experiment with care and test before you use it live.

Website under huge traffic PHP + MySQL

Let me show what problem I'm dealing with:
website powered by Apache 2.2 + PHP 5.x + MySQL 5.1.x
peak traffic = 2.000 unique visitors/min = 5-8k pageviews/min
normal traffic = 2.000 unique visitors/day
website works well while under normal traffic
website lags while under peak traffic
my server cpu load is pretty big while under peak traffic (because of mysql/php processes), so my website is lagging.
Normal state: server response in 0.1-0.4 sec/pageview. PHP code is optimized to get and process all data from database and output HTML code within this time (call it server-response).
Peak traffic state: server response in 2-5 sec/pageview. And that's a bit longer response than I'm happy with. I don't want my visitors to wait so long for requested page.
What I'm doing now: My way to deal with this problem now is local cache system. I'm making local cache file (stored on disk) for about 10 minutes with cached SQL results - so I don't have to call the same sql query with every request.
My website is http://www.lechaton.cz/.
Is there any better way how to deal with peak traffic or optimize CPU utilization?
Thanks all for your time and advise!
I've been working through whole weekend and testing and testing and comparing methods and solutions.
Nginx solution (replace LAMP with LNMP)
I'm still using Apache2.2, but even using nginx (thanks to Bondye) there is not a big difference.
I've tried LNMP on debian wheezy, but with not a big difference from LAMP.
For static files nginx is faster in fact.
Nginx with PHP-FPM is a twice faster, that could be solution for some cases, but not solving my issues for 100%.
Tune-up your MySQL settings and MySQL queries
Tune-up your MySQL server with better caching and buffering. Also check your max-connections and memory usage.
But the most important is to optimize your queries for best performance. Even if your queries are best performing with mysql-cache, bigger traffic brings your server down within several minutes of big traffic. YOU HAVE TO CACHE your output!!
Tune-up apache
tune up your apache mpm-prefork
Reduce KeepAliveTimeout to max 5 seconds (default=30)
keep maxClients set to correct number (depends on your RAM, max-processes and max-servers in settings directive)
Final conclusion
1) content cache = rule #1
I've found the best solution is: cache as much as much content as you can. There is no reason why every user should generate same output if there is option to display cached content. It's much faster and it saves your resources.
2) nginx for static content
You can use nginx to perform best with static files and content, there is much lower cpu load with multiple processes.
With PHP-FPM your code speeds up twice (maybe a bit more). But I can't consider it as final solution.
3) test your website with benchmark tools I've used siege and apache benchmark (ab) and mysqlslap.
These steps helped me to reduce CPU load with my brand new server, speed up my server-response and balance my peak-traffic during big events.
Hope someone will find it helpful.

Apache Server Slots, Memory, KeepAlive, PHP and SSL - how to speed up

on a Debian web server (VPS) with good CPU, 6 GB RAM, and fast backbone Internet connection, I run a PHP application. PHP runs in "prefork" mode (incl. APC opcache), because whenever you search for PHP and the MPM worker, there are abundant warning regarding thread safety. The PHP application is quite large, so each server process requires about 20 to 30 MB RAM. There is sensible data processed by the application, therefore, all connections to the Apache server are SSL encrypted.
Typically, the application shows no or few images (about 1-3 files incl CSS and JS per request) and the users send a new request each 1 minute (30 sec. to 4 minutes, depeding on the user).
Recently, this application faced a big storm of user requests (that was planned, no DoS, about 2.500 concurrent users). While the CPU did fine (<50% use), my server ran quickly out of slots. The point is that - in prefork mode - each slot requires memory and the 6 GB are just enough for "MaxClients" about 200 slots).
Problem 1: According to Apache server-status, most slots were occupied "..reading..". Sometimes reading for 10 seconds and more, while PHP processing takes 0.1 to 2 seconds. Few data is sent by the users, so I guess that this actually is the SSL handshake. This, of course, occupies lots of slots (I also enabled and configured mod_reqtimeout to drop very slow clients and - according to http://unhandledexpression.com/2013/01/25/5-easy-tips-to-accelerate-ssl/ - used SSLHonorCipherOrder to use faster encryption ciphers, SSLCertificateChainFile is also transmitted).
Problem 2: If I enable KeepAlive (only 1 or 2 seconds) to reduce the SSL overhead, slots are kept open and, therefore, occupied twice as long, as PHP processing would require.
Problem 3: If I actually wanted to serve 2.500 users, and want to use KeepAlive to speed up SSL, I would require 2.500 slots. However, I won't have a machine with 32 GB RAM.
With enough users on the server, to test its limits, I were stuck with about 110 requests per second, about 50% CPU load on a quadcore system (max. 400%). Less req/sec if I (re-)enabled KeepAlive. 110 req/sec on a modern webserver - this seems ridiculous! I cannot believe that this is actually what Apache, PHP and SSL can perform.
Is there a major fail in my thinking? Do I encounter a basic limitation of the prefork mode? Did I ignore the obvious? Is SSL acutually such a performance-eater? Thanks for any hints!
I'm the author of that article about SSL performance. I don't think the handshake is responsible for the 8+ seconds on reads. You can get useful information by using http://www.webpagetest.org/ . The handshake is done when a request is marked as "connected".
My guess would be that the slow processing of the PHP app with a lot of concurrent users can make some users wait a lot more.
Here are some ideas to get better performance:
I don't think the KeepAlive would be a good idea if each client does a request every minute.
You could enable SSL session tickets to reduce the handshake overhead.
MPM-Worker works fine for a lot of different setups, so I encourage you to try it out.
caching will probably not help you if the clients recieve a different response every time.
you should test PHP-FPM, that could speed up the PHP code.
also, test APC, to cache precompiled PHP code.
I don't know anything about the architecture of your app, but you could defer sending the results: get the data from the client, send an immediate answer ("processing data..." or something like that), process the data in a background process, then on the next request, send the calculated answer.

Redis request latency

I'm using redis (2.6.8) with php-fpm and phpredis driver and have some trouble with redis latency issues. Under certain load first request to redis from our application takes about 1-1.5s and redis-cli --latency shows the same latency.
I've already checked the latency guide.
We use redis on the same host with Unix sockets
slowlog has no entries longer 5ms
we don't use AOF
redis takes about 3.5Gb memory of 16Gb available (i suppose it's not too much)
our system is not swapping
there is no other process doing disk I/O
I'm using persistent connections and amount of connected client is varying from 5 to 25 (sometimes strikes to 60-80).
Here is the graph.
It looks like problems starts when there are 20 or more simultaneously connected clients.
Can you help me to figure out where is the problem?
Update
I investigated the problem and it seemed like redis did not have enough processor time for some reason to operate properly.
I thoroughly checked communication between php-fpm and redis with the help of network sniffer. Redis received request over tcp but sent the answer back only after one and a half seconds. It obviously signified that the problem is inside redis, that it cannot process so many requests in the given conditions (possibly processor starvation as the processor was only 50% loaded for the whole system).
The problem was resolved by moving redis to other server that was nearly idle. I suppose that we should have played with linux scheduler to make it work on the same server, but have not done it yet.
Bear in mind that Redis is single-threaded. If the operations that you're doing err on the processor-intensive side, your requests could be blocking on each other. For instance, if you're doing HVALS against hashes with very large values, you're going to make all of your clients wait while you pull out all that data and copy it to the output buffer.
Part of what you need to do here (regardless if this is the issue) is to look at all of the commands that you're using and determine the complexity of each command. If you're doing a bunch of O(N) commands against very large amounts of data, it's not impossible that you're simply doing too much stuff at a time.
TL;DR Nobody on here can debug this issue with real certainty without knowing which commands you're using and what your data looks like. But you can look up the time complexity of each method you're using and make sure it's reasonable.
I ran across this in researching an issue I'm working on but thought it might help here:
https://groups.google.com/forum/#!topic/redis-db/uZaXHZUl0NA
If you read through the thread there is some interesting info.

Categories