I operate an online browser game that is very AJAX/database dependent, and the problem I am encountering is excessively high latency during peak hours.
I've created a simple AJAX ping that checks the server in a per-second loop, and the execution/response times of the 5 most recent pings are averaged into a "Connection Speed" that is displayed on the screen.
Most times, this latency records anywhere from 100-350ms, depending on internet speed, the client's other running webpages, and various other things. However, during peak hours on my server, namely 10PM-11PM EST, this latency becomes so bad that my AJAX functions stop working. The latency during these times can be around 2000ms, with some people seeing it as high as 6800ms.
My question is.. what would be the most likely cause of this? Is it a hardware issue on my server? Is it just unfeasible to create a browser game purely powered by AJAX? During these times, I often encounter issues on the server itself, with my control panel returning many "Cannot allocate memory for selected task" errors, yet when I run free through SSH, not even 10% of the RAM is being used.
You are experiencing contention somewhere in your web app or database. This can be in so many places and therefore has so many possible resolutions that it is impossible to list them. Some of the things you can think about:
No threads available to handle incoming requests because they are making synchronous calls to the database which will lock the thread until the database returns thus increasing latency
Contention at the databse level. Are you using partitioning for your data to support true concurrency?
Are you serving static content through your web app which could be retrieved as a directly addressable resource?
Are you load balancing your web app?
Are you using caching on the web app?
It's a bit like "how long is a piece of string?"
Hope this helps some.
Related
I am running HTTP API which should be called more than 30,000 time per minute simultaneously.
Currently I can call it 1,200 time per minute. If I call 1200 time per minute, all the request are completed and get response immediately.
But if I called 12,000 time per minute simultaneously it take 10 minute to complete all the request. And during that 10 minute, I cannot browse any webpage on the server. It is very slow
I am running CentOS 7
Server Specification
Intel® Xeon® E5-1650 v3 Hexa-Core Haswell,
RAM 256 GB DDR4 ECC RAM,
Hard Drive2 x 480 GB SSD(Software-RAID 1),
Connection 1 Gbit/s
API- simple php script that echo the time-stamp
echo time();
I check the top command, there is no load in the server
please help me on it
Thanks
Sounds like a congestion problem.
It doesn't matter how quick your script/page handling is, if the next request gets done within the execution time of the previous:
It is going to use resources (cpu, ram, disk, network traffic and connections).
And make everything parallel to it slower.
There are multiple things you could do, but you need to figure out what exactly the problem is for your setup and decide if the measure produces the desired result.
If the core problem is that resources get hogged by parallel processes, you could lower connection limits so more connections go in to wait mode, which keeps more resources available for actually handing out a page instead of congesting everything even more.
Take a look at this:
http://oxpedia.org/wiki/index.php?title=Tune_apache2_for_more_concurrent_connections
If the server accepts connections quicker then it can handle them, you are going to have a problem which ever you change. It should start dropping connections at some point. If you cram down French baguettes down its throat quicker then it can open its mouth, it is going to suffocate either way.
If the system gets overwhelmed at the network side of things (transfer speed limit, maximum possible of concurent connections for the OS etc etc) then you should consider using a load balancer. Only after the loadbalancer confirms the server has the capacity to actually take care of the page request it will send the user further.
This usually works well when you do any kind of processing which slows down page loading (server side code execution, large volumes of data etc).
Optimise performance
There are many ways to execute PHP code on a webserver and I assume you use appache. I am no expert, but there are modes like CGI and FastCGI for example. Which can greatly enhance execution speed. And tweaking settings connected to these can also show you what is happening. It could for example be that you use to little number of PHP threats to handle that number of concurrent connections.
Have a look at something like this for example
http://blog.layershift.com/which-php-mode-apache-vs-cgi-vs-fastcgi/
There is no 'best fit for all' solution here. To fix it, you need to figure out what the bottle neck for the server is. And act accordingly.
12000 Calls per minute == 200 calls a second.
You could limit your test case to a multitude of those 200 and increase/decrease it while changing settings. Your goal is to dish that number of requestst out in a shortest amount of time as possible, thus ensuring the congestion never occurs.
That said: consequences.
When you are going to implement changes to optimise the maximum number of page loads you want to achieve you are inadvertently going to introduce other conditions. For example if maximum ram usage by Apache would be the problem, the upping that limit will ensure better performance, but heightens the chance the OS runs out of memory when other processes also want to claim more memory.
Adding a load balancer adds another possible layer of failure and possible slow downs. Yes you prevent congestion, but is it worth the slow down caused by the rerouting?
Upping performance will increase the load on the system, making it possible to accept more concurrent connections. So somewhere along the line a different bottle neck will pop up. High traffic on different processes could always end in said process crashing. Apache is a very well build web server, so it should in theories protect you against said problem, however tweaking settings wrongly could still cause crashes.
So experiment with care and test before you use it live.
I have been using elasticsearch for an e-commerce site for quite some time - not only for search, but also to retrieve product data (/index/type/{id}) to avoid SQL queries.
Generally this works really well and most requests are answered between 1ms and 3ms. But there are some requests which take 100ms-250ms - just for a GET request like /index/type/{id}, where no actual searching is done and which normally takes 1-2ms. It seems to me that something must be wrong if such a response takes more than 100ms, because the server has a lot of RAM & a fast 6-core-CPU, the data is stored on very fast SSDs, there are only 150'000 entries (about 300MB in Elasticsearch) and there is almost no load. Elasticsearch has 5GB of RAM, and there is enough spare RAM for Lucene to cache all entries all the time. Requests are made through a local network with a dedicated switch. The index has only one shard and I am running Elasticsearch 2.3.
I am doing the requests in PHP. I have already tried using Nginx as reverse proxy for Elasticsearch, but this did not solve anything - it happens with and without Nginx inbetween.
Edit: Slow requests happen about 1% of the time (in relation to total number of requests). I can also reproduce it by just making 1000 requests in PHP to /index/type/{id} in Elasticsearch - always 1% will be really slow, even when using the same ID like /index/type/55 (as long as the ID exists). This also means there is no "cache effect" - after the first request Elasticsearch should have the data "ready", yet the number of slow requests is the same no matter what IDs I request or if I request the same ID over and over.
Edit2: I have looked at the stats of my nodes with Marvel & Kibana, and nothing indicates slowdowns there: between 20-40% of JVM heap memory is used, and almost no latency (between 0.1ms and 0.5ms). It confirms that there are more than enough resources and I see no correlations or hints for the cause of any slow requests.
After a lot of testing:
These are now my definite testing results:
The larger the response from Elasticsearch, the more likely slow requests are going to happen. Many small responses have a MUCH larger chance of not being exceptionally slow than one large response.
Bombarding Elasticsearch with simple GET requests reduces the likelihood of slow responses when I run more requests in parallel.
When using a simple search for one keyword over and over again, Elasticsearch tells me in the response it "took" 2-3ms, even when a response takes 200ms until my application receives it. But also here: the larger the response, the higher the chances are of slow responses. 1KB response is never slow when I run loops of requests, 2.5KB is only a little slow (30ms) in very few instances, 10KB response always has up to 1% of slow requests with up to 200ms.
I have taken into account that it might be a network "problem", especially when Elasticsearch thinks it is fast even when it is slow. But it would be a strange root cause, because my setup is so standard (Debian Jessie). Also, keep-alive connections and TCP_NODELAY do nothing to improve this problem.
Anybody know how to find the root cause, and what could possibly be happening?
I finally found the reason for the measureable slow responses: It was the network driver or maybe even the hardware implementation on the network card.
When running tests from the node itself the slow responses disappeared, and I also noticed the older servers (8 years old compared to the only-2-years-old newer servers) had no slow responses when running tests on them, which indicated the requesting server was at fault, not the responding ES server, but it also indicated the network itself was fine, because only the "new" servers had this problem.
I went down the rabbit hole of TCP/network settings and found ethtool, which shows network configuration and also allows to change it. I learned there was something called "offloading", where a lot of network operations are offloaded to the network card (especially splitting up requests and responses into segments), and tried the following command to disable all offloading:
ethtool -K eth1 tx off rx off sg off tso off ufo off gso off gro off lro off rxvlan off txvlan off rxhash off
Afterwards my request-1000-identical-searches-from-ES were as fast as expected - no slow requests anymore. My network card (Intel® 82574L Dual port GbE LAN on a SuperMicro X9SRL-F running the e1000e driver) seems to do something in hardware which slows down responses, or holds them back, or whatever. The older servers are running the tg3 driver - offloading is enabled on them (according to ethtool), but it does not cause these delayed responses. Disabling offloading had no noticeable effect on CPU load, which is probably to be expected with any modern CPUs.
With the new settings, I was able to lower the number of slow pages due to slow Elasticsearch responses to 0.07%, where before it was about 1%. I also noticed that using Nginx as reverse proxy for Elasticsearch caused some slow responses, even though they were not many - usually about 3-5 responses for every 150'000 were above 50ms. Without Nginx, by just querying Elasticsearch directly, I am now unable to reproduce any slow requests anymore, even at a grand scale.
UPDATE 11/2017
After updating to Debian Stretch and running the server with kernel 4.9 all remaining "slow requests" disappeared. So this problem seems to be at least partly rooted in older linux kernels.
My client has their web hosting account on a shared server and their account has been suspended because is was "causing critical server overload". I have looked at the code and it is functionally programmed php that uses a lot of database queries. I have looked through and most of them are "SELECT *". This database has tables with a 10 or more rows and more than 1000 records.
I was wondering if the cause could be all the sql queries not being freed up, but I'm not sure when "script execution" finishes. Is it after the function has finished execution, or the whole page has been rendered? Could it be the size of the tables (structure or records)? Does anyone have any other ideas?
It really depends on the kind of package your client was on, what type of script custom coded or a standard script like wordpress.
Causing critical server overload - Could be a multitude of things:
High Memory usage: The script is not using a singleton model or its assigning huge amounts of data to arrays, variables or including lots of files. basically bad design & code smell.
High CPU: Insanely long scripts, Long iterated loops with complex calculations in between or infinite loops (sockets) ect on each page view.
High Network Traffic: Screen Scappers like a crawler thats requesting high amounts of traffic from other site or basically scripts that grab external content ALOT, or something like a torrent tracker.
High Disk usage: Constantly bombarding the servers IO stack (Writing and reading to the disk constantly)
A script with lots of database query's could fall into: High Disk usage (reading)+High Memory usage (iterating the result)+High CPU (doing stuff with the result))
You should use a tool to performance profile the script locally: xDebug or PQP, and find out whats really happening.
If your client is serious about there site then they should invest in a VPS.
Make sure you are closing your SQL connections properly. If you are doing alot of queries at once it might be more efficient to leave the connection open for longer periods. Or if you are not closing them after each query maybe try doing this. I must say 10 tables is not a lot and would it would suprise me that this is overloading the shared server.
The first page I load from my site after not visiting it for 20+ mins is very slow. Subsequent page loads are 10-20x faster. What are the common causes of this symptom? Could my server be sleeping or something when it's not receiving http requests?
I will answer this question generally because I'm sure it's something that confuses a lot of newcomers.
The really short answer is: caching.
Just about every program in your computer uses some form of caching to remember data that has already been loaded/processed recently, so it doesn't have to do the work again.
The size of the cache is invariably limited, so stuff has to be thrown out. And 99% of the time the main criteria for expiring cache entries is, how long ago was this last used?
Your operating system caches file data that is read from disk
PHP caches pages and keeps them compiled in memory
The CPU caches memory in its own special faster memory (although this may be less obvious to most users)
And some things that are not actually a cache, work in the same way as cache:
virtual memory aka swap. When there not enough memory available for certain programs, the operating system has to make room for them by moving chunks of memory onto disk. On more recent operating systems the OS will do this just so it can make the disk cache bigger.
Some web servers like to run multiple copies of themselves, and share the workload of requests between them. The copies individually cache stuff too, depending on the setup. When the workload is low enough the server can terminate some of these processes to free up memory and be nice to the rest of the computer. Later on if the workload increases, new processes have to be started, and their memory loaded with various data.
(Note, the wikipedia links above go into a LOT of detail. I'm not expecting everyone to read them, but they're there if you really want to know more)
It's probably not sleeping. It's just not visited for a while and releases it's resources. It takes time to get it started again.
If the site is visited frequently by many users it should response quickly every time.
It sounds like it could be caching. Is the server running on the same machine as your browser? If not, what's the network configuration (same LAN, etc...)?
I have a game running in N ec2 servers, each with its own players inside (lets assume it a self-contained game inside each server).
What is the best way to develop a frontend for this game allowing me to have near real-time information on all the players on all servers.
My initial approach was:
Have a common-purpose shared hosting php website polling data from each server (1 socket for each server). Because most shared solutions don't really offer permanent sockets, this would require me to create and process a connection each 5 seconds or so. Because there isn't cronjob with that granularity, I would end up using the requests of one unfortunate client to make this update. There's so many wrong's here, lets consider this the worst case scenario.
The best scenario (i guess) would be to create small ec2 instance with some python/ruby/php web based frontend, with a server application designed just for polling and saving the data from the servers on the website database. Although this should work fine, I was looking for some solution where I don't need to spend that much money (even a micro instance is expensive for such pet project).
What's the best and cheap solution for this?
Is there a reason you can't have one server poll the others, stash the results in a json file, then push that file to the web server in question? The clients could then use ajax to update the listings in near real time pretty easily.
If you don't control the game servers I'd pass the work on updating the json off to one of the random client requests. it's not as bad as you think though.
Consider the following:
Deliver (now expired) data to client, including timestamp
call flush(); (test to make sure the page is fully rendered, you may need to send whitespace or something to fill the buffer depending on how the webserver is configured. appending flush(); sleep(4); echo "hi"; to a php script should be an easy way to test.
call ignore user abort (http://php.net/manual/en/function.ignore-user-abort.php) so your client will continue execution regardless of what the user does
poll all the servers, update your file
Client waits a suitable amount of time before attempting to update the updated stats via AJAX.
Yes that client does end up with the request taking a long time, but it doesn't affect their page load, so they might not even notice.
You don't provide the information needed to make a decision on this. It depends on the number of players, number of servers, number of games, communication between players, amount of memory and cpu needed per game/player, delay and transfer rate of the communications channels, geographical distribution of your players, update rate needed, allowed movement of the players, mutual visibility. A database should not initially be part of the solution, as it only adds extra delay and complexity. Make it work real-time first.
Really cheap would be to use netnews for this.