See the information about my CPU architecture:
root#jai [~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 16
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 86
Model name: Intel(R) Xeon(R) CPU D-1541 # 2.10GHz
Stepping: 3
CPU MHz: 2099.998
BogoMIPS: 4199.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15
The performance of my website was struggling, sometimes working in an intermittent manner not loading pages completely one time but then a few seconds later loading all pages fine. Totally intermittent and the same happened from different computers and phones from different locations. I suspect the problem was caused by exceeding my CPU capacity to handle the traffic and processed that I were running at that moment. So I ran the following command while I was experiencing the problem. See the results:
root#cup [~]# ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | more
PID PPID CMD %MEM %CPU
26468 1499 php-fpm: pool m_jaimemontoya_n 1.0 57.3
26463 1499 php-fpm: pool m_jaimemontoya_n 0.7 42.3
26553 1499 php-fpm: pool m_jaimemontoya_n 0.6 38.5
26502 1499 php-fpm: pool m_jaimemontoya_n 0.2 35.6
26190 1499 php-fpm: pool m_jaimemontoya_n 0.3 29.4
18242 1499 php-fpm: pool m_jaimemontoya_n 1.1 22.8
19045 1499 php-fpm: pool m_jaimemontoya_n 1.0 20.6
18437 1499 php-fpm: pool m_jaimemontoya_n 0.6 20.2
18269 1499 php-fpm: pool m_jaimemontoya_n 1.1 18.5
18289 1499 php-fpm: pool m_jaimemontoya_n 0.9 13.1
19042 1499 php-fpm: pool m_jaimemontoya_n 1.1 11.9
26906 1 /usr/sbin/exim -Mc 1jTsTl-0 0.0 7.0
8546 8014 /usr/sbin/mysqld --basedir= 1.1 6.7
26872 1 /usr/sbin/exim -Mc 1jTsTl-0 0.0 6.0
26877 1 /usr/sbin/exim -Mc 1jTsTl-0 0.0 6.0
26869 1 /usr/sbin/exim -Mc 1jTsTl-0 0.0 5.0
26875 21851 /usr/sbin/exim -qG 0.0 5.0
26885 1 /usr/sbin/exim -Mc 1jTsTl-0 0.0 5.0
26891 1 /usr/sbin/exim -Mc 1jTsTl-0 0.0 5.0
26895 1 /usr/sbin/exim -Mc 1jTsTl-0 0.0 5.0
26888 1 /usr/sbin/exim -Mc 1jTsTl-0 0.0 4.0
26903 1 /usr/sbin/exim -Mc 1jTsTl-0 0.0 4.0
26880 26875 /usr/sbin/exim -qG 0.0 3.0
26881 26869 /usr/sbin/exim -Mc 1jTsTl-0 0.0 3.0
26882 26872 /usr/sbin/exim -Mc 1jTsTl-0 0.0 3.0
26899 26888 /usr/sbin/exim -Mc 1jTsTl-0 0.0 3.0
17419 1499 php-fpm: pool jaimemontoya_com 0.0 2.5
26814 1 /usr/sbin/exim -Mc 1jTsTj-0 0.0 2.5
26849 1 /usr/sbin/exim -Mc 1jTsTk-0 0.0 2.5
21290 1499 php-fpm: pool jaimemontoya_com 0.0 2.1
14959 1499 php-fpm: pool jaimemontoya_com 0.0 2.0
16122 1499 php-fpm: pool jaimemontoya_com 0.0 2.0
17085 1499 php-fpm: pool jaimemontoya_com 0.0 2.0
22367 1499 php-fpm: pool jaimemontoya_com 0.0 2.0
26826 1 /usr/sbin/exim -Mc 1jTsTk-0 0.0 2.0
26859 26849 /usr/sbin/exim -Mc 1jTsTk-0 0.0 2.0
26884 26877 /usr/sbin/exim -Mc 1jTsTl-0 0.0 2.0
26902 26895 /usr/sbin/exim -Mc 1jTsTl-0 0.0 2.0
18723 1499 php-fpm: pool jaimemontoya_com 0.0 1.7
21456 1499 php-fpm: pool jaimemontoya_com 0.0 1.7
21975 1499 php-fpm: pool jaimemontoya_com 0.0 1.7
13578 1499 php-fpm: pool jaimemontoya_com 0.0 1.6
I am trying to understand how to interpret the %CPU column. If I add up the following values, I get 404.2 as the total:
57.3+42.3+38.5+35.6+29.4+22.8+20.6+20.2+18.5+13.1+11.9+7.0+6.7+6.0+5.0+5.0+5.0+5.0+5.0+4.0+4.0+3.0+3.0+3.0+2.5+2.5+2.5+2.1+2.0+2.0+2.0+2.0+2.0+2.0+2.0+2.0+1.7+1.7+1.7+1.6
Considering that I have 16 CPUs, how can I interpret that 404.2 that results after adding up all of the values in the %CPU column for all processes? I have a server with 30 GB RAM with 8 Core / 16 Thread CPU. Thank you.
NOTE: In the past I used to get this warning frequently on my server logs: WARNING: [pool m_jaimemontoya_com] server reached max_children setting (5), consider raising it So I raised it to 32. For that reason, you see more than 5 processed for php-fpm running simultaneously.
UPDATE 1:
Using man lscpu, I see this:
COLUMNS
CPU The logical CPU number of a CPU as used by the Linux kernel.
I asked AbraCadaver to write his comment as an answer because that was the solution for me. He did not do it. I am copying his comment below as the answer to my question:
"That's a bad data point for ps CPU% is CPU usage is currently expressed as the percentage of time spent running during the entire lifetime of a process. Might look at top or other utility. – AbraCadaver Apr 29 at 20:41"
I have a wordpress site hosted on a digitalocean's droplet and installed through their one-click installation that goes down very often.
Here the url: sinc.marchespettacolo.it
Almost once a week i have to restart Apache from the console because the site is down and if i try to connect to it i get a timeout.
I'm not that pro on making a diagnosis on what is causing this error.
Could someone help me how i have to proceed to find out what's causing this situation and how i can find a solution to the problem?
I have also another droplet with another droplet where i've installed wordpress with the same process that gives no problems!
Some says could be a problem connected to Apache opening to many processes.
This is the last process log, done while the site was down.
root#sinc:/# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 33460 980 ? Ss 04:27 0:01 /sbin/init
root 2 0.0 0.0 0 0 ? S 04:27 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S 04:27 0:00 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< 04:27 0:00 [kworker/0:0H]
root 7 0.0 0.0 0 0 ? S 04:27 0:12 [rcu_sched]
root 8 0.0 0.0 0 0 ? R 04:27 0:13 [rcuos/0]
root 9 0.0 0.0 0 0 ? S 04:27 0:00 [rcu_bh]
root 10 0.0 0.0 0 0 ? S 04:27 0:00 [rcuob/0]
root 11 0.0 0.0 0 0 ? S 04:27 0:00 [migration/0]
root 12 0.0 0.0 0 0 ? S 04:27 0:00 [watchdog/0]
root 13 0.0 0.0 0 0 ? S< 04:27 0:00 [khelper]
root 14 0.0 0.0 0 0 ? S 04:27 0:00 [kdevtmpfs]
root 15 0.0 0.0 0 0 ? S< 04:27 0:00 [netns]
root 16 0.0 0.0 0 0 ? S< 04:27 0:00 [writeback]
root 17 0.0 0.0 0 0 ? S< 04:27 0:00 [kintegrityd]
root 18 0.0 0.0 0 0 ? S< 04:27 0:00 [bioset]
root 19 0.0 0.0 0 0 ? S< 04:27 0:00 [kworker/u3:0]
root 20 0.0 0.0 0 0 ? S< 04:27 0:00 [kblockd]
root 21 0.0 0.0 0 0 ? S< 04:27 0:00 [ata_sff]
root 22 0.0 0.0 0 0 ? S 04:27 0:00 [khubd]
root 23 0.0 0.0 0 0 ? S< 04:27 0:00 [md]
root 24 0.0 0.0 0 0 ? S< 04:27 0:00 [devfreq_wq]
root 25 0.0 0.0 0 0 ? S 04:27 0:05 [kworker/0:1]
root 27 0.0 0.0 0 0 ? S 04:27 0:00 [khungtaskd]
root 28 3.5 0.0 0 0 ? S 04:27 18:14 [kswapd0]
root 29 0.0 0.0 0 0 ? SN 04:27 0:00 [ksmd]
root 30 0.0 0.0 0 0 ? SN 04:27 0:00 [khugepaged]
root 31 0.0 0.0 0 0 ? S 04:27 0:00 [fsnotify_mark]
root 32 0.0 0.0 0 0 ? S 04:27 0:00 [ecryptfs-kthrea]
root 33 0.0 0.0 0 0 ? S< 04:27 0:00 [crypto]
root 45 0.0 0.0 0 0 ? S< 04:27 0:00 [kthrotld]
root 47 0.0 0.0 0 0 ? S 04:27 0:00 [vballoon]
root 48 0.0 0.0 0 0 ? S 04:27 0:00 [scsi_eh_0]
root 49 0.0 0.0 0 0 ? S 04:27 0:00 [scsi_eh_1]
root 70 0.0 0.0 0 0 ? S< 04:27 0:00 [deferwq]
root 71 0.0 0.0 0 0 ? S< 04:27 0:00 [charger_manager]
root 116 0.0 0.0 0 0 ? S 04:27 0:00 [scsi_eh_2]
root 118 0.0 0.0 0 0 ? S< 04:27 0:00 [kpsmoused]
root 119 0.0 0.0 0 0 ? S 04:27 0:00 [kworker/0:2]
root 126 0.0 0.0 0 0 ? S 04:27 0:00 [jbd2/vda1-8]
root 127 0.0 0.0 0 0 ? S< 04:27 0:00 [ext4-rsv-conver]
root 313 0.0 0.0 19476 0 ? S 04:27 0:00 upstart-udev-bridge --daemon
root 323 0.0 0.0 51340 128 ? Ss 04:27 0:00 /lib/systemd/systemd-udevd --daemon
message+ 326 0.0 0.0 39228 268 ? Ss 04:27 0:00 dbus-daemon --system --fork
root 392 0.0 0.0 43452 560 ? Ss 04:27 0:00 /lib/systemd/systemd-logind
syslog 396 0.0 0.0 255844 360 ? Ssl 04:27 0:01 rsyslogd
root 430 0.0 0.0 15408 4 ? S 04:27 0:00 upstart-file-bridge --daemon
root 542 0.0 0.0 0 0 ? S< 04:27 0:00 [ttm_swap]
root 635 0.0 0.0 0 0 ? S< 04:27 0:00 [kvm-irqfd-clean]
root 771 0.0 0.0 15656 4 ? S 04:27 0:00 upstart-socket-bridge --daemon
root 791 0.0 0.0 15820 116 tty4 Ss+ 04:27 0:00 /sbin/getty -8 38400 tty4
root 795 0.0 0.0 15820 116 tty5 Ss+ 04:27 0:00 /sbin/getty -8 38400 tty5
root 800 0.0 0.0 15820 116 tty2 Ss+ 04:27 0:00 /sbin/getty -8 38400 tty2
root 801 0.0 0.0 15820 116 tty3 Ss+ 04:27 0:00 /sbin/getty -8 38400 tty3
root 804 0.0 0.0 15820 116 tty6 Ss+ 04:27 0:00 /sbin/getty -8 38400 tty6
root 837 0.0 0.0 61364 272 ? Ss 04:27 0:01 /usr/sbin/sshd -D
root 841 0.0 0.0 4368 96 ? Ss 04:27 0:00 acpid -c /etc/acpi/events -s /var/run/acpid.socket
root 843 0.0 0.0 23656 204 ? Ss 04:27 0:00 cron
daemon 844 0.0 0.0 19140 56 ? Ss 04:27 0:00 atd
whoopsie 868 0.0 0.0 335684 792 ? Ssl 04:27 0:00 whoopsie
mysql 916 0.1 2.5 918056 25668 ? Ssl 04:27 1:01 /usr/sbin/mysqld
root 1001 0.0 0.0 25344 376 ? Ss 04:27 0:00 /usr/lib/postfix/master
postfix 1006 0.0 0.0 27572 936 ? S 04:27 0:00 qmgr -l -t unix -u
root 1126 0.0 0.0 15820 116 tty1 Ss+ 04:27 0:00 /sbin/getty -8 38400 tty1
root 1384 0.0 0.0 0 0 ? S< 04:28 0:12 [kworker/u3:1]
root 1416 0.0 0.0 0 0 ? S 04:28 0:00 [kauditd]
root 2853 0.0 0.0 0 0 ? S 05:33 0:00 [kworker/u2:2]
root 3541 0.0 0.0 0 0 ? S 05:59 0:00 [kworker/u2:0]
postfix 6465 0.0 0.0 27408 968 ? S 12:48 0:00 pickup -l -t unix -u -c
root 6508 0.0 0.0 105632 596 ? Ss 12:58 0:00 sshd: root#pts/0
root 6578 0.0 0.1 22404 1484 pts/0 Ss 12:58 0:00 -bash
root 6639 0.0 0.0 0 0 ? S 13:00 0:00 [kworker/u2:1]
root 6673 0.0 1.5 312976 15964 ? Ss 13:00 0:00 /usr/sbin/apache2 -k start
www-data 6677 1.8 2.6 319132 27180 ? S 13:00 0:03 /usr/sbin/apache2 -k start
www-data 6678 1.8 2.6 319132 27180 ? S 13:00 0:03 /usr/sbin/apache2 -k start
www-data 6679 2.0 3.8 324792 38892 ? S 13:00 0:03 /usr/sbin/apache2 -k start
www-data 6680 1.8 3.5 325320 35736 ? S 13:00 0:03 /usr/sbin/apache2 -k start
www-data 6681 1.8 2.6 319132 27180 ? S 13:00 0:03 /usr/sbin/apache2 -k start
www-data 6688 1.8 2.6 319132 27176 ? S 13:00 0:03 /usr/sbin/apache2 -k start
www-data 6690 1.8 2.4 317344 25372 ? S 13:00 0:03 /usr/sbin/apache2 -k start
www-data 6691 1.7 2.6 319132 27176 ? S 13:00 0:02 /usr/sbin/apache2 -k start
root 6700 0.0 0.1 18448 1292 pts/0 R+ 13:03 0:00 ps aux
Any wonderfull idea?
Thaaaanks in advance.
UPDATE
05-Apr-2016
here the Apache error log as requested:
https://gist.github.com/iperdiscount/ff06cf131f7ac1ec31aa28761e36b1c9#file-gistfile1-txt
I'm using php curl with nginx as a proxy. here is my code:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
echo $curl_scraped_page;
after sometime that this running the nginx load is extreamly slow and sometime it returns error 500.
the log says
failed (24: Too many open files),
some more details:
root#proxy-s2:~# ulimit -Hn
4096
root#proxy-s2:~# ulimit -Sn
1024
There is nothing else running on the server, and no other script is using this proxy.
Is it nginx bug? how to resolve it?
or
What else could it be? how can it be resolved?
I didn't change the default nginx configuration
Nginx restart solved the problem (temporarily I guess)
here is my nginx.conf
worker_processes 1;
events {
worker_connections 1024;
}
http {
include mime.types;
default_type application/octet-stream;
sendfile on;
keepalive_timeout 65;
gzip on;
server {
listen 8080;
location / {
resolver 8.8.8.8;
proxy_pass http://$http_host$uri$is_args$args;
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root html;
}
}
}
top
top - 09:23:55 up 21:51, 1 user, load average: 0.09, 0.13, 0.08
KiB Mem: 496164 total, 444328 used, 51836 free, 12300 buffers
KiB Swap: 0 total, 0 used, 0 free. 336228 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8 root 20 0 0 0 0 S 0.0 0.0 4:57.56 rcuos/0
4904 nobody 20 0 97796 14128 1012 R 0.0 2.8 4:19.82 nginx
7 root 20 0 0 0 0 S 0.0 0.0 2:11.35 rcu_sched
3 root 20 0 0 0 0 S 0.0 0.0 0:18.50 ksoftirqd/0
832 root 20 0 139208 6808 172 S 0.0 1.4 0:13.11 nova-agent
45 root 20 0 0 0 0 S 0.0 0.0 0:06.21 xenbus
74 root 20 0 0 0 0 S 0.0 0.0 0:03.03 kworker/u30:1
155 root 20 0 0 0 0 S 0.0 0.0 0:02.73 jbd2/xvda1-8
46 root 20 0 0 0 0 R 0.0 0.0 0:02.39 kworker/0:1
57 root 20 0 0 0 0 S 0.0 0.0 0:01.91 kswapd0
1 root 20 0 33448 2404 1136 S 0.0 0.5 0:01.47 init
391 root 20 0 18048 1336 996 S 0.0 0.3 0:00.97 xe-daemon
1034 syslog 20 0 255840 2632 784 S 0.0 0.5 0:00.90 rsyslogd
1107 root 20 0 61364 3048 2364 S 0.0 0.6 0:00.73 sshd
40 root rt 0 0 0 0 S 0.0 0.0 0:00.29 watchdog/0
316 root 20 0 19472 456 252 S 0.0 0.1 0:00.12 upstart-udev-br
6 root 20 0 0 0 0 S 0.0 0.0 0:00.11 kworker/u30:0
1098 root 20 0 23652 1036 784 S 0.0 0.2 0:00.08 cron
7935 root 20 0 105632 4272 3284 S 0.0 0.9 0:00.07 sshd
330 root 20 0 51328 1348 696 S 0.0 0.3 0:00.06 systemd-udevd
7953 root 20 0 22548 3428 1680 S 0.0 0.7 0:00.05 bash
678 root 20 0 15256 524 268 S 0.0 0.1 0:00.04 upstart-socket-
8647 root 20 0 25064 1532 1076 R 0.0 0.3 0:00.03 top
mpstat
root#proxy-s2:~# mpstat
Linux 3.13.0-55-generic (proxy-s2) 07/09/2015 _x86_64_ (1 CPU)
09:22:17 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
09:22:17 AM all 0.94 0.00 1.63 0.16 0.00 2.16 0.92 0.00 0.00 94.20
iostat
root#proxy-s2:~# iostat
Linux 3.13.0-55-generic (proxy-s2) 07/09/2015 _x86_64_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.94 0.00 3.80 0.16 0.92 94.19
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvdc 0.01 0.02 0.00 1710 0
xvda 3.16 4.19 88.56 322833 6815612
Please try below ,do the following changes in your limits.conf.
vi /etc/security/limits.conf
For open files
soft nofile 64000
hard nofile 64000
For max user processes
soft nproc 47758
hard nproc 47758
For max memory size
soft rss unlimited
hard rss unlimited
For virtual memory
soft as unlimited
hard as unlimited
Just put this on atop of Nginx configuration file:
worker_rlimit_nofile 40000;
events {
worker_connections 4096;
}
I think I found the problem:
here is the nginx error.log
2015/07/09 14:17:27 [error] 15390#0: *7549 connect() failed (111: Connection refused) while connecting to upstream, client: 23.239.194.233, server: , request: "GET http://www.lgqfz.com/ HTTP/1.1", upstream: "http://127.0.0.3:80/", host: "www.lgqfz.com", referrer: "http://www.baidu.com"
2015/07/09 14:17:29 [error] 15390#0: *8121 connect() failed (111: Connection refused) while connecting to upstream, client: 204.44.65.119, server: , request: "GET http://www.lgqfz.com/ HTTP/1.1", upstream: "http://127.0.0.3:80/", host: "www.lgqfz.com", referrer: "http://www.baidu.com"
2015/07/09 14:17:32 [error] 15390#0: *8650 connect() failed (101: Network is unreachable) while connecting to upstream, client: 78.47.53.98, server: , request: "GET http://188.8.253.161/ HTTP/1.1", upstream: "http://188.8.253.161:80/", host: "188.8.253.161", referrer: "http://188.8.253.161/"
It was a DDOS attack on my PROXY that I stopped by allowing only my IP to access the PROXY.
I found it to be common lately - when u crawl a site, and the site identify you as a crawler, it will sometime DDOS attack your proxy until they go black.
One example of such site is amazon.com
I have a query that looks like this:
SELECT id FROM user WHERE id='47'
The ID is indexed and reads for this query are always fast when using profiling data, like this.
SET profiling = 1;
SHOW PROFILES;
The queries always execute in around 0.0002 seconds.
However, if I profile the query from the PHP side, like this:
$current = microtime(true);
$data = $conn->query($full_query);
$elapsed = microtime(true) - $current;
Then occasionally maybe 1 out 50 of these queries will take something like .2 seconds. However, in my test script I have code to test this that profiles the query using SET profiling = 1; and even though the PHP round trip through PDO might be .2 seconds the query time was still 0.0002.
Things I know, or know that aren't causing the issue:
The query isn't slow. When I look at the same query, from the same query run, profiled in PHP and profiled using SET PROFILING the query is always fast and never logged in the slow query log even when it shows taking .2 seconds from the PHP side.
This is not skip-name-resolve related - this is inconsistent and I have skip-name-resolve already on
This is not query cache related, the behavior exists in both
This behavior happens even on queries coming out of the cache.
The query doesn't actually select the ID, but I use this query for testing to show that it isn't a disk access issue since that field is definitely indexed.
This tables is only 10-20 megs with something like a 1 meg index. The machine shows very little load and innodb is not using all of its buffers.
This is tested against a table that has no other activity against it other than my test queries.
Does anyone have any ideas of what else to check? This seems to me to be a networking issue, but I need to be able to see it and find the issue to fix it and I'm running out of places to check next. Any ideas?
I would profile the machine.
You say this occurs ~1 per 50 times, and that each query has a 0.2 sec benchmark. You should be able to put top in a screen, and then run a loop of queries in PHP to load-test the RDBMS and gather performance stats.
You will probably have to run more than 50 * 0.2 =10 seconds, since your "1 out of 50" statistic is probably based on hand-running individual queries - based on what I read in your description. Try 30-second and 90-second load tests.
During this time, watch your top process screen. Sort it by CPU by pressing P. Each time you press 'P' it will change the sort order for process-CPU-consumption, so make sure you have the most-consuming on top. (pressing M sorts by memory usage. check the man page for more)
Look for anything that bubbles to the top during the time(s) of your load-test. You should see something jump higher - however momentarily.
(note, such a process may not reach the top of the list — it need not, but could still introduce enough disk load or other activity to lag the MySQL server)
I have noticed the same phenomenon on my systems. Queries which normally take a millisecond will suddenly take 1-2 seconds. All of my cases are simple, single table INSERT/UPDATE/REPLACE statements --- not on any SELECTs. No load, locking, or thread build up is evident.
I had suspected that it's due to clearing out dirty pages, flushing changes to disk, or some hidden mutex, but I have yet to narrow it down.
Also Ruled Out
Server load -- no correlation with high
load Engine -- happens with InnoDB/MyISAM/Memory MySQL Query
Cache -- happens whether it's on or off
Log rotations -- no correlation in events
Good for you to have been using the query profiler already. If you're using MySQL 5.6, you also have access to a lot of new performance measurements in the PERFORMANCE_SCHEMA. This has the capability to measure a lot more detail than the query profiler, and it also measures globally instead of just one session. P_S is reportedly going to replace the query profiler.
To diagnose your issue, I would start by confirming or ruling out a TCP/IP issue. For example, test the PHP script to see if it gets the same intermittent latency when connecting via the UNIX socket. You can do this by connecting to localhost which means the PHP script must run on the same server as the database. If the problem goes away when you bypass TCP/IP, this would tell you that the root cause is likely to be TCP/IP.
If you're in a virtual environment like a cloud hosting, you can easily experience variations in performance because of other users of the same cloud intermittently using up all the bandwidth. This is one of the downsides of the cloud.
If you suspect it's a TCP/IP issue, you can test TCP/IP latency independently from PHP or MySQL. Typical tools that are readily available include ping or traceroute. But there are many others. You can also test network speed with netcat. Use a tool that can measure repeatedly over time, because it sounds like you have good performance most of the time, with occasional glitches.
Another possibility is that the fault lies in PHP. You can try profiling PHP with XHProf to find out where it is spending its time.
Try to isolate the problem. Run a little script like this:
https://drive.google.com/file/d/0B0P3JM22IdYZYXY3Y0h5QUg2WUk/edit?usp=sharing
... to see which steps in the chain are spiking. If you have ssh2 installed, it'll also return ps axu immediately after the longest-running test-loop to see what's running.
Running against localhost on my home development box, the results look like this:
Array
(
[tests summary] => Array
(
[host_ping] => Array
(
[total_time] => 0.010216474533081
[max_time] => 0.00014901161193848
[min_time] => 9.7036361694336E-5
[tests] => 100
[failed] => 0
[last_run] => 9.8943710327148E-5
[average] => 0.00010216474533081
)
[db_connect] => Array
(
[total_time] => 0.11583232879639
[max_time] => 0.0075201988220215
[min_time] => 0.0010058879852295
[tests] => 100
[failed] => 0
[last_run] => 0.0010249614715576
[average] => 0.0011583232879639
)
[db_select_db] => Array
(
[total_time] => 0.011744260787964
[max_time] => 0.00031399726867676
[min_time] => 0.00010991096496582
[tests] => 100
[failed] => 0
[last_run] => 0.0001530647277832
[average] => 0.00011744260787964
)
[db_dataless_query] => Array
(
[total_time] => 0.023221254348755
[max_time] => 0.00026106834411621
[min_time] => 0.00021100044250488
[tests] => 100
[failed] => 0
[last_run] => 0.00021481513977051
[average] => 0.00023221254348755
)
[db_data_query] => Array
(
[total_time] => 0.075078248977661
[max_time] => 0.0010559558868408
[min_time] => 0.00023698806762695
[tests] => 100
[failed] => 0
[last_run] => 0.00076413154602051
[average] => 0.00075078248977661
)
)
[worst full loop] => 0.039211988449097
[times at worst loop] => Array
(
[host_ping] => 0.00014400482177734
[db_connect] => 0.0075201988220215
[db_select_db] => 0.00012803077697754
[db_dataless_query] => 0.00023698806762695
[db_data_query] => 0.00023698806762695
)
[ps_at_worst] => USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 2884 1368 ? Ss Sep19 0:29 /sbin/init
root 2 0.0 0.0 0 0 ? S Sep19 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S Sep19 0:00 [migration/0]
root 4 0.0 0.0 0 0 ? S Sep19 0:06 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S Sep19 0:00 [migration/0]
root 6 0.0 0.0 0 0 ? S Sep19 0:25 [watchdog/0]
root 7 0.0 0.0 0 0 ? S Sep19 7:42 [events/0]
root 8 0.0 0.0 0 0 ? S Sep19 0:00 [cgroup]
root 9 0.0 0.0 0 0 ? S Sep19 0:00 [khelper]
root 10 0.0 0.0 0 0 ? S Sep19 0:00 [netns]
root 11 0.0 0.0 0 0 ? S Sep19 0:00 [async/mgr]
root 12 0.0 0.0 0 0 ? S Sep19 0:00 [pm]
root 13 0.0 0.0 0 0 ? S Sep19 0:23 [sync_supers]
root 14 0.0 0.0 0 0 ? S Sep19 0:24 [bdi-default]
root 15 0.0 0.0 0 0 ? S Sep19 0:00 [kintegrityd/0]
root 16 0.0 0.0 0 0 ? S Sep19 0:47 [kblockd/0]
root 17 0.0 0.0 0 0 ? S Sep19 0:00 [kacpid]
root 18 0.0 0.0 0 0 ? S Sep19 0:00 [kacpi_notify]
root 19 0.0 0.0 0 0 ? S Sep19 0:00 [kacpi_hotplug]
root 20 0.0 0.0 0 0 ? S Sep19 0:00 [ata/0]
root 21 0.0 0.0 0 0 ? S Sep19 0:00 [ata_aux]
root 22 0.0 0.0 0 0 ? S Sep19 0:00 [ksuspend_usbd]
root 23 0.0 0.0 0 0 ? S Sep19 0:00 [khubd]
root 24 0.0 0.0 0 0 ? S Sep19 0:00 [kseriod]
root 25 0.0 0.0 0 0 ? S Sep19 0:00 [md/0]
root 26 0.0 0.0 0 0 ? S Sep19 0:00 [md_misc/0]
root 27 0.0 0.0 0 0 ? S Sep19 0:01 [khungtaskd]
root 28 0.0 0.0 0 0 ? S Sep19 0:00 [kswapd0]
root 29 0.0 0.0 0 0 ? SN Sep19 0:00 [ksmd]
root 30 0.0 0.0 0 0 ? S Sep19 0:00 [aio/0]
root 31 0.0 0.0 0 0 ? S Sep19 0:00 [crypto/0]
root 36 0.0 0.0 0 0 ? S Sep19 0:00 [kthrotld/0]
root 38 0.0 0.0 0 0 ? S Sep19 0:00 [kpsmoused]
root 39 0.0 0.0 0 0 ? S Sep19 0:00 [usbhid_resumer]
root 70 0.0 0.0 0 0 ? S Sep19 0:00 [iscsi_eh]
root 74 0.0 0.0 0 0 ? S Sep19 0:00 [cnic_wq]
root 75 0.0 0.0 0 0 ? S< Sep19 0:00 [bnx2i_thread/0]
root 87 0.0 0.0 0 0 ? S Sep19 0:00 [kstriped]
root 123 0.0 0.0 0 0 ? S Sep19 0:00 [ttm_swap]
root 130 0.0 0.0 0 0 ? S< Sep19 0:04 [kslowd000]
root 131 0.0 0.0 0 0 ? S< Sep19 0:05 [kslowd001]
root 231 0.0 0.0 0 0 ? S Sep19 0:00 [scsi_eh_0]
root 232 0.0 0.0 0 0 ? S Sep19 0:00 [scsi_eh_1]
root 291 0.0 0.0 0 0 ? S Sep19 0:35 [kdmflush]
root 293 0.0 0.0 0 0 ? S Sep19 0:00 [kdmflush]
root 313 0.0 0.0 0 0 ? S Sep19 2:11 [jbd2/dm-0-8]
root 314 0.0 0.0 0 0 ? S Sep19 0:00 [ext4-dio-unwrit]
root 396 0.0 0.0 2924 1124 ? S<s Sep19 0:00 /sbin/udevd -d
root 705 0.0 0.0 0 0 ? S Sep19 0:00 [kdmflush]
root 743 0.0 0.0 0 0 ? S Sep19 0:00 [jbd2/sda1-8]
root 744 0.0 0.0 0 0 ? S Sep19 0:00 [ext4-dio-unwrit]
root 745 0.0 0.0 0 0 ? S Sep19 0:00 [jbd2/dm-2-8]
root 746 0.0 0.0 0 0 ? S Sep19 0:00 [ext4-dio-unwrit]
root 819 0.0 0.0 0 0 ? S Sep19 0:18 [kauditd]
root 1028 0.0 0.0 3572 748 ? Ss Sep19 0:00 /sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient-eth0.leases -pf /var/run/dhclient-eth0.pid eth0
root 1072 0.0 0.0 13972 828 ? S<sl Sep19 2:13 auditd
root 1090 0.0 0.0 2052 512 ? Ss Sep19 0:00 /sbin/portreserve
root 1097 0.0 0.2 37568 3940 ? Sl Sep19 2:01 /sbin/rsyslogd -i /var/run/syslogd.pid -c 5
rpc 1120 0.0 0.0 2568 800 ? Ss Sep19 0:09 rpcbind
rpcuser 1138 0.0 0.0 2836 1224 ? Ss Sep19 0:00 rpc.statd
root 1161 0.0 0.0 0 0 ? S Sep19 0:00 [rpciod/0]
root 1165 0.0 0.0 2636 472 ? Ss Sep19 0:00 rpc.idmapd
root 1186 0.0 0.0 2940 756 ? Ss Sep19 13:27 lldpad -d
root 1195 0.0 0.0 0 0 ? S Sep19 0:00 [scsi_tgtd/0]
root 1196 0.0 0.0 0 0 ? S Sep19 0:00 [fc_exch_workque]
root 1197 0.0 0.0 0 0 ? S Sep19 0:00 [fc_rport_eq]
root 1199 0.0 0.0 0 0 ? S Sep19 0:00 [fcoe_work/0]
root 1200 0.0 0.0 0 0 ? S< Sep19 0:00 [fcoethread/0]
root 1201 0.0 0.0 0 0 ? S Sep19 0:00 [bnx2fc]
root 1202 0.0 0.0 0 0 ? S< Sep19 0:00 [bnx2fc_l2_threa]
root 1203 0.0 0.0 0 0 ? S< Sep19 0:00 [bnx2fc_thread/0]
root 1206 0.0 0.0 2184 564 ? Ss Sep19 1:08 /usr/sbin/fcoemon --syslog
root 1240 0.0 0.0 8556 976 ? Ss Sep19 1:22 /usr/sbin/sshd
root 1415 0.0 0.1 12376 2088 ? Ss Sep19 6:09 sendmail: accepting connections
smmsp 1424 0.0 0.0 12168 1680 ? Ss Sep19 0:02 sendmail: Queue runner#01:00:00 for /var/spool/clientmqueue
root 1441 0.0 0.0 5932 1260 ? Ss Sep19 0:56 crond
root 1456 0.0 0.0 2004 504 tty2 Ss+ Sep19 0:00 /sbin/mingetty /dev/tty2
root 1458 0.0 0.0 2004 504 tty3 Ss+ Sep19 0:00 /sbin/mingetty /dev/tty3
root 1460 0.0 0.0 2004 508 tty4 Ss+ Sep19 0:00 /sbin/mingetty /dev/tty4
root 1462 0.0 0.0 2004 504 tty5 Ss+ Sep19 0:00 /sbin/mingetty /dev/tty5
root 1464 0.0 0.0 2004 508 tty6 Ss+ Sep19 0:00 /sbin/mingetty /dev/tty6
root 1467 0.0 0.0 3316 1740 ? S< Sep19 0:00 /sbin/udevd -d
root 1468 0.0 0.0 3316 1740 ? S< Sep19 0:00 /sbin/udevd -d
apache 3796 0.0 0.4 32668 9452 ? S Dec16 0:08 /usr/sbin/httpd
apache 3800 0.0 0.4 32404 9444 ? S Dec16 0:08 /usr/sbin/httpd
apache 3801 0.0 0.4 33184 9556 ? S Dec16 0:07 /usr/sbin/httpd
apache 3821 0.0 0.4 32668 9612 ? S Dec16 0:08 /usr/sbin/httpd
apache 3840 0.0 0.4 32668 9612 ? S Dec16 0:07 /usr/sbin/httpd
apache 3841 0.0 0.4 32404 9464 ? S Dec16 0:07 /usr/sbin/httpd
apache 4032 0.0 0.4 32668 9632 ? S Dec16 0:07 /usr/sbin/httpd
apache 4348 0.0 0.4 32668 9460 ? S Dec16 0:07 /usr/sbin/httpd
apache 4355 0.0 0.4 32664 9464 ? S Dec16 0:07 /usr/sbin/httpd
apache 4356 0.0 0.5 32660 9728 ? S Dec16 0:07 /usr/sbin/httpd
apache 4422 0.0 0.4 32676 9460 ? S Dec16 0:06 /usr/sbin/httpd
root 5002 0.0 0.0 2004 504 tty1 Ss+ Nov21 0:00 /sbin/mingetty /dev/tty1
root 7540 0.0 0.0 5112 1380 ? S Dec17 0:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --socket=/var/lib/mysql/mysql.sock --pid-file=/var/run/mysqld/mysqld.pid --basedir=/usr --user=mysql
mysql 7642 0.1 1.0 136712 20140 ? Sl Dec17 2:35 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
root 8001 0.0 0.4 31028 9600 ? Ss Dec13 0:18 /usr/sbin/httpd
root 8092 0.0 0.0 0 0 ? S 13:47 0:00 [flush-253:2]
root 8511 0.0 0.0 0 0 ? S 13:48 0:00 [flush-8:0]
root 8551 16.0 0.4 28612 8008 pts/0 S+ 13:49 0:00 php test-mysql-connection.php exit
root 8552 44.0 0.1 11836 3252 ? Ss 13:49 0:00 sshd: root#notty
root 8560 0.0 0.0 4924 1032 ? Rs 13:49 0:00 ps axu
root 12520 0.0 0.1 11500 3212 ? Ss 09:05 0:00 sshd: jonwire [priv]
jonwire 12524 0.0 0.1 11832 1944 ? S 09:05 0:05 sshd: jonwire#pts/0
jonwire 12525 0.0 0.0 5248 1736 pts/0 Ss 09:05 0:00 -bash
root 16309 0.0 0.0 5432 1436 pts/0 S 12:01 0:00 su -
root 16313 0.0 0.0 5244 1732 pts/0 S 12:01 0:00 -bash
apache 16361 0.0 0.5 32908 9836 ? S Dec15 0:08 /usr/sbin/httpd
apache 16363 0.0 0.5 32908 9784 ? S Dec15 0:08 /usr/sbin/httpd
apache 16364 0.0 0.4 32660 9612 ? S Dec15 0:08 /usr/sbin/httpd
apache 16365 0.0 0.4 32668 9608 ? S Dec15 0:08 /usr/sbin/httpd
apache 16366 0.0 0.7 35076 13948 ? S Dec15 0:08 /usr/sbin/httpd
apache 16367 0.0 0.4 32248 9264 ? S Dec15 0:08 /usr/sbin/httpd
apache 16859 0.0 0.5 32916 9844 ? S Dec15 0:08 /usr/sbin/httpd
apache 20379 0.0 0.4 32248 8904 ? S Dec15 0:08 /usr/sbin/httpd
root 28368 0.0 0.0 0 0 ? S Nov01 0:21 [flush-253:0]
apache 31973 0.0 0.4 31668 8608 ? S Dec16 0:08 /usr/sbin/httpd
)
The results of ps axu here are pretty useless, because I'm connecting to localhost. But, I can see from these results that the the DB connect latency spikes occasionally, as does the "network" latency (some TCP/IP buffer?).
If I were you, I'd bump the number of test cycles up to 5000 or 50000.
I can merely guess, but since you eliminated server load, and I assume you checked for red flags in the InnoDb-Stats (phpmyadmin is a great help on that one, although there are more professional tools), what remains is an inconsistent usage of keys. Could it be that your query slightly varies, and that there is a constellation where suboptimal indices are used?
Please add an FORCE INDEX PRIMARY or alike repeat your tests.
Something I've found immensely useful in diagnosing MySQL issues in this vein is mysqltuner. It's a PERL script that looks at your instance of MySQL and suggests various tuning improvements. honestly, it gets hard to keep track of all the tuning you can do and this script is awesome for giving you a breakdown of potential choke points.
Something else to consider is how Linux itself works, which might also explain why you're lagging randomly. When you load top on a Linux box (any box, regardless of load), you'll notice your memory is almost totally used (unless you just rebooted). This isn't a problem or overloading of your box. Linux loads as much as it can into RAM to save time and swaps infrequently used things to your swap file, just like all modern operating systems (called virtual RAM). Normally not a big deal but you're probably using InnoDB as the table type (the current default), which loads things into RAM to save time as well. What could be happening is your query got loaded into RAM (speedy), but sat idle just long enough to get swapped out to the swap file (much slower). Thus you would get a small performance hit while Linux moved it back into RAM (swapfiles are more efficient at this than MySQL would be moving it from the disk). Neither MySQL nor InnoDB have any way to tell this because, as far as they are concerned, it's still in RAM. The problem is described in detail on this blog, with the relevant portion being
Normally a tiny bit of swap usage could be OK (we’re really concerned
about activity—swaps in and out), but in many cases, “real” useful
memory is being swapped: primarily parts of InnoDB’s buffer pool. When
it’s needed once again, a big performance hit is taken to swap it back
in, causing random delays in random queries. This can cause overall
unpredictable performance on production systems, and often once
swapping starts, the system may enter a performance death-spiral.
We found out that an issue with the underlying hardware was causing this. We moved the server to new hardware using VMotion and the issue went away. VMWare was not showing alerts or issues with the hardware. Nonetheless a move off that hardware fixed the issue. Very very odd.
Today earlier my nginx server was 100% CPU usage, the process using all CPU was php-cgi.
I login and kill all php-cgi with this command.
kill -s 9 PID
Now after restarting my server is not working, I see the message "No input file specified.". I google this message, but nothing works, I suppose I have just to start php-cgi again, but can't find how start it.
UPDATE
If I run top command, I can see php-cgi running
1049 root 20 0 336m 20m 10m S 0.0 0.3 0:00.37 httpd
1051 apache 20 0 219m 5472 608 S 0.0 0.1 0:00.55 httpd
1080 root 20 0 20888 1180 592 S 0.0 0.0 0:00.02 crond
1182 root 20 0 19256 976 384 S 0.0 0.0 0:00.00 nginx
1183 nginx 20 0 19856 3176 1364 S 0.0 0.1 0:05.65 nginx
2326 apache 20 0 337m 13m 2512 S 0.0 0.2 0:02.07 httpd
2331 apache 20 0 337m 13m 2564 S 0.0 0.2 0:02.10 httpd
2696 root 20 0 96656 3820 2944 S 0.0 0.1 0:00.18 sshd
2701 root 20 0 12084 1696 1336 S 0.0 0.0 0:00.03 bash
2808 apache 20 0 337m 12m 1988 S 0.0 0.2 0:00.22 httpd
2864 root 20 0 12632 1228 948 R 0.0 0.0 0:00.29 top
2908 ulisses 20 0 183m 11m 6704 S 0.0 0.2 0:00.07 php-cgi
Run ps aux command, also show cgi-php
root 1049 0.0 0.3 344532 20700 ? Ss 14:39 0:00 /usr/sbin/httpd
apache 1051 0.0 0.0 224920 5472 ? S 14:39 0:00 /usr/sbin/httpd
root 1080 0.0 0.0 20888 1180 ? Ss 14:39 0:00 crond
root 1182 0.0 0.0 19256 976 ? Ss 14:43 0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx 1183 0.0 0.0 19856 3176 ? S 14:43 0:05 nginx: worker process
apache 2326 0.0 0.2 345492 13900 ? S 16:56 0:02 /usr/sbin/httpd
apache 2331 0.0 0.2 345480 13944 ? S 16:57 0:02 /usr/sbin/httpd
root 2696 0.0 0.0 96656 3820 ? Ss 17:41 0:00 sshd: root#pts/0
root 2701 0.0 0.0 12084 1696 pts/0 Ss 17:42 0:00 -bash
apache 2808 0.0 0.2 345164 12848 ? S 17:52 0:00 /usr/sbin/httpd
ulisses 2929 0.8 0.1 187732 11976 ? S 18:06 0:00 /usr/bin/php-cgi -c /var/www/vhosts/teclasap.com.br/etc/php.ini
root 2932 0.0 0.0 10480 932 pts/0 R+ 18:06 0:00 ps aux
You can kill all php-cgi by
sudo killall -9 php-cgi