Last Friday, 0x972.info's server was suspended by the host provider (PulseHeberg) because of a high CPU load for 10 minutes.
CPU Load Indicator
The CPU loads corresponds to the time the CPU is actually running code, over the time it rests. The lower the best, 1 means 100% of usage of 1 processor core, 2 for 2 cores, etc. It's computed for 1min, 5min and 10min:
(desktop) $ uptime ... load average: 0.04, 0.06, 0.17 (server) $ uptime ... load average: 0.11, 0.04, 0.01
When applications bug, they can enter infinite loop, and thus never let the CPU rest. This can be verified easily with a while loop: while [ 1 -eq 1 ] do ; echo -n "" ; done
In the screenshot, we can see two infinite loops (to raise the load faster), and on the other side, the 1min CPU load reads 1.03. (You an also recognize my i3 desktop and custom task bar, with a visual indicator of the current load: both CPU cores are too high, and the temperature T is rising).
Suspending the server is the only safe thing the host provider can do: it's undoubtedly a failing application that caused the overload, and it would escape the infinite loop by itself, so they shutdown the enter server. But on the user side, we've got more control over the server, so there's certainly something less brutal that we can do.
And here comes monit (with a nice presentation of their website):
check process sshd with pidfile /var/run/sshd.pid start program "/etc/init.d/ssh start" stop program "/etc/init.d/ssh stop" if failed protocol ssh then restart if 5 restarts within 5 cycles then alert if 5 restarts within 5 cycles then timeout
It's almost in plain English:
- monitor the process sshd with its PID stored in file ...,
- you can start and stop the program this way,
- if it fails the ssh test connection, restart it,
- if it fails too often, send me a mail alert and stop monitoring it
Apache httpd is a bit more complex, but still straight forward to read:
check process apache with pidfile /var/run/apache2.pid group www start program = "/etc/init.d/apache2 start" stop program = "/etc/init.d/apache2 stop" if failed host www.0x972.info port 80 protocol HTTP request "/monit/token" then restart if failed host www.0x972.info port 443 type TCPSSL protocol HTTP request "/monit/token" then restart if 5 restarts within 5 cycles then timeout if cpu > 40% for 2 cycles then alert if totalcpu > 60% for 2 cycles then alert if totalcpu > 80% for 5 cycles then restart if mem > 100 MB for 5 cycles then restart if loadavg(5min) greater than 1.5 for 8 cycles then stop
I configured monit to run the checks every 2 minutes, and send mails to me in case of failures.
set daemon 120 set alert kevin@...
Currently, I configured the monitoring for apache2, dovecot, mysql, postfix sshd as well as system-wide properties.
From: ... To: ... Subject: monit alert -- Connection failed mysql (Tue, 16 Dec 2014 16:21:15) Date: Tue, 16 Dec 2014 15:21:15 GMT Connection failed Service mysql Action: restart Host: www.0x972.info Description: failed protocol test [MYSQL] at INET[0x972.info:3306] via TCP -- MYSQL: error receiving login response Date: Tue, 16 Dec 2014 16:21:15