(gdb) break *0x972

Debugging, GNU± Linux and WebHosting and ... and ...

Server Monitoring with Monit

Last Friday, 0x972.info's server was suspended by the host provider (PulseHeberg) because of a high CPU load for 10 minutes.

CPU Load Indicator

The CPU loads corresponds to the time the CPU is actually running code, over the time it rests. The lower the best, 1 means 100% of usage of 1 processor core, 2 for 2 cores, etc. It's computed for 1min, 5min and 10min:

(desktop) $ uptime
... load average: 0.04, 0.06, 0.17
(server) $ uptime
... load average: 0.11, 0.04, 0.01

When applications bug, they can enter infinite loop, and thus never let the CPU rest. This can be verified easily with a while loop: while [ 1 -eq 1 ] do ; echo -n "" ; done

While loop and CPU load

In the screenshot, we can see two infinite loops (to raise the load faster), and on the other side, the 1min CPU load reads 1.03. (You an also recognize my i3 desktop and custom task bar, with a visual indicator of the current load: both CPU cores are too high, and the temperature T is rising).

Overload Prevention

Suspending the server is the only safe thing the host provider can do: it's undoubtedly a failing application that caused the overload, and it would escape the infinite loop by itself, so they shutdown the enter server. But on the user side, we've got more control over the server, so there's certainly something less brutal that we can do.

And here comes monit (with a nice presentation of their website):

check process sshd with pidfile /var/run/sshd.pid
   start program  "/etc/init.d/ssh start"
   stop program  "/etc/init.d/ssh stop"
   if failed protocol ssh then restart
   if 5 restarts within 5 cycles then alert
   if 5 restarts within 5 cycles then timeout

It's almost in plain English:

  • monitor the process sshd with its PID stored in file ...,
  • you can start and stop the program this way,
  • if it fails the ssh test connection, restart it,
  • if it fails too often, send me a mail alert and stop monitoring it

Apache httpd is a bit more complex, but still straight forward to read:

check process apache with pidfile /var/run/apache2.pid
   group www
   start program = "/etc/init.d/apache2 start"
   stop  program = "/etc/init.d/apache2 stop"
   if failed host www.0x972.info port 80 
        protocol HTTP request "/monit/token" then restart
   if failed host www.0x972.info port 443 
             type TCPSSL protocol HTTP request "/monit/token" then restart
   if 5 restarts within 5 cycles then timeout
   if cpu > 40% for 2 cycles then alert
   if totalcpu > 60% for 2 cycles then alert
   if totalcpu > 80% for 5 cycles then restart
   if mem > 100 MB for 5 cycles then restart
   if loadavg(5min) greater than 1.5 for 8 cycles then stop

I configured monit to run the checks every 2 minutes, and send mails to me in case of failures.

set daemon 120
set alert kevin@...

Currently, I configured the monitoring for apache2, dovecot, mysql, postfix sshd as well as system-wide properties.

From: ...
To: ...
Subject: monit alert --  Connection failed mysql (Tue, 16 Dec 2014 16:21:15)
Date: Tue, 16 Dec 2014 15:21:15 GMT
Connection failed Service mysql
    Action:      restart
    Host:        www.0x972.info
    Description: failed protocol test [MYSQL] at INET[0x972.info:3306] via TCP -- MYSQL: error receiving login response
    Date:        Tue, 16 Dec 2014 16:21:15

Wednesday, December 17, 2014 - No comments

Publié dans :