Monitoring and Alerting: keeping your website online for longer

29 January 2017

Big Blue Door has been a hosting provider since we first started, and have great pride in our excellent uptime statistics and high-performance secure hosting offering. We guarantee a minimum of 99.95% application uptime, but most of our websites far exceed this in any given calendar month: we have some websites with 100% application uptime for many months running.

Unfortunately, however, problems do occur with websites, and when something happens to a server or a website we want to know instantly so that one of our sysadmins can respond.

There are a number of software packages we use for this, which provide near-instant reporting across our infrastructure, alerting relevant stakeholders via a variety of different methods, including Slack, Email, and SMS.

Some software packages we use are proactive tools, some reactive, but most come with graphing tools which make visualising the data very simple: many of our clients receive some of these graphs in monthly reports.

Pingdom

Pingdom is a great tool, and is really considered a standard part of uptime monitoring for any hosting company. They own servers in multiple geographical locations (across North America, Europe, and Asia) and send a page request to a website of your choosing at a time interval of your choosing. For most of our sites we request a page once a minute, and if the page doesn’t load (for whatever reason) it will send an alert to pre-configured people, via email, SMS, or via APIs to other systems.

For each of the requests it makes, they store the time it took for our server to respond, and the location of the request.

Pingdom - checking every minute

All of these statistics are then collated into a simple graph, with the Uptime metric prominently displayed:

Uptime for bigbluedoor.net

Pingdom’s failing, however, is that it only provides reactive monitoring: that is, it only tells you when the site has actually become unavailable. The site may become unavailable due to unusually high traffic on the server, or because of a malicious attempt (e.g. DOS attack). In this case, wouldn’t it be good if we could monitor our servers to tell us of any problems before they actually caused complete loss of service?

We use a couple of software packages to do this.

New Relic

New Relic is a performance monitoring tool with an extremely simple setup process. Just a few commands on the server can set the defaults, and New Relic will then be able to track key hardware metrics from your server. For example, memory (RAM) usage, CPU usage, network usage, hard disk usage, and the overall “load average” – a single number which gives a metric for how “busy” the server is.

New Relic dashboard

There are also “plugins” for many specific software packages (PHP, Memcached, MySQL, Varnish) which enable you to track more detailed metrics from specific elements of the infrastructure stack.

This, like Pingdom, is reactive monitoring: looking back over time to see when particularly busy periods are, and this can be immensely helpful when trying to ascertain the root cause of some infrastructure downtime. New Relic goes one step further though, and has configurable alerting levels that allow us to step in and sort issues before they provide a loss of service.

New Relic - configuring alerts

With the above settings, when the CPU usage reaches 60% and stays above 60% for more than 20 minutes, the sysadmin team will receive a notification. We can then log into the server remotely to figure out why the server is starting to struggle before the website actually goes down. In some cases, this may just be due to high traffic, but at this stage if we can see lots of requests coming in from a specific IP address then it’s clear that an attacker is attempting a DOS attack and we can immediately mitigate the attack before it causes website downtime.

It is difficult in New Relic to set custom policies for bespoke services we might require. For this, we use another tool: Monit.

Monit

Monit is by far the most difficult element of our reporting infrastructure to set up, but allows for much more granular control over bespoke alerts.

As an example of what we might use Monit for, consider a website which interacts with a CRM system. The CRM is on a private network (so, not accessible to the outside world – only from trusted computers), so we cannot use Pingdom to notify us if the CRM goes down, or for if the webserver loses communication with the CRM server. New Relic would not allow us to set up a custom rule like this, but with Monit we can. Monit allows us to create custom scripts for any rule that we want. In the above example, a script such as the following creates a custom “pingdom” for a specific API on the CRM that we know the webserver needs to access:

#!/bin/bash
#
CRM_URL="https://crm.domain.com/api-status-page"
HTTP_STATUS_CODE=$(curl -s -X GET -I "$CRM_URL" | head -1 | awk '{ print $2 }')

if [ "$HTTP_STATUS_CODE" != "200" ]; then
  exit 1
fi
exit 0

Using a Monit configuration file, we can integrate this into the checks that Monit runs through:

check program live_crm_available with path /etc/monit/scripts/live-crm.sh
  if status > 0 for 4 times within 5 cycles then alert

This script says:

  • Run the script at /etc/monit/scripts/live-crm.sh
  • If the status is greater than 0 (i.e. if the CRM is not accessible), store a record of the failure
  • If there are four failures within five checks, throw an alert

This means that, if the connection to the CRM were to fail, we would be alerted within five minutes. The tolerance level (4 times within 5 cycles) allows for temporary “blips” without throwing too many false alerts.

Other examples of what we use custom scripting with Monit for are:

  • Confirming that SMTP connections to mail servers (e.g. Mandrill, Sendgrid) are accessible
  • Check the ‘hash sums’ of SSH login files (authorized_keys) to throw alerts when new SSH keys are added to servers
  • Check firewall rules on the server and alert when changes are made
  • Confirming backups schedules are in place and running correctly

There’s another great part to Monit though: proactively try and fix issues.

A web server runs multiple software programs, such as Apache, PHP, Varnish. If any one of these three programs crashes then this will cause website outage. We can program Monit to auto-restart these programs if they fail. This is effectively the same as running a program on your computer to restart/reopen Microsoft Word when it fails.

The following Monit script runs this check for Varnish.

check process varnish with pidfile /var/run/varnishd.pid
  start program = "/etc/init.d/varnish start"
  stop program = "/etc/init.d/varnish stop"
  if failed host 127.0.0.1 port 81 for 2 times within 3 cycles then restart
  if failed host 127.0.0.1 port 81 for 2 times within 3 cycles then alert

In this example, we’re running Varnish on port 81 on the same server as the monitoring script. This script says:

  • Check port 81 and listen for a response
  • If no response for two times within three checks, attempt to restart the program
  • Simultaneously, throw an alert

A “cycle” can run as frequently as you want with Monit: most of our servers run every ten seconds. So: if Varnish were to crash, Monit would restart it within 20 seconds. This software runs 24/7/365, and we have similar checks in place for PHP, Apache, NGINX, MySQL, ApacheSolr: all the key parts of the infrastructure.

M/Monit

Pingdom and New Relic have fantastic ease-of-use: a central website status page is provided in each case so as to quickly view the status of all of your servers. Monit is more of a command-line tool, with no centralisation. Fortunately, however, there is a related software packages, M/Monit, which allows for a central reporting website.

M/Monit - Status page

At a glance, you can view the status of all servers, and view any servers which have recently had alerts. Furthermore, you can correlate some of the metrics across multiple hosts. For example, collate all webservers within a load-balanced setup, to analyse simultaneous changes in CPU usage.

M/Monit - Analytics by host group

Redundant (and high-volume) alerting

There are definite areas of crossover between some of the systems we utilise. This obviously provides redundancy: for example if Pingdom were to fail then we’d still get alerts from New Relic and M/Monit. All three of these platforms report in via email, and also post to Slack channels within our company Slack. For some highly-critical alerts, we receive text messages to our company mobile phones. This does, however, mean we get a LOT of alerts coming through when things do go wrong, or when servers get a bit busier than normal.

Big Panda

We’ve recently (in summer 2016) discovered a new and innovative solution to receiving hundreds of alerting messages. Big Panda is not a monitoring platform, but it’s a (very) intelligent correlation engine. Instead of emailing, Slacking, or SMSing alerts to our team, all of our systems now alert Big Panda, and Big Panda then correlates these alerts (in near-instant time) and alerts our sysadmin team. Take for example the following scenario: one of our shared servers, which hosts six websites, is being DOSd. The CPU usage increases (cue warning and critical alerts from New Relic and M/Monit) and very quickly causes website downtime for all six websites (cue critical alerts from Pingdom). For this very simple scenario we would have received two alerts from New Relic, two from M/Monit, and six from Pingdom. Each of these alerts would have emailed us, Slacked the team, and possibly sent an SMS: we’d be expecting around 25 alerts in total. Big Panda will correlate all of these alerts and send us a single critical alert. Big Panda has decreased our MTTR (Mean Time to Response) by allowing us to “see the wood between the trees”. Rather than having to triage tens, or even hundreds, of emails/Slack messages/text messages, we now have a single alert to review and action.

Big Panda can provide timeline graphs for the duration of the incident, correlating all the alerts together and will reopen alerts rather than creating a new one, if a system is seen to be “flapping” – i.e. momentarily up, then down, then up, etc.

Big Panda - incident timeline

We’ve been very impressed with Big Panda so far, though the development team is working on many new features so we’re confident that this service will become even more useful to us in the future.

 

Utilising and integrating all of these systems has allowed us to build a resilient and effective alerting and monitoring platform which helps keep our servers and websites online with downtime kept to an absolute minimum, keeping our clients happy and keeping our sysadmin team less busy. This enables us to focus on more interesting things: building sites rather than firefighting!