This is one more article in a series of posts to share some techniques that I wrote about in Web Operations Dashboards, Monitoring, and Alerting. In this article, I’m going to talk about Monitor Selection Principles.
While it can be tempting to start off by monitoring everything, and alerting every time something slightly odd happens, there is a better pattern for choosing what you monitor and when to sound the alarm. This better way is called the Monitor Selection Principles, because who doesn’t like principles?
Monitor Selection Principles
Here are the monitor selection principles, which will guide you to choose the right things to monitor, and how to refine your alarms over time:
- Pick one metric that is a leading indicator of a fault
- Add a monitor with a reasonably sensitive alarm threshold
- Each time the alarm goes off decide whether you need to
- Take urgent action to resolve a fault, or
- Move the alarm threshold up to make the alarm less likely to sound
Every time an alarm sounds, you must choose either option a, or option b. There is no option c “ignore the alarm and don’t update the alerting”. Take a look at the Alerting Principles for more on this subject.
Once you have hit a good balance with your first monitor, you can use failure events and near-misses to guide your next monitor. Just repeat the pattern of picking a metric that seems related to a fault and adjusting the threshold to achieve the desired result.