Monitoring and Alerting
Overview
Monitoring systems and alerting when issues arise are critical responsibilities for system operators. Effective observability ensures that system health, performance, and security can be continuously assessed. In this unit, we will explore how to design reliable monitoring infrastructures through sound architectural decisions. We will also examine how alerts can be tuned and moderated to minimize noise, prioritize actionable events, and ensure timely response to real issues.
Learning Objectives
- Understand robust monitoring architecture.
- Understand what comprises a well architected monitoring pipeline.
- Understand alert fatigue and how to focus on pertinent, actionable alerts.
- Understand the trade off between information flow and security.
- Get hands on with Fail2Ban, Prometheus, and Grafana.
Key terms and Definitions
| Tracing | Span |
|---|---|
| Label | Time Series Database (TSDB) |
| Queue | Upper control limit / Lower control limit (UCL/LCL) |
| Aggregation | SLO, SLA, SLI |
| Push v. Pull of data | Alerting rules |
| Alertmanager | Alert template |
| Routing | Throttling |
| Monitoring for defensive operations | SIEM |
| Intrusion Detection Systems - IDS | Intrusion Prevention Systems - IPS |