Skip to content

Monitoring and Alerting

Overview

Monitoring systems and alerting when issues arise are critical responsibilities for system operators. Effective observability ensures that system health, performance, and security can be continuously assessed. In this unit, we will explore how to design reliable monitoring infrastructures through sound architectural decisions. We will also examine how alerts can be tuned and moderated to minimize noise, prioritize actionable events, and ensure timely response to real issues.

Learning Objectives

  1. Understand robust monitoring architecture.
  2. Understand what comprises a well architected monitoring pipeline.
  3. Understand alert fatigue and how to focus on pertinent, actionable alerts.
  4. Understand the trade off between information flow and security.
  5. Get hands on with Fail2Ban, Prometheus, and Grafana.

Key terms and Definitions

Tracing Span
Label Time Series Database (TSDB)
Queue Upper control limit / Lower control limit (UCL/LCL)
Aggregation SLO, SLA, SLI
Push v. Pull of data Alerting rules
Alertmanager Alert template
Routing Throttling
Monitoring for defensive operations SIEM
Intrusion Detection Systems - IDS Intrusion Prevention Systems - IPS