Monitoring and Alerting

Overview

Monitoring systems and alerting when issues arise are critical responsibilities for system operators. Effective observability ensures that system health, performance, and security can be continuously assessed. In this unit, we will explore how to design reliable monitoring infrastructures through sound architectural decisions. We will also examine how alerts can be tuned and moderated to minimize noise, prioritize actionable events, and ensure timely response to real issues.

Learning Objectives

Understand robust monitoring architecture.
Understand what comprises a well architected monitoring pipeline.
Understand alert fatigue and how to focus on pertinent, actionable alerts.
Understand the trade off between information flow and security.
Get hands on with Fail2Ban, Prometheus, and Grafana.

Key terms and Definitions

Tracing	Span
Label	Time Series Database (TSDB)
Queue	Upper control limit / Lower control limit (UCL/LCL)
Aggregation	SLO, SLA, SLI
Push v. Pull of data	Alerting rules
Alertmanager	Alert template
Routing	Throttling
Monitoring for defensive operations	SIEM
Intrusion Detection Systems - IDS	Intrusion Prevention Systems - IPS