Overview


In this unit, we focus on Linux system monitoring, using modern tools like Grafana, Prometheus, Node Exporter, and Loki. As Linux administrators, monitoring is essential to ensure system stability, performance, and security across environments.

We will explore how to collect, analyze, and visualize system metrics, and discuss best practices for monitoring and dashboard design that can improve troubleshooting and proactive system management.

Learning Objectives


By the end of this unit, you will be able to:

  • Explain core monitoring concepts like metrics, logs, SLOs, SLIs, and KPIs
  • Set up Prometheus and Node Exporter to collect system metrics
  • Use Grafana to create dashboards for visualizing system health and performance
  • Write and execute PromQL queries to analyze system data
  • Interpret monitoring data to diagnose system issues and support teams with actionable insights

Relevance & Context


Monitoring is a core responsibility of Linux system administration, ensuring you know what’s happening under the hood before issues escalate. Modern IT environments rely on monitoring to track system performance, security events, and overall stability — whether in production, development, or cloud environments.

This unit focuses on Grafana for visualization and Prometheus with Node Exporter for telemetry and metrics collection — tools commonly used in enterprise, cloud, and HPC (High-Performance Computing) environments.

Whether you're in a NOC, SysAdmin, or DevOps role, understanding monitoring and telemetry makes you a key contributor to system reliability and performance.

Prerequisites


Before starting Unit 11, you should have:

  • Basic understanding of Linux system administration and networking
  • Familiarity with system processes, performance metrics, and logs
  • Root or sudo access to a Linux system (Rocky 9 or equivalent)
  • Internet access to run labs via Killercoda and online resources
  • (Optional but recommended): Exposure to containers and services like Grafana or Prometheus

Key Terms and Definitions


SLO (Service Level Objective)

SLA (Service Level Agreement)

SLI (Service Level Indicator)

KPI (Key Performance Indicator)

MTTD (Mean Time to Detect)

MTTR (Mean Time to Repair)