Advanced Configuration for Prometheus Alerts

45 minutes
  • 3 Learning Objectives

About this Hands-on Lab

Prometheus Alertmanager provides some additional useful features around the management of alerts. These features allow you to customize and tweak your alerts so they are more useful in real-world situations. In this lab, you will have the opportunity to practice using some of these Alertmanager features, including alert grouping, inhibitions, and silences.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Combine the Web Server Down Alerts into a Single Group
  1. Log in to the Prometheus server.

  2. Edit the Alertmanager configuration file:

    sudo vi /etc/alertmanager/alertmanager.yml
  3. Add a new node to routing tree to combine the WebServer.*Down alerts:

      - receiver: 'web.hook'
        group_by: ['service']
          alertname: 'WebServer.*Down'
  4. Load the new configuration:

    sudo killall -HUP alertmanager
  5. Check Alertmanager in a web browser at http://<PROMETHEUS_SERVER_PUBLIC_IP>:9093. You should see the Web Server alerts grouped together under the group service="webserver".

Create an Inhibition to Stop the `WebBadGateway` Alert When a `WebServerDown` Alert Is Already Firing
  1. Edit the Alertmanager configuration file:

    sudo vi /etc/alertmanager/alertmanager.yml
    1. Add a new inhibit rule:
      - source_match_re:
          alertname: 'WebServer.*Down'
          alertname: 'WebBadGateway'
    1. Load the new configuration:
    sudo killall -HUP alertmanager
    1. Check Alertmanager in a web browser at http://<PROMETHEUS_SERVER_PUBLIC_IP>:9093. The WebBadGateway should no longer appear. You can click the Inhibited box to make it appear again.
Silence the `WebServer1Down` Alert
  1. Access Alertmanager in a web browser at http://<PROMETHEUS_SERVER_PUBLIC_IP>:9093.

  2. Expand the service="webserver" group.

  3. Locate the alert with alertname="WebServer1Down", and click the Silence button for that alert.

  4. Fill out the Creator and Comment fields, and then click Create.

  5. If you return to the main Alertmanager page, the WebServer1Down should no longer appear.

Additional Resources

Your company, LimeDrop, is using Alertmanager to handle Prometheus alerts. Alertmanager is set up to issue alerts when there are problems with the company's main website. The website is currently experiencing issues. Since you are the expert on Prometheus, the admin team has asked you to perform some tweaks in Alertmanager to make the alerts more relevant and useful.

Implement the following changes in Alertmanager:

  • There is a collection of web servers, and when issues arise there is usually more than one instance that is down at the same time. However, when multiple instances go down, the team gets a separate alert for each instance. Ensure these alerts are combined into a group in Alertmanager so there is only one alert message even if multiple web servers are down. The relevant alerts all have names that match the expression WebServer.*Down.
  • When web servers go down, the website will begin to respond with 502 (Bad Gateway) error messages. This message triggers an additional alert. However, when there are already alerts about the servers being down, this additional alert is unnecessary and distracting. Configure Alertmanager to inhibit the alert named WebBadGateway whenever any of the WebServer.*Down alerts are firing.
  • Web Server 1 is repeatedly going down and then recovering, resulting in multiple alert notifications. Temporarily silence the WebServer1Down alert for two hours.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?