Alert and Alert Manager in Prometheus

6 min readOct 13, 2023

Exporters in Prometheus and Dashboards in Grafana was covered in earlier post. Now we know that Prometheus scrape the metrics exposed by exporters from targets mentioned in its config file.

Let’s understand what monitoring means from DevOps and SRE perspective

Monitoring means to visualize the health of project, environment, servers, and services.
If something bad happen, fix it to keep the server alive.
But no one monitor the dashboard 24x7 so how we going to know something bad happen and there’s a downtime.
For this alerts are created. Alerts are basically a expression with some conditions such as

# monitoring
apache_up # this metric show the status of Apache web server on PromQL
# 1 means up 
# 0 means down

# alerts
if (apache_up != 1){
  printf("Apache is Down")
  # printf("Hey! guys I lost my job due to recession")
}
# the syntax is not like this but I'm giving an example of how it works

Alerts

on http://localhost:9090

The above image shows the value of apache_up metric where 1 means UP and 0 means down. Suppose, if apache goes down means by backend server goes down hence customer will face downtime. So let’s create alert for this but before that visit http://localhost:9090/alerts

Steps:

create a new directory named “alerts” under /etc/prometheus. Then a file named alerts.yml under alerts directory.
change the owner and group to prometheus for both of them.

# sudo vim /etc/prometheus/alerts/alerts.yml
# insert the below lines

groups: # alerts need to be bundled under a group
  - name: My life My alerts My rules # name of the group
    rules: # under this you can create list of alerts
      - alert: Apache is Down # name of the alert
        expr: apache_up == 0 # if expression matches, there would be an alert
        for: 2m # time to wait before sending an alert
        labels: # labels is just like tags , you can provide your own
          severity: Critical
        annotations:
          summary: "Apache Down on [{{ $labels.instance }}]"
          description: "Apache Down on [{{ $labels.instance }}] for job [{{ $labels.job }}]"
        
# close the editor

Let’s understand the above YAML file

groups → Take it as WhatsApp groups. There can be 100s of groups in which one group name is “My life My alerts My rules”.
Rules → Under “My life My alerts My rules” group there are multiple member and one of them is “Apache is Down” who only message when “apache_up == 0” but before sending the message he waits for 2 minute and check for the condition. After 2 minute he messages with a summary and a description along with some labels so his message can be identified easily. Here, the summary and description has some variables which values are dynamic(check the image 1) and get it from metrics.

Now we have our alert, let’s tell the Prometheus to read this file

# sudo vim /etc/prometheus/prometheus.yml

# you will see a variable named
rule_files:
  - "alerts/alerts.yml"
#sudo systemctl restart prometheus

# it may ask to do sudo systemctl daemon-reload

Again open http://localhost:9090/alerts but before that take a look at my alerts/alerts.yml file

groups:
  - name: Apache Down Rule
    rules: 
      - alert: Apache is Down 
        expr: apache_up == 0 
        for: 2m 
        labels: 
          severity: Critical
        annotations:
          summary: "Apache Down on [{{ $labels.instance }}]"
          description: "Apache Down on [{{ $labels.instance }}] for job [{{ $labels.job }}]"

http://localhost:9090/alerts **(Image 2)**

From Image 2 or above image

Inactive → A green box that shows alerts that haven’t met the condition to get triggered.
Pending → As soon as the alert’s expression matched it move to pending for 2 minutes.
Firing → After 2 minutes the alerts transition to firing state.

For 2 minutes it’ll be here **(Image 3)**

Check out the Labels column

I have created another one rule

groups:
  - name: Apache Down Rule # I forgot rename it as Downtime Rule
    rules: 
      - alert: Apache is Down 
        expr: apache_up == 0 
        for: 2m 
        labels: 
          severity: Critical
        annotations:
          summary: "Apache Down on [{{ $labels.instance }}]"
          description: "Apache Down on [{{ $labels.instance }}] for job [{{ $labels.job }}]"

      - alert: Target Down 
        expr: up == 0 
        for: 2m 
        labels: 
          severity: Critical
        annotations:
          summary: "job [{{ $labels.job }}] is down on [{{ $labels.instance }}]"
          description: "job [{{ $labels.job }}] is down on [{{ $labels.instance }}]"

My custom exporter is not running and I stopped node-exporter too. We are getting alerts but how to notify the teams?

AlertManager

Alert Manager provides the ability to send notification via slack, Gmail, pagerduty, OpsGenie and others so that the alerts can be handled in real time.

Alert Manager needs to be setup separately.

wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz

tar xvzf alertmanager-0.26.0.linux-amd64.tar.gz

cd alertmanager-0.26.0.linux-amd64

sudo mkdir /etc/alertmanager

sudo mv alertmanager /usr/bin
sudp mv * /etc/alertmanager

cat <<EOF | sudo tee /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alert Manager
Documentation=https://prometheus.io/docs/alerting/latest/alertmanager/#alertmanager
After=network-online.target

[Service]
User=root
Group=root
Restart=on-failure
ExecStart=/usr/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

# http://localhost:9093

Alert manager runs on http://localhost:9093 but we haven’t configured Prometheus for alert manager. Let’s do that

# open /etc/prometheus/prometheus.yml from any text editor
# and change the following
alerting:
  alertmanagers:
    - static_configs:
        targets:
          - localhost:9093
# this config is already there just uncomment the line with URI and update it.

restart prometheus then go to http://localhost:9090/status and below you will the heading of alert manager with specified URI.

As soon as the alert went into firing mode, I got this on my alert manager

Well, alert manager got the alert but it doesn’t know where to send it. Now we have to configure alert manager with required configuration.

Here’s the current configuration of alert manager:

Configure it for Slack:

go to your workspace → Tools & settings → Manage Apps → Search for Web hook
Get the Webhook then open alertmanager.yml

global:
  resolve_timeout: 1m
  slack_api_url: 'https://hooks.slack.com/services/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'
  group_interval: 30s
  repeat_inetrval: 4h
route:
  receiver: 'slack-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    send_resolved: true

sudo systemctl daemon-reload
sudo systemctl restart alertmanager

Similarly for Gmail, Pagerduty, you can follow this blog.

How to setup Alertmanager with PagerDuty, Gmail and Slack.

Alert manager also provides metrics at http://localhost:9093/metrics , you can add it as a target to Prometheus.

Custom Alerts

Sometime we have a long expression and need to run it frequently such as node exporter doesn’t provide metrics about how much memory is used so, to get that

(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Shmem_bytes) / 1024^3
# this expression provides used memory in GB

Prometheus allow us to create custom metrics and then directly use in PromQL. But how?

# mkdir /etc/prometheus/rules && touch /etc/prometheus/rules/rules.yml
# sudo chown prometheus:prometheus -R /etc/prometheus/rules
# open this file with any text editor
groups:
  - name: Used Memory
    rules:
      - record: memory_used_in_GB
        expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Shmem_bytes) / 1024^3

# add this file to prometheus.yml under rule_files
# sudo systemctl daemon-reload
# sudo systemctl restart prometheus

Similarly for used disk space

groups:
  - name: Used Memory
    rules:
      - record: memory_used_in_GB
        expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Shmem_bytes) / 1024^3
  - name: used disk space in percentage
    rules:
      - record: node_filesystem_used_space_percentage
        expr: 100 - (100 * node_filesystem_free_bytes / node_filesystem_size_bytes)

Summary

Alert comes when the given expression matches and goes to alert manager after mentioned time. Then alert manager sends the notification to configured receivers. Prometheus allows to create custom metrics too.

Give it a try with different receivers too

Thanks for reading!!!

Alert and Alert Manager in Prometheus

Alerts

AlertManager

Custom Alerts

Summary

Written by Naveen Singh