Alert and Alert Manager in Prometheus
Exporters in Prometheus and Dashboards in Grafana was covered in earlier post. Now we know that Prometheus scrape the metrics exposed by exporters from targets mentioned in its config file.
Let’s understand what monitoring means from DevOps and SRE perspective
- Monitoring means to visualize the health of project, environment, servers, and services.
- If something bad happen, fix it to keep the server alive.
- But no one monitor the dashboard 24x7 so how we going to know something bad happen and there’s a downtime.
- For this alerts are created. Alerts are basically a expression with some conditions such as
# monitoring
apache_up # this metric show the status of Apache web server on PromQL
# 1 means up
# 0 means down
# alerts
if (apache_up != 1){
printf("Apache is Down")
# printf("Hey! guys I lost my job due to recession")
}
# the syntax is not like this but I'm giving an example of how it works
Alerts
The above image shows the value of apache_up metric where 1 means UP and 0 means down. Suppose, if apache goes down means by backend server goes down hence customer will face downtime. So let’s create alert for this but before that visit http://localhost:9090/alerts
Steps:
- create a new directory named “alerts” under /etc/prometheus. Then a file named alerts.yml under alerts directory.
- change the owner and group to prometheus for both of them.
# sudo vim /etc/prometheus/alerts/alerts.yml
# insert the below lines
groups: # alerts need to be bundled under a group
- name: My life My alerts My rules # name of the group
rules: # under this you can create list of alerts
- alert: Apache is Down # name of the alert
expr: apache_up == 0 # if expression matches, there would be an alert
for: 2m # time to wait before sending an alert
labels: # labels is just like tags , you can provide your own
severity: Critical
annotations:
summary: "Apache Down on [{{ $labels.instance }}]"
description: "Apache Down on [{{ $labels.instance }}] for job [{{ $labels.job }}]"
# close the editor
Let’s understand the above YAML file
- groups → Take it as WhatsApp groups. There can be 100s of groups in which one group name is “My life My alerts My rules”.
- Rules → Under “My life My alerts My rules” group there are multiple member and one of them is “Apache is Down” who only message when “apache_up == 0” but before sending the message he waits for 2 minute and check for the condition. After 2 minute he messages with a summary and a description along with some labels so his message can be identified easily. Here, the summary and description has some variables which values are dynamic(check the image 1) and get it from metrics.
Now we have our alert, let’s tell the Prometheus to read this file
# sudo vim /etc/prometheus/prometheus.yml
# you will see a variable named
rule_files:
- "alerts/alerts.yml"
#sudo systemctl restart prometheus
# it may ask to do sudo systemctl daemon-reload
Again open http://localhost:9090/alerts but before that take a look at my alerts/alerts.yml file
groups:
- name: Apache Down Rule
rules:
- alert: Apache is Down
expr: apache_up == 0
for: 2m
labels:
severity: Critical
annotations:
summary: "Apache Down on [{{ $labels.instance }}]"
description: "Apache Down on [{{ $labels.instance }}] for job [{{ $labels.job }}]"
From Image 2 or above image
- Inactive → A green box that shows alerts that haven’t met the condition to get triggered.
- Pending → As soon as the alert’s expression matched it move to pending for 2 minutes.
- Firing → After 2 minutes the alerts transition to firing state.
Check out the Labels column
I have created another one rule
groups:
- name: Apache Down Rule # I forgot rename it as Downtime Rule
rules:
- alert: Apache is Down
expr: apache_up == 0
for: 2m
labels:
severity: Critical
annotations:
summary: "Apache Down on [{{ $labels.instance }}]"
description: "Apache Down on [{{ $labels.instance }}] for job [{{ $labels.job }}]"
- alert: Target Down
expr: up == 0
for: 2m
labels:
severity: Critical
annotations:
summary: "job [{{ $labels.job }}] is down on [{{ $labels.instance }}]"
description: "job [{{ $labels.job }}] is down on [{{ $labels.instance }}]"
My custom exporter is not running and I stopped node-exporter too. We are getting alerts but how to notify the teams?
AlertManager
Alert Manager provides the ability to send notification via slack, Gmail, pagerduty, OpsGenie and others so that the alerts can be handled in real time.
Alert Manager needs to be setup separately.
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvzf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64
sudo mkdir /etc/alertmanager
sudo mv alertmanager /usr/bin
sudp mv * /etc/alertmanager
cat <<EOF | sudo tee /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alert Manager
Documentation=https://prometheus.io/docs/alerting/latest/alertmanager/#alertmanager
After=network-online.target
[Service]
User=root
Group=root
Restart=on-failure
ExecStart=/usr/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
# http://localhost:9093
Alert manager runs on http://localhost:9093 but we haven’t configured Prometheus for alert manager. Let’s do that
# open /etc/prometheus/prometheus.yml from any text editor
# and change the following
alerting:
alertmanagers:
- static_configs:
targets:
- localhost:9093
# this config is already there just uncomment the line with URI and update it.
- restart prometheus then go to http://localhost:9090/status and below you will the heading of alert manager with specified URI.
As soon as the alert went into firing mode, I got this on my alert manager
Well, alert manager got the alert but it doesn’t know where to send it. Now we have to configure alert manager with required configuration.
Here’s the current configuration of alert manager:
Configure it for Slack:
- go to your workspace → Tools & settings → Manage Apps → Search for Web hook
- Get the Webhook then open alertmanager.yml
global:
resolve_timeout: 1m
slack_api_url: 'https://hooks.slack.com/services/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'
group_interval: 30s
repeat_inetrval: 4h
route:
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
- sudo systemctl daemon-reload
- sudo systemctl restart alertmanager
Similarly for Gmail, Pagerduty, you can follow this blog.
Alert manager also provides metrics at http://localhost:9093/metrics , you can add it as a target to Prometheus.
Custom Alerts
Sometime we have a long expression and need to run it frequently such as node exporter doesn’t provide metrics about how much memory is used so, to get that
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Shmem_bytes) / 1024^3
# this expression provides used memory in GB
Prometheus allow us to create custom metrics and then directly use in PromQL. But how?
# mkdir /etc/prometheus/rules && touch /etc/prometheus/rules/rules.yml
# sudo chown prometheus:prometheus -R /etc/prometheus/rules
# open this file with any text editor
groups:
- name: Used Memory
rules:
- record: memory_used_in_GB
expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Shmem_bytes) / 1024^3
# add this file to prometheus.yml under rule_files
# sudo systemctl daemon-reload
# sudo systemctl restart prometheus
Similarly for used disk space
groups:
- name: Used Memory
rules:
- record: memory_used_in_GB
expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Shmem_bytes) / 1024^3
- name: used disk space in percentage
rules:
- record: node_filesystem_used_space_percentage
expr: 100 - (100 * node_filesystem_free_bytes / node_filesystem_size_bytes)
Summary
Alert comes when the given expression matches and goes to alert manager after mentioned time. Then alert manager sends the notification to configured receivers. Prometheus allows to create custom metrics too.
Give it a try with different receivers too
Thanks for reading!!!