Alert Configuration
After deploying Prometheus and related service exporters, monitoring metric data will be stored in Prometheus, and visualized using Grafana.
The following describes integrating Alertmanager with PrometheusAlert to complete custom alert metrics and alert message delivery.
-
Alertmanager: Used for processing alerts generated by Prometheus services and routing them to specified receivers according to configured rules.
-
PrometheusAlert: A central alert message forwarding system that supports sending alert messages received from Alertmanager to DingTalk, Enterprise WeChat, Feishu, Email, etc.
Getting Started with Deployment
This document uses an Enterprise WeChat group bot as the alert receiver example.
- Obtain the webhook address of the Enterprise WeChat bot in advance.
Deploying PrometheusAlert
-
Download PrometheusAlert
wget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.9/linux.zip
-
Unzip PrometheusAlert and move to the installation directory
unzip linux.zip
mv linux /usr/local/prometheusalert -
Configure start/stop scripts
cat > /usr/local/prometheusalert/start_prometheusalert.sh <<EOF
nohup cd /usr/local/prometheusalert && ./PrometheusAlert &
EOF
cat > /usr/local/prometheusalert/stop_prometheusalert.sh <<EOF
kill \$(pgrep -f 'PrometheusAlert')
EOF
chmod +x /usr/local/prometheusalert/start_prometheusalert.sh
chmod +x /usr/local/prometheusalert/stop_prometheusalert.sh -
Modify PrometheusAlert configuration file
Edit the
/usr/local/prometheusalert/conf/app.conf
file; by default, only modify the following configurations with actual deployment information:#Login username
login_user=prometheusalert
#Login password
login_password=******
#Listening port
httpport = 8085
#Alert message title
title=Mingdao Alert Push
#Default Enterprise WeChat bot address
wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=******- Customize the login username and password as needed
- Default listening port is 8080; it may conflict with common server service ports, so customize if necessary
- Customize the alert message title as needed
- Modify to the actual address of the default Enterprise WeChat bot (if using DingTalk, Feishu, etc., follow the default comment guidelines in
app.conf
) - If the server cannot access the internet, you can specify a forward proxy through the
proxy =
parameter to meet the requirement of forwarding messages to external destinations.
-
Start PrometheusAlert
cd /usr/local/prometheusalert
chmod +x PrometheusAlert
bash start_prometheusalert.sh -
Add PrometheusAlert to startup on boot
echo "cd /usr/local/prometheusalert && /bin/bash start_prometheusalert.sh" >> /etc/rc.local
chmod +x /etc/rc.local
Configuring PrometheusAlert Alert Template
-
Log into PrometheusAlert at
http://$ip:$port
-
Edit a custom template
Enterprise WeChat Template Content
{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}} ## Prometheus Recovery Message
> Event: **{{$v.labels.alertname}}**
> Alert Level: {{$v.labels.level}}
> Start Time: {{GetCSTtime $v.startsAt}}
> End Time: {{GetCSTtime $v.endsAt}}
> Host: {{$v.labels.instance}}
> <font color="info">**Event Details: {{$v.annotations.description}}**</font>
{{else}} ## Prometheus Alert Message
> Event: **{{$v.labels.alertname}}**
> Alert Level: {{$v.labels.level}}
> Start Time: {{GetCSTtime $v.startsAt}}
> Host: {{$v.labels.instance}}
> <font color="warning">**Event Details: {{$v.annotations.description}}**</font>
{{end}}{{end}}
Deploying Alertmanager
-
Download Alertmanager package
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
-
Extract and move to the installation directory
tar xf alertmanager-0.26.0.linux-amd64.tar.gz
mv alertmanager-0.26.0.linux-amd64 /usr/local/alertmanager -
Configure start/stop scripts
cat > /usr/local/alertmanager/start_alertmanager.sh <<EOF
nohup /usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data --log.level=debug &
EOF
cat > /usr/local/alertmanager/stop_alertmanager.sh <<EOF
kill \$(pgrep -f '/usr/local/alertmanager/alertmanager')
EOF
chmod +x /usr/local/alertmanager/start_alertmanager.sh
chmod +x /usr/local/alertmanager/stop_alertmanager.sh -
Modify
/usr/local/alertmanager/alertmanager.yml
configuration fileroute:
group_by: ['instance']
group_wait: 30s
group_interval: 2m
repeat_interval: 4h
receiver: 'web.hook.prometheusalert'
receivers:
- name: 'web.hook.prometheusalert'
webhook_configs:
# The url must be modified to the actual PrometheusAlert alert receiving address, this url can be obtained from the custom template's path field in PrometheusAlert, &at= parameter does not need to be added by default.
- url: 'http://129.211.209.91:8085/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=******'
send_resolved: trueDescription of alertmanager.yaml Content
route:
group_by: ['instance'] # Group according to alert name
group_wait: 30s # Wait 30 seconds after the first alert of a group is received before sending the group (the 30 seconds is used to wait for other alerts in the same group to send together)
group_interval: 2m # After a new alert in the same group is received, wait 2 minutes to cool down before sending the group message again
repeat_interval: 4h # When an alert in the group persists, it will be sent again every 4 hours at most.
receiver: 'web.hook.prometheusalert' # Send alert messages to the receiver named 'web.hook.prometheusalert'
receivers:
- name: 'web.hook.prometheusalert' # The receiver's name is 'web.hook.prometheusalert'
webhook_configs:
- url: 'http://129.211.209.91:8085/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=******' # Use Webhook to send alerts to PrometheusAlert
send_resolved: true # Also send resolved alerts- Detailed explanations of each URL parameter can be found in the official documentation.
-
Start Alertmanager
cd /usr/local/alertmanager
bash start_alertmanager.sh -
Add Alertmanager to startup on boot
echo "cd /usr/local/alertmanager && /bin/bash start_alertmanager.sh" >> /etc/rc.local
chmod +x /etc/rc.local
Configuring Prometheus Alert Rules
-
Modify prometheus.yaml
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.206.0.6:9093
rule_files:
- 'alert_rules/*.yml'- Modify the targets value to the actual Alertmanager service address
-
Create the alert rules file storage directory
mkdir /usr/local/prometheus/alert_rules
-
Create alert rule files
Server Host
vim /usr/local/prometheus/alert_rules/host.yml
Default alert rules are as follows:
groups:
# CPU usage rate over 90% for 3m
- name: HostCPU
rules:
- alert: High CPU Usage
expr: 100 * (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by (instance)) > 90
for: 3m
labels:
level: Warning
annotations:
description: "Current server CPU usage is high, current rate: {{ $value | printf \"%.2f\" }}%"
# Memory usage rate over 90% for 3m
- name: HostMEM
rules:
- alert: High Memory Usage
expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 90
for: 3m
labels:
level: Warning
annotations:
description: "Current server memory usage is high, current rate: {{ $value | printf \"%.2f\" }}%"
# Usage rate of / and /data directories over 90% for 3m
- name: Disk
rules:
- alert: High Disk Usage
expr: 100 * ((node_filesystem_size_bytes{fstype=~"xfs|ext4"} - node_filesystem_avail_bytes) / node_filesystem_size_bytes {mountpoint=~"/|/data"}) > 90
for: 3m
labels:
level: Warning
annotations:
description: "Mount point: {{$labels.mountpoint}}, usage rate: {{ $value | printf \"%.2f\" }}%"Microservices Containers
vim /usr/local/prometheus/alert_rules/service.yml
Default alert rules are as follows:
groups:
# Container CPU usage over 800% continually for 3m
- name: ServiceCPU
rules:
- alert: High Microservices Container CPU Usage
expr: irate(container_cpu_usage_seconds_total{container!="",pod!="",namespace="default"}[2m])*100 > 500
for: 3m
labels:
level: Warning
annotations:
description: "Container name: {{$labels.pod}}, CPU usage: {{ $value | printf \"%.2f\" }}%"
# Container memory usage over 5G continually for 3m
- name: ServiceMEM
rules:
- alert: High Microservices Container Memory Usage
expr: container_memory_working_set_bytes{namespace="default"} / 1073741824 > 5
for: 3m
labels:
level: Warning
annotations:
description: "Container name: {{$labels.pod}}, memory usage: {{ $value | printf \"%.2f\" }}G"- In actual environments, you might need to adjust the default alert rules. Rules use PromQL expressions to query data from Prometheus. Consider further optimizing your environment’s rules using additional alert rules as provided in the document below.
-
Reload Prometheus to apply the modified configuration
More Help
Alert Templates
Feishu
Feishu Template Content
{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}<font color="green">**Recovery Alert Information**</font>
Event: **{{$v.labels.alertname}}**
Alert Type: {{$v.status}}
Alert Level: {{$v.labels.level}}
Start Time: {{GetCSTtime $v.startsAt}}
End Time: {{GetCSTtime $v.endsAt}}
Host: {{$v.labels.instance}}
<font color="green">**Event Details: {{$v.annotations.description}}**</font>
{{else}}**Alert Information**
Event: **{{$v.labels.alertname}}**
Alert Type: {{$v.status}}
Alert Level: {{$v.labels.level}}
Start Time: {{GetCSTtime $v.startsAt}}
Host: {{$v.labels.instance}}
<font color="red">**Event Details: {{$v.annotations.description}}**</font>
{{end}}
{{ end }}
DingTalk
DingTalk Template Content
{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}
#### Prometheus Recovery Message
- Event: **{{$v.labels.alertname}}**
- Alert Level: {{$v.labels.level}}
- Start Time: {{GetCSTtime $v.startsAt}}
- End Time: {{GetCSTtime $v.endsAt}}
- Host: {{$v.labels.instance}}
- <font color="info">**Event Details**: {{$v.annotations.description}}</font>
{{else}}
#### Prometheus Alert Message
- Event: **{{$v.labels.alertname}}**
- Alert Level: {{$v.labels.level}}
- Start Time: {{GetCSTtime $v.startsAt}}
- Host: {{$v.labels.instance}}
- <font color="warning">**Event Details: {{$v.annotations.description}}**</font>
{{end}}{{end}}
Modify PrometheusAlert service's conf/app.conf
to configure email service as follows:
#---------------------↓ Email Configuration -----------------------
#Enable Email
open-email=1
#SMTP server address
Email_host=smtp.qq.com
#SMTP server port
Email_port=465
#Email account
Email_user=123456789@qq.com
#Email password
Email_password=xxxxxxx
#Email title
Email_title=Ops Alert
#Default recipient email list
Default_emails=123456@qq.com,123456@baidu.com
Email Template Content
{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}
<h3> Prometheus Recovery Message </h3>
<h5>=========start==========</h5>
<h5>Event: {{$v.labels.alertname}}</h5>
<h5>Alert Level: {{$v.labels.level}}</h5>
<h5>Start Time: {{GetCSTtime $v.startsAt}}</h5>
<h5>End Time: {{GetCSTtime $v.endsAt}}</h5>
<h5>Host: {{$v.labels.instance}}</h5>
<h5>Event Details: {{$v.annotations.description}}</h5>
<h5>=========end==========</h5>
{{else}}
<h3> Prometheus Alert Message </h3>
<h5>=========start==========</h5>
<h5>Event: {{$v.labels.alertname}}</h5>
<h5>Alert Level: {{$v.labels.level}}</h5>
<h5>Start Time: {{GetCSTtime $v.startsAt}}</h5>
<h5>Host: {{$v.labels.instance}}</h5>
<h5>Event Details: {{$v.annotations.description}}</h5>
<h5>=========end==========</h5>
{{end}}{{end}}
Alert Rules
Exclude containers like worksheetonlyworkflow, basic; other containers with CPU usage over 500% for 3m
- name: ServiceCPU-Rule-Exclude
rules:
- alert: High Microservices Container CPU Usage
expr: irate(container_cpu_usage_seconds_total{container!~"worksheetonlyworkflow|basic|basiconlyworkflow|workflowconsumer|command",pod!="",namespace="default"}[2m])*100 > 500
for: 3m
labels:
level: Warning
annotations:
description: "Container name: {{$labels.pod}}, CPU usage: {{ $value | printf \"%.2f\" }}%"
Specify containers like worksheetonlyworkflow, basic; CPU usage over 800% for 3m
- name: ServiceCPU-Rule-Specify
rules:
- alert: High Microservices Container CPU Usage
expr: irate(container_cpu_usage_seconds_total{container=~"worksheetonlyworkflow|basic|basiconlyworkflow|workflowconsumer|command",pod!="",namespace="default"}[2m])*100 > 800
for: 3m
labels:
level: Warning
annotations:
description: "Container name: {{$labels.pod}}, CPU usage: {{ $value | printf \"%.2f\" }}%"
Exclude containers like basic, basiconlyworkflow; memory usage over 5G for 3m
- name: ServiceMemory-Rule-Exclude
rules:
- alert: High Microservices Container Memory Usage
expr: container_memory_working_set_bytes{container!~"basic|basiconlyworkflow|api|wwwapi|worksheetexcelapi|worksheetexcelapiconsumer|command|datapipeline|workflowconsumer|workflowrouterconsumer",namespace="default"} / 1073741824 > 5
for: 3m
labels:
level: Warning
annotations:
description: "Container name: {{$labels.pod}}, memory usage: {{ $value | printf \"%.2f\" }}G"
Specify containers like basic, basiconlyworkflow; memory usage over 8G for 3m
- name: ServiceMemory-Rule-Specify
rules:
- alert: High Microservices Container Memory Usage
expr: container_memory_working_set_bytes{container=~"basic|basiconlyworkflow|api|wwwapi|worksheetexcelapi|worksheetexcelapiconsumer|command|datapipeline",namespace="default"} / 1073741824 > 8
for: 3m
labels:
level: Warning
annotations:
description: "Container name: {{$labels.pod}}, memory usage: {{ $value | printf \"%.2f\" }}G"