Alert Configuration

After deploying Prometheus and related service exporters, monitoring metric data will be stored in Prometheus, and visualized using Grafana.

The following describes integrating Alertmanager with PrometheusAlert to complete custom alert metrics and alert message delivery.

Alertmanager: Used for processing alerts generated by Prometheus services and routing them to specified receivers according to configured rules.
PrometheusAlert: A central alert message forwarding system that supports sending alert messages received from Alertmanager to DingTalk, Enterprise WeChat, Feishu, Email, etc.

Getting Started with Deployment

This document uses an Enterprise WeChat group bot as the alert receiver example.

Obtain the webhook address of the Enterprise WeChat bot in advance.

Deploying PrometheusAlert

Download PrometheusAlert

wget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.9/linux.zip

Unzip PrometheusAlert and move to the installation directory
```
unzip linux.zip
mv linux /usr/local/prometheusalert
```

Configure start/stop scripts

cat > /usr/local/prometheusalert/start_prometheusalert.sh <<EOF
nohup cd /usr/local/prometheusalert && ./PrometheusAlert &
EOF

cat > /usr/local/prometheusalert/stop_prometheusalert.sh <<EOF
kill \$(pgrep -f 'PrometheusAlert')
EOF

chmod +x /usr/local/prometheusalert/start_prometheusalert.sh
chmod +x /usr/local/prometheusalert/stop_prometheusalert.sh

Modify PrometheusAlert configuration file

Edit the /usr/local/prometheusalert/conf/app.conf file; by default, only modify the following configurations with actual deployment information:
```
#Login username
login_user=prometheusalert
#Login password
login_password=******
#Listening port
httpport = 8085
#Alert message title
title=Mingdao Alert Push
#Default Enterprise WeChat bot address
wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=******
```
- Customize the login username and password as needed
- Default listening port is 8080; it may conflict with common server service ports, so customize if necessary
- Customize the alert message title as needed
- Modify to the actual address of the default Enterprise WeChat bot (if using DingTalk, Feishu, etc., follow the default comment guidelines in app.conf)
- If the server cannot access the internet, you can specify a forward proxy through the proxy = parameter to meet the requirement of forwarding messages to external destinations.

Start PrometheusAlert

cd /usr/local/prometheusalert
chmod +x PrometheusAlert
bash start_prometheusalert.sh

Add PrometheusAlert to startup on boot

echo "cd /usr/local/prometheusalert && /bin/bash start_prometheusalert.sh" >> /etc/rc.local
chmod +x /etc/rc.local

Configuring PrometheusAlert Alert Template

Log into PrometheusAlert at http://$ip:$port

Edit a custom template

Enterprise WeChat Template Content

{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}} ## Prometheus Recovery Message
> Event: **{{$v.labels.alertname}}**
> Alert Level: {{$v.labels.level}}
> Start Time: {{GetCSTtime $v.startsAt}}
> End Time: {{GetCSTtime $v.endsAt}}
> Host: {{$v.labels.instance}}
> <font color="info">**Event Details: {{$v.annotations.description}}**</font>
{{else}} ## Prometheus Alert Message
> Event: **{{$v.labels.alertname}}**
> Alert Level: {{$v.labels.level}}
> Start Time: {{GetCSTtime $v.startsAt}}
> Host: {{$v.labels.instance}}
> <font color="warning">**Event Details: {{$v.annotations.description}}**</font>
{{end}}{{end}}

Deploying Alertmanager

Download Alertmanager package

wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz

Extract and move to the installation directory

tar xf alertmanager-0.28.1.linux-amd64.tar.gz
mv alertmanager-0.28.1.linux-amd64 /usr/local/alertmanager

Configure start/stop scripts

cat > /usr/local/alertmanager/start_alertmanager.sh <<EOF
nohup /usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data --log.level=debug &
EOF

cat > /usr/local/alertmanager/stop_alertmanager.sh <<EOF
kill \$(pgrep -f '/usr/local/alertmanager/alertmanager')
EOF

chmod +x /usr/local/alertmanager/start_alertmanager.sh
chmod +x /usr/local/alertmanager/stop_alertmanager.sh

Modify /usr/local/alertmanager/alertmanager.yml configuration file

route:
  group_by: ['instance']
  group_wait: 30s
  group_interval: 2m
  repeat_interval: 4h
  receiver: 'web.hook.prometheusalert'
receivers:
  - name: 'web.hook.prometheusalert'
    webhook_configs:
      # The url must be modified to the actual PrometheusAlert alert receiving address, this url can be obtained from the custom template's path field in PrometheusAlert, &at= parameter does not need to be added by default.
      - url: 'http://129.211.209.91:8085/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=******'
        send_resolved: true

Description of alertmanager.yaml Content

route:
  group_by: ['instance'] # Group according to alert name
  group_wait: 30s # Wait 30 seconds after the first alert of a group is received before sending the group (the 30 seconds is used to wait for other alerts in the same group to send together)
  group_interval: 2m # After a new alert in the same group is received, wait 2 minutes to cool down before sending the group message again
  repeat_interval: 4h # When an alert in the group persists, it will be sent again every 4 hours at most.
  receiver: 'web.hook.prometheusalert' # Send alert messages to the receiver named 'web.hook.prometheusalert'

receivers:
  - name: 'web.hook.prometheusalert' # The receiver's name is 'web.hook.prometheusalert'
    webhook_configs:
      - url: 'http://129.211.209.91:8085/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=******' # Use Webhook to send alerts to PrometheusAlert
        send_resolved: true # Also send resolved alerts

Detailed explanations of each URL parameter can be found in the official documentation.

Start Alertmanager

cd /usr/local/alertmanager
bash start_alertmanager.sh

Add Alertmanager to startup on boot

echo "cd /usr/local/alertmanager && /bin/bash start_alertmanager.sh" >> /etc/rc.local
chmod +x /etc/rc.local

Configuring Prometheus Alert Rules

Modify prometheus.yaml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 10.206.0.6:9093
rule_files:
  - 'alert_rules/*.yml'

Modify the targets value to the actual Alertmanager service address

Create the alert rules file storage directory
```
mkdir /usr/local/prometheus/alert_rules
```

Create alert rule files

Server Host

vim /usr/local/prometheus/alert_rules/host.yml

Default alert rules are as follows:

groups:
     
# CPU usage rate over 90% for 3m
- name: HostCPU
  rules:
  - alert: High CPU Usage
    expr: 100 * (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by (instance)) > 90
    for: 3m
    labels:
      level: Warning
    annotations:
      description: "Current server CPU usage is high, current rate: {{ $value | printf \"%.2f\" }}%"

# Memory usage rate over 90% for 3m
- name: HostMEM
  rules:
  - alert: High Memory Usage
    expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 90
    for: 3m
    labels:
      level: Warning
    annotations:
      description: "Current server memory usage is high, current rate: {{ $value | printf \"%.2f\" }}%"

# Usage rate of / and /data directories over 90% for 3m
- name: Disk
  rules:
  - alert: High Disk Usage
    expr: 100 * ((node_filesystem_size_bytes{fstype=~"xfs|ext4"} - node_filesystem_avail_bytes) / node_filesystem_size_bytes {mountpoint=~"/|/data"}) > 90
    for: 3m
    labels:
      level: Warning
    annotations:
      description: "Mount point: {{$labels.mountpoint}}, usage rate: {{ $value | printf \"%.2f\" }}%"

Microservices Containers

vim /usr/local/prometheus/alert_rules/service.yml

Default alert rules are as follows:

groups:

# Container CPU usage over 800% continually for 3m
- name: ServiceCPU
  rules:
  - alert: High Microservices Container CPU Usage
    expr: irate(container_cpu_usage_seconds_total{container!="",pod!="",namespace="default"}[2m])*100 > 500
    for: 3m
    labels:
      level: Warning
    annotations:
      description: "Container name: {{$labels.pod}}, CPU usage: {{ $value | printf \"%.2f\" }}%"

# Container memory usage over 5G continually for 3m
- name: ServiceMEM
  rules:
  - alert: High Microservices Container Memory Usage
    expr: container_memory_working_set_bytes{namespace="default"} / 1073741824 > 5
    for: 3m
    labels:
      level: Warning
    annotations:
      description: "Container name: {{$labels.pod}}, memory usage: {{ $value | printf \"%.2f\" }}G"

In actual environments, you might need to adjust the default alert rules. Rules use PromQL expressions to query data from Prometheus. Consider further optimizing your environment’s rules using additional alert rules as provided in the document below.

Reload Prometheus to apply the modified configuration

More Help

Alert Templates

Feishu

Feishu Template Content

{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}<font color="green">**Recovery Alert Information**</font>
Event: **{{$v.labels.alertname}}**
Alert Type: {{$v.status}}
Alert Level: {{$v.labels.level}}
Start Time: {{GetCSTtime $v.startsAt}}
End Time: {{GetCSTtime $v.endsAt}}
Host: {{$v.labels.instance}}
<font color="green">**Event Details: {{$v.annotations.description}}**</font>
{{else}}**Alert Information**
Event: **{{$v.labels.alertname}}**
Alert Type: {{$v.status}}
Alert Level: {{$v.labels.level}}
Start Time: {{GetCSTtime $v.startsAt}}
Host: {{$v.labels.instance}}
<font color="red">**Event Details: {{$v.annotations.description}}**</font>
{{end}}
{{ end }}

DingTalk

DingTalk Template Content

{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}} 
#### Prometheus Recovery Message
- Event: **{{$v.labels.alertname}}**
- Alert Level: {{$v.labels.level}}
- Start Time: {{GetCSTtime $v.startsAt}}
- End Time: {{GetCSTtime $v.endsAt}}
- Host: {{$v.labels.instance}}
- <font color="info">**Event Details**: {{$v.annotations.description}}</font>
{{else}} 
#### Prometheus Alert Message
- Event: **{{$v.labels.alertname}}**
- Alert Level: {{$v.labels.level}}
- Start Time: {{GetCSTtime $v.startsAt}}
- Host: {{$v.labels.instance}}
- <font color="warning">**Event Details: {{$v.annotations.description}}**</font>
{{end}}{{end}}

Modify PrometheusAlert service's conf/app.conf to configure email service as follows:

#---------------------↓ Email Configuration -----------------------
#Enable Email
open-email=1
#SMTP server address
Email_host=smtp.qq.com
#SMTP server port
Email_port=465
#Email account
Email_user=123456789@qq.com
#Email password
Email_password=xxxxxxx
#Email title
Email_title=Ops Alert
#Default recipient email list
Default_emails=123456@qq.com,123456@baidu.com

Email Template Content

{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}} 
<h3> Prometheus Recovery Message </h3>
<h5>=========start==========</h5>
<h5>Event: {{$v.labels.alertname}}</h5>
<h5>Alert Level: {{$v.labels.level}}</h5>
<h5>Start Time: {{GetCSTtime $v.startsAt}}</h5>
<h5>End Time: {{GetCSTtime $v.endsAt}}</h5>
<h5>Host: {{$v.labels.instance}}</h5>
<h5>Event Details: {{$v.annotations.description}}</h5>
<h5>=========end==========</h5>
{{else}} 
<h3> Prometheus Alert Message </h3>
<h5>=========start==========</h5>
<h5>Event: {{$v.labels.alertname}}</h5>
<h5>Alert Level: {{$v.labels.level}}</h5>
<h5>Start Time: {{GetCSTtime $v.startsAt}}</h5>
<h5>Host: {{$v.labels.instance}}</h5>
<h5>Event Details: {{$v.annotations.description}}</h5>
<h5>=========end==========</h5>
{{end}}{{end}}

Alert Rules

Exclude containers like worksheetonlyworkflow, basic; other containers with CPU usage over 500% for 3m

- name: ServiceCPU-Rule-Exclude
  rules:
  - alert: High Microservices Container CPU Usage
    expr: irate(container_cpu_usage_seconds_total{container!~"worksheetonlyworkflow|basic|basiconlyworkflow|workflowconsumer|command",pod!="",namespace="default"}[2m])*100 > 500
    for: 3m
    labels:
      level: Warning
    annotations:
      description: "Container name: {{$labels.pod}}, CPU usage: {{ $value | printf \"%.2f\" }}%"

Specify containers like worksheetonlyworkflow, basic; CPU usage over 800% for 3m

- name: ServiceCPU-Rule-Specify
  rules:
  - alert: High Microservices Container CPU Usage
    expr: irate(container_cpu_usage_seconds_total{container=~"worksheetonlyworkflow|basic|basiconlyworkflow|workflowconsumer|command",pod!="",namespace="default"}[2m])*100 > 800
    for: 3m
    labels:
      level: Warning
    annotations:
      description: "Container name: {{$labels.pod}}, CPU usage: {{ $value | printf \"%.2f\" }}%"

Exclude containers like basic, basiconlyworkflow; memory usage over 5G for 3m

- name: ServiceMemory-Rule-Exclude
  rules:
  - alert: High Microservices Container Memory Usage
    expr: container_memory_working_set_bytes{container!~"basic|basiconlyworkflow|api|wwwapi|worksheetexcelapi|worksheetexcelapiconsumer|command|datapipeline|workflowconsumer|workflowrouterconsumer",namespace="default"} / 1073741824 > 5
    for: 3m
    labels:
      level: Warning
    annotations:
      description: "Container name: {{$labels.pod}}, memory usage: {{ $value | printf \"%.2f\" }}G"

Specify containers like basic, basiconlyworkflow; memory usage over 8G for 3m

- name: ServiceMemory-Rule-Specify
  rules:
  - alert: High Microservices Container Memory Usage
    expr: container_memory_working_set_bytes{container=~"basic|basiconlyworkflow|api|wwwapi|worksheetexcelapi|worksheetexcelapiconsumer|command|datapipeline",namespace="default"} / 1073741824 > 8
    for: 3m
    labels:
      level: Warning
    annotations:
      description: "Container name: {{$labels.pod}}, memory usage: {{ $value | printf \"%.2f\" }}G"

Getting Started with Deployment​

Deploying PrometheusAlert​

Configuring PrometheusAlert Alert Template​

Deploying Alertmanager​

Configuring Prometheus Alert Rules​

More Help​

Alert Templates​

Alert Rules​