Skip to main content

Alert Configuration

After deploying Prometheus and related service exporters, monitoring metric data will be stored in Prometheus, and visualized using Grafana.

The following describes integrating Alertmanager with PrometheusAlert to complete custom alert metrics and alert message delivery.

  • Alertmanager: Used for processing alerts generated by Prometheus services and routing them to specified receivers according to configured rules.

  • PrometheusAlert: A central alert message forwarding system that supports sending alert messages received from Alertmanager to DingTalk, Enterprise WeChat, Feishu, Email, etc.

Getting Started with Deployment

This document uses an Enterprise WeChat group bot as the alert receiver example.

  • Obtain the webhook address of the Enterprise WeChat bot in advance.

Deploying PrometheusAlert

  1. Download PrometheusAlert

    wget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.9/linux.zip
  2. Unzip PrometheusAlert and move to the installation directory

    unzip linux.zip
    mv linux /usr/local/prometheusalert
  3. Configure start/stop scripts

    cat > /usr/local/prometheusalert/start_prometheusalert.sh <<EOF
    nohup cd /usr/local/prometheusalert && ./PrometheusAlert &
    EOF

    cat > /usr/local/prometheusalert/stop_prometheusalert.sh <<EOF
    kill \$(pgrep -f 'PrometheusAlert')
    EOF

    chmod +x /usr/local/prometheusalert/start_prometheusalert.sh
    chmod +x /usr/local/prometheusalert/stop_prometheusalert.sh
  4. Modify PrometheusAlert configuration file

    Edit the /usr/local/prometheusalert/conf/app.conf file; by default, only modify the following configurations with actual deployment information:

    #Login username
    login_user=prometheusalert
    #Login password
    login_password=******
    #Listening port
    httpport = 8085
    #Alert message title
    title=Mingdao Alert Push
    #Default Enterprise WeChat bot address
    wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=******
    • Customize the login username and password as needed
    • Default listening port is 8080; it may conflict with common server service ports, so customize if necessary
    • Customize the alert message title as needed
    • Modify to the actual address of the default Enterprise WeChat bot (if using DingTalk, Feishu, etc., follow the default comment guidelines in app.conf)
    • If the server cannot access the internet, you can specify a forward proxy through the proxy = parameter to meet the requirement of forwarding messages to external destinations.
  5. Start PrometheusAlert

    cd /usr/local/prometheusalert
    chmod +x PrometheusAlert
    bash start_prometheusalert.sh
  6. Add PrometheusAlert to startup on boot

    echo "cd /usr/local/prometheusalert && /bin/bash start_prometheusalert.sh" >> /etc/rc.local
    chmod +x /etc/rc.local

Configuring PrometheusAlert Alert Template

  1. Log into PrometheusAlert at http://$ip:$port

  2. Edit a custom template

    Enterprise WeChat Template Content

    {{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}} ## Prometheus Recovery Message
    > Event: **{{$v.labels.alertname}}**
    > Alert Level: {{$v.labels.level}}
    > Start Time: {{GetCSTtime $v.startsAt}}
    > End Time: {{GetCSTtime $v.endsAt}}
    > Host: {{$v.labels.instance}}
    > <font color="info">**Event Details: {{$v.annotations.description}}**</font>
    {{else}} ## Prometheus Alert Message
    > Event: **{{$v.labels.alertname}}**
    > Alert Level: {{$v.labels.level}}
    > Start Time: {{GetCSTtime $v.startsAt}}
    > Host: {{$v.labels.instance}}
    > <font color="warning">**Event Details: {{$v.annotations.description}}**</font>
    {{end}}{{end}}

Deploying Alertmanager

  1. Download Alertmanager package

    wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
  2. Extract and move to the installation directory

    tar xf alertmanager-0.26.0.linux-amd64.tar.gz
    mv alertmanager-0.26.0.linux-amd64 /usr/local/alertmanager
  3. Configure start/stop scripts

    cat > /usr/local/alertmanager/start_alertmanager.sh <<EOF
    nohup /usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data --log.level=debug &
    EOF

    cat > /usr/local/alertmanager/stop_alertmanager.sh <<EOF
    kill \$(pgrep -f '/usr/local/alertmanager/alertmanager')
    EOF

    chmod +x /usr/local/alertmanager/start_alertmanager.sh
    chmod +x /usr/local/alertmanager/stop_alertmanager.sh
  4. Modify /usr/local/alertmanager/alertmanager.yml configuration file

    route:
    group_by: ['instance']
    group_wait: 30s
    group_interval: 2m
    repeat_interval: 4h
    receiver: 'web.hook.prometheusalert'
    receivers:
    - name: 'web.hook.prometheusalert'
    webhook_configs:
    # The url must be modified to the actual PrometheusAlert alert receiving address, this url can be obtained from the custom template's path field in PrometheusAlert, &at= parameter does not need to be added by default.
    - url: 'http://129.211.209.91:8085/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=******'
    send_resolved: true
    Description of alertmanager.yaml Content
    route:
    group_by: ['instance'] # Group according to alert name
    group_wait: 30s # Wait 30 seconds after the first alert of a group is received before sending the group (the 30 seconds is used to wait for other alerts in the same group to send together)
    group_interval: 2m # After a new alert in the same group is received, wait 2 minutes to cool down before sending the group message again
    repeat_interval: 4h # When an alert in the group persists, it will be sent again every 4 hours at most.
    receiver: 'web.hook.prometheusalert' # Send alert messages to the receiver named 'web.hook.prometheusalert'

    receivers:
    - name: 'web.hook.prometheusalert' # The receiver's name is 'web.hook.prometheusalert'
    webhook_configs:
    - url: 'http://129.211.209.91:8085/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=******' # Use Webhook to send alerts to PrometheusAlert
    send_resolved: true # Also send resolved alerts
  5. Start Alertmanager

    cd /usr/local/alertmanager
    bash start_alertmanager.sh
  6. Add Alertmanager to startup on boot

    echo "cd /usr/local/alertmanager && /bin/bash start_alertmanager.sh" >> /etc/rc.local
    chmod +x /etc/rc.local

Configuring Prometheus Alert Rules

  1. Modify prometheus.yaml

    alerting:
    alertmanagers:
    - static_configs:
    - targets:
    - 10.206.0.6:9093
    rule_files:
    - 'alert_rules/*.yml'
    • Modify the targets value to the actual Alertmanager service address
  2. Create the alert rules file storage directory

    mkdir /usr/local/prometheus/alert_rules
  3. Create alert rule files

    Server Host

    vim /usr/local/prometheus/alert_rules/host.yml

    Default alert rules are as follows:

    groups:

    # CPU usage rate over 90% for 3m
    - name: HostCPU
    rules:
    - alert: High CPU Usage
    expr: 100 * (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by (instance)) > 90
    for: 3m
    labels:
    level: Warning
    annotations:
    description: "Current server CPU usage is high, current rate: {{ $value | printf \"%.2f\" }}%"

    # Memory usage rate over 90% for 3m
    - name: HostMEM
    rules:
    - alert: High Memory Usage
    expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 90
    for: 3m
    labels:
    level: Warning
    annotations:
    description: "Current server memory usage is high, current rate: {{ $value | printf \"%.2f\" }}%"

    # Usage rate of / and /data directories over 90% for 3m
    - name: Disk
    rules:
    - alert: High Disk Usage
    expr: 100 * ((node_filesystem_size_bytes{fstype=~"xfs|ext4"} - node_filesystem_avail_bytes) / node_filesystem_size_bytes {mountpoint=~"/|/data"}) > 90
    for: 3m
    labels:
    level: Warning
    annotations:
    description: "Mount point: {{$labels.mountpoint}}, usage rate: {{ $value | printf \"%.2f\" }}%"

    Microservices Containers

    vim /usr/local/prometheus/alert_rules/service.yml

    Default alert rules are as follows:

    groups:

    # Container CPU usage over 800% continually for 3m
    - name: ServiceCPU
    rules:
    - alert: High Microservices Container CPU Usage
    expr: irate(container_cpu_usage_seconds_total{container!="",pod!="",namespace="default"}[2m])*100 > 500
    for: 3m
    labels:
    level: Warning
    annotations:
    description: "Container name: {{$labels.pod}}, CPU usage: {{ $value | printf \"%.2f\" }}%"

    # Container memory usage over 5G continually for 3m
    - name: ServiceMEM
    rules:
    - alert: High Microservices Container Memory Usage
    expr: container_memory_working_set_bytes{namespace="default"} / 1073741824 > 5
    for: 3m
    labels:
    level: Warning
    annotations:
    description: "Container name: {{$labels.pod}}, memory usage: {{ $value | printf \"%.2f\" }}G"
    • In actual environments, you might need to adjust the default alert rules. Rules use PromQL expressions to query data from Prometheus. Consider further optimizing your environment’s rules using additional alert rules as provided in the document below.
  4. Reload Prometheus to apply the modified configuration

More Help

Alert Templates

Feishu

Feishu Template Content

{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}<font color="green">**Recovery Alert Information**</font>
Event: **{{$v.labels.alertname}}**
Alert Type: {{$v.status}}
Alert Level: {{$v.labels.level}}
Start Time: {{GetCSTtime $v.startsAt}}
End Time: {{GetCSTtime $v.endsAt}}
Host: {{$v.labels.instance}}
<font color="green">**Event Details: {{$v.annotations.description}}**</font>
{{else}}**Alert Information**
Event: **{{$v.labels.alertname}}**
Alert Type: {{$v.status}}
Alert Level: {{$v.labels.level}}
Start Time: {{GetCSTtime $v.startsAt}}
Host: {{$v.labels.instance}}
<font color="red">**Event Details: {{$v.annotations.description}}**</font>
{{end}}
{{ end }}
DingTalk

DingTalk Template Content

{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}} 
#### Prometheus Recovery Message
- Event: **{{$v.labels.alertname}}**
- Alert Level: {{$v.labels.level}}
- Start Time: {{GetCSTtime $v.startsAt}}
- End Time: {{GetCSTtime $v.endsAt}}
- Host: {{$v.labels.instance}}
- <font color="info">**Event Details**: {{$v.annotations.description}}</font>
{{else}}
#### Prometheus Alert Message
- Event: **{{$v.labels.alertname}}**
- Alert Level: {{$v.labels.level}}
- Start Time: {{GetCSTtime $v.startsAt}}
- Host: {{$v.labels.instance}}
- <font color="warning">**Event Details: {{$v.annotations.description}}**</font>
{{end}}{{end}}
Email

Modify PrometheusAlert service's conf/app.conf to configure email service as follows:

#---------------------↓ Email Configuration -----------------------
#Enable Email
open-email=1
#SMTP server address
Email_host=smtp.qq.com
#SMTP server port
Email_port=465
#Email account
Email_user=123456789@qq.com
#Email password
Email_password=xxxxxxx
#Email title
Email_title=Ops Alert
#Default recipient email list
Default_emails=123456@qq.com,123456@baidu.com

Email Template Content

{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}} 
<h3> Prometheus Recovery Message </h3>
<h5>=========start==========</h5>
<h5>Event: {{$v.labels.alertname}}</h5>
<h5>Alert Level: {{$v.labels.level}}</h5>
<h5>Start Time: {{GetCSTtime $v.startsAt}}</h5>
<h5>End Time: {{GetCSTtime $v.endsAt}}</h5>
<h5>Host: {{$v.labels.instance}}</h5>
<h5>Event Details: {{$v.annotations.description}}</h5>
<h5>=========end==========</h5>
{{else}}
<h3> Prometheus Alert Message </h3>
<h5>=========start==========</h5>
<h5>Event: {{$v.labels.alertname}}</h5>
<h5>Alert Level: {{$v.labels.level}}</h5>
<h5>Start Time: {{GetCSTtime $v.startsAt}}</h5>
<h5>Host: {{$v.labels.instance}}</h5>
<h5>Event Details: {{$v.annotations.description}}</h5>
<h5>=========end==========</h5>
{{end}}{{end}}

Alert Rules

Exclude containers like worksheetonlyworkflow, basic; other containers with CPU usage over 500% for 3m

- name: ServiceCPU-Rule-Exclude
rules:
- alert: High Microservices Container CPU Usage
expr: irate(container_cpu_usage_seconds_total{container!~"worksheetonlyworkflow|basic|basiconlyworkflow|workflowconsumer|command",pod!="",namespace="default"}[2m])*100 > 500
for: 3m
labels:
level: Warning
annotations:
description: "Container name: {{$labels.pod}}, CPU usage: {{ $value | printf \"%.2f\" }}%"

Specify containers like worksheetonlyworkflow, basic; CPU usage over 800% for 3m

- name: ServiceCPU-Rule-Specify
rules:
- alert: High Microservices Container CPU Usage
expr: irate(container_cpu_usage_seconds_total{container=~"worksheetonlyworkflow|basic|basiconlyworkflow|workflowconsumer|command",pod!="",namespace="default"}[2m])*100 > 800
for: 3m
labels:
level: Warning
annotations:
description: "Container name: {{$labels.pod}}, CPU usage: {{ $value | printf \"%.2f\" }}%"

Exclude containers like basic, basiconlyworkflow; memory usage over 5G for 3m

- name: ServiceMemory-Rule-Exclude
rules:
- alert: High Microservices Container Memory Usage
expr: container_memory_working_set_bytes{container!~"basic|basiconlyworkflow|api|wwwapi|worksheetexcelapi|worksheetexcelapiconsumer|command|datapipeline|workflowconsumer|workflowrouterconsumer",namespace="default"} / 1073741824 > 5
for: 3m
labels:
level: Warning
annotations:
description: "Container name: {{$labels.pod}}, memory usage: {{ $value | printf \"%.2f\" }}G"

Specify containers like basic, basiconlyworkflow; memory usage over 8G for 3m

- name: ServiceMemory-Rule-Specify
rules:
- alert: High Microservices Container Memory Usage
expr: container_memory_working_set_bytes{container=~"basic|basiconlyworkflow|api|wwwapi|worksheetexcelapi|worksheetexcelapiconsumer|command|datapipeline",namespace="default"} / 1073741824 > 8
for: 3m
labels:
level: Warning
annotations:
description: "Container name: {{$labels.pod}}, memory usage: {{ $value | printf \"%.2f\" }}G"