跳到主要内容

Prometheus 监控体系部署

监控系统组件说明

服务名服务端口作用
node_exporter59100采集服务器运行时各项指标数据
cadvisor59101采集容器运行时各项指标数据
kafka_exporter59102采集 kafka tpoic 各项指标数据
kube-state-metrics30686采集 k8s 集群各项指标数据
prometheus9090收集、存储监控数据
grafana3000监控数据可视化

各服务器都需要部署 node_exporter 服务

仅运行 docker 的服务器需要部署 cadvisor 服务,如 file 集群各节点

kafka_exporter 服务仅需要部署在 kafka 集群中任意一个节点即可

kube-state-metrics 仅需要在 k8s master 一台节点上部署即可,但镜像需要 k8s 集群中各节点都下载

prometheus 与 grafana 可在同一台服务器上部署

网络端口连通性要求:

  • prometheus 所在服务器到所有部署 node_exporter 服务器的 59100 端口通畅
  • prometheus 所在服务器到所有部署 cadvisor 服务器的 59101 端口通畅
  • prometheus 所在服务器到部署 kafka_exporter 服务器的 59102 端口通畅
  • prometheus 所在服务器到部署 k8s master 服务器的 6443 与 30686 端口通畅
  • grafana 所在服务器到 prometheus 服务器的 9090 端口通畅
  • 如果有配置 grafana 地址反向代理,则:
    • 代理所在服务器需要到 grafana 的 3000 端口通畅
    • 如果有反向代理 prometheus,则到 prometheus 服务器的 9090 端口也要通畅

部署 node_exporter

  1. 下载 node_exporter 安装包

    wget https://pdpublic.mingdao.com/private-deployment/offline/common/node_exporter-1.9.1.linux-amd64.tar.gz
  2. 解压 node_exporter

    tar xf node_exporter-1.9.1.linux-amd64.tar.gz -C /usr/local/
    mv /usr/local/node_exporter-1.9.1.linux-amd64 /usr/local/node_exporter
  3. 写入 node_exporter 的 systemd 服务文件

    cat > /etc/systemd/system/node_exporter.service <<'EOF'
    [Unit]
    Description=Node Exporter for Prometheus
    Documentation=https://github.com/prometheus/node_exporter
    After=network.target

    [Service]
    Type=simple
    ExecStart=/usr/local/node_exporter/node_exporter --web.listen-address=:59100
    User=root
    Group=root
    Restart=always
    RestartSec=10
    LimitNOFILE=102400

    [Install]
    WantedBy=multi-user.target
    EOF
  4. 启动 node_exporter

    systemctl daemon-reload
    systemctl enable node_exporter
    systemctl start node_exporter

部署 cadvisor

  1. 下载

    wget https://pdpublic.mingdao.com/private-deployment/offline/common/cadvisor-v0.52.1-linux-amd64
  2. 创建 cadvisor 目录

    mkdir /usr/local/cadvisor
  3. 移动并添加可执行权限

    mv cadvisor-v0.52.1-linux-amd64 /usr/local/cadvisor/cadvisor
    chmod +x /usr/local/cadvisor/cadvisor
  4. 写入 cadvisor 的 systemd 服务文件

    cat > /etc/systemd/system/cadvisor.service <<'EOF'
    [Unit]
    Description=cAdvisor Container Monitoring
    Documentation=https://github.com/google/cadvisor
    After=network.target

    [Service]
    Type=simple
    ExecStart=/usr/local/cadvisor/cadvisor -port=59101
    User=root
    Group=root
    Restart=always
    RestartSec=10
    LimitNOFILE=102400

    [Install]
    WantedBy=multi-user.target
    EOF
  5. 启动 cadvisor

    systemctl daemon-reload
    systemctl enable cadvisor
    systemctl start cadvisor

部署 kafka_exporter

  1. 下载安装包

    wget https://pdpublic.mingdao.com/private-deployment/offline/common/kafka_exporter-1.9.0.linux-amd64.tar.gz
  2. 解压至安装目录

    tar -zxvf kafka_exporter-1.9.0.linux-amd64.tar.gz -C /usr/local/
    mv /usr/local/kafka_exporter-1.9.0.linux-amd64 /usr/local/kafka_exporter
  3. 写入 kafka_exporter 的 systemd 服务文件

    # 注意替换 --kafka.server 参数为你实际的 Kafka 地址
    cat > /etc/systemd/system/kafka_exporter.service <<'EOF'
    [Unit]
    Description=Kafka Exporter for Prometheus
    Documentation=https://github.com/danielqsj/kafka_exporter
    After=network.target

    [Service]
    Type=simple
    ExecStart=/usr/local/kafka_exporter/kafka_exporter --kafka.server=192.168.1.2:9092 --web.listen-address=:59102
    User=root
    Group=root
    Restart=always
    RestartSec=10
    LimitNOFILE=102400

    [Install]
    WantedBy=multi-user.target
    EOF
  4. 启动 kafka_exporter

    systemctl daemon-reload
    systemctl enable kafka_exporter
    systemctl start kafka_exporter

部署 kube-state-metrics

  1. 下载镜像(k8s 集群中所有节点都需要下载镜像)

    crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/kube-state-metrics:2.3.0
  2. 创建配置文件存放目录

    mkdir -p /usr/local/kubernetes/ops-monit
    cd /usr/local/kubernetes/ops-monit
  3. 写入部署配置文件

    cat > cluster-role-binding.yaml <<\EOF
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    name: kube-state-metrics
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: kube-state-metrics
    subjects:
    - kind: ServiceAccount
    name: kube-state-metrics
    namespace: ops-monit
    EOF

    cat > cluster-role.yaml <<\EOF
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    name: kube-state-metrics
    rules:
    - apiGroups:
    - ""
    resources:
    - configmaps
    - secrets
    - nodes
    - pods
    - services
    - resourcequotas
    - replicationcontrollers
    - limitranges
    - persistentvolumeclaims
    - persistentvolumes
    - namespaces
    - endpoints
    verbs:
    - list
    - watch
    - apiGroups:
    - extensions
    resources:
    - daemonsets
    - deployments
    - replicasets
    - ingresses
    verbs:
    - list
    - watch
    - apiGroups:
    - apps
    resources:
    - statefulsets
    - daemonsets
    - deployments
    - replicasets
    verbs:
    - list
    - watch
    - apiGroups:
    - batch
    resources:
    - cronjobs
    - jobs
    verbs:
    - list
    - watch
    - apiGroups:
    - autoscaling
    resources:
    - horizontalpodautoscalers
    verbs:
    - list
    - watch
    - apiGroups:
    - authentication.k8s.io
    resources:
    - tokenreviews
    verbs:
    - create
    - apiGroups:
    - authorization.k8s.io
    resources:
    - subjectaccessreviews
    verbs:
    - create
    - apiGroups:
    - policy
    resources:
    - poddisruptionbudgets
    verbs:
    - list
    - watch
    - apiGroups:
    - certificates.k8s.io
    resources:
    - certificatesigningrequests
    verbs:
    - list
    - watch
    - apiGroups:
    - storage.k8s.io
    resources:
    - storageclasses
    - volumeattachments
    verbs:
    - list
    - watch
    - apiGroups:
    - admissionregistration.k8s.io
    resources:
    - mutatingwebhookconfigurations
    - validatingwebhookconfigurations
    verbs:
    - list
    - watch
    - apiGroups:
    - networking.k8s.io
    resources:
    - networkpolicies
    verbs:
    - list
    - watch
    EOF

    cat > deployment.yaml <<\EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    name: kube-state-metrics
    namespace: ops-monit
    spec:
    replicas: 1
    selector:
    matchLabels:
    app.kubernetes.io/name: kube-state-metrics
    template:
    metadata:
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    spec:
    containers:
    - image: registry.cn-hangzhou.aliyuncs.com/mdpublic/kube-state-metrics:2.3.0
    livenessProbe:
    httpGet:
    path: /healthz
    port: 8080
    initialDelaySeconds: 5
    timeoutSeconds: 5
    name: kube-state-metrics
    ports:
    - containerPort: 8080
    name: http-metrics
    - containerPort: 8081
    name: telemetry
    readinessProbe:
    httpGet:
    path: /
    port: 8081
    initialDelaySeconds: 5
    timeoutSeconds: 5
    nodeSelector:
    kubernetes.io/os: linux
    serviceAccountName: kube-state-metrics
    EOF

    cat > service-account.yaml <<EOF
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    name: kube-state-metrics
    namespace: ops-monit
    EOF

    cat > service.yaml <<\EOF
    apiVersion: v1
    kind: Service
    metadata:
    # annotations:
    # prometheus.io/scrape: 'true'
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    name: kube-state-metrics
    namespace: ops-monit
    spec:
    ports:
    - name: http-metrics
    port: 8080
    targetPort: http-metrics
    nodePort: 30686
    - name: telemetry
    port: 8081
    targetPort: telemetry
    type: NodePort
    selector:
    app.kubernetes.io/name: kube-state-metrics
    EOF

    # 创建cadvisor所需kubernetes用户
    cat > rbac.yaml <<\EOF
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    name: prometheus
    namespace: kube-system

    ---
    # 创建prometheus关联的Secret(1.24版本开始要手动关联)
    apiVersion: v1
    kind: Secret
    type: kubernetes.io/service-account-token
    metadata:
    name: prometheus
    namespace: kube-system
    annotations:
    kubernetes.io/service-account.name: "prometheus"
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    name: prometheus
    rules:
    - apiGroups:
    - ""
    resources:
    - nodes
    - services
    - endpoints
    - pods
    - nodes/proxy
    verbs:
    - get
    - list
    - watch
    - apiGroups:
    - "extensions"
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
    - apiGroups:
    - ""
    resources:
    - configmaps
    - nodes/metrics
    verbs:
    - get
    - nonResourceURLs:
    - /metrics
    verbs:
    - get
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
    name: prometheus
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: prometheus
    subjects:
    - kind: ServiceAccount
    name: prometheus
    namespace: kube-system
    EOF
  4. 创建命名空间

    kubectl create namespace ops-monit
  5. 启动监控服务

    kubectl apply -f .
  6. 获取 token

    kubectl describe secret $(kubectl describe sa prometheus -n kube-system | sed -n '7p' | awk '{print $2}') -n kube-system | tail -n1 | awk '{print $2}'
    • token 内容复制写入到 prometheus 服务的 /usr/local/prometheus/privatedeploy_kubernetes.token 文件中

部署 prometheus

  1. 下载 prometheus 安装包

    wget https://pdpublic.mingdao.com/private-deployment/offline/common/prometheus-3.5.0.linux-amd64.tar.gz
  2. 解压

    tar -zxvf prometheus-3.5.0.linux-amd64.tar.gz -C /usr/local/
    mv /usr/local/prometheus-3.5.0.linux-amd64 /usr/local/prometheus
  3. 配置 prometheus.yml 文件

    global:
    scrape_interval: 15s

    scrape_configs:
    # 服务器监控
    - job_name: "node_exporter"
    static_configs:
    - targets: ["192.168.10.20:59100"]
    labels:
    nodename: hap-nginx-01
    origin_prometheus: node
    - targets: ["192.168.10.21:59100"]
    labels:
    nodename: hap-k8s-service-01
    origin_prometheus: node
    - targets: ["192.168.10.2:59100"]
    labels:
    nodename: hap-k8s-service-02
    origin_prometheus: node
    - targets: ["192.168.10.3:59100"]
    labels:
    nodename: hap-middleware-01
    origin_prometheus: node
    - targets: ["192.168.10.3:59100"]
    labels:
    nodename: hap-db-01
    origin_prometheus: node

    # docker 监控
    - job_name: "cadvisor"
    static_configs:
    - targets:
    - 192.168.10.16:59101

    # kafka 监控
    - job_name: kafka_exporter
    static_configs:
    - targets: ["192.168.10.7:59102"]

    # k8s 监控
    - job_name: privatedeploy_kubernetes_metrics
    static_configs:
    - targets: ["192.168.10.20:30686"] # 注意替换为 k8s 主节点地址
    labels:
    origin_prometheus: kubernetes

    - job_name: 'privatedeploy_kubernetes_cadvisor'
    scheme: https
    metrics_path: /metrics/cadvisor
    tls_config:
    insecure_skip_verify: true
    bearer_token_file: /usr/local/prometheus/privatedeploy_kubernetes.token
    kubernetes_sd_configs:
    - role: node
    api_server: https://192.168.10.20:6443 # 注意替换为 k8s 主节点地址
    bearer_token_file: /usr/local/prometheus/privatedeploy_kubernetes.token
    tls_config:
    insecure_skip_verify: true
    relabel_configs:
    - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
    replacement: 192.168.10.20:6443 # 注意替换为 k8s 主节点地址
    - target_label: origin_prometheus
    replacement: kubernetes
    - source_labels: [__meta_kubernetes_node_name]
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    metric_relabel_configs:
    - source_labels: [instance]
    separator: ;
    regex: (.+)
    target_label: node
    replacement: $1
    action: replace
    - source_labels: [pod_name]
    separator: ;
    regex: (.+)
    target_label: pod
    replacement: $1
    action: replace
    - source_labels: [container_name]
    separator: ;
    regex: (.+)
    target_label: container
    replacement: $1
    action: replace

  4. 配写入 prometheus 的 systemd 服务文件

    cat > /etc/systemd/system/prometheus.service <<'EOF'
    [Unit]
    Description=Prometheus Monitoring System
    Documentation=https://prometheus.io/docs/introduction/overview/
    After=network.target

    [Service]
    Type=simple
    ExecStart=/usr/local/prometheus/prometheus \
    --storage.tsdb.path=/data/prometheus/data \
    --storage.tsdb.retention.time=30d \
    --config.file=/usr/local/prometheus/prometheus.yml \
    --web.enable-lifecycle
    ExecReload=/usr/bin/curl -X POST http://127.0.0.1:9090/-/reload
    User=root
    Group=root
    Restart=always
    RestartSec=10
    LimitNOFILE=102400

    [Install]
    WantedBy=multi-user.target
    EOF
  5. 启动 prometheus

    systemctl daemon-reload
    systemctl enable prometheus
    systemctl start prometheus
    • 当 prometheus 配置发生修改时,可通过 systemctl reload prometheus 进行热加载

部署 grafana

  1. 下载 grafana 安装包

    wget https://pdpublic.mingdao.com/private-deployment/offline/common/grafana_12.1.2_17957162798_linux_amd64.tar.gz
  2. 解压

    tar -xf grafana_12.1.2_17957162798_linux_amd64.tar.gz -C /usr/local/
    mv /usr/local/grafana-12.1.2 /usr/local/grafana
  3. 修改 /usr/local/grafana/conf/defaults.ini文件 中的 root_url 值如下

    root_url = %(protocol)s://%(domain)s:%(http_port)s/privatedeploy/mdy/monitor/grafana/

    # 一键修改
    sed -ri 's#^root_url = .*#root_url = %(protocol)s://%(domain)s:%(http_port)s/privatedeploy/mdy/monitor/grafana/#' /usr/local/grafana/conf/defaults.ini
    grep "^root_url" /usr/local/grafana/conf/defaults.ini
  4. 修改 /usr/local/grafana/conf/defaults.ini文件 中的 serve_from_sub_path 值如下

    serve_from_sub_path = true

    # 一键修改
    sed -ri 's#^serve_from_sub_path = .*#serve_from_sub_path = true#' /usr/local/grafana/conf/defaults.ini
    grep "^serve_from_sub_path" /usr/local/grafana/conf/defaults.ini
    • 如果不需要通过 nginx 代理来访问 grafana 页面,而是直接通过 grafana 的 IP 作为访问地址的话,则需要同时调整 domain 值为实际的 host。
  5. 写入 grafana 的 systemd 服务文件

    cat > /etc/systemd/system/grafana.service <<'EOF'
    [Unit]
    Description=Grafana Dashboard
    Documentation=https://grafana.com/docs/
    After=network.target

    [Service]
    Type=simple
    WorkingDirectory=/usr/local/grafana
    ExecStart=/usr/local/grafana/bin/grafana-server web
    User=root
    Group=root
    Restart=always
    RestartSec=10
    LimitNOFILE=102400

    [Install]
    WantedBy=multi-user.target
    EOF
  6. 启动 grafana

    systemctl daemon-reload
    systemctl enable grafana
    systemctl start grafana

配置将 grafana 地址反向代理

nginx 反向代理配置文件,参考如下规则配置代理 grafana 页面

upstream grafana {
server 192.168.1.10:3000;
}

map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}

server {
listen 80;
server_name hap.domain.com;
access_log /data/logs/weblogs/grafana.log main;
error_log /data/logs/weblogs/grafana.mingdao.net.error.log;

location /privatedeploy/mdy/monitor/grafana/ {
#allow 1.1.1.1;
#deny all;
proxy_hide_header X-Frame-Options;
proxy_set_header X-Frame-Options ALLOWALL;
proxy_set_header Host $http_host;
proxy_pass http://grafana;
proxy_redirect http://localhost:3000 http://hap.domain.com:80/privatedeploy/mdy/monitor/grafana;
}

location /privatedeploy/mdy/monitor/grafana/api/live {
rewrite ^/(.*) /$1 break;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $http_host;
proxy_pass http://grafana;
}
}
  • 代理配置好后即可通过 http://hap.domain.com/privatedeploy/mdy/monitor/grafana 访问 grafana 页面,然后配置仪表盘

    • 如果要代理 prometheus 可添加如下规则(通常不需要,因为 prometheus 页面没有认证,需要注意访问安全

      upstream prometheus {
      server 192.168.1.10:9090;
      }

      location /privatedeploy/mdy/monitor/prometheus {
      rewrite ^/privatedeploy/mdy/monitor/prometheus$ / break;
      rewrite ^/privatedeploy/mdy/monitor/prometheus/(.*)$ /$1 break;
      proxy_pass http://prometheus;
      proxy_redirect /graph /privatedeploy/mdy/monitor/prometheus/graph;
      }