Skip to main content

Prometheus 监控体系部署

监控系统组件说明

服务名服务端口作用
node_exporter59100采集服务器运行时各项指标数据
cadvisor59101采集容器运行时各项指标数据
kafka_exporter59102采集 kafka tpoic 各项指标数据
kube-state-metrics30686采集 k8s 集群各项指标数据
prometheus9090收集、存储监控数据
grafana3000监控数据可视化

各服务器都需要部署 node_exporter 服务

仅运行 docker 的服务器需要部署 cadvisor 服务,如 file 集群各节点

kafka_exporter 服务仅需要部署在 kafka 集群中任意一个节点即可

kube-state-metrics 仅需要在 k8s master 一台节点上部署即可,但镜像需要 k8s 集群中各节点都下载

prometheus 与 grafana 可在同一台服务器上部署

网络端口连通性要求:

  • prometheus 所在服务器到所有部署 node_exporter 服务器的 59100 端口通畅
  • prometheus 所在服务器到所有部署 cadvisor 服务器的 59101 端口通畅
  • prometheus 所在服务器到部署 kafka_exporter 服务器的 59102 端口通畅
  • prometheus 所在服务器到部署 k8s master 服务器的 6443 与 30686 端口通畅
  • grafana 所在服务器到 prometheus 服务器的 9090 端口通畅
  • 如果有配置 grafana 地址反向代理,则:
    • 代理所在服务器需要到 grafana 的 3000 端口通畅
    • 如果有反向代理 prometheus,则到 prometheus 服务器的 9090 端口也要通畅

部署 Node_exporter

  1. 下载 node_exporter 安装包

    wget http://pdpublic.mingdao.com/private-deployment/offline/common/node_exporter-1.3.1.linux-amd64.tar.gz
  2. 解压 node_exporter

    tar xf node_exporter-1.3.1.linux-amd64.tar.gz -C /usr/local/
    mv /usr/local/node_exporter-1.3.1.linux-amd64 /usr/local/node_exporter
  3. 配置启停脚本

    cat > /usr/local/node_exporter/start_node_exporter.sh <<EOF
    nohup /usr/local/node_exporter/node_exporter --web.listen-address=:59100 &
    EOF

    cat > /usr/local/node_exporter/stop_node_exporter.sh <<EOF
    kill \$(pgrep -f '/usr/local/node_exporter/node_exporter --web.listen-address=:59100')
    EOF

    chmod +x /usr/local/node_exporter/start_node_exporter.sh
    chmod +x /usr/local/node_exporter/stop_node_exporter.sh
  4. 启动 node_exporter

    cd /usr/local/node_exporter/
    bash start_node_exporter.sh
  5. 加入开机自启动

    echo "cd /usr/local/node_exporter/ && /bin/bash start_node_exporter.sh" >> /etc/rc.local
    chmod +x /etc/rc.local

部署 Cadvisor

  1. 下载

    wget http://pdpublic.mingdao.com/private-deployment/offline/common/cadvisor-v0.47.0-linux-amd64
  2. 创建 cadvisor 目录

    mkdir /usr/local/cadvisor
  3. 移动并添加可执行权限

    mv cadvisor-v0.47.0-linux-amd64 /usr/local/cadvisor/cadvisor
    chmod +x /usr/local/cadvisor/cadvisor
  4. 编写启动、停止脚本

    cat > /usr/local/cadvisor/start_cadvisor.sh <<EOF
    nohup /usr/local/cadvisor/cadvisor -port 59101 &
    EOF

    cat > /usr/local/cadvisor/stop_cadvisor.sh <<EOF
    kill \$(pgrep -f '/usr/local/cadvisor/cadvisor')
    EOF

    chmod +x /usr/local/cadvisor/start_cadvisor.sh
    chmod +x /usr/local/cadvisor/stop_cadvisor.sh
  5. 启动 cadvisor

    cd /usr/local/cadvisor
    bash start_cadvisor.sh
  6. 加入开机自启动

    echo "cd /usr/local/cadvisor && /bin/bash start_cadvisor.sh" >> /etc/rc.local
    chmod +x /etc/rc.local

部署 Kafka_exporter

  1. 下载安装包

    wget http://pdpublic.mingdao.com/private-deployment/offline/common/kafka_exporter-1.4.2.linux-amd64.tar.gz
  2. 解压至安装目录

    tar -zxvf kafka_exporter-1.4.2.linux-amd64.tar.gz -C /usr/local/
    mv /usr/local/kafka_exporter-1.4.2.linux-amd64 /usr/local/kafka_exporter
  3. 加入管理脚本

    # 注意替换 kafka 服务的地址为实际的IP
    cat > /usr/local/kafka_exporter/start_kafka_exporter.sh <<EOF
    nohup /usr/local/kafka_exporter/kafka_exporter --kafka.server=192.168.1.2:9092 --web.listen-address=:59102 &
    EOF

    cat > /usr/local/kafka_exporter/stop_kafka_exporter.sh <<EOF
    kill \$(pgrep -f '/usr/local/kafka_exporter/kafka_exporter')
    EOF

    chmod +x /usr/local/kafka_exporter/start_kafka_exporter.sh
    chmod +x /usr/local/kafka_exporter/stop_kafka_exporter.sh
  4. 启动服务

    cd /usr/local/kafka_exporter/ && bash start_kafka_exporter.sh
  5. 加入开机自启动

    echo "sleep 60; cd /usr/local/kafka_exporter/ && bash start_kafka_exporter.sh" >> /etc/rc.local
    chmod +x /etc/rc.local

部署 Kube-state-metrics

  1. 下载镜像(k8s 集群中所有节点都需要下载镜像)

    crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/kube-state-metrics:2.3.0
  2. 创建配置文件存放目录

    mkdir -p /usr/local/kubernetes/ops-monit
    cd /usr/local/kubernetes/ops-monit
  3. 写入部署配置文件

    cat > cluster-role-binding.yaml <<\EOF
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    name: kube-state-metrics
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: kube-state-metrics
    subjects:
    - kind: ServiceAccount
    name: kube-state-metrics
    namespace: ops-monit
    EOF

    cat > cluster-role.yaml <<\EOF
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    name: kube-state-metrics
    rules:
    - apiGroups:
    - ""
    resources:
    - configmaps
    - secrets
    - nodes
    - pods
    - services
    - resourcequotas
    - replicationcontrollers
    - limitranges
    - persistentvolumeclaims
    - persistentvolumes
    - namespaces
    - endpoints
    verbs:
    - list
    - watch
    - apiGroups:
    - extensions
    resources:
    - daemonsets
    - deployments
    - replicasets
    - ingresses
    verbs:
    - list
    - watch
    - apiGroups:
    - apps
    resources:
    - statefulsets
    - daemonsets
    - deployments
    - replicasets
    verbs:
    - list
    - watch
    - apiGroups:
    - batch
    resources:
    - cronjobs
    - jobs
    verbs:
    - list
    - watch
    - apiGroups:
    - autoscaling
    resources:
    - horizontalpodautoscalers
    verbs:
    - list
    - watch
    - apiGroups:
    - authentication.k8s.io
    resources:
    - tokenreviews
    verbs:
    - create
    - apiGroups:
    - authorization.k8s.io
    resources:
    - subjectaccessreviews
    verbs:
    - create
    - apiGroups:
    - policy
    resources:
    - poddisruptionbudgets
    verbs:
    - list
    - watch
    - apiGroups:
    - certificates.k8s.io
    resources:
    - certificatesigningrequests
    verbs:
    - list
    - watch
    - apiGroups:
    - storage.k8s.io
    resources:
    - storageclasses
    - volumeattachments
    verbs:
    - list
    - watch
    - apiGroups:
    - admissionregistration.k8s.io
    resources:
    - mutatingwebhookconfigurations
    - validatingwebhookconfigurations
    verbs:
    - list
    - watch
    - apiGroups:
    - networking.k8s.io
    resources:
    - networkpolicies
    verbs:
    - list
    - watch
    EOF

    cat > deployment.yaml <<\EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    name: kube-state-metrics
    namespace: ops-monit
    spec:
    replicas: 1
    selector:
    matchLabels:
    app.kubernetes.io/name: kube-state-metrics
    template:
    metadata:
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    spec:
    containers:
    - image: registry.cn-hangzhou.aliyuncs.com/mdpublic/kube-state-metrics:2.3.0
    livenessProbe:
    httpGet:
    path: /healthz
    port: 8080
    initialDelaySeconds: 5
    timeoutSeconds: 5
    name: kube-state-metrics
    ports:
    - containerPort: 8080
    name: http-metrics
    - containerPort: 8081
    name: telemetry
    readinessProbe:
    httpGet:
    path: /
    port: 8081
    initialDelaySeconds: 5
    timeoutSeconds: 5
    nodeSelector:
    kubernetes.io/os: linux
    serviceAccountName: kube-state-metrics
    EOF

    cat > service-account.yaml <<EOF
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    name: kube-state-metrics
    namespace: ops-monit
    EOF

    cat > service.yaml <<\EOF
    apiVersion: v1
    kind: Service
    metadata:
    # annotations:
    # prometheus.io/scrape: 'true'
    labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.3.0
    name: kube-state-metrics
    namespace: ops-monit
    spec:
    ports:
    - name: http-metrics
    port: 8080
    targetPort: http-metrics
    nodePort: 30686
    - name: telemetry
    port: 8081
    targetPort: telemetry
    type: NodePort
    selector:
    app.kubernetes.io/name: kube-state-metrics
    EOF

    # 创建cadvisor所需kubernetes用户
    cat > rbac.yaml <<\EOF
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    name: prometheus
    namespace: kube-system

    ---
    # 创建prometheus关联的Secret(1.24版本开始要手动关联)
    apiVersion: v1
    kind: Secret
    type: kubernetes.io/service-account-token
    metadata:
    name: prometheus
    namespace: kube-system
    annotations:
    kubernetes.io/service-account.name: "prometheus"
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    name: prometheus
    rules:
    - apiGroups:
    - ""
    resources:
    - nodes
    - services
    - endpoints
    - pods
    - nodes/proxy
    verbs:
    - get
    - list
    - watch
    - apiGroups:
    - "extensions"
    resources:
    - ingresses
    verbs:
    - get
    - list
    - watch
    - apiGroups:
    - ""
    resources:
    - configmaps
    - nodes/metrics
    verbs:
    - get
    - nonResourceURLs:
    - /metrics
    verbs:
    - get
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
    name: prometheus
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: prometheus
    subjects:
    - kind: ServiceAccount
    name: prometheus
    namespace: kube-system
    EOF
  4. 创建命名空间

    kubectl create namespace ops-monit
  5. 启动监控服务

    kubectl apply -f .
  6. 获取 token

    kubectl describe secret $(kubectl describe sa prometheus -n kube-system | sed -n '7p' | awk '{print $2}') -n kube-system | tail -n1 | awk '{print $2}'
    • token 内容复制写入到 prometheus 服务的 /usr/local/prometheus/privatedeploy_kubernetes.token 文件中

部署 Prometheus

  1. 下载 prometheus 安装包

    wget http://pdpublic.mingdao.com/private-deployment/offline/common/prometheus-2.32.1.linux-amd64.tar.gz
  2. 解压

    tar -zxvf prometheus-2.32.1.linux-amd64.tar.gz -C /usr/local/
    mv /usr/local/prometheus-2.32.1.linux-amd64 /usr/local/prometheus
  3. 配置 prometheus.yml 文件

    global:
    scrape_interval: 15s

    scrape_configs:
    # 服务器监控
    - job_name: "node_exporter"
    static_configs:
    - targets: ["192.168.10.20:59100"]
    labels:
    nodename: service01
    origin_prometheus: node
    - targets: ["192.168.10.21:59100"]
    labels:
    nodename: service02
    origin_prometheus: node
    - targets: ["192.168.10.2:59100"]
    labels:
    nodename: db01
    origin_prometheus: node
    - targets: ["192.168.10.3:59100"]
    labels:
    nodename: db02
    origin_prometheus: node

    # 容器监控
    - job_name: "cadvisor"
    static_configs:
    - targets:
    - 192.168.10.16:59101
    - 192.168.10.17:59101
    - 192.168.10.18:59101
    - 192.168.10.19:59101

    # kafka 监控
    - job_name: kafka_exporter
    static_configs:
    - targets: ["192.168.10.7:59102"]

    # k8s监控
    - job_name: privatedeploy_kubernetes_metrics
    static_configs:
    - targets: ["192.168.10.20:30686"] # 注意替换为 k8s 主节点地址
    labels:
    origin_prometheus: kubernetes

    - job_name: 'privatedeploy_kubernetes_cadvisor'
    scheme: https
    metrics_path: /metrics/cadvisor
    tls_config:
    insecure_skip_verify: true
    bearer_token_file: /usr/local/prometheus/privatedeploy_kubernetes.token
    kubernetes_sd_configs:
    - role: node
    api_server: https://192.168.10.20:6443 # 注意替换为 k8s 主节点地址
    bearer_token_file: /usr/local/prometheus/privatedeploy_kubernetes.token
    tls_config:
    insecure_skip_verify: true
    relabel_configs:
    - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
    replacement: 192.168.10.20:6443 # 注意替换为 k8s 主节点地址
    - target_label: origin_prometheus
    replacement: kubernetes
    - source_labels: [__meta_kubernetes_node_name]
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
    metric_relabel_configs:
    - source_labels: [instance]
    separator: ;
    regex: (.+)
    target_label: node
    replacement: $1
    action: replace
    - source_labels: [pod_name]
    separator: ;
    regex: (.+)
    target_label: pod
    replacement: $1
    action: replace
    - source_labels: [container_name]
    separator: ;
    regex: (.+)
    target_label: container
    replacement: $1
    action: replace

  4. 配置启停脚本

    cat > /usr/local/prometheus/start_prometheus.sh <<EOF
    nohup /usr/local/prometheus/prometheus --storage.tsdb.path=/data/prometheus/data --storage.tsdb.retention.time=30d --config.file=/usr/local/prometheus/prometheus.yml --web.enable-lifecycle &
    EOF

    cat > /usr/local/prometheus/stop_prometheus.sh <<EOF
    kill \$(pgrep -f '/usr/local/prometheus/prometheus')
    EOF

    cat > /usr/local/prometheus/reload_prometheus.sh <<EOF
    curl -X POST http://127.0.0.1:9090/-/reload
    EOF

    chmod +x /usr/local/prometheus/start_prometheus.sh
    chmod +x /usr/local/prometheus/stop_prometheus.sh
    chmod +x /usr/local/prometheus/reload_prometheus.sh
  5. 启动 prometheus

    cd /usr/local/prometheus/
    bash start_prometheus.sh
  6. 加入开机自启动

    echo "cd /usr/local/prometheus/ && /bin/bash start_prometheus.sh" >> /etc/rc.local
    chmod +x /etc/rc.local

部署 Grafana

  1. 下载 grafana 安装包

    wget http://pdpublic.mingdao.com/private-deployment/offline/common/grafana-10.1.1.linux-amd64.tar.gz
  2. 解压

    tar -xf grafana-10.1.1.linux-amd64.tar.gz -C /usr/local/
    mv /usr/local/grafana-10.1.1 /usr/local/grafana
  3. 配置启停脚本

    cat > /usr/local/grafana/start_grafana.sh <<EOF
    cd /usr/local/grafana && nohup ./bin/grafana-server web &
    EOF

    cat > /usr/local/grafana/stop_grafana.sh <<EOF
    kill \$(pgrep -f 'grafana server web')
    EOF

    chmod +x /usr/local/grafana/start_grafana.sh
    chmod +x /usr/local/grafana/stop_grafana.sh
  4. 修改 /usr/local/grafana/conf/defaults.ini文件 中的 root_url 值如下

    root_url = %(protocol)s://%(domain)s:%(http_port)s/privatedeploy/mdy/monitor/grafana/

    # 一键修改
    sed -ri 's#^root_url = .*#root_url = %(protocol)s://%(domain)s:%(http_port)s/privatedeploy/mdy/monitor/grafana/#' /usr/local/grafana/conf/defaults.ini
    grep "^root_url" /usr/local/grafana/conf/defaults.ini
  5. 修改 /usr/local/grafana/conf/defaults.ini文件 中的 serve_from_sub_path 值如下

    serve_from_sub_path = true

    # 一键修改
    sed -ri 's#^serve_from_sub_path = .*#serve_from_sub_path = true#' /usr/local/grafana/conf/defaults.ini
    grep "^serve_from_sub_path" /usr/local/grafana/conf/defaults.ini
    • 如果不需要通过 nginx 代理来访问 grafana 页面,而是直接通过 grafana 的 IP 作为访问地址的话,则需要同时调整 domain 值为实际的 host。
  6. 启动 grafana

    cd /usr/local/grafana/
    bash start_grafana.sh
  7. 加入开机自启动

    echo "cd /usr/local/grafana/ && /bin/bash start_grafana.sh" >> /etc/rc.local
    chmod +x /etc/rc.local

配置将 Grafana 地址反向代理

nginx 反向代理配置文件,参考如下规则配置代理 grafana 页面

upstream grafana {
server 192.168.1.10:3000;
}

map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}

server {
listen 80;
server_name mdy.domain.com;
access_log /data/logs/weblogs/grafana.log main;
error_log /data/logs/weblogs/grafana.mingdao.net.error.log;

location /privatedeploy/mdy/monitor/grafana/ {
#allow 1.1.1.1;
#deny all;
proxy_hide_header X-Frame-Options;
proxy_set_header X-Frame-Options ALLOWALL;
proxy_set_header Host $http_host;
proxy_pass http://grafana;
proxy_redirect http://localhost:3000 http://mdy.domain.com:80/privatedeploy/mdy/monitor/grafana;
}

location /privatedeploy/mdy/monitor/grafana/api/live {
rewrite ^/(.*) /$1 break;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $http_host;
proxy_pass http://grafana;
}
}
  • 代理配置好后即可通过 http://mdy.domain.com/privatedeploy/mdy/monitor/grafana 访问 grafana 页面,然后配置仪表盘

    • 如果要代理 prometheus 可添加如下规则(通常不需要,因为 prometheus 页面没有认证,需要注意访问安全

      upstream prometheus {
      server 192.168.1.10:9090;
      }

      location /privatedeploy/mdy/monitor/prometheus {
      rewrite ^/privatedeploy/mdy/monitor/prometheus$ / break;
      rewrite ^/privatedeploy/mdy/monitor/prometheus/(.*)$ /$1 break;
      proxy_pass http://prometheus;
      proxy_redirect /graph /privatedeploy/mdy/monitor/prometheus/graph;
      }