Prometheus 监控体系部署
监控系统组件说明
| 服务名 | 服务端口 | 作用 |
|---|---|---|
| node_exporter | 59100 | 采集服务器运行时各项指标数据 |
| cadvisor | 59101 | 采集容器运行时各项指标数据 |
| kafka_exporter | 59102 | 采集 kafka tpoic 各项指标数据 |
| kube-state-metrics | 30686 | 采集 k8s 集群各项指标数据 |
| prometheus | 9090 | 收集、存储监控数据 |
| grafana | 3000 | 监控数据可视化 |
各服务器都需要部署 node_exporter 服务
仅运行 docker 的服务器需要部署 cadvisor 服务,如 file 集群各节点
kafka_exporter 服务仅需要部署在 kafka 集群中任意一个节点即可
kube-state-metrics 仅需要在 k8s master 一台节点上部署即可,但镜像需要 k8s 集群中各节点都下载
prometheus 与 grafana 可在同一台服务器上部署
网络端口连通性要求:
- prometheus 所在服务器到所有部署 node_exporter 服务器的 59100 端口通畅
- prometheus 所在服务器到所有部署 cadvisor 服务器的 59101 端口通畅
- prometheus 所在服务器到部署 kafka_exporter 服务器的 59102 端口通畅
- prometheus 所在服务器到部署 k8s master 服务器的 6443 与 30686 端口通畅
- grafana 所在服务器到 prometheus 服务器的 9090 端口通畅
- 如果有配置 grafana 地址反向代理,则:
- 代理所在服务器需要到 grafana 的 3000 端口通畅
- 如果有反向 代理 prometheus,则到 prometheus 服务器的 9090 端口也要通畅
部署 node_exporter
-
下载 node_exporter 安装包
wget https://pdpublic.mingdao.com/private-deployment/offline/common/node_exporter-1.9.1.linux-amd64.tar.gz -
解压 node_exporter
tar xf node_exporter-1.9.1.linux-amd64.tar.gz -C /usr/local/mv /usr/local/node_exporter-1.9.1.linux-amd64 /usr/local/node_exporter -
写入 node_exporter 的 systemd 服务文件
cat > /etc/systemd/system/node_exporter.service <<'EOF'[Unit]Description=Node Exporter for PrometheusDocumentation=https://github.com/prometheus/node_exporterAfter=network.target[Service]Type=simpleExecStart=/usr/local/node_exporter/node_exporter --web.listen-address=:59100User=rootGroup=rootRestart=alwaysRestartSec=10LimitNOFILE=102400[Install]WantedBy=multi-user.targetEOF -
启动 node_exporter
systemctl daemon-reloadsystemctl enable node_exportersystemctl start node_exporter
部署 cadvisor
-
下载
wget https://pdpublic.mingdao.com/private-deployment/offline/common/cadvisor-v0.52.1-linux-amd64 -
创建 cadvisor 目录
mkdir /usr/local/cadvisor -
移动并添加可执行权限
mv cadvisor-v0.52.1-linux-amd64 /usr/local/cadvisor/cadvisorchmod +x /usr/local/cadvisor/cadvisor -
写入 cadvisor 的 systemd 服务文件
cat > /etc/systemd/system/cadvisor.service <<'EOF'[Unit]Description=cAdvisor Container MonitoringDocumentation=https://github.com/google/cadvisorAfter=network.target[Service]Type=simpleExecStart=/usr/local/cadvisor/cadvisor -port=59101User=rootGroup=rootRestart=alwaysRestartSec=10LimitNOFILE=102400[Install]WantedBy=multi-user.targetEOF -
启动 cadvisor
systemctl daemon-reloadsystemctl enable cadvisorsystemctl start cadvisor
部署 kafka_exporter
-
下载安装包
wget https://pdpublic.mingdao.com/private-deployment/offline/common/kafka_exporter-1.9.0.linux-amd64.tar.gz -
解压至安装目录
tar -zxvf kafka_exporter-1.9.0.linux-amd64.tar.gz -C /usr/local/mv /usr/local/kafka_exporter-1.9.0.linux-amd64 /usr/local/kafka_exporter -
写入 kafka_exporter 的 systemd 服务文件
# 注意替换 --kafka.server 参数为你实际的 Kafka 地址cat > /etc/systemd/system/kafka_exporter.service <<'EOF'[Unit]Description=Kafka Exporter for PrometheusDocumentation=https://github.com/danielqsj/kafka_exporterAfter=network.target[Service]Type=simpleExecStart=/usr/local/kafka_exporter/kafka_exporter --kafka.server=192.168.1.2:9092 --web.listen-address=:59102User=rootGroup=rootRestart=alwaysRestartSec=10LimitNOFILE=102400[Install]WantedBy=multi-user.targetEOF -
启动 kafka_exporter
systemctl daemon-reloadsystemctl enable kafka_exportersystemctl start kafka_exporter
部署 kube-state-metrics
-
下载镜像(k8s 集群中所有节点都需要下载镜像)
- 服务器支持访问互联网
- 服务器不支持访问互联网
crictl pull registry.cn-hangzhou.aliyuncs.com/mdpublic/kube-state-metrics:2.3.0# 离线镜像文件下载链接,下载完成后上传到部署服务器wget https://pdpublic.mingdao.com/private-deployment/offline/common/kube-state-metrics.tar.gz# 解压镜像文件gunzip -d kube-state-metrics.tar.gz# 导入离线镜像ctr -n k8s.io image import kube-state-metrics.tar -
创建配置文件存放目录
mkdir -p /usr/local/kubernetes/ops-monitcd /usr/local/kubernetes/ops-monit -
写入部署配置文件
cat > cluster-role-binding.yaml <<\EOFapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata:labels:app.kubernetes.io/name: kube-state-metricsapp.kubernetes.io/version: v2.3.0name: kube-state-metricsroleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: kube-state-metricssubjects:- kind: ServiceAccountname: kube-state-metricsnamespace: ops-monitEOFcat > cluster-role.yaml <<\EOFapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:labels:app.kubernetes.io/name: kube-state-metricsapp.kubernetes.io/version: v2.3.0name: kube-state-metricsrules:- apiGroups:- ""resources:- configmaps- secrets- nodes- pods- services- resourcequotas- replicationcontrollers- limitranges- persistentvolumeclaims- persistentvolumes- namespaces- endpointsverbs:- list- watch- apiGroups:- extensionsresources:- daemonsets- deployments- replicasets- ingressesverbs:- list- watch- apiGroups:- appsresources:- statefulsets- daemonsets- deployments- replicasetsverbs:- list- watch- apiGroups:- batchresources:- cronjobs- jobsverbs:- list- watch- apiGroups:- autoscalingresources:- horizontalpodautoscalersverbs:- list- watch- apiGroups:- authentication.k8s.ioresources:- tokenreviewsverbs:- create- apiGroups:- authorization.k8s.ioresources:- subjectaccessreviewsverbs:- create- apiGroups:- policyresources:- poddisruptionbudgetsverbs:- list- watch- apiGroups:- certificates.k8s.ioresources:- certificatesigningrequestsverbs:- list- watch- apiGroups:- storage.k8s.ioresources:- storageclasses- volumeattachmentsverbs:- list- watch- apiGroups:- admissionregistration.k8s.ioresources:- mutatingwebhookconfigurations- validatingwebhookconfigurationsverbs:- list- watch- apiGroups:- networking.k8s.ioresources:- networkpoliciesverbs:- list- watchEOFcat > deployment.yaml <<\EOFapiVersion: apps/v1kind: Deploymentmetadata:labels:app.kubernetes.io/name: kube-state-metricsapp.kubernetes.io/version: v2.3.0name: kube-state-metricsnamespace: ops-monitspec:replicas: 1selector:matchLabels:app.kubernetes.io/name: kube-state-metricstemplate:metadata:labels:app.kubernetes.io/name: kube-state-metricsapp.kubernetes.io/version: v2.3.0spec:containers:- image: registry.cn-hangzhou.aliyuncs.com/mdpublic/kube-state-metrics:2.3.0livenessProbe:httpGet:path: /healthzport: 8080initialDelaySeconds: 5timeoutSeconds: 5name: kube-state-metricsports:- containerPort: 8080name: http-metrics- containerPort: 8081name: telemetryreadinessProbe:httpGet:path: /port: 8081initialDelaySeconds: 5timeoutSeconds: 5nodeSelector:kubernetes.io/os: linuxserviceAccountName: kube-state-metricsEOFcat > service-account.yaml <<EOFapiVersion: v1kind: ServiceAccountmetadata:labels:app.kubernetes.io/name: kube-state-metricsapp.kubernetes.io/version: v2.3.0name: kube-state-metricsnamespace: ops-monitEOFcat > service.yaml <<\EOFapiVersion: v1kind: Servicemetadata:# annotations:# prometheus.io/scrape: 'true'labels:app.kubernetes.io/name: kube-state-metricsapp.kubernetes.io/version: v2.3.0name: kube-state-metricsnamespace: ops-monitspec:ports:- name: http-metricsport: 8080targetPort: http-metricsnodePort: 30686- name: telemetryport: 8081targetPort: telemetrytype: NodePortselector:app.kubernetes.io/name: kube-state-metricsEOF# 创建cadvisor所需kubernetes用户cat > rbac.yaml <<\EOFapiVersion: v1kind: ServiceAccountmetadata:name: prometheusnamespace: kube-system---# 创建prometheus关联的Secret(1.24版本开始要手动关联)apiVersion: v1kind: Secrettype: kubernetes.io/service-account-tokenmetadata:name: prometheusnamespace: kube-systemannotations:kubernetes.io/service-account.name: "prometheus"---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:name: prometheusrules:- apiGroups:- ""resources:- nodes- services- endpoints- pods- nodes/proxyverbs:- get- list- watch- apiGroups:- "extensions"resources:- ingressesverbs:- get- list- watch- apiGroups:- ""resources:- configmaps- nodes/metricsverbs:- get- nonResourceURLs:- /metricsverbs:- get---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata:name: prometheusroleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: prometheussubjects:- kind: ServiceAccountname: prometheusnamespace: kube-systemEOF -
创建命名空间
kubectl create namespace ops-monit -
启动监控服务
kubectl apply -f . -
获取 token
kubectl describe secret $(kubectl describe sa prometheus -n kube-system | sed -n '7p' | awk '{print $2}') -n kube-system | tail -n1 | awk '{print $2}'- token 内容复制写入到 prometheus 服务的 /usr/local/prometheus/privatedeploy_kubernetes.token 文件中
部署 prometheus
-
下载 prometheus 安装包
wget https://pdpublic.mingdao.com/private-deployment/offline/common/prometheus-3.5.0.linux-amd64.tar.gz -
解压
tar -zxvf prometheus-3.5.0.linux-amd64.tar.gz -C /usr/local/mv /usr/local/prometheus-3.5.0.linux-amd64 /usr/local/prometheus -
配置 prometheus.yml 文件
global:scrape_interval: 15sscrape_configs:# 服务器监控- job_name: "node_exporter"static_configs:- targets: ["192.168.10.20:59100"]labels:nodename: hap-nginx-01origin_prometheus: node- targets: ["192.168.10.21:59100"]labels:nodename: hap-k8s-service-01origin_prometheus: node- targets: ["192.168.10.2:59100"]labels:nodename: hap-k8s-service-02origin_prometheus: node- targets: ["192.168.10.3:59100"]labels:nodename: hap-middleware-01origin_prometheus: node- targets: ["192.168.10.3:59100"]labels:nodename: hap-db-01origin_prometheus: node# docker 监控- job_name: "cadvisor"static_configs:- targets:- 192.168.10.16:59101# kafka 监控- job_name: kafka_exporterstatic_configs:- targets: ["192.168.10.7:59102"]# k8s 监控- job_name: privatedeploy_kubernetes_metricsstatic_configs:- targets: ["192.168.10.20:30686"] # 注意替换为 k8s 主节点地址labels:origin_prometheus: kubernetes- job_name: 'privatedeploy_kubernetes_cadvisor'scheme: httpsmetrics_path: /metrics/cadvisortls_config:insecure_skip_verify: truebearer_token_file: /usr/local/prometheus/privatedeploy_kubernetes.tokenkubernetes_sd_configs:- role: nodeapi_server: https://192.168.10.20:6443 # 注意替换为 k8s 主节点地址bearer_token_file: /usr/local/prometheus/privatedeploy_kubernetes.tokentls_config:insecure_skip_verify: truerelabel_configs:- action: labelmapregex: __meta_kubernetes_node_label_(.+)- target_label: __address__replacement: 192.168.10.20:6443 # 注意替换为 k8s 主节点地址- target_label: origin_prometheusreplacement: kubernetes- source_labels: [__meta_kubernetes_node_name]target_label: __metrics_path__replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisormetric_relabel_configs:- source_labels: [instance]separator: ;regex: (.+)target_label: nodereplacement: $1action: replace- source_labels: [pod_name]separator: ;regex: (.+)target_label: podreplacement: $1action: replace- source_labels: [container_name]separator: ;regex: (.+)target_label: containerreplacement: $1action: replace -
配写入 prometheus 的 systemd 服务文件
cat > /etc/systemd/system/prometheus.service <<'EOF'[Unit]Description=Prometheus Monitoring SystemDocumentation=https://prometheus.io/docs/introduction/overview/After=network.target[Service]Type=simpleExecStart=/usr/local/prometheus/prometheus \--storage.tsdb.path=/data/prometheus/data \--storage.tsdb.retention.time=30d \--config.file=/usr/local/prometheus/prometheus.yml \--web.enable-lifecycleExecReload=/usr/bin/curl -X POST http://127.0.0.1:9090/-/reloadUser=rootGroup=rootRestart=alwaysRestartSec=10LimitNOFILE=102400[Install]WantedBy=multi-user.targetEOF -
启动 prometheus
systemctl daemon-reloadsystemctl enable prometheussystemctl start prometheus- 当 prometheus 配置发生修改时,可通过
systemctl reload prometheus进行热加载
- 当 prometheus 配置发生修改时,可通过
部署 grafana
-
下载 grafana 安装包
wget https://pdpublic.mingdao.com/private-deployment/offline/common/grafana_12.1.2_17957162798_linux_amd64.tar.gz -
解压
tar -xf grafana_12.1.2_17957162798_linux_amd64.tar.gz -C /usr/local/mv /usr/local/grafana-12.1.2 /usr/local/grafana -
修改 /usr/local/grafana/conf/defaults.ini文件 中的 root_url 值如下
root_url = %(protocol)s://%(domain)s:%(http_port)s/privatedeploy/mdy/monitor/grafana/# 一键修改sed -ri 's#^root_url = .*#root_url = %(protocol)s://%(domain)s:%(http_port)s/privatedeploy/mdy/monitor/grafana/#' /usr/local/grafana/conf/defaults.inigrep "^root_url" /usr/local/grafana/conf/defaults.ini -
修改 /usr/local/grafana/conf/defaults.ini文件 中的 serve_from_sub_path 值如下
serve_from_sub_path = true# 一键修改sed -ri 's#^serve_from_sub_path = .*#serve_from_sub_path = true#' /usr/local/grafana/conf/defaults.inigrep "^serve_from_sub_path" /usr/local/grafana/conf/defaults.ini- 如果不需要通过 nginx 代理来访问 grafana 页面,而是直接通过 grafana 的 IP 作为访问地址的话,则需要同时调整 domain 值为实际的 host。
-
写入 grafana 的 systemd 服务文件
cat > /etc/systemd/system/grafana.service <<'EOF'[Unit]Description=Grafana DashboardDocumentation=https://grafana.com/docs/After=network.target[Service]Type=simpleWorkingDirectory=/usr/local/grafanaExecStart=/usr/local/grafana/bin/grafana-server webUser=rootGroup=rootRestart=alwaysRestartSec=10LimitNOFILE=102400[Install]WantedBy=multi-user.targetEOF -
启动 grafana
systemctl daemon-reloadsystemctl enable grafanasystemctl start grafana
配置将 grafana 地址反向代理
nginx 反向代理配置文件,参考如下规则配置代理 grafana 页面
upstream grafana {
server 192.168.1.10:3000;
}
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
server {
listen 80;
server_name hap.domain.com;
access_log /data/logs/weblogs/grafana.log main;
error_log /data/logs/weblogs/grafana.mingdao.net.error.log;
location /privatedeploy/mdy/monitor/grafana/ {
#allow 1.1.1.1;
#deny all;
proxy_hide_header X-Frame-Options;
proxy_set_header X-Frame-Options ALLOWALL;
proxy_set_header Host $http_host;
proxy_pass http://grafana;
proxy_redirect http://localhost:3000 http://hap.domain.com:80/privatedeploy/mdy/monitor/grafana;
}
location /privatedeploy/mdy/monitor/grafana/api/live {
rewrite ^/(.*) /$1 break;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $http_host;
proxy_pass http://grafana;
}
}
-
代理配置好后即可通过 http://hap.domain.com/privatedeploy/mdy/monitor/grafana 访问 grafana 页面,然后配置仪表盘
-
如果要代理 prometheus 可添加如下规则(通常不需要,因为 prometheus 页面没有认证,需要注意访问安全)
upstream prometheus {server 192.168.1.10:9090;}location /privatedeploy/mdy/monitor/prometheus {rewrite ^/privatedeploy/mdy/monitor/prometheus$ / break;rewrite ^/privatedeploy/mdy/monitor/prometheus/(.*)$ /$1 break;proxy_pass http://prometheus;proxy_redirect /graph /privatedeploy/mdy/monitor/prometheus/graph;}
-