
概述 #
Prometheus 是开源的监控和告警工具包,具备强大的指标收集和查询能力。
核心组件 #
Prometheus Server
├── Time Series Database (TSDB)
├── HTTP Server
└── PromQL Query Engine
Client Libraries
├── Node Exporter (节点指标)
├── cAdvisor (容器指标)
└── blackbox_exporter (黑盒监控)
Alertmanager
└── 告警管理安装配置 #
Docker 部署 #
# docker-compose.yml
version: '3'
services:
prometheus:
image: prom/prometheus:v2.47.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
volumes:
prometheus-data:配置文件 #
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
file_sd_configs:
- files:
- /etc/prometheus/targets/node.json
refresh_interval: 1m指标类型 #
四种基本指标 #
| 指标类型 | 说明 | 示例 |
|---|---|---|
| Counter | 单调递增计数器 | 请求总数 |
| Gauge | 可增可减的数值 | 内存使用 |
| Histogram | 分位数统计 | 响应时间 |
| Summary | 百分位数统计 | 请求延迟 |
常用指标 #
# HTTP 请求总数
http_requests_total
# CPU 使用率
node_cpu_seconds_total
# 内存使用
node_memory_MemAvailable_bytes
# 磁盘使用
node_filesystem_avail_bytes
# 网络流量
node_network_receive_bytes_totalPromQL 语法 #
基础查询 #
# 返回当前值
http_requests_total
# 过去 5 分钟的速率
rate(http_requests_total[5m])
# 过去 1 小时的累计
sum(rate(http_requests_total[1h]))
# 按标签分组
sum by (path) (rate(http_requests_total[5m]))
# 过滤特定值
http_requests_total{status="200"}常用函数 #
# rate - 计算每秒增长率
rate(http_requests_total[5m])
# increase - 计算区间内的增长量
increase(http_requests_total[1h])
# sum - 求和
sum(rate(http_requests_total[5m]))
# avg - 平均值
avg(node_memory_MemAvailable_bytes)
# topk - 取前 N 个
topk(5, rate(http_requests_total[5m]))
# histogram_quantile - 分位数
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, method))实战示例 #
Node Exporter #
# 启动 Node Exporter
docker run -d \
--name node-exporter \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
prom/node-exporter:v1.6.1监控 Nginx #
# nginx-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nginx-monitor
spec:
selector:
matchLabels:
app: nginx
endpoints:
- port: http
interval: 15s
path: /metrics告警规则 #
# alert-rules.yml
groups:
- name: example
rules:
- alert: HighRequestRate
expr: sum(rate(http_requests_total[5m])) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "高请求速率"
description: "请求速率超过 1000/s"
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: critical
annotations:
summary: "CPU 使用率过高"
description: "CPU 使用率超过 80%"配置 Alertmanager #
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'webhook'
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://alertmanager-webhook:5001/webhook'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']Grafana 集成 #
# docker-compose add Grafana
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
depends_on:
- prometheus
volumes:
grafana-data:Dashboards #
常用 Dashboard ID:
1860- Node Exporter Full3662- Prometheus Stats14057- Kubernetes cluster monitoring
最佳实践 #
- 合理的 scrape_interval - 平衡精度和性能
- 告警聚合 - 避免告警风暴
- 使用 recording rules - 预计算复杂查询
- 告警分级 - 设置优先级
- 定期审核 - 删除过期数据
总结 #
Prometheus 是强大的监控系统,适合云原生环境。