Skip to main content

Prometheus 监控系统完整指南

·691 words·2 mins

概述
#

Prometheus 是开源的监控和告警工具包,具备强大的指标收集和查询能力。

核心组件
#

Prometheus Server
├── Time Series Database (TSDB)
├── HTTP Server
└── PromQL Query Engine

Client Libraries
├── Node Exporter (节点指标)
├── cAdvisor (容器指标)
└── blackbox_exporter (黑盒监控)

Alertmanager
└── 告警管理

安装配置
#

Docker 部署
#

# docker-compose.yml
version: '3'

services:
  prometheus:
    image: prom/prometheus:v2.47.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

volumes:
  prometheus-data:

配置文件
#

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/node.json
        refresh_interval: 1m

指标类型
#

四种基本指标
#

指标类型 说明 示例
Counter 单调递增计数器 请求总数
Gauge 可增可减的数值 内存使用
Histogram 分位数统计 响应时间
Summary 百分位数统计 请求延迟

常用指标
#

# HTTP 请求总数
http_requests_total

# CPU 使用率
node_cpu_seconds_total

# 内存使用
node_memory_MemAvailable_bytes

# 磁盘使用
node_filesystem_avail_bytes

# 网络流量
node_network_receive_bytes_total

PromQL 语法
#

基础查询
#

# 返回当前值
http_requests_total

# 过去 5 分钟的速率
rate(http_requests_total[5m])

# 过去 1 小时的累计
sum(rate(http_requests_total[1h]))

# 按标签分组
sum by (path) (rate(http_requests_total[5m]))

# 过滤特定值
http_requests_total{status="200"}

常用函数
#

# rate - 计算每秒增长率
rate(http_requests_total[5m])

# increase - 计算区间内的增长量
increase(http_requests_total[1h])

# sum - 求和
sum(rate(http_requests_total[5m]))

# avg - 平均值
avg(node_memory_MemAvailable_bytes)

# topk - 取前 N 个
topk(5, rate(http_requests_total[5m]))

# histogram_quantile - 分位数
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, method))

实战示例
#

Node Exporter
#

# 启动 Node Exporter
docker run -d \
  --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  prom/node-exporter:v1.6.1

监控 Nginx
#

# nginx-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nginx-monitor
spec:
  selector:
    matchLabels:
      app: nginx
  endpoints:
  - port: http
    interval: 15s
    path: /metrics

告警规则
#

# alert-rules.yml
groups:
  - name: example
    rules:
    - alert: HighRequestRate
      expr: sum(rate(http_requests_total[5m])) > 1000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "高请求速率"
        description: "请求速率超过 1000/s"

    - alert: HighCPUUsage
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "CPU 使用率过高"
        description: "CPU 使用率超过 80%"

配置 Alertmanager
#

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'webhook'

receivers:
  - name: 'webhook'
    webhook_configs:
      - url: 'http://alertmanager-webhook:5001/webhook'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Grafana 集成
#

# docker-compose add Grafana
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus

volumes:
  grafana-data:

Dashboards
#

常用 Dashboard ID:

  • 1860 - Node Exporter Full
  • 3662 - Prometheus Stats
  • 14057 - Kubernetes cluster monitoring

最佳实践
#

  1. 合理的 scrape_interval - 平衡精度和性能
  2. 告警聚合 - 避免告警风暴
  3. 使用 recording rules - 预计算复杂查询
  4. 告警分级 - 设置优先级
  5. 定期审核 - 删除过期数据

总结
#

Prometheus 是强大的监控系统,适合云原生环境。