prometheus + confd + etcd 自动发现

  • 架构

    1. Prometheus的配置文件都是经由confd从etcd中读取并生成
    2. 采集端采用node-exporter,kafka-exporter,mysql-exporter等进行采集,启动的时候需要调用cmdb接口将自身数据写入etcd
    3. codoon-alert通过与etcd进行交互,对rules,告警屏蔽等进行配置
  • 主配置文件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    global:
    scrape_interval: 10s #抓取间隔
    scrape_timeout: 10s #抓取超时时间
    evaluation_interval: 15s #评估规则间隔
    alerting:
    alertmanagers:
    - scheme: http
    timeout: 10s
    api_version: v1
    static_configs:
    - targets:
    - 127.0.0.1:9093
    rule_files:
    - /codoon/prometheus/etc/rules/rule_*.yml
    scrape_configs:
    - job_name: prometheus
    honor_timestamps: true
    scrape_interval: 10s
    scrape_timeout: 10s
    metrics_path: /metrics
    scheme: http
    static_configs:
    - targets:
    - 127.0.0.1:9090
    - job_name: codoon_ops
    honor_timestamps: true
    scrape_interval: 10s
    scrape_timeout: 10s
    metrics_path: /metrics
    scheme: http
    file_sd_configs:
    - files:
    - /codoon/prometheus/etc/targets/target_*.json
    refresh_interval: 20s #重载配置文件间隔
  • prometheus启动命令

    1
    2
    3
    /codoon/prometheus/prometheus --web.enable-lifecycle --config.file=/codoon/prometheus/etc/prometheus.yml --storage.tsdb.path=/codoon/prometheus

    nohup ./prometheus --web.enable-lifecycle --config.file=./etc/prometheus.yml --storage.tsdb.path=/codoon/prometheus --web.external-url=xxx.com/ 2>&1 > prometheus.log &
  • confd配置文件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    #服务发现
    #conf.d/discovery_host.toml
    [template]
    src = "discovery_host.tmpl"
    dest = "/codoon/prometheus/etc/targets/target_host.json"
    mode = "0777"
    keys = [
    "/prometheus/discovery/host",
    ]
    reload_cmd = "curl -XPOST 'http://127.0.0.1:9090/-/reload'"

    #templates/discovery_host.tmpl
    [
    {{- range $index, $info := getvs "/prometheus/discovery/host/*" -}}
    {{- $data := json $info -}}
    {{- if ne $index 0 }},{{- end }}
    {
    "targets": [
    "{{$data.address}}"
    ],
    "labels":{
    "instance": "{{$data.name}}"
    {{- if $data.labels -}}
    {{- range $data.labels -}}
    ,"{{.key}}": "{{.val}}"
    {{- end}}
    {{- end}}
    }
    }{{- end }}
    ]

    #规则下发
    #conf.d/rule_host.toml
    [template]
    src = "rule_host.tmpl"
    dest = "/codoon/prometheus/etc/rules/rule_host.yml"
    mode = "0777"
    keys = [
    "/prometheus/rule/host",
    ]
    reload_cmd = "curl -XPOST 'http://127.0.0.1:9090/-/reload'"

    #templates/rule_host.tmpl
    groups:
    - name: host
    rules:
    {{- range $info := getvs "/prometheus/rule/host/*"}}
    {{- $data := json $info}}
    {{- if $data.status}}
    - alert: {{$data.alert}}
    expr: {{$data.expr}}
    for: {{$data.for}}
    {{- if $data.labels}}
    labels:
    {{- range $data.labels}}
    {{.key}}: {{.val}}
    {{- end}}
    {{- end}}
    annotations:
    {{- if $data.summary}}
    summary: "{{$data.summary}}"
    {{- end}}
    {{- if $data.description}}
    description: "{{$data.description}}"
    {{- end}}
    {{- end }}
    {{- end }}
  • confd启动命令

    1
    2
    3
    /codoon/prometheus/confd-0.16.0-linux-amd64 -confdir /codoon/prometheus/confd/ -backend etcdv3  -watch -node http://127.0.0.1:2379

    nohup ./confd-0.16.0-linux-amd64 -confdir ./confd/ -backend etcdv3 -watch -node http://127.0.0.1:2379 2>&1 > confd.log &
  • 模拟服务发现

    1
    2
    3
    4
    #标签默认有instance: name
    etcdctl put /prometheus/discovery/host/test1 '{"name":"test1","address":"10.12.10.1:9091"}'
    #自定义标签
    etcdctl put /prometheus/discovery/host/test2 '{"name":"test2","address":"10.12.10.1:9092","labels":[{"key":"label1","val":"test1"},{"key":"label2","val":"test2"}]}'
  • 模拟规则下发

    1
    2
    3
    etcdctl put /prometheus/rule/host/test1 '{"alert":"test1 is down","expr":"up == 0","for":"30s","summary":"s1","description":"d1"}'
    #自定义标签
    etcdctl put /prometheus/rule/host/test2 '{"alert":"test2 is down","expr":"up == 0","for":"1m","summary":"s1","description":"d1","labels":[{"key":"label1","val":"test1"},{"key":"label2","val":"test2"}]}'
  • alertmanager

    1
    nohup ./alertmanager-0.21.0.linux-amd64/alertmanager --config.file=alertmanager-0.21.0.linux-amd64/alertmanager.yml 2>&1 > alertmanager.log &
  • 常用promsql

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    /prometheus/rule/host/nodata
    #无数据
    {"status":true,"alert":"no data","expr":"up == 0","for":"5m","summary":"no data","description":"{{$labels.instance}} no data for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]}

    /prometheus/rule/host/availcpult20
    #cpu可用率小于20%
    {"status":true,"alert":"avail cpu lt 20%","expr":"avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) by (type,instance,env,ip) < 0.2","for":"5m","summary":"avail cpu lt 20%","description":"avail cpu lt 20% for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]}

    /prometheus/rule/host/availmemlt20
    #mem可用率小于20%
    {"status":true,"alert":"avail mem lt 20%","expr":"1-(node_memory_MemTotal_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes - node_memory_MemFree_bytes) /node_memory_MemTotal_bytes < 0.2","for":"5m","summary":"avail mem lt 20%","description":"avail mem lt 20% for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]}

    /prometheus/rule/host/availdisklt20
    #disk可用率小于20%
    {"status":true,"alert":"avail disk lt 20%","expr":"node_filesystem_avail_bytes{fstype=~\"ext.*xfs\",mountpoint!~\".*docker.*.*pod.*.*container.*kubelet\"} /node_filesystem_size_bytes{fstype=~\"ext.*xfs\",mountpoint!~\".*docker.*.*pod.*.*container.*kubelet\"} < 0.2","for":"5m","summary":"avail disk lt 20%","description":"mount: {{ $labels.mountpoint }} avail lt 20G for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]}

    /prometheus/rule/host/load1toohigh
    #1分钟负载
    {"status":true,"alert":"load1 is too high","expr":"node_load1/2 > on(type,instance,env,ip) count(node_cpu_seconds_total{mode=\"system\"}) by (type,instance,env,ip)","for":"5m","summary":"load1 is too high","description":"load1 is too high for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]}

    /prometheus/rule/host/useiopsgt80
    #iops使用率大于80%
    {"status": true,"alert":"iops too high","expr":"rate(node_disk_io_time_seconds_total[5m]) > 0.8","for":"5m","summary":"iops too high","description":"iops too high for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]}

    (1 - (node_memory_MemFree_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} +node_memory_Buffers_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} +node_memory_Cached_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} / (node_memory_MemTotal_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"})))* 100

    ((node_memory_MemTotal_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} - node_memory_MemFree_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} - node_memory_Buffers_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} )) * 100

    #告警规则整理
    1分钟的负载大于cpu核心数 持续5m
    node_load1 > on(instance,ip) count(node_cpu_seconds_total{mode="system"}) by (instance,ip)

    CPU可用率小于20% 持续5m
    avg(rate(node_cpu_seconds_total{mode="system"}[5m])) by (instance) *100
    avg(rate(node_cpu_seconds_total{mode="user"}[5m])) by (instance) *100
    avg(rate(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance) *100
    avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) *100

    磁盘可用率小于20%且可用小于20G 持续5m
    (node_filesystem_avail_bytes{fstype=~\"ext.*xfs\",mountpoint!~\".*pod.*.*docker-lib.*\"} / node_filesystem_size_bytes{fstype=~\"ext.*xfs\",mountpoint!~\".*pod.*.*docker-lib.*\"} < 0.2) and node_filesystem_avail_bytes{fstype=~\"ext.*xfs\",mountpoint!~\".*pod.*.*docker-lib.*\"} < 20*1024^3

    内存使用率大于80% 持续5m
    (node_memory_MemTotal_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes - node_memory_MemFree_bytes) /node_memory_MemTotal_bytes

    IOPS write大于300 read 大于2000 持续5m
    rate(node_disk_reads_completed_total[5m]) > 1000 or rate(node_disk_writes_completed_total[5m]) > 200

    网卡 1小时总流量 5分钟速率
    increase(node_network_receive_bytes_total[60m]) /1024/1024
    increase(node_network_transmit_bytes_total[60m]) /1024/1024
    rate(node_network_receive_bytes_total[5m])*8
    rate(node_network_transmit_bytes_total[5m])*8
  • temp

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    {"status": true,"alert":"rw iops too high","expr":"rate(node_disk_io_time_seconds_total[5m]) > 0.8","for":"5m","summary":"iops too high","description":"iops too high for 5m, curr: {{ $value }}","labels":[{"key":"receiver","val":"xxxx,xxxx,xxx"}

    etcdctl put /prometheus/discovery/host/codoon-istio-master01 '{"name":"codoon-istio-master01","address":"10.10.16.73:9100","labels": [{"key":"type","val":"host"},{"key":"ip","val":"10.10.16.73"}]}'

    etcdctl put /prometheus/rule/host/cpuavail20 '{"alert":"cpu avail less 20","expr":"avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) by (instance) < 0.2","for":"5m","summary":"avail less 20","description":"cpu avail less 20 for 5m, curr: {{ $value }}","labels":[{"key":"receiver","val":"xxx"}]}'

    etcdctl put /prometheus/rule/host/memuse80 '{"alert":"mem use gt 80","expr":"(node_memory_MemTotal_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes - node_memory_MemFree_bytes) /node_memory_MemTotal_bytes > 0.8","for":"5m","summary":"use gt 80","description":"mem use gt 80 for 5m, curr: {{ $value }}","labels":[{"key":"receiver","val":"xxx"}]}'

    etcdctl put /prometheus/rule/host/iopsth '{"alert":"rw iops too high","expr":"rate(node_disk_reads_completed_total[5m]) > 1000 or rate(node_disk_writes_completed_total[5m]) > 200","for":"5m","summary":"iops too high","description":"iops too high for 5m, curr: {{ $value }}","labels":[{"key":"receiver","val":"xxxx"}]}'

    {
    "status": true,
    "alert": "avail disk lt 20%",
    "expr": "node_filesystem_avail_bytes{fstype=~\"ext.*xfs\",mountpoint!~\".*docker.*.*pod.*.*container.*kubelet\"} /node_filesystem_size_bytes{fstype=~\"ext.*xfs\",mountpoint!~\".*docker.*.*pod.*.*container.*kubelet\"} < 0.2 and node_filesystem_avail_bytes{fstype=~\"ext.*xfs\",mountpoint!~\".*docker.*.*pod.*.*container.*kubelet\"} < 50*1024^3",
    "for": "2m",
    "summary": "avail disk lt 20%",
    "description": "mount: {{ $labels.mountpoint }} avail lt 20% for 2m, curr: {{ $value }}",
    "labels": [{
    "key": "severity",
    "val": "warnning"
    }]
    }

    etcdctl put /prometheus/rule/host/load1too2high '{"status":true,"alert":"load1 is too2 high","expr":"node_load1 > on(type,instance,env,ip) count(node_cpu_seconds_total{mode=\"system\"}) by (type,instance,env,ip) /1.5","for":"2m","summary":"load1 is too2 high","description":"load1 is too2 high for 2m, curr: {{ $value }}","labels":[{"key":"severity","val":"critical"}]}'
  • 启动脚本

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    vim /usr/lib/systemd/system/prometheus.service
    [Unit]
    Description=prometheus
    Documentation=codoon_ops
    After=network.target
    [Service]
    EnvironmentFile=-/etc/sysconfig/prometheus
    User=prometheus
    ExecStart=/usr/local/prometheus/prometheus \
    --web.enable-lifecycle \
    --storage.tsdb.path=/codoon/prometheus/data \
    --config.file=/codoon/prometheus/etc/prometheus.yml \
    --web.listen-address=0.0.0.0:9090 \
    --web.external-url= $PROM_EXTRA_ARGS \
    --log.level=debug
    Restart=on-failure
    StartLimitInterval=1
    RestartSec=3
    [Install]
    WantedBy=multi-user.target

    systemctl daemon-reload
    systemctl enable prometheus
  • docker

    1
    docker run --name promconfd -d -v /codoon/prometheus/etc:/opt/prometheus/etc -v /codoon/prometheus/data:/opt/prometheus/data -v /codoon/prometheus/confd/etc:/opt/confd/etc -p 9090:9090 dockerhub.xxxx.com/prom/prometheus:v2.24.1
  • 部署方式

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    prometheus+confd 以docker方式部署 prom-monitor
    tsdb数据库存放路径:/codoon/prometheus/data
    prometheus配置文件路径:/codoon/prometheus/etc
    confd配置文件路径:/codoon/prometheus/confd/etc

    ops-etcd012
    etcd服务自动发现
    /prometheus/discovery/host/*
    /prometheus/discovery/db/*
    ...

    规则自动下发
    /prometheus/rule/host/*
    /prometheus/rule/host/*
    ...
  • 发送消息策略

    1
    2
    3
    4
    1、warnning级别告警首次先等1分钟再发,
    看同类型是否有critical级别告警,若有立即发送,warnning级别告警不再发送
    2、warnning级别告警间隔20分钟发送1次
    3、critical级别告警间隔10分钟发送1次
  • 静默配置

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    通过opscenter配置,原理是通过标签判断过滤,会找最优匹配
    抑制逻辑:同一告警高优先级自动抑制低优先级,高优先级恢复后自动解除抑制
    静默配置保存到ops-etcd /prometheus/silencev2

    支持alertname:instance:lables... 告警名称、实例、IP、级别等正则匹配
    新增静默 POST
    curl -X POST -H 'Content-Type: application/json' -d '{"sc_key":"tidb","sc_val":"instance:severity:alertname:tidb-(nodessd-[0-9]+)warnning(load1.*avail cpu.*)"}' codoon-alert.in.xxx.com:8875/backend/codoon_alert/api/v1/silence
    删除静默 DELETE
    curl -X DELETE codoon-alert.in.xxx.com:8875/backend/codoon_alert/api/v1/silence/tidb
    查看静默 GET
    curl codoon-alert.in.xx.com:8875/backend/codoon_alert/api/v1/silence

    查看alertconfig配置 GET
    curl codoon-alert.in.xxx.com:8875/backend/codoon_alert/api/v1/alertconfig?cfg_key=noticewaitclearreslove

    {
    "data": {
    "apitmporcheckall": "instance:alertname:(nginx-api-tmpapicheck(-[0-9])?)(.*)",
    "intwarnall": "instance:severity:alertname:integrationwarnning(.*)",
    "istio": "instance:severity:alertname:(codoon[0-9]+istio)warnning(load1.*)",
    "monitor_roy": "instance:severity:alertname:monitor_roywarnning(load1.*)",
    "testall": "instance:alertname:testall(.*)",
    "tidb": "instance:severity:alertname:tidb-(nodessd-[0-9]+)warnning(load1.*avail cpu.*)"
    },
    "description": "ok",
    "status": "OK"
    }
  • 告警配置

    1
    2
    3
    和静默配置原理一样,通过标签过滤,默认会找最优匹配,标签匹配逻辑,
    优先检查=、!=,其次检查=~、!~(正则)
    告警配置保存到ops-etcd /prometheus/receiver
  • 告警模板

    1
    2
    3
    通过opscenter自定义,告警大于3条时会自动收拢,
    同时会再发一封邮件(包括完整告警信息)
    告警配置保存到ops-etcd /prometheus/template
  • 其他说明

    1
    2
    3
    4
    5
    标签type=service会根据服务名称(service=xxx)通过cmdb获取告警人
    不希望收到恢复通知,可在标签中配置resolved=no
    pod cpu/mem(pprof_type=memory/cpu)告警会发pprof
    service error/panic(log_type: ERRO/PANIC)会从loki获取详情并发送
    servicemap 日志名与服务映射,watch err_check/service_map