当前位置: 代码迷 >> 综合 >> Prometheus 监控 CoreDNS
  详细解决方案

Prometheus 监控 CoreDNS

热度:81   发布时间:2023-09-30 12:06:29.0

1、简介


prometheus 插件主要用于暴露CoreDNS相关的监控数据,除了coredns本身外,其他支持prometheus的插件(如cache插件)在启用的时候也可以通过prometheus插件暴露出相关的监控信息,默认情况下暴露出的监控数据在localhost:9153,路径为/metrics配置文件中的每个server块只能使用一次prometheus下面是一些coredns自身相关的指标:

  • coredns_build_info{version, revision, goversion} - 关于 CoreDNS 本身的信息
  • coredns_panics_total{} - panics的总数
  • coredns_dns_requests_total{server, zone, proto, family, type} - 总查询次数
  • coredns_dns_request_duration_seconds{server, zone, type} - 处理每个查询的耗时
  • coredns_dns_request_size_bytes{server, zone, proto} - 请求的大小(以bytes为单位)
  • coredns_dns_do_requests_total{server, zone} - 设置了 DO 位的查询(queries that have the DO bit set)
  • coredns_dns_response_size_bytes{server, zone, proto} - 响应的大小(以bytes为单位)
  • coredns_dns_responses_total{server, zone, rcode} - 每个zone的响应码和数量
  • coredns_plugin_enabled{server, zone, name} - 每个zone上面的各个插件是否被启用

需要注意的是上面频繁出现的几个标签(label),这里额外做一些解释:

  • zone:每个request/response相关的指标都会有一个zone的标签,也就是上述的大多数监控指标都是可以细化到每一个zone的。这对于需要具体统计相关数据和监控排查问题的时候是非常有用的
  • server:是用来标志正在处理这个对应请求的服务器,一般的格式为<scheme>://[<bind>]:<port>,默认情况下应该是dns://:53,如果使用了bind插件指定监听的IP,那么就可能是dns://127.0.0.53:53这个样子
  • proto:指代的就是传输的协议,一般就是udp或tcp
  • family:指代的是传输的IP协议代数,(1 = IP (IP version 4), 2 = IP6 (IP version 6))
  • type:指代的是DNS查询的类型,这里被分为常见的如(A, AAAA, MX, SOA, CNAME, PTR, TXT, NS, SRV, DS, DNSKEY, RRSIG, NSEC, NSEC3, IXFR, AXFR and ANY) 和其他类型 “other”
If monitoring is enabled, queries that do not enter the plugin chain are exported under the fake name “dropped” (without a closing dot - this is never a valid domain name).

2、监控 coreDNS:要寻找什么?


免责声明不同 Kubernetes 版本的coreDNS指标可能不同。在这里,我们使用了 Kubernetes 1.18 和 coreDNS 版本。您可以在Kubernetes 存储库(1.18.8 版本的链接)中检查适用于您的版本的指标。

请求延迟:根据黄金信号,请求的延迟是检测服务质量下降的重要指标。要检查这一点,您必须始终将百分位数与平均值进行比较。在 Prometheus 中执行此操作的方法是使用运算符histogram

histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket{job="kube-dns"}[5m])) by(server, zone, le))

错误率:错误率是您必须监控的另一个黄金信号。尽管错误并不总是由 DNS 故障引起的,但它仍然是您必须仔细观察的关键指标。coreDNS 关于错误的关键指标之一是coredns_dns_responses_total, 并且code也是相关的。例如,该NXDOMAIN错误表示 DNS 查询失败,因为查询的域名不存在。

coredns_dns_responses_total 响应状态码计数器。
# TYPE coredns_dns_responses_total 计数器
coredns_dns_responses_total{rcode="NOERROR",server="dns://:53",zone="."} 1336
coredns_dns_responses_total{rcode="NXDOMAIN",server="dns://:53",zone="."} 471519

 

3、grafana配置dashboard


coredns原生支持的prometheus指标数量和丰富程度在众多DNS系统中可以说是首屈一指的,此外在grafana的官网上也有着众多href="https://grafana.com/grafana/dashboards?search=coredns">现成的dashboard可用,并且由于绝大多数指标都是通用的,多个不同的dashboard之间的panel可以随意复制拖拽组合成新的dashboard并且不用担心兼容性问题。我们可以很容易的根据自己的实际需求配置对应的权威/递归/组合DNS相关的监控项。

Prometheus 监控 CoreDNS

如上图我们可以看到能够监控出不同DNS类型的请求数量以及不同的zone各自的请求数量,还有其他的类似请求延迟、请求总数等等各项参数都能完善地监控起来。

Prometheus 监控 CoreDNS

如上图我们能看到可以监控到不同的请求的传输层协议状态,缓存的大小状态和命中情况等各种信息。

 

Alerts


Complete list of pregenerated alerts is available here.

coredns

CoreDNSDown

https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsdown

Copy


alert: CoreDNSDown
annotations:message: CoreDNS has disappeared from Prometheus target discovery.runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsdown
expr: |absent(up{job="kube-dns"} == 1)
for: 15m
labels:severity: critical

CoreDNSLatencyHigh

https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednslatencyhigh

Copy


alert: CoreDNSLatencyHigh
annotations:message: CoreDNS has 99th percentile latency of {
   { $value }} seconds for server{
   { $labels.server }} zone {
   { $labels.zone }} .runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednslatencyhigh
expr: |histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket{job="kube-dns"}[5m])) by(server, zone, le)) > 4
for: 10m
labels:severity: critical

CoreDNSErrorsHigh

https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednserrorshigh

Copy


alert: CoreDNSErrorsHigh
annotations:message: CoreDNS is returning SERVFAIL for {
   { $value | humanizePercentage }} ofrequests.runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednserrorshigh
expr: |sum(rate(coredns_dns_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))/sum(rate(coredns_dns_responses_total{job="kube-dns"}[5m])) > 0.03
for: 10m
labels:severity: critical

CoreDNSErrorsHigh

https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednserrorshigh

Copy


alert: CoreDNSErrorsHigh
annotations:message: CoreDNS is returning SERVFAIL for {
   { $value | humanizePercentage }} ofrequests.runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednserrorshigh
expr: |sum(rate(coredns_dns_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))/sum(rate(coredns_dns_responses_total{job="kube-dns"}[5m])) > 0.01
for: 10m
labels:severity: warning

coredns_forward

CoreDNSForwardLatencyHigh

https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardlatencyhigh

Copy


alert: CoreDNSForwardLatencyHigh
annotations:message: CoreDNS has 99th percentile latency of {
   { $value }} seconds forwardingrequests to {
   { $labels.to }}.runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardlatencyhigh
expr: |histogram_quantile(0.99, sum(rate(coredns_forward_request_duration_seconds_bucket{job="kube-dns"}[5m])) by(to, le)) > 4
for: 10m
labels:severity: critical

CoreDNSForwardErrorsHigh

https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwarderrorshigh

Copy


alert: CoreDNSForwardErrorsHigh
annotations:message: CoreDNS is returning SERVFAIL for {
   { $value | humanizePercentage }} offorward requests to {
   { $labels.to }}.runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwarderrorshigh
expr: |sum(rate(coredns_forward_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))/sum(rate(coredns_forward_responses_total{job="kube-dns"}[5m])) > 0.03
for: 10m
labels:severity: critical

CoreDNSForwardErrorsHigh

https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwarderrorshigh

Copy


alert: CoreDNSForwardErrorsHigh
annotations:message: CoreDNS is returning SERVFAIL for {
   { $value | humanizePercentage }} offorward requests to {
   { $labels.to }}.runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwarderrorshigh
expr: |sum(rate(coredns_forward_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))/sum(rate(coredns_forward_responses_total{job="kube-dns"}[5m])) > 0.01
for: 10m
labels:severity: warning

CoreDNSForwardHealthcheckFailureCount

https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardhealthcheckfailurecount

Copy


alert: CoreDNSForwardHealthcheckFailureCount
annotations:message: CoreDNS health checks have failed to upstream server {
   { $labels.to }}.runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardhealthcheckfailurecount
expr: |sum(rate(coredns_forward_healthcheck_failures_total{job="kube-dns"}[5m])) by (to) > 0
for: 10m
labels:severity: warning

CoreDNSForwardHealthcheckBrokenCount

https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardhealthcheckbrokencount

Copy


alert: CoreDNSForwardHealthcheckBrokenCount
annotations:message: CoreDNS health checks have failed for all upstream servers.runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardhealthcheckbrokencount
expr: |sum(rate(coredns_forward_healthcheck_broken_total{job="kube-dns"}[5m])) > 0
for: 10m
labels:severity: warning

  CoreDNS : Embedded exporter (1 rules)

# CoreDNS Panic Count  Number of CoreDNS panics encountered 

  - alert: CorednsPanicCountexpr: increase(coredns_panics_total[1m]) > 0for: 0mlabels:severity: criticalannotations:summary: CoreDNS Panic Count (instance {
   { $labels.instance }})description: "Number of CoreDNS panics encountered\n  VALUE = {
   { $value }}\n  LABELS = {
   { $labels }}"

 

Dashboards


Following dashboards are generated from mixins and hosted on github:

  • coredns