如何监控Kubernetes API服务器 - DockOne.io如何监控Kubernetes API服务器 - 【编者的话】本文讨论了Kubernetes API服务器监控的重要性和相关方法。在生产环境中运行Kubernetes时,学习如何监控Kubernetes API服务器是至关重要的。监视kube-apiserver能使你检测和排除延迟、错误,并验证服务是否...http://dockone.io/article/9769How to Monitor Kubernetes API Server – Sysdighttps://sysdig.com/blog/monitor-kubernetes-api-server/
API Server 监控
API Server 是 Kubernetes 集群中所有组件交互的中枢。下表列出了 API Server 的主要监控指标。
指标 | 描述 |
---|---|
请求延迟 | 资源请求响应延迟,单位为毫秒。该指标按照 HTTP 请求方法进行分类。 It’s a good idea to use percentiles to understand the latency spread: histogram_quantile(0.99, sum(rate(apiserver_request_latencies_count{job="kubernetes-apiservers"}[5m])) by (verb, le)) |
每秒请求次数 | kube-apiserver 每秒接受的请求数。 |
Learning how to monitor Kubernetes API server is of vital importance when running Kubernetes in production. Monitoring kube-apiserver will let you detect and troubleshoot latency, errors and validate the service performs as expected. Keep reading to learn how you can collect the most important metrics from the kube-apiserver and use them to monitor this service.
在生产环境中运行 Kubernetes 时,学习如何监控 Kubernetes API 服务器至关重要。监控 kube-apiserver 将使您能够检测和排除延迟、错误并验证服务是否按预期执行。继续阅读以了解如何从 kube-apiserver 收集最重要的指标并使用它们来监控此服务。
K8s control plane is the brain of your cluster. The #Kubernetes API server is part of it, and here are a few tips on #monitoring it.
Like with any other microservice, we are going to take the Golden Signals approach to monitor the Kubernetes API server health and performance:
与任何其他微服务一样,我们将采用黄金信号方法来监控 Kubernetes API 服务器的健康状况和性能:
- Latency
- Request rate
- Errors
- Saturation
But before we dive into the meaning of each one, let’s see how to fetch those metrics.
Monitor Kubernetes API server: What to look for?
We can use Golden Signals to monitor Kubernetes API server. Golden Signals is a technique used to monitor a service through a number of metrics that give insights on how it’s performing for the consumers (here they are kubectl users and the internal cluster components). These metrics are latency, requests, errors and saturation (how busy the server is towards its maximum capacity with current resources).
我们可以使用Golden Signals来监控 Kubernetes API 服务器。Golden Signals 是一种用于通过许多指标监控服务的技术,这些指标提供有关服务如何为消费者(这里是 kubectl 用户和内部集群组件)执行的见解。这些指标是延迟、请求、错误和饱和度(服务器在当前资源下达到最大容量的繁忙程度)。
Disclaimer: API server metrics might differ between Kubernetes versions. Here we used Kubernetes 1.15. You can check the metrics available for your version in the Kubernetes repo (link for the 1.15.3 version).
免责声明:API 服务器指标可能因 Kubernetes 版本而异。这里我们使用了 Kubernetes 1.15。您可以在Kubernetes 存储库(1.15.3 版本的链接)中检查适用于您的版本的指标。
Latency: Latency can be extracted from the apiserver_request_duration_seconds
histogram buckets:
延迟:可以从apiserver_request_duration_seconds
直方图桶中提取延迟:
# TYPE apiserver_request_latencies histogram
apiserver_request_duration_seconds{resource="adapters",scope="cluster",subresource="",verb="LIST",le="125000"} 2
apiserver_request_duration_seconds{resource="adapters",scope="cluster",subresource="",verb="LIST",le="250000"} 2
apiserver_request_duration_seconds{resource="adapters",scope="cluster",subresource="",verb="LIST",le="500000"} 2
apiserver_request_duration_seconds{resource="adapters",scope="cluster",subresource="",verb="LIST",le="1e+06"} 2
apiserver_request_duration_seconds{resource="adapters",scope="cluster",subresource="",verb="LIST",le="2e+06"} 2
apiserver_request_duration_seconds{resource="adapters",scope="cluster",subresource="",verb="LIST",le="4e+06"} 2
apiserver_request_duration_seconds{resource="adapters",scope="cluster",subresource="",verb="LIST",le="8e+06"} 2
apiserver_request_duration_seconds{resource="adapters",scope="cluster",subresource="",verb="LIST",le="+Inf"} 2
apiserver_request_duration_seconds_sum{resource="adapters",scope="cluster",subresource="",verb="LIST"} 50270
apiserver_request_duration_seconds_count{resource="adapters",scope="cluster",subresource="",verb="LIST"} 2
It’s a good idea to use percentiles to understand the latency spread:
histogram_quantile(0.99, sum(rate(apiserver_request_latencies_count{job="kubernetes-apiservers"}[5m])) by (verb, le))
Request rate: The metric apiserver_request_total can be used to monitor the requests to the service, from where they are coming, to which service, which action and whether they were successful:
请求率:指标apiserver_request_total可用于监控对服务的请求,它们来自哪里,到哪个服务,哪个动作以及它们是否成功:
# TYPE apiserver_request_count counter apiserver_request_total{client="Go-http-client/1.1",code="0",contentType="",resource="pods",scope="namespace",subresource="portforward",verb="CONNECT"} 4 apiserver_request_total{client="Go-http-client/2.0",code="200",contentType="application/json",resource="alertmanagers",scope="cluster",subresource="",verb="LIST"} 1 apiserver_request_total{client="Go-http-client/2.0",code="200",contentType="application/json",resource="alertmanagers",scope="cluster",subresource="",verb="WATCH"} 72082 apiserver_request_total{client="Go-http-client/2.0",code="200",contentType="application/json",resource="clusterinformations",scope="cluster",subresource="",verb="LIST"} 1
For example, you can get all the successful requests across the service like this:
例如,您可以像这样获取跨服务的所有成功请求:
sum(rate(apiserver_request_total{job="kubernetes-apiservers",code=~"2.."}[5m]))
Errors: You can use the same query used for request rate, but filter for 400 and 500 error codes:
错误:您可以使用与请求率相同的查询,但过滤 400 和 500 错误代码:
sum(rate(apiserver_request_total{job="kubernetes-apiservers",code=~"[45].."}[5m]))
Examples of issues
You detect an increase of latency in the requests to the API.
This is typically a sign of overload in the API server. Probably your cluster has a lot of load and the API server needs to be scaled out.
You can segment the metrics by type of request, by resource or verb. This way you can detect where the problem is. Maybe you are having issues reading or writing to etcd and need to fix it.
您检测到 API 请求的延迟增加。
这通常是 API 服务器过载的标志。可能您的集群负载很大,需要扩展 API 服务器。
您可以按请求类型、资源或动词对指标进行细分。这样您就可以检测出问题所在。也许您在读取或写入 etcd 时遇到问题,需要修复它。
You detect an increase in the depth and the latency of the work queue.
You are having issues scheduling actions. You should check that the scheduler is working. Maybe some of your nodes are overloaded and you need to scale out your cluster. Maybe one node is having issues and you want to replace it.
https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-4-the-kubernetes-api-server-72f1e1210770https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-4-the-kubernetes-api-server-72f1e1210770
Request Rates and Latencies
The API server understands the Kubernetes nouns like nodes, pods, and namespaces. If we want to get a feel for how often these resources are being requested we can look at the metric apiserver_request_count
. This is the Rate metric:
API 服务器理解 Kubernetes 名词,如节点、pod 和命名空间。如果我们想了解这些资源被请求的频率,我们可以查看指标apiserver_request_count
。这是速率指标:
sum(rate(apiserver_request_count[5m])) by (resource, subresource, verb)
This will give you a five minute rate of all the Kubernetes resources by “verb”. The verbs in this case are HTTP verbs; WATCH, PUT, POST, PATCH, LIST, GET, DELETE, and CONNECT.
The Errors for the API server can be tracked as HTTP 5xx errors. Use this query to get the ratio of errors to the request rate:
这将通过“动词”为您提供所有 Kubernetes 资源的五分钟速率。本例中的动词是 HTTP 动词;WATCH、PUT、POST、PATCH、LIST、GET、DELETE 和 CONNECT。
API 服务器的错误可以作为 HTTP 5xx 错误进行跟踪。使用此查询获取错误与请求率的比率:
rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
For Duration, we will look at the p90th latency for all the resources and verbs. Use the metricapiserver_request_latencies_bucket
:
对于Duration,我们将查看所有资源和动词的 p90th 延迟。使用指标
histogram_quantile(0.9, sum(rate(apiserver_request_latencies_bucket[5m])) by (le, resource, subresource, verb) ) / 1e+06