Easy to control! How Prometheus monitors indicators and quickly locates faults

Prometheus monitors business indicators

Now that Kubernetes has become the de facto container orchestration standard, the deployment of microservices has become very easy. However, as the scale of microservices expands, the challenges brought by service governance will also become greater and greater. In this context, the concept of service observability emerged.

In a distributed system, system failures may occur at any node. How can we quickly locate and solve the problem when a failure occurs? We can even detect abnormalities in the service system before the failure occurs and nip the failure in the cradle. inside. This is what observability is all about.

Observability

Observability is built by logging, metrics, and tracing, referred to as the three pillars of observability.

Lgging shows the events generated when the application is running or some logs generated during the execution of the program. It can explain the running status of the system in detail, but storage and querying require a lot of resources. So filters are often used to reduce the amount of data.
Metrics is an aggregated value with a small storage space. It can observe the status and trends of the system, but it lacks detailed display for problem location. At this time, multidimensional data structures such as contour indicators are used to enhance the expressiveness of details. For example, statistics on the TBS accuracy rate, success rate, traffic, etc. of a service are common for a single indicator or a certain database.
Tracing is request-oriented and can easily analyze abnormal points in requests. However, it has the same problem as logging in that it consumes a lot of resources. It is usually necessary to reduce the amount of data through sampling. For example, the scope of a request, that is, any call initiated from the browser or mobile phone, is a process-based thing, and we need to track it.

The topic discussed in this article is metrics in observability. In the context of k8s as infrastructure, we know that K8s itself is a complex container orchestration system, and its stable operation is crucial. The accompanying indicator monitoring system Promethues has also become the de facto standard for monitoring systems under cloud native services.

I believe everyone is familiar with the monitoring of indicators such as the resource level, such as CPU, Memory, and Network; and the application level, such as the number of Http requests and request time consumption. So how do you use Prometheus to monitor and alert business-level indicators? This is the core content of this article.

Taking one of our business scenarios as an example, there are multiple types of tasks running in the system, and the running times of the tasks are different. The tasks themselves have various states including pending execution, execution, execution success, execution failure, etc. If we want to ensure the stable operation of the system, we must have a thorough understanding of the operating status of each type of task. For example, whether there are currently tasks being squeezed, whether there are too many failed tasks, and whether there will be an alarm when the threshold is exceeded.

In order to solve the above monitoring and alarm problems, we must first understand the indicator types of Prometheus

Indicators

Indicator definition

Formally, all indicators (Metrics) are marked in the following format:

<metric name>{<label name>=<label value>, ...}

The name of the indicator (metric name) can reflect the meaning of the sample being monitored (for example, http_request_total – indicates the total number of HTTP requests received by the current system). The indicator name can only consist of ASCII characters, numbers, underscores and colons and must conform to the regular expression [a-zA-Z_:][a-zA-Z0-9_:]*.

The label reflects the characteristic dimensions of the current sample, through which Prometheus can filter, aggregate, etc. the sample data. The name of the label can only consist of ASCII characters, numbers and underscores and satisfy the regular expression [a-zA-Z_][a-zA-Z0-9_]*.

Indicator Type

Prometheus defines 4 different metric types: Counter, Gauge, Histogram, and Summary.

Counter

Counter type indicators work the same as counters, they only increase but do not decrease (unless the system is reset). Common monitoring indicators, such as http_requests_total and node_cpu, are all Counter type monitoring indicators. Generally, it is recommended to use _total as the suffix when defining the name of a Counter type indicator.

Through the counter indicator, we can easily understand the rate change caused by an event.

For example, use the rate() function to obtain the growth rate of HTTP requests:

rate(http_requests_total[5m])

Gauge

Gauge type indicators focus on reflecting the current status of the system. Therefore, the sample data of such indicators can be increased or decreased. Common indicators such as: node_memory_MemFree (currently idle content size of the host), node_memory_MemAvailable (available memory size) are all Gauge type monitoring indicators.

Through the Gauge indicator, we can directly view the current status of the system

node_memory_MemFree

Summary

Summary is mainly used for statistics and analysis of sample distribution. For example, most of the response times of a certain HTTP request are within 100 ms, while the response time of individual requests takes 5 seconds. In this case, the average value of the statistical indicators cannot reflect the real situation. And if we can immediately see the 9th percentile of response time through the Summary indicator, such an indicator will be meaningful.

For example

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.</code><code># TYPE go_gc_duration_seconds summary</code><code>go_gc_duration_seconds{quantile="0"} 3.98e-05</code><code>go_gc_duration_seconds{quantile="0.25"} 5.31e-05</code><code>go_gc_duration_seconds{quantile="0.5"} 6.77e-05</code><code>go_gc_duration_seconds{ quantile="0.75"} 0.0001428</code><code>go_gc_duration_seconds{quantile="1"} 0.0008099</code><code>go_gc_duration_seconds_sum 0.0114183</code><code>go_gc_duration_seconds_count 85

Histogram

Indicators of the Histogram type are also used for statistical and sample analysis. Similar to the Summary type indicator, the Histogram type sample will also reflect the total number of records of the current indicator (with _count as the suffix) and the total amount of its value (with _sum >as a suffix). The difference is that the Histogram indicator directly reflects the number of samples in different intervals, and the intervals are defined by the label len. At the same time, for the Histogram indicator, the quantile of its value can be calculated through the histogram_quantile() function.

For example

# HELP prometheus_http_response_size_bytes Histogram of response size for HTTP requests.</code><code># TYPE prometheus_http_response_size_bytes histogram</code><code>prometheus_http_response_size_bytes_bucket{handler="/",le="100"} 1</code><code>prometheus_http_response_size_bytes_bucket{handler="/",le="1000"} 1</code><code>prometheus_http_response_size_bytes_bucket{handler="/",le="10000\ "} 1</code><code>prometheus_http_response_size_bytes_bucket{handler="/",le="100000"} 1</code><code>prometheus_http_response_size_bytes_bucket{handler="/",le=" 1e + 06"} 1</code><code>prometheus_http_response_size_bytes_bucket{handler="/",le=" + Inf"} 1</code><code>prometheus_http_response_size_bytes_sum{handler="/" } 29</code><code>prometheus_http_response_size_bytes_count{handler="/"} 1

Application indicator monitoring

Exposure indicators

The most commonly used way for Prometheus is to grab metrics through pull. So we first expose the indicators in the service through the /metrics interface, so that the Promethues server can capture our business indicators through http requests.

Interface example

server := gin.New()</code><code>server.Use(middlewares.AccessLogger(), middlewares.Metric(), gin.Recovery())</code>
<code>server.GET("/health", func(c *gin.Context) {<!-- --></code><code> c.JSON(http.StatusOK, gin.H{ \ "message": "ok",</code><code> })</code><code>})</code>
<code>server.GET("/metrics", Monitor)func Monitor(c *gin.Context) {<!-- --></code><code> h := promhttp.Handler()</code><code> h.ServeHTTP(c.Writer, c.Request)</code><code>}

Define indicators

In order to facilitate understanding, three types of indicators and two business scenarios are selected here.
Example

var ( </code><code> //HTTPReqDuration metric:http_request_duration_seconds</code><code> HTTPReqDuration *prometheus.HistogramVec </code><code> //HTTPReqTotal metric:http_request_total</code><code> HTTPReqTotal *prometheus.CounterVec </code><code> // TaskRunning metric:task_running</code><code> TaskRunning *prometheus.GaugeVec</code><code>)</code>
<code>func init() {<!-- --></code><code> // Monitoring interface request time </code><code> // The indicator type is Histogram</code><code> HTTPReqDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{<!-- --></code><code> Name: "http_request_duration_seconds",</code><code> Help: "http request latencies in seconds", </code><code> Buckets: nil,</code><code> }, []string{"method", "path"})</code><code> // "method\ ","path" is label</code>
<code> // Monitor the number of interface requests</code><code> // The indicator type is Counter</code><code> HTTPReqTotal = prometheus.NewCounterVec(prometheus.CounterOpts{<!-- --></code> <code> Name: "http_requests_total",</code><code> Help: "total number of http requests",</code><code> }, []string{"method", \ "path", "status"}) // "method", "path", "status" are labels</code>
<code> // Monitor the number of tasks currently executing</code><code> // The monitoring type is Gauge</code><code> TaskRunning = prometheus.NewGaugeVec(prometheus.GaugeOpts{<!-- --></code><code> Name: "task_running",</code><code> Help: "current count of running task",</code><code> }, []string{"type\ ", "state"})</code><code> // "type", "state" are labels</code>
<code> prometheus.MustRegister(</code><code> HTTPReqDuration,</code><code> HTTPReqTotal,</code><code> TaskRunning,</code><code> )</code><code>}

Through the above code, we define and register the indicators we want to monitor.

Generate indicators

Example

start := time.Now()</code><code>c.Next()</code>
<code>duration := float64(time.Since(start)) / float64(time.Second)</code>
<code>path := c.Request.URL.Path</code>
<code>//Increase the number of requests by 1</code><code>controllers.HTTPReqTotal.With(prometheus.Labels{ </code><code> "method": c.Request.Method, </code><code> "path": path, </code><code> "status": strconv.Itoa(c.Writer.Status()),</code><code>}).Inc()</code>
<code>// Record the processing time of this request</code><code>controllers.HTTPReqDuration.With(prometheus.Labels{ </code><code> "method": c.Request.Method, </code><code> "path": path,</code><code>}).Observe(duration)</code>
<code>// Simulate a new task</code><code>controllers.TaskRunning.With(prometheus.Labels{ </code><code> "type": shuffle([]string{"video", "audio"}), </code><code> "state": shuffle([]string{"process", "queue"}), </code><code>}) .Inc()</code>
<code>//Simulation task completed</code><code>controllers.TaskRunning.With(prometheus.Labels{ </code><code> "type": shuffle([]string{"video", "audio"}), </code><code> "state": shuffle([]string{"process", "queue"}), </code><code>}) .Dec()

Crawling indicators

Promethues capture target configuration?

# Scrape interval</code><code>scrape_interval: 5s</code>
<code># Targets</code><code>scrape_configs:</code><code> - job_name: 'prometheus'</code><code> static_configs:</code><code> - targets: [\ 'prometheus:9090']</code><code> - job_name: 'local-service'</code><code> metrics_path: /metrics</code><code> static_configs:</code><code> - targets: ['host.docker.internal:8000']

In actual applications, statically configuring the target address is not suitable. Under k8s, Promethues currently supports five service discovery modes through integration with the Kubernetes API, namely: Node, Service, Pod, Endpoints, and Ingress.

The indicators are displayed as follows:

Source: https://www.lxkaka.wang/app-metrics/