A little research on TiDB monitoring methods

Author: buddyyuan Original source: https://tidb.net/blog/c591a993

The alarm of TiDB is extremely complicated, and the prometheus class library is integrated in the code of tidb software. In this way, when tidb is running, the state port 10080 can be used to expose some mertis running on tidb server. At this point, prometheus can finally obtain the monitoring data and display it by pulling the data. In order to understand the operating mechanism clearly, today we can simply study it through the golang language.

1. Create the application

We first write a program in go language, refer to the class library of prometheus, through this program, the metrics of the program itself can be exposed, the code is as follows:

package main
  
import (
   "github.com/prometheus/client_golang/prometheus/promhttp"
   "net/http")
  
func main() {
   http.Handle("/metrics", promhttp.Handler())
   http.ListenAndServe(":8080", nil)
}

After writing this code point run to run. Then we can get the metrics of the current program through curl http://127.0.0.1:8080/metrics. As shown below:

[root@vmtest ~]# % curl -s http://127.0.0.1:8080/metrics | grep -v #
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
go_goroutines 6
go_info{version="go1.18.1"} 1
go_memstats_alloc_bytes 2.15052e+06
go_memstats_alloc_bytes_total 2.15052e+06
go_memstats_buck_hash_sys_bytes 4224

From the results, we can see some monitoring indicators of the go process, including allocated heap, goroutine, etc.

2. Custom indicators

After creating the application, we can develop our own indicators. During the running of the program, we can register the indicators and then expose the indicators. After exposure, you can use prometheus to pull data.

package main
  
import (
   "github.com/prometheus/client_golang/prometheus"
   "github.com/prometheus/client_golang/prometheus/promhttp"
   "net/http"
)
  
func main() {
   temp := prometheus. NewCounter(prometheus. CounterOpts{Name: "test_Counter", Help: "The current Counter"})
  
   prometheus.MustRegister(temp)
   temp. Add(20)
   http.Handle("/metrics", promhttp.Handler())
   http.ListenAndServe(":8080", nil)
}

The program here creates a new Counter counter through the prometheus class, and registers the indicator through MustRegister. After registration, you can use the API to access. Click run again to run the program, and you can get the result of our custom test_Counter through curl http://127.0.0.1:8080/metrics.

[root@vmtest ~]# % curl -s http://127.0.0.1:8080/metrics | grep -i test
# HELP test_Counter The current Counter
# TYPE test_Counter counter
test_Counter 20

Of course, my data is hard-coded. In some complex programs, you can set 4 types of prometheus, including: Counter (a counter that only increases but does not decrease) Gauge (a dashboard that can increase or decrease) Histogram (a histogram Figure) Summary (similar to the histogram, it is also used to count and analyze the distribution of samples) At this point, we only need to configure it in prometheus to obtain this indicator, and it can periodically fetch this data through the API.

3. Study the monitoring in TiDB

Next, let’s explore the monitoring in TiDB. Let’s take the alarm TiDB_query_duration as an example. First look at its alarm expression, we remove the outer prometheus function, the innermost is tidb_server_handle_query_duration_seconds_bucket .

histogram_quantile(0.99, sum(rate(tidb_server_handle_query_duration_seconds_bucket[1m])) BY (le, instance)) > 1

Through the above information and the description of the official document, we can know that it is a histogram of the execution time of the SQL statement. In the code of TiDB, all indicators are placed under the Metrics folder, which contains the monitoring of each component, and the indicator we learned can be found in metrics->distsql.go The definition is found in the code of . This code is very simple, using the prometheus class to create a histogram Graph indicator, the name is handle_query_duration_seconds . The main function of the histogram type is to record the data within a certain time range, sample the data, and include it in a configurable bucket (bucket), and then filter the samples through the specified interval. So here we also need to define a bucket, and the definition of this bucket uses the ExpinentialBickets function, which has three input parameters, which are the value of the first Bucket, the coefficient factor (factor), and the number of Buckets. The above code actually generates a section of Buckets, its algorithm In fact, it is the minimum value, multiplied by the factor 2, to get the next value, and so on to get a total of 29 numbers, we write a similar code to execute it.

package main
  
import "fmt"
  
func main() {
   buckets := make([]float64, 29)
   start := 0.0005
   for i := range buckets {
      buckets[i] = start
      start *= 2
   }
   fmt. Println("demo:", buckets)
}

As a result of running this code, you will get a range of buckets, the minimum value is 0.0005, and the maximum value is 134217.728.

demo: [0.0005 0.001 0.002 0.004 0.008 0.016 0.032 0.064 0.128 0.256 0.512 1.024 2.048 4.096 8.192 16.384 32.768 65.536 131.072 262.144 524.288 1048.576 2097.152 4194.304 8388.608 16777.216 33554.432 67108.864 134217.728]

After the metrics are generated, you can register the metrics. The registration code is implemented in the RegisterMetrics function of metrics->metrics.go .

prometheus.MustRegister(DistSQLQueryHistogram)

After the TiDB-Server program is running and the SQL statement is executed, the histogram information will be updated through the fetchResp function of distsql->select_result.go . Here, in fact, you’re done. This part of information can be found through the API outside. The following is a set of values whose sql_type type is select.

curl http://127.0.0.1:10080/metrics | grep -i tidb_server_handle_query_duration_seconds_bucket | grep -i select
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.0005"} 32
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.001"} 37
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.002"} 39
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.004"} 39
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.008"} 40
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.016"} 45
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.032"} 45
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.064"} 45
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.128"} 45
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.256"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.512"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="1.024"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="2.048"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="4.096"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="8.192"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="16.384"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="32.768"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="65.536"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="131.072"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="262.144"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="524.288"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="1048.576"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="2097.152"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="4194.304"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="8388.608"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="16777.216"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="33554.432"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="67108.864"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="134217.728"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="+Inf"} 46

Seeing this set of values, you may feel a little strange, why it has not changed since bucket[0.256] reached 46. In fact, there are 32 actual sampling points falling in the interval [-,0.0005], 5 actual sampling points falling in [0.0005,0.001], and 2 actual sampling points falling in [0.001,0.002]. Its this final bucket result is cumulative. Let’s make an Analyze statement to simply verify it. First, we record the values of each bucket in the current histogram.

curl http://127.0.0.1:10080/metrics | grep -i tidb_server_handle_query_duration_seconds_bucket | grep -i AnalyzeTable
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.001"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.002"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.004"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.008"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.016"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.032"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.064"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.128"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.256"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.512"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="1.024"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="2.048"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="4.096"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="8.192"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="16.384"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="32.768"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="65.536"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="131.072"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="262.144"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="524.288"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="1048.576"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="2097.152"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="4194.304"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="8388.608"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="16777.216"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="33554.432"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="67108.864"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="134217.728"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="+Inf"} 2

Use the MySQL client to do an Analyze analysis. This SQL took 8.38 seconds.

MySQL [test]> analyze table stock_bak;
Query OK, 0 rows affected, 1 warning (8.38 sec)

According to our deduction, it will fall into the Bucket le=”16.384″. The current data of this Bucket is 1, and it will become 2 if it falls here. Then, by analogy to le=”524.288″, the Buckets here will become 2, and then from le=”1048.576″, the subsequent Buckets will increase by 1 and become 3. Let’s curl it to verify whether our logic is reasonable.

tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.001"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.002"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.004"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.008"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.016"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.032"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.064"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.128"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.256"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.512"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="1.024"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="2.048"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="4.096"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="8.192"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="16.384"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="32.768"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="65.536"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="131.072"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="262.144"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="524.288"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="1048.576"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="2097.152"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="4194.304"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="8388.608"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="16777.216"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="33554.432"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="67108.864"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="134217.728"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="+Inf"} 3

With the above information in hand, we can then think about how to add a native monitoring to TiDB. You can add an indicator and register it yourself in the way I introduced above.

end

This article briefly describes the principle of monitoring in tidb, and explains how to add an indicator in a native way. Of course, you can also query the results through SQL, and use node_exporter to push the results to Prometheus, but querying through SQL will cause an additional burden on the database. After all, some SQL query data dictionary views are very slow. Once a query is delayed, it will not stop Queries will cause a backlog, and the impact on the database will be huge.