Quick Recognition – Using Prometheus-EverNote Synchronization

Background

I don’t understand Prometheus monitoring at all. I only know that it collects monitoring data. PromSQL can’t write. I don’t know the disadvantages.
Start with a few questions:

Question 1
  1. In actuator, what is the relationship between exposed metrics and Prometheus?
  2. What is MicroMeter and what is its relationship with Prometheus?

Point learning

Because I set up Prometheus and opened SpringActuator (endpoint *) to report, but I haven’t received the data for a long time; so I exported the search article, how to get through the system:
VMWare: https://tanzu.vmware.com/developer/guides/spring-prometheus/?utm_source=pocket_saves
CalliCoder: https://www.callicoder.com/spring-boot-actuator-metrics-monitoring-dashboard-prometheus-grafana/?utm_source=pocket_saves This article is better, it shows that Prometheus needs to be configured as a file where additional configuration is required To collect monitoring data: add metric_path

Answer 1
  1. metrics and prometheus are the same data, displayed in different formats
  2. Micrometer and SLF4j are similar things, defining facades; for example, PrometheusMeterRegistry inherits MicroMeter’s MeterRegistry

    Micrometer provides a simple facade over the instrumentation clients for a number of popular monitoring systems. Currently, it supports the following monitoring systems: Atlas, Datadog, Graphite, Ganglia, Influx, JMX, and Prometheus.

Data type

Refer to yunlzheng.gitbook
Counter: A counter that only increases but does not decrease
Gauge: Speedometer! CPU usage
Histogram: reflects the number of samples in different intervals, and the quantile is calculated on the server side
Summary: The quantile is calculated on the client side and the query performance is better
Both of these are used for the distribution of statistical data. I remember that the word Histogram is used when the jmap tool displays the number of instances of each class.
Example:

Histogram
prometheus_tsdb_compaction_chunk_range_bucket{le=”100″} 0
prometheus_tsdb_compaction_chunk_range_bucket{le=”400″} 0
Summary
prometheus_tsdb_wal_fsync_duration_seconds{quantile=”0.5″} 0.012352463
prometheus_tsdb_wal_fsync_duration_seconds{quantile=”0.9″} 0.014458005
Both of these have additional count and sum

Quick Grammar

Mathematical operations: addition, subtraction, multiplication, division, remainder and power operations are available
Logical operation: and or unless (difference set) There is also a statement that turns greater than less than into bool
String operation: regular =~ regular negation !~ is = not !=
Statement example:
http_requests_total{environment=~"staging|testing|development",method!="GET"} [5m] offset 1d
The name of the indicator with the special tag __name__

Question 2
  1. How to write Prometheus Counter summation, including handling counter reset scenarios
  2. Prometheus alarm writing
  3. How does Prometheus discover HTTP service endpoints without manual configuration?
  4. When the Alert Proemtheus expression is matched to multiple pods and >0, how is it calculated?
  5. How does prometheus discover the newly deployed service? How to deal with it under the k8s cluster
  6. The calculation logic of prometheus increase?
  7. Should it be sum first and then rate or rate first and then sum?
Answer 3, 8-increase, rate has processing counter reset

It is also simple to say, using extrapolation. It should be noted that extrapolation is a statistical concept, and similarly, interpolation uses existing values to infer unknown values. Extrapolation methods include linear extrapolation, polynomial extrapolation, conical extrapolation, etc., which can be found in Wikipedia. When I understand the extrapolation of increase, I think of it as an integral, which is seeking increase and calculating increment! Take the source code of v2.43 as an example https://github.com/prometheus/prometheus/blob/v2.43.0/promql/functions.go#L66 It can be easily read, the main calculation is the interval of extrapolation time extrapolateToInterval is proportional to sampledInterval. There are two more important points: One is to deal with the situation of counter reset (this is very important), as follows

resultValue = samples.Points[len(samples.Points)-1].V - samples.Points[0].V
prevValue := samples.Points[0].V
for _, currPoint := range samples. Points[1:] {<!-- -->
    if currPoint.H != nil {<!-- -->
        return nil // Range contains a mix of histograms and floats.
    }
    if !isCounter {<!-- -->
        continue
    }
    if currPoint.V < prevValue {<!-- -->
        resultValue += prevValue
    }
    prevValue = currPoint.V
}

The other is the calculation of the extrapolation time, the coefficient 1.1 is multiplied by the average sampling period
Of course github also has some discussions that extrapolation is harmful, rate()/increase() extrapolation considered harmful #3746 But this discussion is currently locked:
It’s appropriate to lock a conversation when the entire conversation is not constructive or violates your community’s code of conduct or GitHub’s Community Guidelines locking-conversations Obviously prometheus developers do not accept this proposal
This SF answer Do I understand Prometheus’s rate vs increase functions correctly? is also worth referring to: In an ideal world, the sampling time is exactly at the whole second, and the observation time is also at the whole second

Answer 9

The document says that when rate and aggregation are needed at the same time, rate is required first, https://prometheus.io/docs/prometheus/latest/querying/functions/#rate because rate handles counter reset (see function extrapolatedRate ), and aggregation such as sum will not be processed, which will lose the value, which is smaller than the target value.

Note that when combining rate() with an aggregation operator (e.g. sum()) or a function aggregating over time (any function ending in _over_time), always take a rate() first, then aggregate. Otherwise rate() cannot detect counter resets when your target restarts.

At that time, I wrote the sum first and then the rate, but I didn’t write it out for a long time!

Answers 3-7

3- After learning the expression, you can write it, and directly reset the sum counter, see the source code in answer 8
4-Alarm has a special alert structure
5- This is a general problem, service discovery; if prometheus does not perceive the data source through configuration, then only the data source actively reports, but the spring actuator can only pull data passively, so there is always a configuration annotation: prometheus. io/scrape
6- Then one will report one, what do you want to ask at that time?
7-Prometheus Operator ServiceMonitor CRD Git-Book

Progressive

Shannon sampling theorem – Nyquist sampling law

The analysis of monitoring data is similar to signal processing. Recently, I saw that our Kafka consumption lag actually appeared in a jagged pattern, first slowly increasing, and then suddenly going to zero; this does not make sense at all, the consumption capacity of a pod cannot change so much; after searching, a StackOverflow answer (How does a sawtooth pattern of Kafka consumer lag emerge?) is worth recording: One possibility is that the size of the observation window range is larger than the data collection frequency, and switching to signal processing is a violation of Shannon’s sampling theorem: In order not to To restore the analog signal with distortion, the sampling frequency should be greater than or equal to twice the highest frequency in the analog signal spectrum. When the sampling frequency is too small, there will be aliasing (aliasing), or moire effect (that is, stripes appear on the screen when the mobile phone takes pictures) Ah, great information science and technology! That is to say, it may be that the sampling frequency is too low to cause the output graphics to be distorted. But, I thought about it again, because the accumulation amount cannot change so much in a short period of time, so it should be that the message has expired.

vector cannot contain metrics with the same labelset

Vectors cannot contain the same metrics as labelset, why are there metrics like labelset?
vector, referring to a set of related timing

Since Prometheus is a timeseries database, all data is in the context of some timestamp. The series that maps a timestamp to recorded data is called a timeseriesa set of related timeseries is called a vector
In Prometheus, when the data is converted, the name is discarded, because you don’t know if it still represents the original meaning, issues-380
It is because in the vector, when __name__ is removed, there is the same labelset, you need to use label_replace to intercept the field from __name__ to generate a new label rate

how-to-avoid-vector-cannot-contain-metrics-with-the-same-labelset-error-when-p
rate(label_replace({__name__=~"camel_proxy.*count"},"name_label","$1","__name__", "(. + )")[5m :])

  1. $1 refers to the domain match of the last regular expression
    label replacement function
    label_replace(v instant-vector, dst_label string, replacement string, src_label string, regex string)
  2. Note [5m:]
    VictoriaMetrics doesn’t have this problem, they avoid it
    In the early years of prometheus, users complained that the original __name__ would cause confusion due to data conversion, so the metric name should be removed. After a few years, users complained that the error was reported because of the absence of __name__
    Short book – cloud native monitoring Prometheus calculates the sample rate, and there are repeated tags?
    StackOverflow – Prometheus instant vector vs range vector

Instant vector – a set of timeseries where every timestamp maps to a single data point at that “instant”
Range vector – a set of timeseries where every timestamp maps to a “range” of data points, recorded some duration into the past.