[Loki] Best Practices – Metric based on LogQL

1. Preface

My career has been in the traditional software industry, so most of the systems I have come into contact with are single-unit systems with a low upper limit of scale. Therefore, whether it is the technical atmosphere of the team or the actual resource investment, monitoring is an area of responsibility. The rabbit I hunted on the night of New Year’s Eve – celebrate the New Year with you, or celebrate the New Year without you.

Although there has been a lack of actual opportunities to actually experience the use of monitoring in large-scale software architectures, the author has been trying to promote his understanding of monitoring based on his daily reading theory and deliberate observation and thinking about actual work.

Running an application without monitoring is like driving with your eyes closed, highlighting a gambling character.
~
If you can’t measure it, you can’t optimize it. So monitoring should be the starting step for all improvements.
~
More importantly, it is also a common sense that needs to be continuously popularized – there is no equivalent relationship between the use of monitoring tools and the realization and efficient application of monitoring functions. The main purpose of monitoring is:

When a problem is reported, it helps to locate the problem more quickly and continuously shortens the MTTR of the problem. (This is a job with no end)

In the initial stage of a problem, detect the problem before the customer does and enhance your flexibility in dealing with the problem.

Through statistical analysis, we can predict the enemy first and provide guidance and direction for application optimization.

This article focuses on the third point above – “Providing guidance and direction for application optimization through statistical analysis.” From my personal understanding, this is the greatest value of monitoring. Solving problems is only the most basic CMMI level 1, and being able to predict problems is at least CMMI level 4.

2. Best Practices

Note: The following functions are just introductions to inspire thinking. The most important thing is to stand on the overall system, from the perspective of R&D and products, and analyze and summarize more indicators independently by thinking from others’ perspective. Constantly point out the direction for system optimization, firmly hold the guidance of the system optimization direction in your own hands, and turn passivity into initiative.

Before we officially start, let’s first explain the background.

The background project is a microservice architecture, and its log format is divided into two categories: access log (access log) and business log (business log). The specific format is as follows:

# access log (system access log, automatically implemented using the logback-access component)
[%t{<!-- -->yy-MM-dd HH:mm:ss.SSS}][%tid][%clientHost][%requestURL,%statusCode][%elapsedTime,%i{<!- - -->Referer}][%reqAttribute{<!-- -->client}][%i{<!-- -->User-Agent}][%reqAttribute{<!-- -->userId} ][%reqAttribute{<!-- -->serviceName}][%reqAttribute{<!-- -->serviceSourceType}][%reqAttribute{<!-- -->serviceType}][%reqAttribute{<!- - -->serviceOwner}][#%requestContent#][#%responseContent#]

# business log (log output using log.xxx() method in business code)
[%d{<!-- -->yy-MM-dd HH:mm:ss.SSS}][%X{<!-- -->tid}][pid:${PID:-}][ tid: .15t][%-40.40logger:%line][%5p] %msg%n

When promtail collects, the logs are marked with necessary labels: module (the module to which the log belongs), job, filename. (Following best practices, we minimize the use of labels)
2.1 For the module label, we simply label it according to the existing modules. Divided into: api-gateway, xxx, etc.
2.2 For the job label, we divide it into gatewayLog (access log of the gateway module, which is separated for special statistics), accessLog (access log of other microservice modules), normalLog ( info/warn level log), errorLog (error level log).

Against the above background, we have summarized the following Metric indicators so far:

######################### System QPS-using api-gateway as the entry point (last five minutes)
rate({<!-- -->module="api-gateway", job="gatewayLog"} | drop filename[5m])

######################### Total system visits - using api-gateway as the entry point (in the past 2 days)
count_over_time({<!-- -->job="gatewayLog"} | drop filename[2d])
 
######################### System error rate - using api-gateway as the entry point (past five minutes)
rate({<!-- -->module="api-gateway", job="errorLog"} | drop filename[5m])
 
######################### Total system errors - using api-gateway as the entry point (past five minutes)
count_over_time({<!-- -->module="api-gateway", job="errorLog"} | drop filename[5m])
 
######################### Total number of errors in each module of the system (past two days)
# The feedback in this result is very interesting. The main errors occur in the api-gateway and serve-manager modules.
count_over_time({<!-- -->job="errorLog"} | drop filename[2d])
 
 
########################## The total number of ordinary logs for each module of the system (in the past two days)
# Combined with the "total number of errors" above, it is easy to find some interesting statistical information:
# server-manager module in the past two days: the number of error logs is 42981, the number of normal logs is 117
# api-gateway is still the largest source of log generation, with a difference of three orders of magnitude.
count_over_time({<!-- -->job="normalLog"} | drop filename[2d])
 
 
########################## Total number of all logs in each module of the system (past two days) ---- Choose one of the following two
sum (count_over_time({<!-- -->module=~". + "} | drop filename[2d])) by (module)
 
count_over_time({<!-- -->module=~". + "} | drop filename,job [2d])
 
######################### Order of url request time consumption
# Filter out the top ten URLs with the most time-consuming requests in the system and analyze whether there is room for further optimization.
sort_desc(topk(10,quantile_over_time(0.99,
  {<!-- -->module="api-gateway", job="gatewayLog"}
    |json
    | __error__ = ""
    | level = "ACCESS"
    | label_format requestUrl=`{<!-- -->{<!-- -->regexReplaceAll "(.*)\?.*" .requestUrl "${1}"}}`
    | requestUrl !~ ".*-proxy/.*"
    | unwrap elapsedTime [1h]) by (requestUrl)) by (elapsedTime))
 
 
sort_desc(topk(10,avg_over_time({<!-- -->module="api-gateway", job="gatewayLog"}
    |json
    | __error__ = ""
    | level = "ACCESS"
    | label_format requestUrl=`{<!-- -->{<!-- -->regexReplaceAll "(.*)\?.*" .requestUrl "${1}"}}`
    | drop clientIp,filename,job,level,logtime,method,module,msg,protocol,referer,serviceName,serviceOwner,serviceSourceType,serviceType,statusCode,tid,userAgent,userName
    | unwrap elapsedTime [1h]) by (requestUrl)))
 
######################### The request for a certain URL took P99 lines
quantile_over_time(0.99,
  {<!-- -->module="api-gateway", job="gatewayLog"}
    |json
    | __error__ = ""
    | level = "ACCESS"
    | label_format requestUrl=`{<!-- -->{<!-- -->regexReplaceAll "(.*)\?.*" .requestUrl "${1}"}}`
    | requestUrl = "/api/server-manager/xxx/yyy/zzz"
    | unwrap elapsedTime [1h]) by (requestUrl)
 
######################### Average request time for a certain URL (in the past hour)
# Switch avg_over_time to max_over_time, min_over_time to get the maximum and minimum time consumption of the request in the past hour.
avg_over_time({<!-- -->module="api-gateway", job="gatewayLog"}
    |json
    | __error__ = ""
    | level = "ACCESS"
    | label_format requestUrl=`{<!-- -->{<!-- -->regexReplaceAll "(.*)\?.*" .requestUrl "${1}"}}`
    | requestUrl = "/api/server-manager/xxx/yyy/zzz"
    | drop clientIp,filename,job,level,logtime,method,module,msg,protocol,referer,serviceName,serviceOwner,serviceSourceType,serviceType,statusCode,tid,userAgent,userName
    | unwrap elapsedTime [1h])
 
######################### Troubleshoot the situation when the monitoring indicator serviceName is empty
sum(count_over_time({<!-- -->module="api-gateway", job="gatewayLog"}
    |json
    | __error__ = ""
    | label_format requestUrl=`{<!-- -->{<!-- -->regexReplaceAll "(.*)\?.*" .requestUrl "${1}"}}`
    | drop clientIp,filename,job,level,logtime,method,module,msg,protocol,referer,serviceOwner,serviceSourceType,serviceType,statusCode,tid,userAgent,userName
    |serviceName = ""[2d])) by (requestUrl)

######################### Whether an interface has been called, the number of times it has been called: to filter out expired interfaces.
{<!-- -->module="gis-manager", job="accessLog"}
  |json
  | __error__ = ""
 #| level = "ACCESS"
  | label_format requestUrl=`{<!-- -->{<!-- -->regexReplaceAll "(.*)\?.*" .requestUrl "${1}"}}`
  | requestUrl =~ ".*/services.*"

#====================== Request volume caused by non-manual access
sum(count_over_time({<!-- -->module="api-gateway", job="gatewayLog"}
    |json
    | __error__ = ""
    |userAgent = "fasthttp" or userAgent = "Apache-HttpClient/4.5.13 (Java/1.8.0_332)"[2d])
    )

# api-gateway exception log statistics - count the total number of exceptions of each type, the corresponding URLs, analyze which links have the largest number of problems, and find out the optimization points.
sum(count_over_time({<!-- -->module="api-gateway", job="errorLog"}
    | drop filename
    !~ "(?s).*PreAuthFilter.*"
    |= "Exception"
    |json
    | __error__ = ""
    | label_format exceptionType=`{<!-- -->{<!-- -->regexReplaceAll "(?s). + ?\s(.*?)Exception:.*" .msg "${1} Exception"}}`
    | drop msg [2h]
    )) by (exceptionType)

3. Postscript

It can be seen that the above are actually expressions that can be written immediately according to the needs after becoming familiar with LogQL. Therefore, this article is intended to summarize and introduce new ideas. I hope to continue to complete the system’s real-time Metric library and slow down the corruption of the system.

In many optimizations in the past, although we also tried to consider the overall situation, but in the case of a global perspective, the actual effect is more of a single-point optimization.

But after the introduction of observability Metrics, the situation can fundamentally change – now there is a global inspection method that is ready at all times to verify/check whether your ideas have deviated at any time; Use an objective global vision and data to determine the main contradictions of the current system, rather than relying on “feeling” to decide which aspects of optimization should be done first.

4. Reference

Office Site – LogQL: Log query language

Directory

1. Preface

2. Best Practices

3. Postscript

4. Reference