Still using Zipkin distributed service link tracking? Give this a try!

Since the advent of Spring Cloud, microservices have taken the world by storm, and enterprise architectures are transforming from traditional SOA to microservices. However, while the double-edged sword of microservices brings various advantages, it also brings great difficulties to operation and maintenance, performance monitoring, and error troubleshooting.

In large projects, the service architecture will include dozens or even hundreds of service nodes. Often a request will be designed to multiple microservices. If you want to check which nodes a request link passes through and how each node performs, it is an urgent problem to be solved. So the APM management system of distributed systems came into being.

　What is the APM system?

The APM system can help understand system behavior and be a tool for analyzing performance problems so that when a failure occurs, the problem can be quickly located and solved. This is the APM system, whose full name is (Application Performance Monitor).

[Google Dapper](http://bigbully.github.io/Dapper-translation) mentioned in Google’s public paper can be said to be the earliest APM system, which has helped Google’s developers and operation and maintenance teams a lot, so Google Public papers shared with Dapper.

Since then, many technology companies have designed and developed many excellent APM frameworks based on the principles of this paper, such as `Pinpoint`, `SkyWalking`, etc.

The SpringCloud official website also integrates such a system: `Spring Cloud Sleuth`, combined with `Zipkin`.

　Basic principles of APM

At present, most APM systems are implemented based on Google’s Dapper principle. Let’s briefly take a look at the concepts and implementation principles in Dapper.

Let’s first look at an example request call:

1. The service cluster includes: front-end (A), two middle layers (B and C), and two back-ends (D and E)

2. When the user initiates a request, it first reaches the front-end service A, and then service A makes RPC calls to service B and service C respectively;

3. Service B responds to A after processing, but service C still needs to interact with the back-end service D and service E before returning to service A. Finally, service A responds to the user’s request;

1698027580789_Figure 1.jpg

How can tracking be implemented?

Google’s Dapper has designed the following concepts to record request links:

– Span: The basic unit of work in the request. Each link call (RPC, Rest, database call) will create a Span. The approximate structure is as follows:

 type Span struct {
      TraceID int64 // Used to mark a complete request id
      Name string // unit name
      ID int64 // The current span_id is called this time
      ParentID int64 // The span_id of the upper-layer service. The parent_id of the upper-layer service is null, which represents the root service.
      Annotation []Annotation // Annotation, used to record detailed information in the call, such as time
  }

– Trace: a complete call link, including a tree structure of multiple spans, with a unique TraceID

Each link requested at one time can be connected in series through spanId and parentId:

1698027602609_Figure 2.jpg

Of course, starting from the request to the server and ending with the server returning the response, each span has the same unique identifier trace_id.

APM screening criteria

The current mainstream APM framework will include the following components to complete the collection and display of link information:

– Probe (Agent): Responsible for searching for service call link information when the client program is running, and sending it to the collector

– Collector: Responsible for formatting and saving data to memory

– Storage: Save data

– UI interface (WebUI): statistics and display of collected information

Therefore, to select a qualified APM framework, it is to compare the usage differences of each component. The main comparison items are:

-Probe performance

It is mainly the impact of the agent on the throughput, CPU and memory of the service. If the probe has a relatively large performance impact on the operation of the microservice when collecting microservice operation data, I believe few people will be willing to use it.

– Collector scalability

It can be expanded horizontally to support large-scale server clusters and ensure the high availability of the collector.

– Comprehensive call link data analysis

The data analysis should be fast and the analysis dimensions should be as many as possible. A tracking system that can provide information feedback quickly enough can respond quickly to anomalies in the production environment, and it is best to provide code-level visibility to easily locate failure points and bottlenecks.

– Transparent for development and easy to switch on and off

That is to say, as a business component, it should have as little or no intrusion into other business systems as possible, be transparent to users, and reduce the burden on developers.

– Complete call chain application topology

Automatically detect application topology to help you understand the application architecture

Next, we will compare the indicators of the three common APM frameworks, which are:

– [Zipkin](https://link.juejin.im/?target=http://zipkin.io/): An open source distributed tracking system open sourced by Twitter, used to collect timing data of services, To solve the delay problem in microservice architecture, including: data collection, storage, search and display.

– [Pinpoint](https://pinpoint.com/): An APM tool for large-scale distributed systems written in Java, a distributed tracking component open sourced by Koreans.

– [Skywalking](https://skywalking.apache.org/zh/): An excellent domestic APM component, it is a system for tracking, alarming and analyzing the business operation of JAVA distributed application clusters. It is now one of Apache’s top projects.

The comparison between the three is as follows:

| ————- | —— | ——– | ———- |

It can be seen that zipkin’s probe performance, development transparency, and data analysis capabilities are not superior, and it is really the best choice.

Pinpoint has great advantages in data analysis capabilities and development transparency, but the deployment of Pinpoint is relatively complex and requires high hardware resources.

Skywalking has great advantages in probe performance and development transparency, and its data analysis capabilities are also good. The important thing is that its deployment is more convenient and flexible, and it is more suitable for small and medium-sized enterprises than Pinpoint.

Introduction to Skywalking

SkyWalking was created in 2015 and provides distributed tracking functionality. Starting from 5.x, the project evolved into a fully functional Application Performance Management system.

It is used to track, monitor and diagnose distributed systems, especially using microservices architecture, cloud native or volumetric technologies. Provides the following main functions:

– Distributed tracing and context transfer

– Analysis of application, instance, and service performance indicators

– Root cause analysis

– Apply topology analysis

-Application and service dependency analysis

– Slow service detection

– Performance optimization

Official website address: http://skywalking.apache.org/

1698027643122_Figure 3.jpg

?Main features:

-Multi-language probes or libraries

– Java automatic probe, no need to modify the source code when tracking and monitoring programs.

– Other multi-language probes provided by the community

– [.NET Core](https://github.com/OpenSkywalking/skywalking-netcore)

– [Node.js](https://github.com/OpenSkywalking/skywalking-nodejs)

– Multiple backend storage: ElasticSearch, H2

– support

?OpenTracing

– Java automatic probe support and OpenTracing API work together

– Lightweight, fully functional back-end aggregation and analysis

– Modern Web UI

– Log integration

– Alarms for applications, instances and services

?Installation of Skywalking

First, let’s take a look at the official structure diagram of Skywalking:

1698027668367_Picture 4.jpg

It is roughly divided into four parts:

– skywalking-oap-server: It is the service of Observability Analysis Platformd, used to collect and process data sent by probes

– skywalking-UI: It is the Web UI service provided by skywalking, which graphically displays service links, topology diagrams, traces, performance monitoring, etc.

– Agent: Probe, obtains link information and performance information of service calls, and sends them to the OAP service of skywalking

-Storage: Storage, generally choose elasticsearch

Skywalking supports deployment in windows or Linux environment. Here we choose to install Skywalking under Linux. You must first ensure that elasticsearch is started in your Linux environment.

The next installation is divided into three steps:

– Download the installation package

– Install Skywalking’s OAP service and WebUI

– Deploy probes in services

Download the installation package

The installation package can be downloaded from Skywalking’s official website, http://skywalking.apache.org/downloads/

The latest version is currently version 8.0.1:

1698027688178_Figure 5.jpg

Downloaded installation package:

1698027714818_Figure 6.jpg

Install OAP service and WebUI

Install

Extract the downloaded installation package to a directory on Linux:

tar xvf apache-skywalking-apm-es7-8.0.1.tar.gz

Then rename the unzipped folder:

mv apache-skywalking-apm-es7 skywalking

Enter the unzipped directory:

cd skywalking

View the directory structure:

1698027789068_Figure 7.jpg

Several key directories:

– agent: probe

– bin: startup script

– config: configuration file

– logs: logs

– oap-libs: dependencies

– webapp：WebUI

Here you need to modify the application.yml file in the config directory. For detailed configuration, see the official website: https://github.com/apache/skywalking/blob/v8.0.1/docs/en/setup/backend/backend-setup.md

Configuration

Enter the `config` directory and modify `application.yml`, mainly to change the storage solution from h2 to elasticsearch

You can use the following configuration directly:

cluster:
  selector: ${SW_CLUSTER:standalone}
  standalone:
core:
  selector: ${SW_CORE:default}
  default:
    role: ${SW_CORE_ROLE:Mixed} # Mixed/Receiver/Aggregator
    restHost: ${SW_CORE_REST_HOST:0.0.0.0}
    restPort: ${SW_CORE_REST_PORT:12800}
    restContextPath: ${SW_CORE_REST_CONTEXT_PATH:/}
    gRPCHost: ${SW_CORE_GRPC_HOST:0.0.0.0}
    gRPCPort: ${SW_CORE_GRPC_PORT:11800}
    gRPCSslEnabled: ${SW_CORE_GRPC_SSL_ENABLED:false}
    gRPCSslKeyPath: ${SW_CORE_GRPC_SSL_KEY_PATH:""}
    gRPCSslCertChainPath: ${SW_CORE_GRPC_SSL_CERT_CHAIN_PATH:""}
    gRPCSslTrustedCAPath: ${SW_CORE_GRPC_SSL_TRUSTED_CA_PATH:""}
    downsampling:
      -Hour
      -Day
      - Month
    # Set a timeout on metrics data. After the timeout has expired, the metrics data will automatically be deleted.
    enableDataKeeperExecutor: ${SW_CORE_ENABLE_DATA_KEEPER_EXECUTOR:true} # Turn it off then automatically metrics data delete will be close.
    dataKeeperExecutePeriod: ${SW_CORE_DATA_KEEPER_EXECUTE_PERIOD:5} # How often the data keeper executor runs periodically, unit is minute
    recordDataTTL: ${SW_CORE_RECORD_DATA_TTL:3} # Unit is day
    metricsDataTTL: ${SW_CORE_RECORD_DATA_TTL:7} # Unit is day
    # Cache metric data for 1 minute to reduce database queries, and if the OAP cluster changes within that minute,
    # the metrics may not be accurate within that minute.
    enableDatabaseSession: ${SW_CORE_ENABLE_DATABASE_SESSION:true}
    topNReportPeriod: ${SW_CORE_TOPN_REPORT_PERIOD:10} # top_n record worker report cycle, unit is minute
    # Extra model column are the column defined by in the codes, These columns of model are not required logically in aggregation or further query,
    # and it will cause more load for memory, network of OAP and storage.
    # But, being activated, user could see the name in the storage entities, which make users easier to use 3rd party tool, such as Kibana->ES, to query the data by themselves.
    activeExtraModelColumns: ${SW_CORE_ACTIVE_EXTRA_MODEL_COLUMNS:false}
    # The max length of service + instance names should be less than 200
    serviceNameMaxLength: ${SW_SERVICE_NAME_MAX_LENGTH:70}
    instanceNameMaxLength: ${SW_INSTANCE_NAME_MAX_LENGTH:70}
    # The max length of service + endpoint names should be less than 240
    endpointNameMaxLength: ${SW_ENDPOINT_NAME_MAX_LENGTH:150}
storage:
  selector: ${SW_STORAGE:elasticsearch7}
  elasticsearch7:
    nameSpace: ${SW_NAMESPACE:""}
    clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}
    protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
    trustStorePath: ${SW_STORAGE_ES_SSL_JKS_PATH:""}
    trustStorePass: ${SW_STORAGE_ES_SSL_JKS_PASS:""}
    dayStep: ${SW_STORAGE_DAY_STEP:1} # Represent the number of days in the one minute/hour/day index.
    user: ${SW_ES_USER:""}
    password: ${SW_ES_PASSWORD:""}
    secretsManagementFile: ${SW_ES_SECRETS_MANAGEMENT_FILE:""} # Secrets management file in the properties format includes the username, password, which are managed by 3rd party tool.
    indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:1} # The index shards number is for store metrics data rather than basic segment record
    superDatasetIndexShardsFactor: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_SHARDS_FACTOR:5} # Super data set has been defined in the codes, such as trace segments. This factor provides more shards for the super data set, shards number = indexShardsNumber * superDatasetIndexShardsFactor. Also, this factor effects Zipkin and Jaeger traces.
    indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:0}
    # Batch process setting, refer to https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.5/java-docs-bulk-processor.html
    bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:1000} # Execute the bulk every 1000 requests
    flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests
    concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
    resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
    metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
    segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
    profileTaskQueryMaxSize: ${SW_STORAGE_ES_QUERY_PROFILE_TASK_SIZE:200}
    advanced: ${SW_STORAGE_ES_ADVANCED:""}
  h2:
    driver: ${SW_STORAGE_H2_DRIVER:org.h2.jdbcx.JdbcDataSource}
    url: ${SW_STORAGE_H2_URL:jdbc:h2:mem:skywalking-oap-db}
    user: ${SW_STORAGE_H2_USER:sa}
    metadataQueryMaxSize: ${SW_STORAGE_H2_QUERY_MAX_SIZE:5000}
receiver-sharing-server:
  selector: ${SW_RECEIVER_SHARING_SERVER:default}
  default:
    authentication: ${SW_AUTHENTICATION:""}
receiver-register:
  selector: ${SW_RECEIVER_REGISTER:default}
  default:

receiver-trace:
  selector: ${SW_RECEIVER_TRACE:default}
  default:
    sampleRate: ${SW_TRACE_SAMPLE_RATE:10000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
    slowDBAccessThreshold: ${SW_SLOW_DB_THRESHOLD:default:200,mongodb:100} # The slow database access thresholds. Unit ms.

receiver-jvm:
  selector: ${SW_RECEIVER_JVM:default}
  default:

receiver-clr:
  selector: ${SW_RECEIVER_CLR:default}
  default:

receiver-profile:
  selector: ${SW_RECEIVER_PROFILE:default}
  default:

service-mesh:
  selector: ${SW_SERVICE_MESH:default}
  default:

istio-telemetry:
  selector: ${SW_ISTIO_TELEMETRY:default}
  default:

envoy-metric:
  selector: ${SW_ENVOY_METRIC:default}
  default:
    acceptMetricsService: ${SW_ENVOY_METRIC_SERVICE:true}
    alsHTTPAnalysis: ${SW_ENVOY_METRIC_ALS_HTTP_ANALYSIS:""}

prometheus-fetcher:
  selector: ${SW_PROMETHEUS_FETCHER:default}
  default:
    active: ${SW_PROMETHEUS_FETCHER_ACTIVE:false}

receiver_zipkin:
  selector: ${SW_RECEIVER_ZIPKIN:-}
  default:
    host: ${SW_RECEIVER_ZIPKIN_HOST:0.0.0.0}
    port: ${SW_RECEIVER_ZIPKIN_PORT:9411}
    contextPath: ${SW_RECEIVER_ZIPKIN_CONTEXT_PATH:/}

receiver_jaeger:
  selector: ${SW_RECEIVER_JAEGER:-}
  default:
    gRPCHost: ${SW_RECEIVER_JAEGER_HOST:0.0.0.0}
    gRPCPort: ${SW_RECEIVER_JAEGER_PORT:14250}

query:
  selector: ${SW_QUERY:graphql}
  graphql:
    path: ${SW_QUERY_GRAPHQL_PATH:/graphql}

alarm:
  selector: ${SW_ALARM:default}
  default:

Telemetry:
  selector: ${SW_TELEMETRY:none}
  none:
  prometheus:
    host: ${SW_TELEMETRY_PROMETHEUS_HOST:0.0.0.0}
    port: ${SW_TELEMETRY_PROMETHEUS_PORT:1234}

configuration:
  selector: ${SW_CONFIGURATION:none}
  none:
  grpc:
    host: ${SW_DCS_SERVER_HOST:""}
    port: ${SW_DCS_SERVER_PORT:80}
    clusterName: ${SW_DCS_CLUSTER_NAME:SkyWalking}
    period: ${SW_DCS_PERIOD:20}
  apollo:
    apolloMeta: ${SW_CONFIG_APOLLO:http://106.12.25.204:8080}
    apolloCluster: ${SW_CONFIG_APOLLO_CLUSTER:default}
    apolloEnv: ${SW_CONFIG_APOLLO_ENV:""}
    appId: ${SW_CONFIG_APOLLO_APP_ID:skywalking}
    period: ${SW_CONFIG_APOLLO_PERIOD:5}
  zookeeper:
    period: ${SW_CONFIG_ZK_PERIOD:60} # Unit seconds, sync period. Default fetch every 60 seconds.
    nameSpace: ${SW_CONFIG_ZK_NAMESPACE:/default}
    hostPort: ${SW_CONFIG_ZK_HOST_PORT:localhost:2181}
    #RetryPolicy
    baseSleepTimeMs: ${SW_CONFIG_ZK_BASE_SLEEP_TIME_MS:1000} # initial amount of time to wait between retries
    maxRetries: ${SW_CONFIG_ZK_MAX_RETRIES:3} # max number of times to retry
  etcd:
    period: ${SW_CONFIG_ETCD_PERIOD:60} # Unit seconds, sync period. Default fetch every 60 seconds.
    group: ${SW_CONFIG_ETCD_GROUP:skywalking}
    serverAddr: ${SW_CONFIG_ETCD_SERVER_ADDR:localhost:2379}
    clusterName: ${SW_CONFIG_ETCD_CLUSTER_NAME:default}
  consul:
    # Consul host and ports, separated by comma, e.g. 1.2.3.4:8500,2.3.4.5:8500
    hostAndPorts: ${SW_CONFIG_CONSUL_HOST_AND_PORTS:1.2.3.4:8500}
    # Sync period in seconds. Defaults to 60 seconds.
    period: ${SW_CONFIG_CONSUL_PERIOD:1}
    # Consul aclToken
    aclToken: ${SW_CONFIG_CONSUL_ACL_TOKEN:""}

exporter:
  selector: ${SW_EXPORTER:-}
  grpc:
    targetHost: ${SW_EXPORTER_GRPC_HOST:127.0.0.1}
    targetPort: ${SW_EXPORTER_GRPC_PORT:9870}

start up

Make sure elasticsearch has been started and the firewall has ports 8080, 11800, and 12800 open.

Enter the `bin` directory and execute the command to run:

./startup.sh

The default UI port is 8080, which can be accessed: http://192.168.150.101:8080

1698027891500_Figure 8.jpg

Deploy microservice probes

Now that the Skywalking server has been started, we still need to add service probes to the microservices to collect data.

Decompress

First, unzip the compressed package provided with the pre-class materials.

1698027907610_Figure 9.jpg

?Extract the `agent` to a certain directory. Do not include Chinese characters. You can see that its structure is as follows:

1698027923455_Figure 10.jpg

One of them is `skywalking-agent.jar`, which is the probe we want to use.

Configuration

If you are running a jar package, you can enter parameters at runtime to specify the probe:

java -jar xxx.jar -javaagent:C:/lesson/skywalking-agent/skywalking-agent.jar -Dskywalking.agent.service_name=ly-registry -Dskywalking.collector.backend_service=192.168.150.101:11800</ pre>
<p>In this example, we use development tools to run and configure.</p>
<p>Use the IDEA development tool to open a project of yours. In the IDEA tool, select the startup item you want to modify, right-click and select `Edit Configuration`:</p>
<p></p>
<p><img alt="1698027971277_Figure 11.jpg" height="301" src="//i2.wp.com/img-blog.csdnimg.cn/img_convert/8c758d6e988be31c076f720754b116f3.jpeg" width= "753"></p>
<p>Then in the pop-up window, click `Environment` and select the corresponding expand button behind `VM options`</p>
<p></p>
<p><img alt="1698027986847_Figure 12.jpg" height="386" src="//i2.wp.com/img-blog.csdnimg.cn/img_convert/cbcd6c8ba7ce539a43a08f0ff069f7ce.jpeg" width= "750"></p>
<p>In the expanded input box, enter the following configuration:</p>
<pre>-javaagent:C:/lesson/skywalking-agent/skywalking-agent.jar
-Dskywalking.agent.service_name=ly-registry
-Dskywalking.collector.backend_service=192.168.150.101:11800

Notice:

– `-javaagent:C:/lesson/skywalking-agent/skywalking-agent.jar`: The configuration is the location of the skywalking-agent.jar package. You need to change it to the directory where you store it.

– `-Dskywalking.agent.service_name=ly-registry`: is the name of the current project, which needs to be modified to `ly-registry`, `ly-gateway`, `ly-item-service` respectively.

– `-Dskywalking.collector.backend_service=192.168.150.101:11800`: It is the OPA service address of Skywalking. It uses GRPC communication, so the port is 11800, not 8080

start up

Skywalking’s probe will modify the class file before the project is started, complete the probe implantation, and have zero intrusion into the business code, so we only need to start the project for it to take effect.

Start the project, then access the business interface in the project, and the probe will start working.

WebUI interface

Visit: http://192.168.150.101:8080 You can see that the statistics have come out:

1698028038787_Figure 13.jpg

Performance monitoring of service instances:

1698028055937_Figure 14.jpg

Service topology diagram:

1698028069444_Figure 15.jpg

1698028085477_Figure 16.jpg

Link tracking information for a certain request:

1698028102457_Figure 17.jpg

Table view:

1698028118169_Figure 18.jpg

The copyright of this article belongs to Dark Horse Programmer Java Training Academy. Reprinting is welcome. Please indicate the source of the author. Thanks!

Author: Dark Horse Programmer Java Training Academy

First release: https://java.itheima.com

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56887 people are learning the system