Continuous Profiling of Amazon EKS Container Service to Diagnose Application Performance with Pyroscope

The current status of Continuous Profiling

In the observable field, Trace, Log, and Metrics serve as the “three pillars” to help engineers more easily gain insight into the internal problems of applications. However, it is often necessary for developers to drill down into the application to find the root cause of the bottleneck. In the “three pillars” of observability, this information is often collected through logs. However, this method is often very time-consuming and lacks enough details to help developers locate application performance issues.

A more effective method is to use profiling technology. Profiling is a dynamic method of analyzing program complexity that aims to collect application system information at runtime to study system health and locate performance hot spots, such as CPU utilization or the frequency and duration of function calls. Through analysis, you can pinpoint which parts of the application consume the most resources or time, thereby optimizing overall system performance. Profiling technology is generally used in the following forms:

System tools: Such as Linux’s strace/perf and Solaris’ DTrace. The use of such tools requires a strong C and operating system foundation, and often requires the ability to understand OS-level system calls;

Languages Native: Provided through programming language profiling libraries, such as golang’s net/http/pprof and runtime/pprof. Engineers need to introduce these packages into the program and view and analyze them through specialized tools. Amazon CodeGuru Profiler also provides Java/Python language agents to provide profiling for Java and Python applications;

Use eBPF: eBPF Profiling uses the idea of Infrastructure to solve application observation problems. eBPF is a very popular technology in the current Linux kernel. Using eBPF profiling can obtain it from the kernel without modifying the code. Stack traces for the entire system (eBPF is useful for much more than profiling).

It should be noted that when using compiled languages such as Golang/Java/C/C++, the eBPF profiler can obtain very similar information to the non-eBPF profiler. But for interpreted languages such as Python, the runtime stack trace cannot be easily accessed from the kernel, and Languages Native has better results in this scenario. In view of the advantages and disadvantages of Languages Native and eBPF, general commercial products will provide access to both methods at the same time.

On the other hand, the original Profiling information is often difficult to read and understand. To solve this problem, Brendan Gregg, the author of “Systems Performance: Enterprise and the Cloud” (Chinese translation “Top of Performance”) invented FlameGraph (Flame Graph), which analyzes the Visualize application stack traces and durations in Profiling in a layer-by-layer manner to intuitively, quickly and accurately identify the most frequently executed and most resource-consuming code paths. Almost all mainstream profiling tools use FlameGraph for visualization. For interpretation of flame graph, please refer to: Performance Tuning Tool: Flame graph.

Performance tuning tool: Flame graph:

https://www.infoq.cn/article/a8kmnxdhbwmzxzsytlga

Profiling alone is often not enough. In modern application scenarios, immutable infrastructure is widely used. Taking Kubernetes as an example, the application often crashes after a fault occurs, the Liveness Probe check fails, and then the Pod is destroyed, and the new The application Pod will replace the destroyed Pod to provide services. If profiling is not performed in time, the application stack call information will be lost as the Pod life cycle terminates. In addition, for problems such as memory overflow and OOM, it is often necessary to compare profiling data at different times to find the problem. Continuous Profiling adds a time dimension to profiling, helping to locate, debug, and repair performance-related issues by understanding changes in program profiling information over time.

How to use Pyroscope with Amazon EKS

Introduction to Pyroscope Architecture

Pyroscope is a company that provides open source Continuous Profiling services. It was acquired by Grafana in March 2023 and integrated Grafana’s own Phlare into Pyroscope. Similar to the technical implementation of the Trace observable pillar, Pyroscope also supports SDK-Instrumentation and Auto-Instrumentation to produce Profiling data, and uses a combination of Push and Pull to collect data, store and display it. This article takes Pyroscope as an example to demonstrate how to use the Pyroscope tool to implement Continuous Profiling of modern applications and gain insight into the performance of modern applications.

The deployment of Pyroscope is divided into two parts, Pyroscope Server and Client:

The Server part mainly collects, processes, stores, and displays the data reported by the Client, and provides API interfaces to the outside world. You can use Grafana to display Pyroscope Profiling data in the form of a flame graph.

The client is a Grafana Agent. The Agent can use eBPF technology to collect Profiling and then push it to the Server, or use Agent pull to collect Profiling data directly from the application. In addition to using Agent, users can also use SDK to push the generated Profiling data directly to the server.

Installing Pyroscope in EKS

To use Pyroscope with EKS, you need to install the Pyroscope service first. The installation steps are as follows:

1. Prepare S3 bucket for Pyroscope persistence

Pyroscope supports using S3 to persist data. Pyroscope uses Thanos’ object store client. Since the document does not specify whether IRSA authentication is supported, you can create a separate IAM User for Pyroscope to access the S3 bucket. It is recommended to keep the user’s AK. /SK to avoid leaks and set fine-grained IAM Policy for the S3 bucket used by Pyroscope. Pay attention to replace in the following template.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Pyroscope",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject",
                "s3:GetObjectTagging",
                "s3:PutObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::<YOUR-S3-BUCKET>/*",
                "arn:aws:s3:::<YOUR-S3-BUCKET>"
            ]
        }
    ]
}

Swipe left to see more

2.Use the following command to create a kubernetes namespace for Pyroscope

kubectl create namespace pyroscope

3.Install Pyroscope helm repo

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

Swipe left to see more

4. Download the Pyroscope template and adjust the deployment parameters. Pyroscope supports distributed deployment and provides a separate minio as the back-end persistent object storage. In the distributed deployment mode, it is not supported to modify the helm template to use S3 as the long-term persistence layer, so the deployment method of the monolithic architecture is used here.

curl -Lo pyroscope-values.yaml \
https://raw.githubusercontent.com/grafana/pyroscope/main/operations/pyroscope/helm/pyroscope/values.yaml

Swipe left to see more

Pyroscope server is a statefulset stateful application. You can adjust the number of Pyroscope server service instances by configuring replicaCount.

pyroscope:
  replicaCount: 3

If running in production, it is recommended to modify the limit and request of resources used by the Pyroscope server Pod.

resources:
    {}

By default, after the Pyroscope ingester module receives Profiling data, it will retain recent data in memory. When the threshold is reached or exceeds 3 hours, Pyroscope will persist the data to block storage. Since S3 can be used for tiering, pv does not Need to set too large.

persistence:
    enabled: True
    accessModes:
      - ReadWriteOnce
    size: 10Gi
    annotations: {}

When object storage is configured, complete data blocks will be uploaded to S3 for persistence. You can use S3 Intelligent-Tiering to reduce this part of the data persistence cost. Pay attention to adjusting the following S3 configuration parameters according to your own environment.

config: |
    storage:
      backend: s3
      s3:
        region: <YOUR-S3-REGION>
        endpoint: s3.<YOUR-S3-REGION>.amazonaws.com
        bucket_name: <YOUR-S3-BUCKET>
        access_key_id: <YOUR-ACCESS-KEY>
        secret_access_key: <YOUR-SECRET-KEY>

Swipe left to see more

By default, Pyroscope sets up minio object storage for long-term data persistence. Use the following settings to turn off the minio service and use S3 directly:

minio:
  enabled: false

Swipe left to see more

5.After modifying the deployment configuration, execute helm install to install the Pyroscope server

helm -n pyroscope install pyroscope grafana/pyroscope --values pyroscope-values.yaml

Swipe left to see more

The command output is roughly as follows:

NAME: pyroscope
LAST DEPLOYED: Tue Sep 5 09:10:31 2023
NAMESPACE: pyroscope
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thanks for deploying Grafana Pyroscope.


# Pyroscope UI & Grafana


Pyroscope database comes with a built-in UI, to access it from your localhost you can use:


```
kubectl --namespace pyroscope port-forward svc/pyroscope 4040:4040
```


You can also use Grafana to explore Pyroscope data.
For that, you'll need to add the Pyroscope data source to your Grafana instance and configure the query URL accordingly.
See https://grafana.com/docs/grafana/latest/datasources/grafana-pyroscope/ for more details.


The in-cluster query URL for the data source in Grafana is:


```
http://pyroscope.pyroscope.svc.cluster.local.:4040
```


# Collecting profiles.




The Grafana Agent has been installed to scrape and discover pprof profiles endpoint via pod annotations.


As an example, to start collecting memory and cpu profile using the 8080 port, add the following annotations to your workload:


```
profiles.grafana.com/memory.scrape: "true"
profiles.grafana.com/memory.port: "8080"
profiles.grafana.com/cpu.scrape: "true"
profiles.grafana.com/cpu.port: "8080"
```




To learn more supported annotations, read our guide https://grafana.com/docs/pyroscope/next/deploy-kubernetes/#optional-scrape-your-own-workloads-profiles


There are various ways to collect profiles from your application depending on your needs.
Follow our guide to setup profiling data collection for your workload:


https://grafana.com/docs/pyroscope/next/configure-client/

Swipe left to see more

Pay attention to the output information

http://pyroscope.pyroscope.svc.cluster.local.:4040

Required for provisioning Grafana Datasource.

6.Pyroscope uses Grafana for data query and display. By default, Grafana does not support flame graphs. You need to use the following parameters to enable this feature when installing helm.

helm upgrade -n pyroscope --install grafana grafana/grafana \
  --set image.repository=grafana/grafana \
  --set image.tag=main \
  --set env.GF_FEATURE_TOGGLES_ENABLE=flameGraph \
  --set env.GF_AUTH_ANONYMOUS_ENABLED=true \
  --set env.GF_AUTH_ANONYMOUS_ORG_ROLE=Admin \
  --set env.GF_DIAGNOSTICS_PROFILING_ENABLED=true \
  --set env.GF_DIAGNOSTICS_PROFILING_ADDR=0.0.0.0 \
  --set env.GF_DIAGNOSTICS_PROFILING_PORT=6060 \
  --set-string 'podAnnotations.pyroscope\.grafana\.com/scrape=true' \
  --set-string 'podAnnotations.pyroscope\.grafana\.com/port=6060'

Swipe left to see more

7. Grafana installed

Installing Grafana will generate login secret information. The default account is admin. The password needs to be obtained using the following command:

kubectl get secret --namespace pyroscope grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Swipe left to see more

Use the following command to execute port-forward (you can also use a load balancer in the form of ingress or loadbalancer for service exposure) to map the Grafana UI to localhost for access.

export POD_NAME=$(kubectl get pods --namespace pyroscope -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0]. metadata.name}")
kubectl --namespace pyroscope port-forward $POD_NAME 3000

Swipe left to see more

Here, port-forward to the local port 3000, use localhost:3000 in the browser to access Grafana, log in using the admin username and the obtained secret, and select the Grafana Pyroscope type in the Datasource.

Fill in the URL:

http://pyroscope.pyroscope.svc.cluster.local.:4040

Save this data source.

In Grafana explorer, select the Pyroscope data source, use the following tags to filter the Profiling data, execute Run query, and you can see that Grafana displays the flame graph information of the pyroscope-0 Pod.

So far, the Pyroserver server has been built, and only the Profiling flame graph information of the Pyroserver server itself can be accessed. If you want to display the application’s Profiling information, you also need to introduce the SDK into the code for development or install the eBPF agent on the node, and then push the generated Profiling information to the Pyroserver server.

Automatic Profiling using Pyroscope eBPF agent

1.Create agent configuration

The Pyroscope eBPF agent is deployed on each EKS node as a daemonset, uses Kubernetes’ service discovery mechanism to obtain the Pod list, and relabels the collected data by configuring rules. Here, I wrote a configuration with reference to the Grafana agent. Please set the endpoint URL according to your own environment and save the configuration as pyroscope-ebpf-values.yaml.

agent:
  mode: 'flow'
  configMap:
    create: true
    content: |
      discovery.kubernetes "all_pods" {
        selectors {
          field = "spec.nodeName=" + env("HOSTNAME")
          role = "pod"
        }
        role = "pod"
      }
      discovery.relabel "local_pods" {
        targets = discovery.kubernetes.all_pods.targets
        rule {
          action = "replace"
          replacement = "${1}/${2}"
          separator = "/"
          source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_container_name"]
          target_label = "service_name"
        }
      }
      pyroscope.ebpf "instance" {
        forward_to = [pyroscope.write.endpoint.receiver]
        targets = discovery.kubernetes.local_pods.targets
      }
      pyroscope.write "endpoint" {
        endpoint {
          url = "http://pyroscope.pyroscope.svc.cluster.local:4040"
        }
      }


  securityContext:
    privileged: true
    runAsGroup: 0
    runAsUser: 0


controller:
  hostPID: true

Swipe left to see more

You can also refer to the configuration manual to modify the configuration according to needs and relabel rules:

https://grafana.com/docs/pyroscope/latest/configure-client/grafana-agent/ebpf/

2.Install Pyroscope eBPF agent

helm install -n pyroscope pyroscope-ebpf grafana/grafana-agent -f pyroscope-ebpf-values.yaml

Swipe left to see more

3. Query the application flame graph generated by eBPF on Grafana

Since istio has been deployed in my environment, the agent converts the profiling data tags after collection. You can directly use the following tags for query:

You can see the flame graph information of istio’s jaeger program:

It can be seen that eBPF can continuously collect the profiling information of the jaeger program without modifying the jaeger code, and display the flame graph on Grafana.

Profiling your app using the SDK

In addition to using eBPF for continuous profiling of applications, Pyroscope also supports profiling using the SDK.

This is a sample code for go provided by Pyroscope. The program uses goroutine to continuously run two functions, fastFunction and slowFunction. It calls work in the function and loops 800000000 and 200000000 times respectively to generate profiling information and attach a label for easy query. The generated profiling Data is transferred to PYROSCOPE_ENDPOINT for storage and query display. In addition to golang, Pyroscope also provides corresponding SDKs for other development languages.

package main


import (
    "context"
    "fmt"
    "os"
    "runtime"
    "runtime/pprof"
    "sync"


    "github.com/grafana/pyroscope-go"
)


//go:noinline
func work(n int) {
    // revive:disable:empty-block this is fine because this is an example app, not real production code
    for i := 0; i < n; i + + {
    }
    fmt.Printf("work\
")
    // revive:enable:empty-block
}


var m sync.Mutex


func fastFunction(c context.Context, wg *sync.WaitGroup) {
    m.Lock()
    defer m.Unlock()


    pyroscope.TagWrapper(c, pyroscope.Labels("function", "fast"), func(c context.Context) {
        work(200000000)
    })
    wg.Done()
}


func slowFunction(c context.Context, wg *sync.WaitGroup) {
    m.Lock()
    defer m.Unlock()


    // standard pprof.Do wrappers work as well
    pprof.Do(c, pprof.Labels("function", "slow"), func(c context.Context) {
        work(800000000)
    })
    wg.Done()
}


func main() {
    runtime.SetMutexProfileFraction(5)
    runtime.SetBlockProfileRate(5)
    pyroscope.Start(pyroscope.Config{
        ApplicationName: os.Getenv("SERVICE_NAME"),
        ServerAddress: os.Getenv("PYROSCOPE_ENDPOINT"),
        Logger: pyroscope.StandardLogger,
        AuthToken: os.Getenv("PYROSCOPE_AUTH_TOKEN"),
        TenantID: os.Getenv("PYROSCOPE_TENANT_ID"),
        BasicAuthUser: os.Getenv("PYROSCOPE_BASIC_AUTH_USER"),
        BasicAuthPassword: os.Getenv("PYROSCOPE_BASIC_AUTH_PASSWORD"),
        ProfileTypes: []pyroscope.ProfileType{
            pyroscope.ProfileCPU,
            pyroscope.ProfileInuseObjects,
            pyroscope.ProfileAllocObjects,
            pyroscope.ProfileInuseSpace,
            pyroscope.ProfileAllocSpace,
            pyroscope.ProfileGoroutines,
            pyroscope.ProfileMutexCount,
            pyroscope.ProfileMutexDuration,
            pyroscope.ProfileBlockCount,
            pyroscope.ProfileBlockDuration,
        },
        HTTPHeaders: map[string]string{"X-Extra-Header": "extra-header-value"},
    })


    pyroscope.TagWrapper(context.Background(), pyroscope.Labels("foo", "bar"), func(c context.Context) {
        for {
            wg := sync.WaitGroup{}
            wg.Add(2)
            go fastFunction(c, & amp;wg)
            go slowFunction(c, & amp;wg)
            wg.Wait()
        }
    })
}

Swipe left to see more

Use the following Dockerfile to build the code into an image and save it in ECR:

FROM golang:alpine AS build-env
RUN apk update & amp; & amp; apk add ca-certificates
WORKDIR /usr/src/app
COPY . .
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -ldflags '-extldflags "-static"'
FROM scratch
COPY --from=build-env /usr/src/app/pyroscope-demo /pyroscope-demo
COPY --from=build-env /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
CMD ["/pyroscope-demo"]

Swipe left to see more

Use the following configuration to deploy the pyroscope-demo application to the EKS cluster, and use environment variables to set the endpoint of the Pyroscope server:

---
apiVersion: apps/v1
Kind: Deployment
metadata:
  name: pyroscope-demo
  labels:
    app:pyroscope-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app:pyroscope-demo
  template:
    metadata:
      labels:
        app:pyroscope-demo
    spec:
      containers:
      - name: pyroscope-demo
        image: "<AWS_ACCOUNT_ID>.dkr.ecr.ap-northeast-1.amazonaws.com/pyroscope-demo:latest"
        env:
        - name: PYROSCOPE_ENDPOINT
          value: "http://pyroscope.pyroscope.svc.cluster.local:4040"
        - name: SERVICE_NAME
          value: "Pyroscope-demo"

Swipe left to see more

After running successfully, query the profiling information of the pyroscope-demo application in Grafana explorer. You can see the running status of the application in the past 5 minutes. The flame graph is a very intuitive chart that displays profiling. The upper and lower layers represent the calling relationship, and the width of the horizontal bar represents the amount of resources occupied. It can be seen that the running time of the main.work function in slowFunction is 3.98 minutes, while that of fastFunction is 59.9s, which almost matches the number of loops executed by the two functions.

Through this example, you can intuitively analyze the resources consumed by the application and find the cause of the performance problem, so as to troubleshoot the problem and optimize the application performance.

Summary

In short, Continuous Profiling is the future of internal performance analysis of modern applications. Combined with the other three pillars of observability, profiling based on eBPF without code intrusion helps customers make it simpler, larger, and more continuous from infrastructure to applications. As well as the involved middleware, we conduct application performance analysis and debugging, helping customers quickly locate problems in key business scenarios by combining logs, indicators, and tracking, and continuously optimize and improve applications.

Reference materials

https://www.cncf.io/blog/2022/05/31/what-is-continuous-profiling/

https://www.brendangregg.com/flamegraphs.html

https://github.com/brendangregg/FlameGraph

https://www.infoq.cn/article/a8kmnxdhbwmzxzsytlga

https://github.com/grafana/pyroscope

https://grafana.com/docs/pyroscope/latest/configure-client/language-sdks/

https://opentelemetry.io/community/roadmap/

The author of this article

Lin Xufang

Amazon Cloud Technology Solution Architect, mainly responsible for the promotion of Amazon Cloud Technology cloud technology and solutions, has rich practical experience in Container, host, storage, disaster recovery and other directions.

Li Junjie

Amazon Cloud Solutions Architect is responsible for the consulting and architecture design of cloud computing solutions, and is also committed to the research and promotion of containers. Before joining Amazon Cloud Technology, he was responsible for the modernization of traditional financial systems in the IT department of the financial industry. He has extensive experience in the transformation and containerization of traditional applications.

Stars will not get lost and development will be faster!

Remember to star “Amazon Cloud Developer” after following it

I heard, click the 4 buttons below

You won’t encounter bugs!