The ultimate method to verify reconstructed systems-traffic replication

1. Introduction

When refactoring a system, you will face such a scenario. After a series of steps such as business sorting, redesign, and code reconstruction, you find that the most difficult thing is how to verify the correctness of the optimized system.

You might say, is there no QA? Isn’t this what QA wants to ensure?

Yes, QA should do a good job in system testing and verification, but this is not enough. Because:

The complexity of a software system is very high. R&D cannot guarantee that all cases will be evaluated, let alone QA who stands outside the black box?
There will always be differences in users’ usage environments and usage scenarios. No matter how much testing resources you invest, you cannot fully simulate the real usage scenarios of all users on the production line;

From the perspective of being ultimately responsible for the refactoring results, we must do everything we can to verify the system and do our best to reduce the risks caused by refactoring.

The most effective way is to verify it with real online user traffic, which is the topic we are going to discuss today – traffic replication.

2. Implementation ideas

The essence of traffic replication is to copy the request sent to the old system and re-forward it to the new system. The process is roughly as follows:

Deep copy: Expand layer by layer and recursively copy the fields of each layer. If you want to avoid the interaction of request objects in concurrent scenarios, you can use this method. The source code is excerpted as follows.

// Clone returns a deep copy of r with its context changed to ctx.
// The provided ctx must be non-nil.
func (r *Request) Clone(ctx context.Context) *Request {
if ctx == nil {
panic("nil context")
}
r2 := new(Request)
*r2 = *r
r2.ctx = ctx
r2.URL = cloneURL(r.URL)
if r.Header != nil {
r2.Header = r.Header.Clone()
}
if r.Trailer != nil {
r2.Trailer = r.Trailer.Clone()
}
if s := r.TransferEncoding; s != nil {
s2 := make([]string, len(s))
copy(s2, s)
r2.TransferEncoding = s2
}
r2.Form = cloneURLValues(r.Form)
r2.PostForm = cloneURLValues(r.PostForm)
r2.MultipartForm = cloneMultipartForm(r.MultipartForm)
return r2
}

2.2 Traffic forwarding

Traffic ultimately needs to be forwarded to a target server, so first configure a mirror server address to receive traffic:

MirrorServerUrl = http://192.168.28.212

Compared with ordinary requests, traffic replication requires modifying the target server address before forwarding, and encapsulating a method to complete this step:

func MirrorTraffic(request *http.Request) {
    serverUrl := beego.AppConfig.String("MirrorServerUrl") // Mirror server address
u, err := url.Parse(serverUrl)
if err != nil {
// Error handling omitted
return
}
    //Copy the request and modify the target server to the mirror address
    r := request.WithContext(context.Background())
r.URL.Scheme = u.Scheme
r.URL.Host = u.Host

    / send request
resp, err := http.DefaultTransport.RoundTrip(r)
if err != nil {
// Error handling omitted
return
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
// Error handling omitted
return
}
\t
    // read response
responseBody, err := ioutil.ReadAll(resp.Body)
if err != nil {
// Error handling omitted
return
}
// Subsequent verification
}

In this way, when user requests arrive at the old system, the traffic can be continuously forwarded to the new system for verification.

However, if the target server used for verification has limited resources and cannot handle all the traffic of the production environment, problems such as slowness, timeouts, and failures may occur due to physical resource bottlenecks. At this point, flow control is required.

2.3 Flow Control

We can encapsulate a simple flow control module that can meet the COPY scenario. The general idea is:

Define a field copyRatio to set the proportional threshold of copy traffic;
Define a field copiedCounts to record the number of copied requests;
Define reqCounts to record the total number of requests in real time, and calculate the number of currently allowed replication requests based on the ratio allow;
If allowCounts > copiedCounts, this request allows copying and sets copiedCounts + 1;

The code implementation example is as follows:

//Simple request traffic and copy traffic statistics
// Note: The count under concurrent requests is not completely accurate, but due to scenario and performance considerations, no locking is performed
type TrafficStats struct {
reqCounts int64 //The total number of requests flowing through the server
copiedCounts int64 //Number of copied requests
copyRatio float64 //Copy ratio, for example: 0.5
}
// Modify the traffic copy ratio
func (t *TrafficStats) SetCopyRatio(ratio float64) {
…
}

// Determine whether this request can be copied
func (t *TrafficStats) AllowCopy() bool {
t.reqCounts++
allowCounts := int64(math.Floor(float64(reqCounts) * t.copyRatio))
if allowCounts > t.copiedCounts {
t.copiedCounts++
return true
} else {
return false
}
}

Note: The above traffic statistics are relatively rough and may not be completely accurate in concurrent scenarios. If accuracy is actually required, locking can be used.

2.4 Result Verification

Result verification may be based on scenarios. The methods of result verification are different under different businesses and different requirements. We will only briefly discuss them here.

Simple verification scenario

For example, in our business scenario – diverter reconstruction, verification is relatively simple. You only need to compare the target environment URLs between the new and old system divert results to verify the results.
To do this, we can pass some expected results to the new system by requesting the Header, so that the new system knows how to run and verify the results.

r.Header.Set("Run-Mode", "test") // Tell the new system to run in test mode and only verify the offloading results. The request does not need to be forwarded to the business side.
r.Header.Set("Expect-Target-Url", targetUrl) // Expected offloading result address
r.Header.Set("Original-Request-ID", requestID) // The unique identifier of the request in the old system, used to troubleshoot problems when the new and old results are inconsistent.

Complex scenes

If it is a pure business system reconstruction, it may be necessary to verify the interface response. In this case, there are several options:

Real-time verification: When copying traffic, directly get the response of the old system and the response of the new system, and compare the binary content;
Collect first and then verify: asynchronously collect the response results of the new system and the old system into a system similar to ES, and combine it with certain strategies for analysis and comparison;
Request sampling verification: Randomly copy a certain proportion of sample traffic, and use the naked eye to observe whether the results are as expected on the new system. It may be inefficient, but sometimes it can only be done this way;
Field sampling verification: Define a set of key field information that needs to be verified in advance according to the request. If some of the key information is consistent, the verification is passed;

3. Open source tools

The method discussed above involves a certain code intrusion into the system. Is there a traffic copying method that does not intrude into the code?

There is an open source tool goreplay that can do this. It can record the real-time traffic in the system for playback, analysis and load testing. The operating principle is as follows:

Start the gor program on the machine where the business server is located and listen on the same network port as the business service;
Capture requests from the port and replay them in the QA test environment;
Compare and analyze the responses returned from the production line and test environments;

Before use, you need to download and install the executable package for the corresponding platform. Download address:

https://github.com/buger/goreplay/releases

There are many ways to use this tool, including:

Simple test request capture, the results are output to the console

# –input-raw specifies the network port to capture the request data, here is port 80
$gor --input-raw :80 --output-stdout

Interface: eth0 . BPF Filter: ((tcp dst port 80) and (dst host 10.255.0.187 or dst host fe80::250:56ff:febe:42df))
Interface: lo . BPF Filter: ((tcp dst port 80) and (dst host 127.0.0.1 or dst host ::1))
2023/10/17 15:25:38 [PPID 32153 and PID 32719] Version:1.3.0

1 8ff400500aff0180d8069d4a 1697527542939395480 0
POST /uniform/rs/conference/geteventsimpleinfo HTTP/1.1
Accept: application/json
Content-Type: application/json;charset=UTF-8
Content-Length: 18
Host: testcloudb.quanshi.com
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.12 (Java/1.8.0_202)
Accept-Encoding: gzip,deflate

{"eventId":644164}

Record traffic from the specified port and copy directly to the target server:

sudo ./gor --input-raw :80 --output-http = "http://copytest.quanshi.com"

Log traffic to local file

gor --input-raw :80 --output-file=requests.gor

Playback of local file recorded requests

gor --input-file=requests_0.gor --output-http="http://copytest.quanshi.com"

Request filtering: only play back requests under the specified URL path

gor --input-file=requests_0.gor --output-http="http://copytest.quanshi.com" --http-allow-url=/uniform

Speed limit: playback does not exceed 10% of the original traffic

gor --input-file=requests_0.gor --output-http="http://copytest.quanshi.com|10%"

Performance stress test: simulate a large number of concurrent user requests, and the order of requests can be ignored in performance scenarios.

# --input-file Get the request data from the file, 10x faster during playback
# --input-file-loop infinite loop instead of stopping after reading the file
# --output-http-workers simulate 100 user concurrent requests
# --stats --output-http-stats Output TPS data every 5 seconds
gor --input-file="requests_0.gor|1000%" --input-file-loop --output-http="http://testcloud3.quanshi.com" --output-http-workers 100 --stats --output-http-stats

Output the request traffic to kafka for asynchronous analysis of the traffic and comparison of results.

# --output-kafka-topic: Specify the topic name of kafka
# --output-kafka-host: Specify the broker address of kafka
# --input-raw-track-response: By default, only the request is recorded. This option can also output the response together.
# --output-kafka-json-format: Use json format to output to kafka
gor --input-raw :80 --input-raw-track-response --output-kafka-host "10.255.0.94:9092" --output-kaf-topic "testdd" --output-kafka -json-format

Summary

This article mainly introduces a method to detect problems in new systems in advance: traffic replication. First, the golang language is used as an example to introduce how to copy request traffic and control the forwarding rate. Then, the open source tool goreplay is used as an example to introduce how to complete traffic copying without invading the code.

Obviously, the latter is a more recommended option. However, if you need to do some special logic and verification, it will be more flexible to implement the code yourself. There are many ways to copy traffic, such as the nginx-based mirror instruction, tcpcopy, etc. For details, you can read the reference link below to read more.

Reference reading:

goreplay download address: https://github.com/buger/goreplay/releases
Common traffic replication tools: https://blog.csdn.net/zuozewei/article/details/116466415