[etcd] Solve the problem of “Auto sync endpoints failed.” when go-zero registers etcd

go: v1.20.3

go-zero: v1.5.4

etcd: v3.5.9

Description of the problem

In go-zero, etcd is used to realize service registration and discovery. RPC services can be registered to etcd, and other services can discover registered microservices and can also access them. However, the log of the registered rpc service keeps reporting the following error. The log has been brushing the problem of Auto sync endpoints failed, and the service can also be accessed, which is very strange.

{<!-- -->"level":"warn","ts":"2023-07-30T15:57:02.004 + 0800","logger" :"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target ":"etcd-endpoints://0xc0007281c0/192.168.2.2:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: connect: connection refused""}
{<!-- -->"level":"info","ts":"2023-07-30T15:57:02.004 + 0800","logger":" etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error": "context deadline exceeded"}

If you are a veteran, you should know where the problem will be after seeing this. I am a rookie, so I have to analyze step by step.

If you want to see the solution directly, at the end, the following is my analysis process.

Troubleshooting

When there is a problem, I think about the package problem, not my problem. Just go to the source code of trace go-zero, which is where etcd is registered.

I read the source code of go-zero and packaged etcd several times, and found no problems, all of which are normal registration and discovery. (Note: At this time, my go-zero version is still v1.3.2)

No problem, I thought the go-zero version was too old, please upgrade it. Because according to the issue, etcd version may be too old.

Verification One

Upgrade the go-zero version to v1.5.4 (the middle is v1.4.4 first), and etcd is upgraded to v3.5.9 (the middle is first to v3.5.7), but it is not successful, and Auto sync will still appear endpoints failed.

Thinking about it, go-zero should not have this problem. Could it be my etcd startup problem? I started etcd with docker, because it is for testing, I use a single point. Here I have no concept of the configured IP, and I don’t know what it is for.

docker run -d --name ai-etcd --network=host --restart always \
 -v $PWD/etcd.conf.yml:/opt/bitnami/etcd/conf/etcd.conf.yml \
 -e ETCD_ADVERTISE_CLIENT_URLS=http://0.0.0.0:2379 \
 -e ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379 \
 -e ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380 \
 -e ETCD_INITIAL_ADVERTISE_PEER_URLS=http://0.0.0.0:2380 \
 -e ALLOW_NONE_AUTHENTICATION=yes \
 bitnami/etcd:3.5.9

The starting method of this docker is also found from above, no problem. The startup is also ok, and go-zero can also be registered.

Let’s take a look at where etcd went wrong, and then look at the source code.

// client.go L196
func (c *Client) autoSync() {<!-- -->
if c.cfg.AutoSyncInterval == time.Duration(0) {<!-- -->
return
}

for {<!-- -->
select {<!-- -->
case <-c.ctx.Done():
return
case <-time.After(c.cfg.AutoSyncInterval):
ctx, cancel := context.WithTimeout(c.ctx, 5*time.Second)
err := c. Sync(ctx)
cancel()
if err != nil & amp; & amp; err != c.ctx.Err() {<!-- -->
c.lg.Info("Auto sync endpoints failed.", zap.Error(err))
}
}
}
}

// Sync synchronizes client's endpoints with the known endpoints from the etcd membership.
func (c *Client) Sync(ctx context.Context) error {<!-- -->
mresp, err := c.MemberList(ctx)
if err != nil {<!-- -->
return err
}
var eps[]string
for _, m := range mresp.Members {<!-- -->
...
}
c. SetEndpoints(eps...)
return nil
}

The above method reports an error, because Auto sync endpoints failed always appears, indicating that there is a place in go-zero, and AutoSyncInterval should also be configured, so this will run here . The following is the place where go-zero calls, in fact, it is to initialize the parameters of an etcd client.

// registry.go L337
// DialClient dials an etcd cluster with given endpoints.
func DialClient(endpoints []string) (EtcdClient, error) {<!-- -->
cfg := clientv3. Config{<!-- -->
Endpoints: endpoints,
AutoSyncInterval: autoSyncInterval,
DialTimeout: DialTimeout,
DialKeepAliveTime: dialKeepAliveTime,
DialKeepAliveTimeout: DialTimeout,
RejectOldCluster: true,
PermitWithoutStream: true,
}
...
}

At this moment, I feel helpless, and I don’t know where to think. The service did not report an error, and it can be called, so I really don’t want to analyze it. This kind of problem is very tormenting, and there are many key logs, which are ugly. Let’s search more, go to the issue of go-zero, and see that some people have raised this problem, but there is no solution. I went to see etcd’s issue, but I didn’t see anything. In the end, I don’t know why, I wondered why I kept thinking about this problem with go-zero, I wrote an etcd client myself and tried it out. Find an example, test it, and you will know whether it is a problem with go-zero or a problem with etcd.

I won’t list the examples in detail, you can read my other [etcd] docker start single-point etcd_ Feixiaoweixiao’s blog-CSDN blog article.

cli, err = clientv3.New(clientv3.Config{<!-- -->
Endpoints: []string{<!-- -->"192.168.2.2:2379"},
DialTimeout: time. Second * 5,
AutoSyncInterval: time. Second * 5,
})

Also set AutoSyncInterval, and the result is uncomfortable, even reported this error.

There is a very important point here. Don’t be limited to the framework you use when you have a problem.

At this time, it means that there is a problem with the etcd department, so we have to look at the instructions and errors of docker run. I reduced the error to the following line. If you look carefully, you can see that the service dial 0.0.0.0:2379 failed. The rpc service and etcd are located on two servers. Accessing the local 2379 is accessing the 2379 in the container, so it must not be accessible. The key is 0.0.0.0:2379 brought it?

"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: connect: connection refused\\ ""

The configuration of rpc is configured with the host of etcd, which must not be 0.0.0.0:2379. That means that the etcd service itself returned it. At this time, we have to analyze the url in docker run. 4, 2 of which are 2380, indicating that I am not concerned, and the remaining 2 should be the problem itself. The other two 2380 seem to be related to the cluster, so they can be ignored directly.

-e ETCD_ADVERTISE_CLIENT_URLS=http://0.0.0.0:2379 \
-e ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379 \

Read the note:

ETCD_LISTEN_CLIENT_URLS : List of comma separated URLs to listen on for client traffic.
ETCD_ADVERTISE_CLIENT_URLS: List of this member’s peer URLs to advertise to the rest of the cluster. The URLs needed to be a comma-separated list.

What I understand is that ETCD_LISTEN_CLIENT_URLS is the listening IP:PORT, who can access this service, similar to the bind used by our other services, so it should be no problem to use 0.0.0.0 for this. ETCD_ADVERTISE_CLIENT_URLS This seems to be to notify others to visit this IP:PORT before they can visit me. It seems to be the problem. Try changing it to my server IP.

Verification two

docker run -d --name ai-etcd --network=host --restart always \
 -v $PWD/etcd.conf.yml:/opt/bitnami/etcd/conf/etcd.conf.yml \
 -e ETCD_ADVERTISE_CLIENT_URLS=http://192.168.2.2:2379 \
 -e ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379 \
 -e ALLOW_NONE_AUTHENTICATION=yes \
 bitnami/etcd:3.5.9

emmm, I tested it with my etcd example, and it really didn’t report an error. I feel deeply speechless because of my stupidity, and I always thought it was a problem with go-zero, but it has nothing to do with go-zero at all.

Summary

This problem has been around for a long time since v1.3.2. But it has been put on hold if it can’t be analyzed, and it doesn’t affect the use. In the end, I am still a clean freak, I can’t get used to it, let’s analyze this problem. At the beginning, I preconceived that it was a problem with go-zero. I went to see issue, Baidu, and Google, but I brought go-zero with me. In fact, it is not its problem. My own analysis ability is still not enough. Most of the ways to start docker on etcd are the same. It seems that there is no problem with starting, and I don’t think about it.

The deep-seated reason for this problem is that I still don’t understand etcd, and I don’t understand some of the urls in it. Moreover, you need to try more and use the smallest unit to test, so that you can get twice the result with half the effort.

Solution

The ETCD_ADVERTISE_CLIENT_URLS of etcd docker run should be changed to the server IP, and then the rpc service should be re-registered.

This is the process and understanding of my own analysis. There may be deeper problems in it. Anyone who understands it can share it.