Database|Binlog troubleshooting: drainer strikes periodically

Table of Contents

1. Background

2. Fault phenomenon

3. Analysis process

4. Solution

5. Thinking

6. Summary


1. Background

Recently, users have reported that the drainer of our production environment TiDB cluster frequently fails. Failures manifest as service crashes and failure to start, or data being lost during operation, which brings great trouble to our business. This cluster runs an offline analysis business, with a data volume of 20T. It uses the v4 version of drainer, and has multiple drainers to synchronize data to the downstream. The target end includes kafka, file, tidb and other forms.

We had just fixed one of these glitches two days ago, but now the same problem is occurring again. Therefore, we need to conduct a root cause troubleshooting to resolve this issue.

2. Fault phenomenon

After receiving feedback from the business end, a certain downstream kafka has not received TiDB data for several hours, but the status of the pump and drainer nodes are normal. I also received similar feedback a few days ago. At that time, the TiDB server stopped due to an unknown exception in the binlog. To write, you need to verify the binlog status through the following API:

curl http://{tidb-server ip}:{status_port}/info/all

This API will return meta information of all TiDB servers, including the writing status of each instance’s binlog (binlog_status field). If the TiDB server sets ignore-error, then when the binlog fails, it is usually skipping, normally it is on.

It is confirmed that the binlog_status of the seven TiDB servers are all in the skipping state, which is the same problem as before.

The solution is relatively simple, just restart the TiDB server, but to avoid subsequent recurrences, you need to figure out the cause before restarting.

3. Analysis process

If the data is out of sync, I believe everyone will immediately suspect that it is a problem with the drainer. The most common reason is that a large transaction causes the drainer to crash and panic. However, after logging in to the machine where the drainer is located and analyzing the log, no abnormality is found. The log shows that the savepoint is written normally. After entering, the checkpoint progresses normally, and the instance status is up, indicating that it is not a problem with the drainer.

According to the binlog skip status, turn to analyze the TiDB server monitoring. You can see the number of skipped binlogs in the TiDB->Server->Skip Binlog Count panel:

From the monitoring, we can see that the Skip Count suddenly dropped from a high level to 0 around 6 pm on August 18, precisely because the fault was repaired by restarting the TiDB server last time. Later, around the morning of the 21st, Skip Count began to appear again, so we must focus on analyzing the logs at this time point.

After further analysis and monitoring, we found that the rising trend of Skip Count coincides with the Critical Error, indicating that a large number of binlog writing exceptions began to occur around 07:06 on the morning of the 21st. Next, we will check the pump log based on this time point.

According to the precise time point, the location of the panic was quickly located in the pump log. After the panic, a very useful information was found in the log:

The log shows that binlog has indeed stopped writing, and also points out that the reason for stopping writing is insufficient disk space. There is a key message here called StopWriteAtAvailableSpace, which means that writing will stop when the available space of the disk where the pump is located is less than this parameter. Let’s use edit-config to take a look at the configuration parameters of pump:

pump:
gc: 2
heartbeat-interval: 2
security:
ssl-ca: ""
ssl-cert: ""
ssl-key: ""
storage:
stop-write-at-available-space: 1gib

It was found that the logs and configuration parameters could be matched, but the disk was indeed insufficient. The backhand is just a df -h to see what happens:

Unexpectedly, the above picture shows that only 1% of the pump data disk is used, and there is still a lot of unused space, which seems to be inconsistent with the reason for the log error.

I have overlooked an important time factor here, that is, 3 days have passed since the pump failure when I intervened in the investigation (as you can see from the monitoring picture in Chapter 1), and the pump gc time is set to 2 days, which means During the investigation, the binlogs recorded before pumping have been deleted by gc. It is unknown whether these binlogs are normally consumed by the drainer. Fortunately, there is a panel in Grafana monitoring that records the changes in pump storage. The path is: Binlog->pump->Storage Size.

Judging from the curve above and the two recent failures, it seems that the problem is suddenly clear: TiDB Server was restarted around 6 pm on the 18th to resume binlog writing, and the available pump space began to decrease, and it was almost completely used by around 7 am on the 21st. , triggering the StopWriteAtAvailableSpace exception, the binlog stops writing and changes to the skipping state, but at the same time the pump gc is still working, and no new binlog comes in. Two days later, the existing data is gc completed and returned to the empty disk level around 7 a.m. on the 23rd. , continues until now.

I don’t know much about the various background information of the cluster I took over halfway, and it often takes me a long time to look up strange problems. This is the daily life of an oncaller. .

4. Solution

Through communication with users, we learned that there is no way to increase disk space, so we were faced with two choices: open up or throttle back.

In order to solve this problem, we decided to take measures to shorten the pump gc time and shorten it to 1 day. However, shortening the gc time also has risks. If the drainer fails for more than 1 day, data will be lost.

Finally, we adjusted the gc time and restarted the TiDB server at the same time around 6 pm on the 24th, which brought the binlog synchronization back to normal.

After a few days of observation, we found that the disk usage of pump was stable at about half, and it was estimated that similar problems would not occur again, so we can rest peacefully and no longer have to worry about previous failures.

Fortunately, downstream Kafka is not sensitive to data loss and can be resynchronized from the current time at any time, which saves us a lot of time and energy, otherwise we may face greater challenges.

5. Thinking

It must be said that it is really uncomfortable to have to restart the TiDB server in order to restore the binlog. In a production environment, restarting is not an easy operation. Fortunately, I later discovered that there is an API that can be used to restore binlog. Next time I encounter a similar problem, I will try to use this API to solve it.

curl http://{tidb-server ip}:{status_port}/binlog/recover

The document concealment work is very good. I wonder how much more TiDB hides through these backdoor methods.

Then there is the binlog master-slave failure recovery method that I saw in the document. I am really trembling. Whoever uses it 996. ↓↓↓

According to my experience, after setting ignore-error, the possibility of critical error occurring is very high. Binlog synchronization seems to be very fragile and will stop working if it is slightly inappropriate. A full backup, recovery and restart of the TiDB server is required every time, which is an absolute disaster for a cluster that can easily reach tens of terabytes. Not only is it time-consuming, but the probability of errors with such a large amount of data is also very high, which is scary just thinking about it.

6. Summary

From this case, I concluded that some buffer should be left in parameter settings so that there is buffer time to deal with problems in the future. For example, in this case, you can set stop-write-at-available-space to be larger. When there is insufficient disk space, you can quickly reduce the value. In this way, binlog can resume and continue synchronization, and you can also leave time for disk expansion. Or make other plans.

In addition, we need to pay attention to the impact of pump gc. The pump gc does not consider whether the drainer is consuming normally. Therefore, when setting up gc, we need to reserve some time for drainer failure handling.

Finally, we should strengthen the research on monitoring indicators so that we can avoid detours when troubleshooting problems.

Author: He Ao | Senior back-end development engineer

Copyright Statement: This article was organized and written by the Digital China Cloud Base team. If reprinted, please indicate the source.

Search the Digital China Cloud Base on the public account, reply to the database in the background, and join the database technology exchange group.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. MySQL entry-level skills treeHomepageOverview 78032 people are learning the system