Interpretation of TiCDC source code (3)–Analysis of the working process of TiCDC cluster

Author: LingJin Original source: https://tidb.net/blog/24ad9e7b

Summary of content

TiCDC is a TiDB incremental data synchronization tool. By pulling the data change log of upstream TiKV, TiCDC can parse the data into orderly row-level change data and output it downstream.

This article is the third part of TiCDC source code interpretation. The main content is to describe the startup and basic working process of TiCDC cluster, which will be expanded from the following aspects:

  1. TiCDC Server startup process, and the concept and relationship of Server / Capture / Owner / Processor Manager
  2. TiCDC Changefeed Creation Process
  3. The role of Etcd in the TiCDC cluster
  4. Introduction to Owner and Processor Manager concepts, and Owner election and switching process
  5. The role of Etcd Worker in TiCDC

Start TiCDC Server

When starting a TiCDC Server, the command used is as follows, and the PD address of the current upstream TiDB cluster needs to be passed in.

cdc server --pd=http://127.0.0.1:2379

It will start a running instance of TiCDC Server and write TiCDC-related metadata to PD’s ETCD Server. The specific Key is as follows:

/tidb/cdc/default/__cdc_meta__/capture/${capture_id}

/tidb/cdc/default/__cdc_meta__/owner/${session_id}

The first Key is Capture Key, which is used to register the Capture information running on a TiCDC Server, and the corresponding Key and Value will be written every time a Capture is started.

The second key is the Campaign Key, and each Capture will register such a Key for the campaign Owner. The first Capture to write the Owner Key will be the Owner node.

Server startup, after parsing the server startup parameters, verifying the legality of the parameters, and then creating and running TiCDC Server. During the running of Server, multiple running threads will be started. First start an Http Server thread to provide external access to Http OpenAPI. Secondly, a series of resources running at the Server level will be created, the main function is to assist the Capture thread to run. The most important thing is to create and run the Capture thread, which is the main function provider for TiCDC Server to run.

When Capture is running, it will first put its own Capture Information into ETCD. Then start two threads, one running ProcessorManager , responsible for all Processor management. Another one runs campaignOwner , which internally will be responsible for running the campaign Owner and running the Owner responsibilities. As shown below, after TiCDC Server is started, a Capture thread will be created, and Capture will create two threads, ProcessorManager and Owner, during the running process, each of which is responsible for different tasks.

Create TiCDC Changefeed

The command used when creating the changefeed is as follows:

cdc changefeed create --server=http://127.0.0.1:8300 --sink-uri="blackhole://" --changefeed-id="blackhole-test"

The server parameter identifies a running TiCDC node, which records the PD address at startup. When creating a changefeed, the server will access the ETCD Server in the PD and write a Changefeed metadata information.

/tidb/cdc/default/default/changefeed/info/${changefeed_id}

/tidb/cdc/default/default/changefeed/status/${changefeed_id}
  • The first Key identifies a Changefeed, including various static metadata information of the Changefeed, such as changefeed-id , sink-uri , and some other identifiers at runtime for the data.
  • The second Key identifies the runtime progress of the Changefeed, mainly records the progress of Checkpoint and ResolvedTs , and will be updated periodically.

What Etcd does

ETCD undertakes a very important metadata storage function in the entire TiCDC cluster, and it records important information such as Capture and Changefeed. At the same time, by continuously recording and updating the Checkpoint and ResolvedTs of Changefeed, it is ensured that Changefeed can move forward steadily. From the figure above, we can know that when Capture starts, it writes its metadata information into ETCD by itself. After that, operations such as creating, suspending, and deleting Changefeed are all done through the started TiCDC Owner. Execution, which is responsible for updating etcd.

Owner election and switching

There can be multiple TiCDC nodes in a TiCDC cluster, and each node runs a campaignOwner thread, which is responsible for running for the Owner election. If the election is successful, it will perform the Owner’s job duties. Only one node in the cluster will win the election, and then execute the owner’s work logic, and the threads on other nodes will be blocked in the election for the Owner.

The election process of TiCDC Owner is based on ETCD Election. After each Capture starts, it will create an ETCD Session, then use the Session, call the NewElection method, create an election to Owner Key /tidb/cdc/${ClusterID}/__cdc_meta/owner , and then call Election.Campaign starts the campaign. The basic relevant code process is as follows:

sess, err := concurrency. NewSession(etcdClient, ttl) // ttl is set to 10s
if err != nil {
    return err
}

election := concurrency.NewElection(sess, key) // key is `/tidb/cdc/${ClusterID}/__cdc_meta/owner`

if err := election.Campaign(ctx); err != nil {
    return err
}

...

Interested readers can use the Capture.Run method as an entry point to browse this part of the code flow and deepen their understanding of the process. During the real cluster operation, multiple TiCDC nodes go online one after another and start to run for Owner at different times. The first instance that writes the Owner Key to ETCD will become the Owner. As shown in the figure below, TiCDC-0 writes the Owner Key at time t=1 and will become the Owner. If it encounters a failure and resigns its Ownership during subsequent operation, then TiCDC-1 will become the new Owner node . The old Owner node goes online again, calls the Election.Campaign method to re-elect the Owner, and the cycle repeats.

EtcdWorker module

EtcdWorker is a very important module inside TiCDC. It is mainly responsible for reading data from ETCD, mapping it to TiCDC memory, and then driving Owner and ProcessorManager to run. In the specific implementation, EtcdWorker periodically obtains the changes of all TiCDC-related Keys by calling the ETCD Watch interface, and then maps them to the GlobalReactorState structure maintained by itself, which is defined as follows As shown, information such as Capture, Changefeed, and Owner are recorded.

type GlobalReactorState struct {
    ClusterID string
    Owner map[string]struct{}
    Captures map[model.CaptureID]*model.CaptureInfo
    Upstreams map[model.UpstreamID]*model.UpstreamInfo
    Changefeeds map[model.ChangeFeedID]*ChangefeedReactorState
    
    ....
}

Both Owner and ProcessorManager are implementations of the Reactor interface , and both rely on information provided by GlobalReactorState to advance work. Specifically, the Owner polls each Changefeed recorded in GlobalReactorState , so that each Changefeed can be steadily pushed into the sync state. At the same time, it is also responsible for the work related to the operation status of Changefeed, such as Pause / Resume / Remove. ProcessorManager polls each Processor so that they can update their running status in time.

closing

That’s all for this article. I hope readers can understand the following questions:

  • TiCDC Server is started, and the interactive process between Changefeed and ETCD is created.
  • How EtcdWorker reads etcd data and drives Owner and Processor Manager to run.
  • TiCDC Owner election and switching process.

Next time we will introduce the working principle of the Scheduler module inside TiCDC Changefeed.

Read the series of articles on TiCDC source code interpretation:

  • Interpretation of TiCDC source code (1) — Overview of TiCDC architecture

  • Interpretation of TiCDC source code (2) — Introduction to TiKV CDC module

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge Cloud native entry skill treeHomepageOverview 11119 people are learning systematically