In-depth explanation of SQL Server CDC data synchronization

Introduction

SQL Server is an old-fashioned relational database. It was jointly launched by Microsoft, Sybase and Ashton-Tate in 1988. It has been iteratively updated and has a wide range of user groups.

Today, when we refer to SQL Server we usually refer to versions after Microsoft SQL Server 2000.

SQL Server 2008 is a milestone version, adding a lot of new features, including new syntax, richer types and the CDC capability mentioned in this article strong>, this capability makes it more convenient to synchronize data from SQL Server to the outside in real time.

This article will introduce the further optimization and practice of CloudCanal for SQL Server data synchronization in the new version.

What does SQL Server CDC look like?

Original log

Common databases often have the following two types of logs

  • redo log
    • Record the positive change of data. In simple terms, the commit of the transaction is usually recorded in this file first, and then returns the application to succeed, which can ensure data persistence
  • undo log
    • It is used to ensure the atomicity of the transaction, such as executing the rollback command to reversely execute the content in the undo log to achieve data rollback

The process of writing a DML statement to the database is as follows

  • In most relational databases, one or more changes are implicitly or explicitly wrapped into a transaction
  • When the transaction starts, the database engine locates the file location where the data row is located and generates front image and back image based on the existing data
  • After mirroring data is recorded in the redo log, and before mirroring data is recorded in the undo log
  • After the transaction is committed, the log commit point (checkpoint) advances, and the committed log content may be overwritten or released

The SQL Server redo/undo log adopts the ldf format , and the file is recycled.

  • The ldf log file is composed of multiple VLFs (logical logs) concatenated end to end to form a complete database log record
  • When ldf reaches the end of the physical log file at the end of the logical log, the new log record will return to the beginning of the physical log file and overwrite the old data

ldf files are delta log files analyzed by CDC.

Enable CDC

Execute the exec [console].sys.sp_cdc_enable_db command on the database to enable the CDC function for the console database. This statement will actually create two jobs: cdc.console_capture , cdc. console_cleanup

Use the exec sp_cdc_help_jobs command to view both job details.

  • cdc. console_capture
    • Responsible for analyzing ldf logs and parsing console database events, and then writing them into the CDC table
    • Scans are performed every 5 seconds, with 10 rounds per scan, with a maximum of 500 transactions per scan
  • cdc. console_cleanup
    • Responsible for regular cleaning of older data in the CDC table
    • By default, keep 3 days of CDC log data (4320 seconds)

After the CDC function is enabled, the SQL Server database will have an additional schema named cdc, and the following tables will appear in it.

  • change_tables
    • Document each CDC-enabled source table and its corresponding capture table
  • captured_columns
    • Record information corresponding to each column in the capture table
  • index_columns
    • Record the primary key information contained in the source table (if any)
  • lsn_time_mapping
    • Record the start/end time and LSN location information of each transaction
  • ddl_history
    • Record the DDL information corresponding to the increase/decrease column in the source table, other DDL will not be recorded

With the above preparatory actions and information, you can start change data capture(CDC) for the original table, that is, incremental data capture.

Capture table changes

There are source tables as follows

create table [dbo].[test_table] (
  [id] [bigint] NOT NULL primary key,
  [test] [nchar](10) NULL
)

To enable CDC for it execute the following command

exec [console].[sys].[sp_cdc_enable_table]
    @source_schema = [dbo],
    @source_name = [test_table],
    @role_name = NULL,
    @capture_instance = [dbo_test_table], -- optional
    @supports_net_changes = 0;

There is an extra table named dbo_test_table_CT under the cdc schema, which is the capture table

  • Do some DML operations on the source table [dbo].[test_table], usually within 5 seconds you can see the change record in the capture table
  • Do some increase/decrease column operations on the source table, and the corresponding DDL will appear in the ddl_history table

Other tables can also obtain corresponding incremental changes through similar settings. The whole mechanism seems quite intuitive and simple.

What is the challenge?

Difficulty 1: Difficulty in DDL synchronization

The CDC capture table only feeds back data changes, no DDL information

DDL requires additional acquisition, that is, the order relationship with DML requires additional processing

To solve this problem, it is necessary to mix DDL and DML events together and ensure the sequence by executing the following SQL, but serious performance problems will be faced in actual use.

select * from (
select __$start_lsn lsn,__$operation oper,__$update_mask mask, null ddl,id data_id,test data_test
from [console].[cdc].[dbo_test_table_CT]
union
select ddl_lsn lsn, -1 oper, null mask, ddl_command ddl, null data_id, null data_test
from [console].[cdc].[ddl_history]
) t order by lsn

Difficulty 2: Unable to obtain new column data

The structure of the CDC capture table does not change with the occurrence of DDL events, which means that the data of the newly added columns cannot be obtained

Difficulty 3: Database limitations

Using the CDC function itself will also have some hard restrictions, which can be roughly divided into two categories

Hard limit

  • The truncate table statement cannot be executed on the source table that has enabled CDC capture, and an error will be reported after execution

  • The CDC capture table is essentially an ordinary table, and a large number of subscriptions will cause the number of tables in the entire library to expand

  • Relying on SQL Server Agent, if it is not started or the job fails to run, no new data will be written to the capture table

  • A table can only create 2 corresponding CDC capture tables, that is, no more than 2 incremental subscriptions can be made

  • The CDC capture of a table can only be enabled and disabled, that is, new data cannot be obtained by rebuilding CDC and specifying LSN

soft limit

  • The data retention time in the CDC capture table defaults to 3 days

  • When inserting or updating large fields, the default CDC will only process data up to 64KB bytes

    • If the data content exceeds this limit, the CDC capture task will report an error and stop working
    • There are 7 affected types: text, ntext, varchar(max), nvarchar(max), varbinary(max), xml, image

Workaround for CloudCanal

The basic processing model of CloudCanal SQL Server incremental consumption is as follows, which guarantees the data change order of a single table and meets most scenarios

  • Determine a work queue based on the change_tables table
  • Determine the starting point, from which the incremental data scan of the capture table starts
  • Concurrent processing of events on the work queue
  • Each Worker will scan the CDC capture table to be processed by itself according to the starting LSN
  • Each worker scan maintains its own LSN progress

Solving difficulty 1: DML/DDL reordering

Each record in the CDC capture table has an LSN information, and the ddl_history table also has LSN information. Therefore, you can use the idea of interpolation to insert DDL events into the normal DML event sequence. The principle is as follows:

  1. Perform a pre-query on the ddl_history table, and the obtained DDL events will be compared with each other in the later processing
  2. Query the dbo_test_table_CT data capture table
  3. Detect whether DDL events can be inserted when processing the data of each capture table
  4. complete flow of events

Solving Difficulty 2: Reverse Check and Supplement Missing Data

The SQL Server CDC capture table can only create a maximum of 2 is a hard limit, but it can solve this problem, creating a second CDC capture table after the DDL occurs can sense the changes in the data caused by the DDL

  1. Create the first CDC capture table dbo_test_table_1_CT
  2. Add a new column in the middle of two data insertions
  3. Create a second CDC capture table dbo_test_table_2_CT
  4. Inserting a new data

From the above figure, we can see that dbo_test_table_2_CT can perceive the new column data compared to dbo_test_table_1_CT

Unfortunately, after the DDL occurs and before the second CDC capture table is created, the intermediate data is still missing

The above example is shown in the figure below (the event in gray indicates that the event or data is missing)

Take the LSN where the DDL occurs as the cut-off point

  • The data in the dbo_test_table_1_CT table is fully trusted before the DDL happens
  • Since there is no new column field in the dbo_test_table_1_CT table after the DDL occurs, its data is incomplete and cannot be fully trusted
  • And dbo_test_table_2_CT is created after DDL occurs, so compared with dbo_test_table_1_CT its data is missing
  • In addition, there is a blind zone between dbo_test_table_1_CT and dbo_test_table_2_CT, which makes the two tables of this INSERT event untrustworthy

CloudCanal’s solution is to solve this problem by reversely using PK to fill in the source table for the missing sites in both tables (the dark gray part in the above figure)

In an extreme case, a new DDL occurs during the creation of the second CDC capture table, which will cause the newly created capture table to be unreliable, so the second CDC capture table needs to be recreated, and the middle of the expansion needs to be reversed. The completed data range (the dark gray part in the figure below)

It is based on the above-mentioned series of mechanisms that CloudCanal solves the problem that DDL events cannot obtain incremental data

Solving Difficulty 3: Provide Professional Optimization Solutions

For the hard limit, CloudCanal does not have a positive solution, but provides more ways (such as trigger, scheduled incremental scan, New version of SQL Server CDC solution etc.) to supplement.

The soft limit, however, can be optimized by

  • Set the data retention time in the CDC capture table through the retention parameter in the following command

    exec console.sys.sp_cdc_change_job
        @job_type = 'cleanup',
        @retention=4320 -- unit: second
    
  • Adjust the maximum data bytes processed by CDC with the following command

    exec sp_configure 'show advanced options', 1 ;
    reconfigure;
    exec sp_configure 'max text repl size', -1; -- -1 means unlimited
    reconfigure;
    

Summary

This article briefly introduces the SQL Server CDC technology, and then based on this capability, how CloudCanal realizes stable incremental DML + DDL synchronization, and solves the problems encountered in it.