Data change capture based on SQL Server

Typical uses

Typical uses of change capture are defined in Microsoft’s official documentation

Change Data Capture uses SQL Server Agent to record inserts, updates, and deletes that occur in a table. A typical example of a data consumer targeted by this technology is extract, transform, and load (ETL) applications. ETL applications incrementally load changed data from SQL Server source tables into a data warehouse or data mart.

I will add some typical uses here. Scenarios that need to share and synchronize data between heterogeneous systems can also make full use of this technology.

The loading of dimension tables and fact tables in data warehouses generally takes the following form:

Violent full load, that is, delete all data in the target dimension table, and then fully load it from the source table. 1. This method will destroy the surrogate key in the dimension table, and even the surrogate key cannot be used. (e.g. self-increasing ID). 2. Changes of SCD-Type2, the historical type, cannot be retained.
Still use full load, but when actually filling the dimension table and fact table, write the comparison process using the primary key LookUp yourself to find out which ones do not exist in the target table and which ones need to be updated. Various methods can only easily identify new data, but it is still necessary to write complex comparison logic for updated data, which will greatly reduce the speed of data loading.
Design an independent last_update_date field in the source table to record the update time. The loading process obtains the changed/new data by comparing it with this update time.

In the author’s career, the above three methods are the most common. The huge disadvantage of the first two is full loading. If there are hundreds of thousands or even millions of data in the dimension table or real-time table in the data mart, it is necessary to It takes hours to load, and even the solution is not feasible at all. Because the full amount of data is loaded, it will also cause the source table to be scanned (under the default isolation level) and locked, reducing the UPDATE/DELETE performance of the source table. The disadvantage of the third solution is that it relies on the table schema. In some cases, the source system is not under the control and is not allowed to change the table schema, let alone the application. It also relies heavily on the accuracy of this field. For example, this field may be incorrectly used by the application. Modification, causing update to load incorrect data.

Therefore, major database manufacturers have long been aware of this scenario and have withdrawn from the data change capture technology (CDC) very early. Unfortunately, not many people pay attention to this technology, so I wrote an article to share it with you. I hope everyone can make full use of this technology in data synchronization, data warehousing, and various data processing scenarios. Okay, let’s start with a step-by-step introduction on how to use SQL Server data change capture.

Enable change capture for the database

Take my own newly created Test database as an example, which contains an Account table.

<code>-- ====</code>

-- Enable change capture for data

-- ====

USE Test

GO

EXEC sys.sp_cdc_enable_db

GO

Enable capture for table

<code>USE Test</code>

GO

EXEC sys.sp_cdc_enable_table

@source_schema = N'dbo',

@source_name = N'Account', --table name

@role_name = NULL, --Role permission control is not used here for demonstration purposes. Actual production environments often need to assign a separate role, and only this role can use and obtain change data.

@supports_net_changes = 1 -- Enable net capture, which only retains the last change of each row of data.

GO

Query change capture data

<code>DECLARE @from_lsn binary(10), @to_lsn binary(10)</code>

SET @from_lsn = sys.fn_cdc_get_min_lsn ('dbo_Account');

SET @to_lsn = sys.fn_cdc_get_max_lsn();

SELECT * from cdc.fn_cdc_get_net_changes_dbo_Account( @from_lsn, @to_lsn, 'all' );

Okay, if you are a little confused, let me explain one by one. The above picture is the result after running. This result comes from the shadow table created when CDC capture is enabled for the table. The shadow table is specially used to store change information. They can be seen under the system table. You can see that the shadow table is composed of adding many columns to the source table to identify the capture.

<code>SET @from_lsn = sys.fn_cdc_get_min_lsn ('dbo_Account');</code>

--This sentence is used to query the minimum LSN number of the captured table. LSN is the consecutive number of the SQL Server transaction log, arranged continuously from small to large.

--sys.fn_cdc_get_min_lsn is used to obtain this LSN. LSN is in binary format. All CDC change captures can only use the LSN number.

SET @to_lsn = sys.fn_cdc_get_max_lsn();

--This sentence is to obtain the maximum log LSN value. Why is there no table name here? Because the maximum log LSN is aligned in all tables. For example, I made an update to the Account table in a transaction, and at the same time, I also made an update to another capture table, the Sales table. Then the maximum LSN of the shadow table of the two capture tables is the same value, so there is no need to pass the table name separately.

SELECT * from cdc.fn_cdc_get_net_changes_dbo_Account( @from_lsn, @to_lsn, 'all' );

--This sentence is the result returned by the final call to the capture function. The parameters are two LSN intervals. 'all' means returning all columns that were finally updated, and not marking which columns have been changed in _$update_mask.

Okay, let’s GO back and look at the meaning of the column names in the query result set.

Column name	Description
_$ start_lsn	The LSN number to which this row of data changes belongs
_$operation	The type number of the change, the reference is as follows: 1 = delete 2 = insert 4 = updated row 3 = row before update 5 = 5 regardless of whether the row is updated or inserted (The last parameter is if you enter all with merge)
_$update_mask	A mask used to mark which columns have been changed. It can only be read using the sys.fn_cdc_is_bit_set function, which will be demonstrated later. How can I use this function to identify which columns have been changed.
The remaining columns are displayed one-to-one as the columns of the source table	I won’t GO into details.

Do another update and delete operation and see the results.

<code>update Account</code>

set name = 'Gary3'

where ID = 20

update Account

set name3 = 'Gary name 3'

where id = 19

delete from Account

where id = 18

The first row of data corresponds to the second modification operation. _$operation=4 status is correct.

The second row of data corresponds to the third delete operation. _$operation=1 status is correct

The third row of data corresponds to the first modification operation. _$operation=2, wait, why is it 2 here? Because this data was the first one I added manually and then modified. At the same time, the net change function is used to read, so he always thinks that this data is new. To put it simply, the capture range I use is from the smallest LSN to the largest LSN. Because the smallest change of this data is the first addition, the status it always returns is 2, but its data is the latest change. If you want to get the new changes, you need to change the minimum LSN.

The following table shows all the records in the shadow table. You can clearly see that the minimum LSN of the record with ID=20 is the first row. $operation=2 means that this transaction is a new addition, but subsequent ID=20 The record was updated twice, once to Gary and once to Gary3, so he returned the final Gary3. This is the meaning of net change. Otherwise, if operation=4 is returned, this record will be considered a change, and its newly added status will be lost, which is wrong. When we are doing ETL, we need to know when this record is newly added and retain its latest value.

How to know whether a column has changed?

This is also one of the most powerful features of data change capture. If you do not use change capture technology, you can only use the boring and ugly oldValue == newValue technology to determine whether the value has changed. Let’s make a change and change the name of the record with ID=19.

<code>update Account</code>

set name3 = 'Gary name 4'

where id = 19

We can see that the last record in the shadow table (the shadow table ends in _CT in the system table) reflects this change. __$operation =4, indicating update.

Here we need to record the _start_lsn of this update, because the starting lsn of this change has changed. If it is still captured from the minimum LSN, then this data only reflects the initial state (this is very important), so we use The new start_lsn capture, for the convenience of demonstration, I directly hardcoded the copy.

<code>DECLARE @from_lsn binary(10), @to_lsn binary(10),@name int</code>

SET @from_lsn = 0x0000002C000003D60003

SET @to_lsn = sys.fn_cdc_get_max_lsn();

SET @name = sys.fn_cdc_get_column_ordinal

( 'dbo_Account','Name');

SELECT *,

sys.fn_cdc_is_bit_set(@name, __$update_mask) as 'Is_Name_Changed'

from cdc.fn_cdc_get_net_changes_dbo_Account( @from_lsn, @to_lsn, 'all with mask' );

You can see that the last column is_Name_Changed = 1 means that the Name column has been modified. With this tag, we can easily use operation=4 $$ is_name_changed = 1 as a condition to determine whether the column has been updated. During the ETL process, we can use this judgment to add this field to the dimension table as history.

Capture only the columns you need

For example, the following @captured_column_list parameter can configure the columns that need to be captured instead of capturing all columns by default, which is very helpful for improving performance.

<code>USE Test</code>

GO

EXEC sys.sp_cdc_enable_table

@source_schema = N'dbo',

@source_name = N'Account', --table name

@role_name = NULL, --Role permission control is not used here for demonstration purposes. Actual production environments often need to assign a separate role, and only this role can use and obtain change data.

@supports_net_changes = 1 -- Enable net capture, which only retains the last change of each row of data.

@captured_column_list = N'DepartmentID, Name, GroupName',

GO

Sliding time loading

As mentioned earlier, the most important parameters for change capture are _start_lsn, _end_lsn. Simply put, these two values determine the interval for reading changed data. Whether the returned data is in a state of addition, deletion, or modification is completely determined by this interval. For example, a row of data may be added in an earlier interval, but updated or deleted in a later interval. So it is very important to determine the interval.

Okay, so in real ETL loading, to a large extent we need to record the start_lsn, end_lsn of each loading interval. Next time we load we need to use the lastend_lsn to calculate a starting lsn.

For example: the maximum lsn after the last loading was completed = 0x0000002C000006370004. After that, the source table has been added, deleted, modified and checked many times.

The next one is lsn = 0x0000002C000006370005. Then we need to use this as the starting point for the next load. So how to get the next next lsn based on the last lsn?

This requires the use of the sys.fn_cdc_increment_lsn function. The parameter is the maximum lsn after the last load. So let’s modify the code.

<code>DECLARE @from_lsn binary(10), @to_lsn binary(10),@name int</code>

SET @from_lsn = sys.fn_cdc_increment_lsn(0x0000002C000003D60003);

SET @to_lsn = sys.fn_cdc_get_max_lsn();

SET @name = sys.fn_cdc_get_column_ordinal

( 'dbo_Account','Name');

SELECT *,

sys.fn_cdc_is_bit_set(@name, __$update_mask) as 'Is_Name_Changed'

from cdc.fn_cdc_get_net_changes_dbo_Account( @from_lsn, @to_lsn, 'all with mask' );

For further improvement, we need to use a table to record the lsn after each loading, and then query this lsn to dynamically transfer a code. Assume that I have created this history table, and this history table always has only one record, which is the last loaded lsn. Then we pass the found result to the sys.fn_cdc_increment_lsn function, and it will return the latest lsn. We Use this as the starting point lsn to load, and then save the largest @to_lsn to the table Capture_Last_lsn after the loading is completed.

<code>DECLARE @last_load_lsn binary(10), @from_lsn binary(10), @to_lsn binary(10),@name int</code>

select @last_load_lsn = last_load_lsn from Capture_Last_lsn

SET @from_lsn = sys.fn_cdc_increment_lsn(@last_load_lsn);

SET @to_lsn = sys.fn_cdc_get_max_lsn();

SET @name = sys.fn_cdc_get_column_ordinal

( 'dbo_Account','Name');

SELECT *,

sys.fn_cdc_is_bit_set(@name, __$update_mask) as 'Is_Name_Changed'

from cdc.fn_cdc_get_net_changes_dbo_Account( @from_lsn, @to_lsn, 'all with mask' );

Okay, so when we run the load, we can always read the changed data from the starting point of the last load.

Summary

Data change capture technology is a very critical core technology. If you want to design a reliable data capture and transmission solution, this may be the only reliable solution and it is very efficient. So let’s summarize the advantages and impacts:

advantage:

Reliability provided at the database level without relying on other third-party technologies.

Only the captured changes are read each time, without querying all the data in the source table.

Enable and disable capture, transparent to the source table (the only requirement is that the original table has a primary key)

The capture is generated by using the replication technology in the SQL Server kernel to capture the transaction log, so it is an asynchronous capture and has little impact on the source table.

Changing the table schema of the source table does not destroy the capture, deleted columns will return null, and new columns will not appear in the capture table. Capture is enabled unless redisabled.

question:

The capture uses SQLServer agent to execute the job for capturing and cleaning.

The data capture retention period defaults to 3 days, but can of course be changed to a longer date. That is to say, if you do not perform a capture query for 3 days, SQL Server will automatically clear the modifications captured in the shadow table 3 days aGO. So you should carefully consider your ETL data loading frequency, for example, run a capture query once a day for data synchronization. Because the longer the capture validity period is set, the more data will be in the shadow table, which is a great challenge for large tables that are frequently updated.