How to construct a strongly consistent system? Understand the principles of 2PC and TCC modes in data consistency and how to do it (with pictures)

Background

First of all, when reading this article, you should first understand what a transaction is and what a distributed transaction is.
Let me give 2 examples here. There are two typical scenarios:
1. An application has two databases, one database is for orders, and the other database is for points. It requires points to be given to the user when placing an order. Points cannot be given but the order is not placed, nor points can be given after the order is placed. .
2. An application has two microservices. There is no problem in calling the order service and returning HTTP 200, but when going to points, HTTP 500 is returned. At this time, because an exception is thrown, the front end can only prompt that the order failed, but the actual order has been placed.

2PC mode

In order to solve the problem, the first thing we think of is to submit the transactions of the two data sources at the same time. However, errors may occur due to various business logic processing, so this must be done by encapsulating a layer of framework outside the business code and proxying the business code.

Because multiple database connections are easy to manage, the 2PC model is generally suitable for a monolithic architecture, with one monolith facing multiple databases. You can use a framework such as atomikos and look at the pictures of the execution process without talking nonsense.

No matter whether there is an exception or not, all submission operations will be captured by the rewritten submission method, and they will wait until the entire program segment is executed before submitting. This intercepted process is called pre-commit. If there is a problem in any link at this time and an exception is thrown, the entire process will be rolled back.

Simple, this is the principle of 2PC. After talking about the principles for a long time, let’s look at the core code:

XADataSource dataSource=new DruidXADataSource();
dataSource.setUrl(fill in Url here);
dataSource.setUsername(fill in username here);
dataSource.setPassword(fill in password here);
dataSource.setDriverClassName(fill in the driver here);
dataSource.setInitialSize(minimum connection pool size);
dataSource.setMaxActive(maximum connection pool size);
dataSource.setValidationQuery("SELECT 1");

AtomikosDataSourceBean atomikosDs = new AtomikosDataSourceBean();
atomikosDs.setXaDataSource(dataSource);
atomikosDs.setMaxPoolSize(maximum connection pool size);
atomikosDs.setMinPoolSize (minimum connection pool size);
atomikosDs.setUniqueResourceName (global unique encoding, such as ds1, ds2);
atomikosDs.init();

// Get the connection and do business as you like. You can generate multiple Connections.
Connection conn1 = atomikosDs.getConnection()

// This transaction manager is globally unique and thread-bound
TransactionManager tx=new UserTransactionManager();
// Enable distributed transaction consistency
tx.begin();

//Write your business code here

// Unified submission
tx.commit();

If you want Spring Boot or Spring Cloud to take over the transaction using annotations, then write the following two classes:

import com.atomikos.icatch.jta.UserTransactionManager;
import org.springframework.transaction.TransactionDefinition;
import org.springframework.transaction.TransactionException;
import org.springframework.transaction.support.AbstractPlatformTransactionManager;
import org.springframework.transaction.support.DefaultTransactionStatus;

/**
 * Transaction manager
 * @author Yang Ruoyu
 */
public class DbTransactionManager extends AbstractPlatformTransactionManager {<!-- -->
    @Override
    protected Object doGetTransaction() throws TransactionException {<!-- -->
        return new UserTransactionManager();
    }

    @Override
    protected void doBegin(Object transaction, TransactionDefinition definition) throws TransactionException {<!-- -->
        UserTransactionManager tx = (UserTransactionManager) transaction;
        try {<!-- -->
            tx.begin();
        } catch (Exception e) {<!-- -->
            throw new RuntimeException(e);
        }
    }

    @Override
    protected void doCommit(DefaultTransactionStatus status) throws TransactionException {<!-- -->
        UserTransactionManager tx = (UserTransactionManager) status.getTransaction();
        try {<!-- -->
            tx.commit();
        } catch (Exception e) {<!-- -->
            throw new RuntimeException(e);
        }
    }

    @Override
    protected void doRollback(DefaultTransactionStatus status) throws TransactionException {<!-- -->
        UserTransactionManager tx = (UserTransactionManager) status.getTransaction();
        try {<!-- -->
            tx.rollback();
        } catch (Exception e) {<!-- -->
            throw new RuntimeException(e);
        }
    }
}
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.transaction.PlatformTransactionManager;

/**
 * Transaction manager configuration
 * @author Yang Ruoyu
 */
@Configuration
public class TransactionConfig {<!-- -->
    @Bean
    public PlatformTransactionManager transactionManager() {<!-- -->
        return new DbTransactionManager();
    }
}

As for where the data source is generated, this is optional and can be used anywhere.

TCC mode

First, let’s understand the state transition of data in TCC mode.
Try operation: try to execute, the data ontology will not change
Confirm operation: real execution, real changes will occur to the data ontology
Cancel operation: cancel the execution and the data body will be restored to the state before Try
The white part in the picture will not be queried (similar to the pseudo-deletion query), while the blue-green part will be retrieved normally.

So what should be done if distributed services are provided? Look at the picture below, and you will understand if you understand it carefully.

Only the Try and Confirm phases are shown here. If an exception occurs, the front-end service notifies the distributed coordinator to send Cancel instructions to all services. It should be noted here that we assume that Confirm and Cancel will not throw any exceptions. If exceptions occur in these two stages, manual intervention is required.
The method shown in the picture is very simple and crude. In fact, these can be taken over by Seata.
What needs to be noted here is that in order to ensure strong consistency of data, in step 5, the service will be stuck waiting for the status to change to Ready. If you don’t want efficiency to drop, there is only one way: throw an exception, let the front end know that the data is Updating, and trigger Cancel.
In step 7, why should we deep copy the data to the coordinator’s Redis? The main reason is to directly overwrite it with the original data in Redis during the Cancel stage.
And during this process, you need to pay attention to the idempotence principle. In layman’s terms: if you randomly generate a UUID when trying, pass it to the points service, and then generate a new one when confirming UUID, naturally the points service will cause unpredictable errors when it gets a new UUID (such as recording the wrong order primary key). Therefore, XID and all randomly generated variables generated during the execution of Try and variables affected by other data states should be cached and recorded and taken out during Confirm. In other words, if I read a certain data The status is 1, which makes me go according to the logic of 1 this time. After a few seconds, other users change it to 2, so I still go according to the logic of 1. I just process it according to the context a few seconds ago, when doing the underlying architecture, you can proxy both the getter and setter methods. If the results when trying and confirming are different, that won’t work, so when does not meet idempotence, The entire GlobalTransaction must be rolled back.
For step 10, in fact, all the data has not been actually updated, only the status of the data has been updated. What is not mentioned here is that there is newly created data. The status should be Creating. When other functions are retrieving concurrently, it should be Filter out data with a status of Creating in the search conditions.

Various abnormal situations of TCC

Suddenly the microservice is down

We need to consider removing all Creating data when restarting, and changing all Updating data to Ready status. Writing a Spring Boot initialization will save a lot of trouble. The Creating data here was down because it was previously down, so the caller must have thought it failed and had already notified other services that Cancel was lost.

Suddenly the distributed coordinator went down

This shows the significance of our Redis existence. If you have multiple distributed coordinators, if the caller cannot adjust it, it will retry to another distributed coordinator. The data is still in Redis, which can ensure the stability of the business. continuous.

What should I do if the distributed coordinator crashes halfway through Confirm

It doesn’t matter, the same as above. If the caller finds that problems such as return 500 and Connection Refused were sent in step 14, it will retry three times. As long as there is still one available coordinator, it will continue to send the Confirm command.

What if the entire computer room loses power? Combine the first and third questions.

Well…even though it’s extreme. But there is still a solution. After the microservice itself shuts down and restarts, it will cancel all the data that is still in the distributed transaction. Similarly, the distributed coordinator will also cancel all the data that has not been confirmed when it is started. Send instructions to the microservices one by one. Assuming that other coordinators have completed Confirm and written back to Redis in the first 0.01 seconds of the shutdown, these data will no longer be canceled after the cluster starts.

Postscript

It’s a brain-burning thing to write this article. Anyway, please give it a thumbs up.