ElasticSearch synchronous MySQL data solution summary

narrative

In actual project development, we often use Mysql as a business database and ES as a query database to realize the separation of reading and writing, relieve the query pressure of the Mysql database, and deal with complex queries of massive data. One of the very important issues is how to realize the data synchronization between Mysql database and ES. Today, I will talk to you about various solutions for data synchronization between Mysql and ES.

Characteristics of MySQL and ElasticSearch

1. Why choose MySQL?

MySQL does not have a particularly advantageous position in the history of relational databases. Oracle/DB2/PostgreSQL (Ingres) was developed 20 years earlier than MySQL. However, taking advantage of the Internet boom in 2000, the LAMP architecture was quickly used, especially in In China, the master data of IT systems of most emerging enterprises are deposited in MySQL.
Core features: Open source free, high concurrency, stability, support transactions, support SQL query

High concurrency: MySQL kernel features are especially suitable for high-concurrency simple SQL operations, lightweight links (thread mode), optimizers, executors, and transaction engines are relatively simple and rough, and storage engines are more detailed
Good stability: The biggest requirement for the main database is to be stable and not lose data. On the contrary, the characteristics of the MySQL kernel make it distinctive, so as to achieve good stability. The main and backup systems are also ready very early to cope with rapid switching in case of a crash. The innodb storage engine also ensures the stability of the MySQL disk
Convenient operation: good and convenient user experience (compared to PostgreSQL), making it very easy for application developers to get started, and the learning cost is low
Open source ecology: MySQL is an open source product, which makes it relatively simple for upstream and downstream manufacturers to build tools around it. HA proxy, sub-database and sub-table middleware greatly enhance its practicability, and at the same time, its open source characteristics allow it to have a large number of users

2. Why choose ES?

Several notable features of ES can effectively make up for the shortcomings of MySQL in enterprise-level data operation scenarios, and this is also an important reason why we choose it as a downstream data source

Core features: Support word segmentation retrieval, multi-dimensional filtering performance is good, support massive data query

Text search capability: ES is a search system based on inverted index. With a variety of tokenizers, it performs better in text fuzzy matching search and has a wide range of business scenarios.
Good multi-dimensional filtering performance: billion-scale data is pre-built using wide tables (eliminating joins), combined with full-field indexes, so that ES has an overwhelming advantage in multi-dimensional filtering capabilities, and this ability is such as CRM, BOSS, MIS and other enterprise operating systems Core appeal, plus text search capability, unique
Open source and business in parallel: ES open source ecology is very active, with a large number of user groups, and behind it is also supported by independent commercial companies, which allows users to have more diverse and gradual choices according to their own characteristics

Data synchronization scheme

1. Synchronous double writing

This is the easiest way to write data to ES while writing data to mysql.

example code

/**
 * New product
 */
@Transactional(rollbackFor = Exception. class)
public void addGoods(GoodsDto goodsDto) {<!-- -->
//1, save Mysql
Goods goods = new Goods();
BeanUtils.copyProperties(goodsDto,goods);
GoodsMapper.insert();
 
//2, save ES
IndexRequest indexRequest = new IndexRequest("goods_index","_doc");
indexRequest.source(JSON.toJSONString(goods), XContentType.JSON);
indexRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
highLevelClient. index(indexRequest);
}

advantage:

Simple business logic
High real-time

shortcoming:

Hard-coded, where there is a need to write to mysql, you need to add the code to write to ES
Strong business coupling
There is a risk of data loss due to double-write failure
Poor performance: Originally, the performance of mysql is not very high, adding an ES, the performance of the system will inevitably decline.

Attached:
The risks of double writing failure mentioned above include the following:

ES system is unavailable;
Network failure between the program and ES;
The program restarts, causing the system to have no time to write to ES, etc.

In view of this situation, if there is a strong data consistency requirement, it must be double-written into the transaction for processing, and once the transaction is used, the performance degradation will be more obvious.

2. Asynchronous double writing (MQ mode)

For the scenario of multi-data source writing, MQ can be used to implement asynchronous multi-source writing. In this case, the writing logic of each source does not interfere with each other, and the abnormal or slow writing of a single data source will not affect other data sources. Writing, although the overall writing throughput has increased, but because MQ consumption is asynchronous consumption, it is not suitable for real-time business scenarios.

advantage:

high performance
Data loss is not easy to occur, mainly based on the consumption protection mechanism of MQ messages, such as ES downtime or write failure, and MQ messages can be consumed again.
Multi-source writes are isolated from each other, making it easy to expand more data source writes

shortcoming:

Hard-coded problems, access to new data sources requires new consumer code
Increased system complexity: the introduction of message middleware
There may be a delay problem: MQ is an asynchronous consumption model, and the data written by the user may not be immediately visible, causing a delay.

3. Scheduled scan synchronization based on MySQL table

There are hard-coded problems in the above two solutions, that is, any additions, deletions, modifications, and checks to mysq are either implanted with ES codes or replaced with MQ codes, and the codes are too intrusive.
If the real-time requirements are not high, you can consider using a timer to handle it. The specific steps are as follows:

Add a timestamp field to the related table of the database, and any crud operation will cause the time of this field to change;
The CURD operation in the original program does not make any changes;
Add a timer program, let the program scan the specified table according to a certain period of time, and extract the changed data within the period of time;
Write to ES one by one.

As shown below:

The typical implementation of this solution is to realize data synchronization with the help of logstash. The underlying implementation principle is to regularly use SQL query to write new data into ES according to the configuration, so as to realize incremental synchronization of data.

For specific implementation, please refer to: Timing incremental synchronization of mysql data to ES through Logstash

advantage:

No change to the original code, no intrusion, no hard coding;
There is no strong business coupling, and the performance of the original program will not be changed;
Worker code is easy to write and does not need to consider adding, deleting, modifying and checking;

shortcoming:

The timeliness is poor. Since the timer is used to synchronize data according to the fixed frequency lookup table, even if the synchronization period is set to the second level, there will still be a certain time delay.
There is a certain polling pressure on the database. One way to improve it is to put the polling on the slave library with little pressure.

4. Real-time synchronization based on Binlog

The above three solutions either have code intrusion, hard coding, or delay, so is there a solution that can ensure real-time data synchronization without substitution intrusion?
Of course, you can use mysql’s binlog for synchronization. Its realization principle is as follows:

Specific steps are as follows:

Read the binlog log of mysql to obtain the log information of the specified table;
Convert the read information to MQ;
Write an MQ consumer program;
Continuously consume MQ, and write the message into ES after consuming a message.

advantage:

No code intrusion, no hard coding;
The original system does not require any changes and has no perception;
high performance;
Business decoupling, no need to pay attention to the business logic of the original system.

shortcoming:

Building a Binlog system is complex;
If the binlog information parsed by MQ consumption is used, there will also be the risk of MQ delay like the second solution.

The currently popular solution in the industry: use canal to monitor binlog to synchronize data to es

canal , translated as waterway/pipe/ditch, the main purpose is to provide incremental data subscription and consumption based on MySQL database incremental log parsing.
To put it bluntly, data is incrementally synchronized according to Mysql’s binlog log. To understand the principle of canal, you must first understand the principle of master-slave replication of mysql:

All create update delete operations will enter the MySQLmaster node
The master node will generate a binlog file, and each operation of the mysql database will be recorded in the binlog file
The slave node will subscribe to the binlog file of the master node, and synchronize the data to the slave data in the form of incremental backup

The principle of canal is to pretend to be a slave node of mysql, so as to subscribe to the binlog log of the master node, the main process is:

The canal server transmits the dump protocol to the master node of mysql
After receiving the dump request, the mysql master node pushes the binlog log to the canal server, parses the binlog object (originally a byte stream) and converts it into Json format
The canal client listens to the canal server through the TCP protocol or MQ, and synchronizes data to ES

Data migration synchronization tool selection

There are many options for data migration and synchronization tools. The following table only compares some data synchronization tools that the author has used and researched in the scenario of synchronizing ES with MySQL. Users can choose the product that suits them according to their actual needs.

Feature\Product	Canal	DTS	CloudCanal
Does it support self-built ES?	Yes	No	Yes
ES Peer Version Support Richness	Supports ES6 and ES7 in the middle	Highly supports ES5, ES6 and ES7	Supports ES6 and ES7 in the middle
Nested type support	join/nested/object	object	nested/object
Join support method	Based on join parent-child document & amp; anti-check	None	Based on wide table pre-construction & amp;Reverse check
Whether structure migration is supported	No	Yes	Yes
Whether full migration is supported	Yes	Yes	Yes
Is incremental migration supported?	Yes	Yes	Yes
Data filtering capabilities	Medium – where condition can only be added in full amount	High – full incremental stage where condition	high – full incremental stage where condition
Support time zone conversion	No	Yes	Yes
Synchronous current limiting capability	No	Yes	Yes
Task editing ability	No	Yes	No
Data source support richness	medium	High	Medium
Architecture mode	The subscription consumption mode needs to be written to the message queue first	Direct connection mode	Direct connection mode
Monitoring indicator richness	Medium performance indicator monitoring	Medium performance indicators monitoring	High performance indicators, resource indicators monitoring
Alarm capability	None	For Delayed and abnormal phone alarm	Aiming at delayed and abnormal DingTalk, SMS, email alarm
Visual task creation & amp; configuration & amp; management Capability	No	Yes	Yes
Is it open source	Yes	No	No
Free	Yes	No Community Edition 、	SAAS version is free
Does it support independent output	Yes	Does it depend on the overall output of the cloud platform	Yes
Does it support SAAS?	No	Yes	Yes

Summary

This article mainly summarizes the common solutions for data synchronization between Mysql and ES.

Synchronous double writing is the simplest synchronization method, which can ensure the real-time performance of synchronous data writing to the greatest extent. The biggest problem is that the code is too intrusive.
Asynchronous double writing introduces message middleware. Since MQ is an asynchronous consumption model, there may be a problem of data synchronization delay. The advantage is that the throughput is higher and the performance is better during large-scale message synchronization, and it is convenient to access more data sources, and the data consumption and writing of each data source are isolated from each other and do not affect each other.
Based on Mysql table timing scan synchronization, the principle is to perform data synchronization through the incremental data in the timer scan table regularly, which will not cause code intrusion, but because it is timing scan synchronization, there will also be data synchronization Latency issues, a typical implementation is to use
Logstash implements incremental synchronization.
Real-time synchronization based on Binlog, the principle is to incrementally synchronize data by monitoring Mysql’s binlog log. There will be no code intrusion, and the real-time data synchronization can be guaranteed. The disadvantage is that the Binlog system is relatively complicated. A typical implementation is to use
canal realizes data synchronization.

Article reference: https://blog.csdn.net/w1014074794/article/details/127124574