ElasticSearch (3)-based on Mysql hot update IK dictionary

Foreword

There is a view in front of me but I can’t find a way to do it, Cui Hao wrote a poem on it; (Reprinted:https://blog.51cto.com/u_13270529/5962113)

Adding an extended dictionary or a remote extended dictionary to the IK word segmenter requires restarting the ES server after each dictionary update, which is absolutely not allowed in a production environment; if we store extended dictionary data in third-party components such as Redis, ElasticSearch in Mysql synchronizes the latest dictionary data in Redis or Mysql every minute to update the dictionary, so that the IK dictionary data can be hot updated without restarting the ElasticSearch server every time the dictionary is updated;

Is the hot update a full update or an incremental update? The dictionary itself does not have a large amount of data, only a few hundred thousand at best. Generally, not all words are placed in the dictionary. Moreover, the dictionary data in the database serves the entire ES cluster, and the increment is considered. There will be a lot of details, and at the same time, the use of existing dictionaries will not be affected when updating the dictionary. Overall, it is better to update the dictionary in full.

1. Environmental parameters

ElasticSearch 8.1.0

Mysql 8.0.28

2. Hot update dictionary steps

1. Create a new “Extended Dictionary” and “Extended Stop Dictionary” data table in Mysql

# Extended disabled dictionary data table
CREATE TABLE stop_words (word VARCHAR (200));
# Extended dictionary data table
CREATE TABLE ext_words (word VARCHAR (200));

2. Download the IK word segmenter source code package corresponding to the ES version, and use Idea to open the project

Download address: https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v8.1.0

Add the following dependencies in 3.pom.xml

# Mysql dependency is used to pull dictionary data from the ES Mysql database
        <!--Introducing Mysql dependency -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.20</version>
        </dependency>

# Hutool's core dependency only uses the StrUtil.isNotBlank() method to determine whether a string is an empty string.
        <!-- Hutool tool class -->
        <dependency>
            <groupId>cn.hutool</groupId>
            <artifactId>hutool-core</artifactId>
            <version>5.6.3</version>
        </dependency>

4. Modify the Dictionar class in the org.wltea.analyzer.dic package

4.1. Annotate the initial method in the Dictionar class

4.2. Add the following code after commenting the initial method

 /**
* Dictionary initialization Since the IK Analyzer dictionary uses the static method of the Dictionary class for dictionary initialization
* Only when the Dictionary class is actually called, the dictionary will start to be loaded, which will prolong the time of the first word segmentation operation. This method provides a means to initialize the dictionary during the application loading phase.
*
* @return Dictionary
*/
public static synchronized void initial(Configuration cfg) {
if (singleton == null) {
synchronized (Dictionary.class) {
if (singleton == null) {

singleton = new Dictionary(cfg);
singleton.loadMainDict();
singleton.loadSurnameDict();
singleton.loadQuantifierDict();
singleton.loadSuffixDict();
singleton.loadPrepDict();
singleton.loadStopWordDict();

if(cfg.isEnableRemoteDict()){
// Establish monitoring thread
for (String location : singleton.getRemoteExtDictionarys()) {
// 10 seconds is the initial delay and can be modified. 60 is the interval time in seconds.
pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
}
for (String location : singleton.getRemoteExtStopWordDictionarys()) {
pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
}
}

new Thread(()->{
Properties pro = new Properties();
try {
// Read the Mysql configuration file. The mysql.properties configuration file is created in the config directory under the IK folder.
pro.load(new FileInputStream(PathUtils.get(singleton.getDictRoot(), "mysql.properties").toFile()));
} catch (IOException e) {
e.printStackTrace();
}

while (true){
try {
TimeUnit.SECONDS.sleep(5);
logger.info("Start loading from MySQL....");
// Read the url, user, and password in the configuration file to create a Mysql connection
try (Connection conn = DriverManager.getConnection(pro.getProperty("mysql.url"), pro.getProperty("mysql.user"), pro.getProperty("mysql.password"))) {
//Load the dictionary into the database
reLoadFromMySQL(conn, pro);
}
logger.info("Loading from MySQL completed...");
}catch (Exception e){
logger.error("load from mysql error..",e);
}
}
}).start();
}
}
}
}

//Load connection driver
static {
try {
Class.forName("com.mysql.cj.jdbc.Driver");
} catch (ClassNotFoundException ignored) {
}
}

// Reload the dictionary in Mysql
private static void reLoadFromMySQL(Connection conn, Properties pro) throws Exception {
logger.info("Reload dictionary from MySQL...start");
//Open a new instance to load the dictionary to reduce the impact of the loading process on the use of the current dictionary
Dictionary tmpDict = new Dictionary(getSingleton().configuration);
tmpDict.configuration = getSingleton().configuration;
//IK word segmenter's own method of loading remote dictionary
tmpDict.loadMainDict();
//Load the extended dictionary into Mysql
tmpDict.reloadExtDictFromMySQL(conn);
// IK word segmenter's own loading and remote deactivation dictionary method
tmpDict.loadStopWordDict();
// Load and disable the dictionary in Mysql
tmpDict.reloadStopDictFromMySQL(conn);
getSingleton()._MainDict = tmpDict._MainDict;
getSingleton()._StopWords = tmpDict._StopWords;
logger.info("Dictionary reloading from MySQL completed...end");
}

// Load and disable the dictionary in Mysql
private void reloadStopDictFromMySQL(Connection conn) throws SQLException {
try (Statement statement = conn.createStatement();ResultSet rs = statement.executeQuery("select word from stop_words")) {
while(rs.next()) {
String word = rs.getString("word");
if (StrUtil.isNotBlank(word)){
_StopWords.fillSegment(word.toCharArray());
}
}
}
}

//Load the extended dictionary into Mysql
private void reloadExtDictFromMySQL(Connection conn) throws SQLException {
try (Statement statement = conn.createStatement(); ResultSet rs = statement.executeQuery("select word from ext_words")) {
while(rs.next()) {
String word = rs.getString("word");
if (StrUtil.isNotBlank(word)){
_MainDict.fillSegment(word.toCharArray());
}
}
}
}

5. Use the Maven packaging plug-in to create tool packages

6. Enter the root directory of the IK word segmenter under ElasticSearch

6.1. Delete the original elasticsearch-analysis-ik-8.1.0.jar

6.2. Add the elasticsearch-analysis-ik-7.16.0.jar written above

6.3. Add Mysql connection driver package

6.4. Add Hutool-core dependency package

Shaped like:

7. Enter the config directory under the root directory of the IK word segmenter and create a new mysql.properties file and socketPolicy.policy file

Configuration content:

mysql.properties:

mysql.url=jdbc:mysql://localhost:3306/ik?serverTimezone=UTC & amp;characterEncoding=utf8 & amp;useUnicode=true & amp;useSSL=false
mysql.user=root
mysql.password=123456

socketPolicy.policy:

grant {
   permission java.net.SocketPermission "*:*","accept,connect,resolve";
   permission java.lang.RuntimePermission "setContextClassLoader";
};

8. Enter the config folder in the ES root directory and edit the jvm.options file to add the following configuration

# Solve ES console Chinese garbled characters
-Dfile.encoding=GBK

# Absolute path to the socketPolicy.policy file
-Djava.security.policy=D:\software\ElasticSearch8\elasticsearch-8.1.0\plugins\elasticsearch-analysis-ik-8.1.0\config\socketPolicy.policy

9. Start ElasticSearch

3. Test

# 1. Use the IK word segmenter ik_max_word granularity to conduct a word segmentation test on "Freljord"
GET /_analyze
{
    "text":"Freljord",
    "analyzer":"ik_max_word"
}


response:
{
    "tokens": [
        {
            "token": "福",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "雷",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "er",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "Zhuo",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "德",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 4
        }
    ]
}


# 2. Add the "Freljord" record to the Mysql extended dictionary table ext_words and test the word segmentation effect again (no need to restart the ES server)
GET /_analyze
{
    "text":"Freljord",
    "analyzer":"ik_max_word"
}


response:
{
    "tokens": [
        {
            "token": "Freljord",
            "start_offset": 0,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}


# 3. Add the "Freljord" record to the Mysql extension disabled dictionary table stop_words and test the word segmentation effect again (no need to restart the ES service)
GET /_analyze
{
    "text":"Freljord",
    "analyzer":"ik_max_word"
}


response:
{
    "tokens": []
}