2.core kernel instance, IK tokenizer, Solr (stand-alone, cluster)

Table of Contents


1. Introduction, download and installation of Apache Solr

2. Core kernel instance, IK tokenizer, Solr (stand-alone, cluster)

3. Solr basic commands (start, stop, system information)

4. Solr’s solrconfig.xml configuration and managed.schema mode

5. Solr Admin UI operations (XML, JSON add|modify|delete|query index)

6. Solr configures DataImport to import index data and IK word segmentation query

7. Use Solr in Java, historical version (after 7.0.0, 5.0.0~6.6.6, before 4.10.4)

8. Traditional Spring integration with Solr

9. Spring Boot integrates Solr


core kernel instance, IK tokenizer, Solr (stand-alone, cluster)

  • Table of contents
  • core kernel instance, IK tokenizer, Solr (stand-alone, cluster)
    • Create a core kernel instance (two types)
      • 1. Use the solr create -c name command
      • 2. Create a core directly using the Admin UI page
    • Tokenizer (IKAnalyzer)
      • download
    • Stand-alone Solr
      • 1. Put the jar package in the Solr installation directory
      • 2. Configure Solr’s managed-schema and add ik tokenizer
      • 3. Start the Solr service to test the tokenizer
    • Solr-Cloud (cluster)
      • 1. Put the jar package in the Solr installation directory
      • 2. Put ik.conf and dynamicdic.txt into the solr configuration folder
      • 3. Configure Solr’s managed-schema and add ik tokenizer
      • 4. Test word segmentation
      • 5. Test the dynamic dictionary

core kernel instance, IK tokenizer, Solr (stand-alone, cluster)

Create a core kernel instance (two types)

Simply put, a core is an instance of solr. There can be multiple cores under a solr service, and each core has its own index library and corresponding configuration files. Therefore, a core must be created before operating solr to create an index, because Indexes are stored under the core

1. Use the solr create -c name command

Create a kernel for indexing and searching

Create a core that uses a data-driven architecture that tries to guess the correct field type when adding a document to the index

Execute “solr create –c name” in the bin directory. Create a core, the default created location is as shown below

command description
bin/solr create -c Create a kernel
bin/solr create -help See all available options for creating a new core
bin/solr create -help td>

2. Create a core directly using the Admin UI page

For error reporting. Cause missing solrconfig.xml configuration file

Solution: Copy the files under example\example-DIH\solr\solr to the new_core_one file

Next, restart the solr service in the previously launched cmd window, and enter the following command in the console

solr restart -p 8983

A data file set will be automatically generated after accessing the Admin UI

Token breaker (IKAnalyzer)

IKAnalyzer is an open source, lightweight Chinese word segmentation toolkit developed based on the Java language

IK Download

Solr 8x requires a newer version to be downloaded. The old version of 2012 does not support problems
GitHub address: https://github.com/magese/ik-analyzer-solr

<!-- Maven warehouse address -->
<dependency>
    <groupId>com.github.magese</groupId>
    <artifactId>ik-analyzer</artifactId>
    <version>8.3.0</version>
</dependency>

Standalone Solr

1. Put the jar package in the Solr installation directory

Put the jar package into the Jetty or Tomcat webapp/WEB-INF/lib/ directory of the Solr service. That is, under \server\solr-webapp\webapp\WEB-INF\lib under the Solr installation directory

Put the five configuration files in the resources directory into the webapp/WEB-INF/classes/ directory of Jetty or Tomcat served by Solr
①IKAnalyzer.cfg.xml ②ext.dic ③stopword.dic ④ik.conf ⑤dynamicdic.txt

These files exist in the Jar package, so the above configuration is not required

Document Description
(1) IKAnalyzer.cfg.xml configuration file

name type description default
use_main_dict boolean Whether to use the default main dictionary true
ext_dict String Extended dictionary file name, multiple separated by semicolons ext.dic;
ext_stopwords String Disable dictionary file names, multiple separated by semicolons stopword.dic;

(2) ik.conf file

files = dynamicdic.txt
lastupdate = 0
Configuration Description
files It is a dynamic dictionary list, you can set multiple dictionary tables, separated by commas, and replace the dynamic dictionary table with dynamicdic.txt
lastupdate The default value is 0 , please + 1 every time you modify the dynamic dictionary table, otherwise the new lyrics in the dictionary table will not be added to the memory. The int type of lastupdate is used, and replacement is not supported. If you use an alternative, you can change the int in the source code to long; 2018-08-23, the long type of lastUpdate in the source code has been crossed, and now you can use timestamp

(3) dynamicdic.txt is a dynamic dictionary
Phrases configured in this file can be loaded into memory without restarting the service. Word comments starting with # will not be loaded into memory

2. Configure Solr’s managed-schema, add ik tokenizer

In the created croe instance, modify the managed-schema configuration file and add the following code to the file

<!-- IK tokenizer -->
<fieldType name ="text_ik" class ="solr.TextField">
<analyzer type ="index">
<tokenizer class ="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart = "false" conf ="ik.conf" />
<filter class = "solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class ="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart ="true" conf ="ik.conf" />
<filter class ="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>

3. Start the Solr service to test the tokenizer

solr start

Solr-Cloud (cluster)

Because the configuration files in Solr-Cloud are managed by zookeeper, in order to facilitate updating the dynamic dictionary, the dynamic dictionary file should also be uploaded to zookeeper, and the directory is consistent with the configuration file directory of solr

Note: Because the size of the configuration file in zookeeper cannot exceed 1m, when there are too many dictionary lists, the dictionary file needs to be divided into multiple

1. Put the jar package in the Solr installation directory

Put the jar package into the Jetty or Tomcat webapp/WEB-INF/lib/ directory of the Solr service of each server;
Put IKAnalyzer.cfg.xml, ext.dic, and stopword.dic in the resources directory into the webapp/WEB-INF/classes/ directory of Jetty or Tomcat of the solr service;
① IKAnalyzer.cfg.xml (IK default configuration file, used to configure the built-in extended dictionary and disable dictionary)
② ext.dic (the default extended dictionary)
③ stopword.dic (default stop word dictionary)
Note: Unlike the stand-alone version, please do not put ik.conf and dynamicdic.txt in the classes directory

2.ik.conf and dynamicdic.txt into the solr configuration folder

Put ik.conf and dynamicdic.txt in the resources directory into the solr configuration folder, in the same directory as the managed-schema file of solr
①ik.conf (dynamic dictionary configuration file)

Configuration Description
files Dynamic dictionary list, multiple dictionary tables can be set, separated by commas, the default dynamic dictionary table is dynamicdic.txt
lastupdate The default value is 0 , please modify the value after each modification of the dynamic dictionary table, it must be greater than the last value, otherwise the new words in the dictionary table will not be added to the memory

② dynamicdic.txt
The default dynamic dictionary, the words configured in this file can be loaded into memory without restarting the service. Words starting with # are regarded as comments and will not be loaded into memory

3. Configure Solr’s managed-schema, add ik tokenizer

Schema is used to tell solr how to build an index. Its configuration revolves around a schema configuration file. This configuration file determines how solr builds an index, the data type of each field, the word segmentation method, etc., old The name of the version schema configuration file is schema.xml, and its configuration method is manual editing, but now the name of the new version of the schema configuration file is called managed-schema, and its configuration method is no longer manual editing but using schema API to configure , the official explanation is that after using the schema API to modify the managed-schema content, there is no need to reload the core or restart solr, which is more suitable for maintenance in a production environment. If you use manual editing to change the configuration without reloading the core, the configuration may be lost.

<!-- ik tokenizer -->
<fieldType name="text_ik" class="solr. TextField">
  <analyzer type="index">
      <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false" conf="ik.conf"/>
      <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
      <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true" conf="ik.conf"/>
      <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Upload the configuration file to zookeeper, please restart the service or reload Collection for the first use

4. Test word segmentation

The dynamic dictionary file at this time is empty; the configuration file lastupdate is 0

test participle

5. Test dynamic dictionary

Add dynamic dictionary words and upload to zookeeper; modify configuration file and upload to zookeeper

test participle

syntaxbug.com © 2021 All Rights Reserved.