How does Elasticsearch implement full-text content retrieval in Word, PDF, and TXT?

Foreword

  1. Can support file upload and download

  2. To be able to search for files based on keywords, it is required to be able to search for the text in the file, and the file type must support word, pdf, and txt.

File uploading and downloading are relatively simple. You must be able to retrieve the text in the file and be as accurate as possible. In this case, many things need to be taken into consideration. In this case, I decided to use Elasticsearch.

Because I was preparing to look for a job and brush up my skills, I found that many interviewers asked about Elasticsearch. In addition, I didn’t even know what Elasticsearch was at that time, so I Decided to try something new. I have to say that the Elasticsearch version is updated really quickly. I just used 7.9.1 a few days ago, and 7.9.2 came out on the 25th. code>version.

Introduction to Elasticsearch

Elasticsearch is an open source document search engine. The general meaning is that you tell it the keyword through the Rest request, and it returns the corresponding content to you. It’s that simple.

Elasticsearch encapsulates Lucene, which is an open source full-text search engine toolkit of the apache Software Foundation. The call of Lucene is relatively complicated, so Elasticsearch encapsulates it again and provides some more advanced functions such as distributed storage.

There are many plug-ins based on Elasticsearch. There are two main ones I used this time, one is kibana and the other is Elasticsearch-head.

  • kibana is mainly used to construct requests. It provides many automatic completion functions.

  • Elasticsearch-head is mainly used to visualize Elasticsearch.

Development environment

First install Elasticsearch, Elasticsearch-head, and kibana. All three are out-of-the-box, double-click to run . It should be noted that the version of kibana must correspond to the version of Elasticsearch.

Elasticsearch-head is the visual interface of Elasticsearch. Elasticsearch is based on the Rest style API to operate. With the visual interface, you don’t need to use the Get operation to query every time, which can improve development efficiency.

Elasticsearch-head is developed using node.js. You may encounter cross-domain problems during the installation process: the default port of Elasticsearch is 9200, and the default port of Elasticsearch-head is 9100. You need to change the configuration file. I won’t go into details on how to change it. After all, there are The universal search engine.

After Elasticsearch is installed, access the port and the following interface will appear.

37dc9ff26d0b2c0572cc83a1a7fa1fee.png

Core issues

There are two core issues that need to be solved, file uploading and entering keyword queries.

File upload

First of all, for the pure text form of txt, it is relatively simple, just pass in the content directly. But for the two special formats pdf and word, in addition to text, there is a lot of irrelevant information in the file, such as pictures, tags in pdf and other information. This requires preprocessing of the file.

Elasticsearch 5.x and later provides a function called ingest node. ingest node can preprocess input documents. As shown in the figure, after the PUT request enters, it will first determine whether there is a pipline. If so, it will enter the Ingest Node for processing, and then it will be officially processed.

fb745916b28a57f7fe26553344f2e085.png

Ingest Attachment Processor Plugin is a text extraction plug-in, which essentially uses the ingest node function of Elasticsearch to provide a key preprocessorattachment. Run the following command in the installation directory to install.

./bin/elasticsearch-plugin install ingest-attachment
Define text extraction pipeline
PUT /_ingest/pipeline/attachment
{
    "description": "Extract attachment information",
    "processors": [
        {
            "attachment": {
                "field": "content",
                "ignore_missing": true
            }
        },
        {
            "remove": {
                "field": "content"
            }
        }
    ]
}

Specify the field to be filtered in attachment as content, so when writing Elasticsearch, you need to put the document content in content >field.

The running result is as shown in the figure:

3d21f9e6dec2443df4d2f292c4579650.png

Establish document structure mapping

We need to establish a document structure mapping to define the form in which text files are stored after being uploaded through the preprocessor. An index will be automatically created when PUT defines document structure mapping, so we first create an index of docwrite for testing.

PUT /docwrite
{
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "name":{
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "type":{
        "type": "keyword"
      },
      "attachment": {
        "properties": {
          "content":{
            "type": "text",
            "analyzer": "ik_smart"
          }
        }
      }
    }
  }
}

The attachment field is added to ElasticSearch. This field is automatically attached after the attachment name pipeline extracts the text in the document attachment. field. This is a nested field that contains multiple subfields, including extracted text content and some document information metadata.

Also specify the analyzer analyzer as ik_max_word for the name of the file, so that ElasticSearch can perform Chinese word segmentation on them when establishing the full-text index.

6ad19d7a75e0dacde2803ebb31aff2f7.png

Test

After the above two steps, we conduct a simple test. Because ElasticSearch is a document database based on JSON format, attachment documents must be Base64 encoded before being inserted into ElasticSearch. First convert a pdf file into base64 text through the website below. PDF to Base64

The test document is as shown in the figure:

948565b6fa669c94a2075a00d9d74e5a.png

Then upload it through the following request. I found a large pdf file. What needs to be specified is the pipeline we just created, and the result is as shown in the figure.

8c996d245cbbbf3f97271db2c0500992.png

The original index had a type type, which will be deprecated in the new version. The default version is _doc

Then we use the GET operation to see if our document is uploaded successfully. You can see that it has been parsed successfully.

12e5435a958b856f5878facf21f75e6c.png

If pipline is not specified, it will not be parsed.

2e3f39c6830cee5ec31d99a151c42ff8.png

According to the results, we can see that our PDF file has passed our self-defined pipline, and then officially entered the index database docwrite.

Keyword query

Keyword query means that the input text can be segmented to a certain extent. For example, for the string of words “database, computer network, my computer”, it must be divided into three keywords: “database”, “computer network”, and “my computer”, and then queried based on the keywords.

Elasticsearch comes with a word segmenter that supports all Unicode characters, but it will only do the largest division, such as the four Imported red wines The words will be divided into four words: "import", "mouth", "red", and "wine", so the query results will include "import", "lipstick" "," red wine".

91f42f515bb4633e20d77f1f150394b2.png

This is not the result we want. The result we want is to only divide it into two sections: "import" and "red wine", and then query the corresponding results. This requires the use of a word segmenter that supports Chinese.

ik word segmenter

ik word segmenter is a popular Chinese word segmentation plug-in in the open source community. We first install the ik word segmenter. Note that the following code cannot be used directly.

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/...find your version here

ik word segmenter includes two modes.

  1. ik_max_word will split Chinese as much as possible.

  2. ik_smart will be divided according to common habits. For example, "imported red wine" will be divided into "imported", "red wine".

e4091a13f50e5bf4991de799d103fd42.png

We use the ik word segmenter to query documents when querying. For example, for the inserted test document, we use the ik_smart mode to search, and the results are as shown in the figure.

GET /docwrite/_search
{
  "query": {
    "match": {
      "attachment.content": {
        "query": "Experiment 1",
        "analyzer": "ik_smart"
      }
    }
  }
}

d7a18be4127df362506f13b181435aa9.png

We can specify the highlighting in Elasticsearch to add labels to the filtered text. In this case, tags will be added before and after the text. As shown in the picture.

a445794b2bf4024167d0c17b045a7cda.png

Encoding

To code using the development environment of Idea + maven, first import the dependencies, which must correspond to the version of Elasticsearch.

Import dependencies

Elstacisearch has two API for Java. The one we use is a relatively well-encapsulated high-level API.

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>7.9.1</version>
</dependency>

File upload

First create a fileObj object corresponding to the above

public class FileObj {
    String id; //Used to store file id
    String name; //File name
    String type; //Type of file, pdf, word, or txt
    String content; //All contents after the file is converted into base64 encoding.
}

First, according to the above complaint, we must first read the file in the form of a byte array, and then convert it into Base64 encoding.

public FileObj readFile(String path) throws IOException {
    //read file
    File file = new File(path);
    
    FileObj fileObj = new FileObj();
    fileObj.setName(file.getName());
    fileObj.setType(file.getName().substring(file.getName().lastIndexOf(".") + 1));
    
    byte[] bytes = getContent(file);
    
    //Convert the file content to base64 encoding
    String base64 = Base64.getEncoder().encodeToString(bytes);
    fileObj.setContent(base64);
    
    return fileObj;
}

java.util.Base64 has provided ready-made functions Base64.getEncoder().encodeToString for us to use.

Next, you can use the API of Elasticsearch to upload the file. In addition, regarding Elasticsearch interview questions, public account Java selection, respond to Java interviews, and obtain interview materials.

To upload, you need to use the IndexRequest object. Use FastJson to convert fileObj into Json before uploading. You need to use the indexRequest.setPipeline function to specify the pipline we defined above. In this way, the file will be preprocessed through pipline and then entered into the fileindex index.

public void upload(FileObj file) throws IOException {
    IndexRequest indexRequest = new IndexRequest("fileindex");
    
    //While uploading, use attachment pipline to extract files.
    indexRequest.source(JSON.toJSONString(file), XContentType.JSON);
    indexRequest.setPipeline("attatchment");
    
    IndexResponse indexResponse = client.index(indexRequest, RequestOptions.DEFAULT);
    System.out.println(indexResponse);
}

File query

File query needs to use the SearchRequest object. First, I need to specify the ik_smart mode word segmentation using the ik word segmenter for our keywords.

SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", keyword).analyzer("ik_smart"));
searchRequest.source(srb);

Then we can get each hits through the returned Response object, and then get the returned content.

Iterator<SearchHit> iterator = hits.iterator();
int count = 0;
while (iterator.hasNext()) {
    SearchHit hit = iterator.next();
}

A very powerful function of Elasticsearh is the file highlighting function, so we can set up a highlighter to highlight the queried text.

HighlightBuilder highlightBuilder = new HighlightBuilder();
HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("attachment.content");
highlightContent.highlighterType();
highlightBuilder.field(highlightContent);
highlightBuilder.preTags("<em>");
highlightBuilder.postTags("</em>");
srb.highlighter(highlightBuilder);

I set the preceding tag to wrap the query results. In this way, the corresponding results will be included in the query results.

Multiple file testing

A simple demo has been written, but the effect still needs to be tested using multiple files. This is one of my test folders, with various types of files placed underneath.

f70d675819012c33b55d7f15e5adb1e8.png

After uploading all the files in this folder, use the elestacisearch``-head visual interface to view the imported files.

bc3e64f67457eefffdcd34e092990df5.png

Search code:

/**
     * This part will query the information in the database based on the entered keywords, and then return the corresponding results.
     * @throwsIOException
     */
    @Test
    public void fileSearchTest() throws IOException {
        ElasticOperation elo = eloFactory.generate();

        elo.search("Database State Council Computer Network");
    }

Run our demo and the query results are as shown in the figure.

8f32df9df16cbb6f62798d26e93a7134.png

Some problems that still exist

1. File length issue

Through testing, we found that for files with text content exceeding 100,000 words, elasticsearch only retains 100,000 words, and the rest are truncated. This requires further understanding of Elasticsearch‘s ability to handle files with more than 100,000 words. Text support.

2. Some coding issues

In my code, after all the files are read into the memory, a series of processing are performed. There is no doubt that it will definitely cause problems, such as if it is a very large file that exceeds the memory, or For several large files, in an actual production environment, file upload will occupy a considerable portion of the server’s memory and bandwidth, which requires further optimization based on specific needs.

Author: HENG_Blog

https://www.cnblogs.com/strongchenyu/p/13777596.html

If the source of the content published in the public account "Java Selection" is indicated, the copyright belongs to the original source (the content whose copyright cannot be verified or the source is not indicated is from the Internet and is reprinted. The purpose of reprinting is to convey more information. The copyright belongs to the original author. If there is any infringement, please contact us and we will delete it as soon as possible!
Recently, many people have asked if there is a reader exchange group! The method to join is very simple, just select the public account Java and reply "Add group" to join the group!

Java Selected Interview Questions (WeChat Mini Program): 3000+ interview questions, including Java basics, concurrency, JVM, threads, MQ series, Redis, Spring series, Elasticsearch, Docker, K8s, Flink, Spark, architecture design, etc., online Answer questions at any time!
------ Special recommendation ------
Special recommendation: "Big Shot Notes" is a public account that focuses on sharing the most cutting-edge technology and information, preparing for overtaking in corners, and various open source projects and high-efficiency software. It focuses on digging out good things and is very worthy of everyone's attention. Click on the public account card below to follow.

Click "Read the original text" to learn more exciting content! If the article is helpful, please click to read and forward it!