Foreword
-
Can support file upload and download
-
To be able to search for files based on keywords, it is required to be able to search for the text in the file, and the file type must support word, pdf, and txt.
File uploading and downloading are relatively simple. You must be able to retrieve the text in the file and be as accurate as possible. In this case, many things need to be taken into consideration. In this case, I decided to use Elasticsearch
.
Because I was preparing to look for a job and brush up my skills, I found that many interviewers asked about Elasticsearch
. In addition, I didn’t even know what Elasticsearch
was at that time, so I Decided to try something new. I have to say that the Elasticsearch
version is updated really quickly. I just used 7.9.1
a few days ago, and 7.9.2
came out on the 25th. code>version.
Introduction to Elasticsearch
Elasticsearch
is an open source document search engine. The general meaning is that you tell it the keyword through the Rest
request, and it returns the corresponding content to you. It’s that simple.
Elasticsearch
encapsulates Lucene
, which is an open source full-text search engine toolkit of the apache
Software Foundation. The call of Lucene
is relatively complicated, so Elasticsearch
encapsulates it again and provides some more advanced functions such as distributed storage.
There are many plug-ins based on Elasticsearch
. There are two main ones I used this time, one is kibana
and the other is Elasticsearch-head
.
-
kibana
is mainly used to construct requests. It provides many automatic completion functions. -
Elasticsearch-head
is mainly used to visualizeElasticsearch
.
Development environment
First install Elasticsearch
, Elasticsearch-head
, and kibana
. All three are out-of-the-box, double-click to run . It should be noted that the version of kibana
must correspond to the version of Elasticsearch
.
Elasticsearch-head
is the visual interface of Elasticsearch
. Elasticsearch
is based on the Rest
style API
to operate. With the visual interface, you don’t need to use the Get
operation to query every time, which can improve development efficiency.
Elasticsearch-head
is developed using node.js
. You may encounter cross-domain problems during the installation process: the default port of Elasticsearch
is 9200
, and the default port of Elasticsearch-head
is 9100
. You need to change the configuration file. I won’t go into details on how to change it. After all, there are The universal search engine.
After Elasticsearch
is installed, access the port and the following interface will appear.
Core issues
There are two core issues that need to be solved, file uploading and entering keyword queries.
File upload
First of all, for the pure text form of txt
, it is relatively simple, just pass in the content directly. But for the two special formats pdf and word
, in addition to text, there is a lot of irrelevant information in the file, such as pictures, tags in pdf and other information. This requires preprocessing of the file.
Elasticsearch 5.x and later provides a function called ingest node
. ingest node
can preprocess input documents. As shown in the figure, after the PUT request enters, it will first determine whether there is a pipline
. If so, it will enter the Ingest Node
for processing, and then it will be officially processed.
Ingest Attachment Processor Plugin
is a text extraction plug-in, which essentially uses the ingest node
function of Elasticsearch
to provide a key preprocessorattachment
. Run the following command in the installation directory to install.
./bin/elasticsearch-plugin install ingest-attachment
Define text extraction pipeline
PUT /_ingest/pipeline/attachment { "description": "Extract attachment information", "processors": [ { "attachment": { "field": "content", "ignore_missing": true } }, { "remove": { "field": "content" } } ] }
Specify the field to be filtered in attachment
as content
, so when writing Elasticsearch
, you need to put the document content in content
>field.
The running result is as shown in the figure:
Establish document structure mapping
We need to establish a document structure mapping to define the form in which text files are stored after being uploaded through the preprocessor. An index will be automatically created when PUT defines document structure mapping, so we first create an index of docwrite
for testing.
PUT /docwrite { "mappings": { "properties": { "id":{ "type": "keyword" }, "name":{ "type": "text", "analyzer": "ik_max_word" }, "type":{ "type": "keyword" }, "attachment": { "properties": { "content":{ "type": "text", "analyzer": "ik_smart" } } } } } }
The attachment
field is added to ElasticSearch
. This field is automatically attached after the attachment
name pipeline
extracts the text in the document attachment. field. This is a nested field that contains multiple subfields, including extracted text content and some document information metadata.
Also specify the analyzer analyzer
as ik_max_word
for the name of the file, so that ElasticSearch
can perform Chinese word segmentation on them when establishing the full-text index.
Test
After the above two steps, we conduct a simple test. Because ElasticSearch
is a document database based on JSON
format, attachment documents must be Base64
encoded before being inserted into ElasticSearch
. First convert a pdf file into base64
text through the website below. PDF to Base64
The test document is as shown in the figure:
Then upload it through the following request. I found a large pdf file. What needs to be specified is the pipeline
we just created, and the result is as shown in the figure.
The original index had a
type
type, which will be deprecated in the new version. The default version is_doc
Then we use the GET
operation to see if our document is uploaded successfully. You can see that it has been parsed successfully.
If pipline
is not specified, it will not be parsed.
According to the results, we can see that our PDF file has passed our self-defined pipline
, and then officially entered the index database docwrite
.
Keyword query
Keyword query means that the input text can be segmented to a certain extent. For example, for the string of words “database, computer network, my computer”, it must be divided into three keywords: “database”, “computer network”, and “my computer”, and then queried based on the keywords.
Elasticsearch
comes with a word segmenter that supports all Unicode
characters, but it will only do the largest division, such as the four Imported red wines
The words will be divided into four words: "import", "mouth", "red", and "wine"
, so the query results will include "import", "lipstick" "," red wine"
.
This is not the result we want. The result we want is to only divide it into two sections: "import" and "red wine"
, and then query the corresponding results. This requires the use of a word segmenter that supports Chinese.
ik word segmenter
ik word segmenter
is a popular Chinese word segmentation plug-in in the open source community. We first install the ik word segmenter. Note that the following code cannot be used directly.
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/...find your version here
ik word segmenter
includes two modes.
-
ik_max_word
will split Chinese as much as possible. -
ik_smart
will be divided according to common habits. For example,"imported red wine"
will be divided into"imported", "red wine"
.
We use the ik word segmenter
to query documents when querying. For example, for the inserted test document, we use the ik_smart
mode to search, and the results are as shown in the figure.
GET /docwrite/_search { "query": { "match": { "attachment.content": { "query": "Experiment 1", "analyzer": "ik_smart" } } } }
We can specify the highlighting in Elasticsearch
to add labels to the filtered text. In this case, tags will be added before and after the text. As shown in the picture.
Encoding
To code using the development environment of Idea + maven
, first import the dependencies, which must correspond to the version of Elasticsearch
.
Import dependencies
Elstacisearch
has two API
for Java
. The one we use is a relatively well-encapsulated high-level API
.
<dependency> <groupId>org.elasticsearch.client</groupId> <artifactId>elasticsearch-rest-high-level-client</artifactId> <version>7.9.1</version> </dependency>
File upload
First create a fileObj
object corresponding to the above
public class FileObj { String id; //Used to store file id String name; //File name String type; //Type of file, pdf, word, or txt String content; //All contents after the file is converted into base64 encoding. }
First, according to the above complaint, we must first read the file in the form of a byte array, and then convert it into Base64
encoding.
public FileObj readFile(String path) throws IOException { //read file File file = new File(path); FileObj fileObj = new FileObj(); fileObj.setName(file.getName()); fileObj.setType(file.getName().substring(file.getName().lastIndexOf(".") + 1)); byte[] bytes = getContent(file); //Convert the file content to base64 encoding String base64 = Base64.getEncoder().encodeToString(bytes); fileObj.setContent(base64); return fileObj; }
java.util.Base64
has provided ready-made functions Base64.getEncoder().encodeToString
for us to use.
Next, you can use the API of Elasticsearch
to upload the file. In addition, regarding Elasticsearch interview questions, public account Java selection, respond to Java interviews, and obtain interview materials.
To upload, you need to use the IndexRequest
object. Use FastJson
to convert fileObj
into Json
before uploading. You need to use the indexRequest.setPipeline
function to specify the pipline
we defined above. In this way, the file will be preprocessed through pipline
and then entered into the fileindex
index.
public void upload(FileObj file) throws IOException { IndexRequest indexRequest = new IndexRequest("fileindex"); //While uploading, use attachment pipline to extract files. indexRequest.source(JSON.toJSONString(file), XContentType.JSON); indexRequest.setPipeline("attatchment"); IndexResponse indexResponse = client.index(indexRequest, RequestOptions.DEFAULT); System.out.println(indexResponse); }
File query
File query needs to use the SearchRequest
object. First, I need to specify the ik_smart
mode word segmentation using the ik word segmenter
for our keywords.
SearchSourceBuilder srb = new SearchSourceBuilder(); srb.query(QueryBuilders.matchQuery("attachment.content", keyword).analyzer("ik_smart")); searchRequest.source(srb);
Then we can get each hits
through the returned Response
object, and then get the returned content.
Iterator<SearchHit> iterator = hits.iterator(); int count = 0; while (iterator.hasNext()) { SearchHit hit = iterator.next(); }
A very powerful function of Elasticsearh
is the file highlighting function, so we can set up a highlighter
to highlight the queried text.
HighlightBuilder highlightBuilder = new HighlightBuilder(); HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("attachment.content"); highlightContent.highlighterType(); highlightBuilder.field(highlightContent); highlightBuilder.preTags("<em>"); highlightBuilder.postTags("</em>"); srb.highlighter(highlightBuilder);
I set the preceding tag to wrap the query results. In this way, the corresponding results will be included in the query results.
Multiple file testing
A simple demo has been written, but the effect still needs to be tested using multiple files. This is one of my test folders, with various types of files placed underneath.
After uploading all the files in this folder, use the elestacisearch``-head
visual interface to view the imported files.
Search code:
/** * This part will query the information in the database based on the entered keywords, and then return the corresponding results. * @throwsIOException */ @Test public void fileSearchTest() throws IOException { ElasticOperation elo = eloFactory.generate(); elo.search("Database State Council Computer Network"); }
Run our demo and the query results are as shown in the figure.
Some problems that still exist
1. File length issue
Through testing, we found that for files with text content exceeding 100,000 words, elasticsearch
only retains 100,000 words, and the rest are truncated. This requires further understanding of Elasticsearch
‘s ability to handle files with more than 100,000 words. Text support.
2. Some coding issues
In my code, after all the files are read into the memory, a series of processing are performed. There is no doubt that it will definitely cause problems, such as if it is a very large file that exceeds the memory, or For several large files, in an actual production environment, file upload will occupy a considerable portion of the server’s memory and bandwidth, which requires further optimization based on specific needs.
Author: HENG_Blog
https://www.cnblogs.com/strongchenyu/p/13777596.html
If the source of the content published in the public account "Java Selection" is indicated, the copyright belongs to the original source (the content whose copyright cannot be verified or the source is not indicated is from the Internet and is reprinted. The purpose of reprinting is to convey more information. The copyright belongs to the original author. If there is any infringement, please contact us and we will delete it as soon as possible! Recently, many people have asked if there is a reader exchange group! The method to join is very simple, just select the public account Java and reply "Add group" to join the group! Java Selected Interview Questions (WeChat Mini Program): 3000+ interview questions, including Java basics, concurrency, JVM, threads, MQ series, Redis, Spring series, Elasticsearch, Docker, K8s, Flink, Spark, architecture design, etc., online Answer questions at any time! ------ Special recommendation ------ Special recommendation: "Big Shot Notes" is a public account that focuses on sharing the most cutting-edge technology and information, preparing for overtaking in corners, and various open source projects and high-efficiency software. It focuses on digging out good things and is very worthy of everyone's attention. Click on the public account card below to follow. Click "Read the original text" to learn more exciting content! If the article is helpful, please click to read and forward it!