Directory
1. System environment and software requirements
2. Software description
3. Define the text extraction pipeline (pipeline)
4. Create index and set document structure mapping
5. Insert document
6. Query documents
The requirement is to process the content of local emails and attachments such as PDF, EXCEL, WORD, etc., and save them to the ES database to realize full-text retrieval of email contents and attachments.
1. System environment and software requirements
System: CentOS7.3
elasticsearch version: 7.13.3
Kibana version: 7.16.3
ingest-attachment plugin version: 7.13.3
2. Software description
Kibana is an open source analytics and visualization platform designed to work with Elasticsearch. Our current use is mainly to execute some commands in kibana’s development tool dev tools.
Ingest-Attachment is an out-of-the-box plugin. Files in commonly used formats can be written to Index as attachments. The ingest attachment plugin uses Apache Tika to extract files, and the supported file formats include TXT, DOC, PPT, XLS, and PDF. Text extraction and automatic import are possible. Note: The source field must be base64-encoded binary.
Disadvantage: When dealing with xls and xlsx formats, the sheet cannot be indexed separately, and the entire file can only be inserted into es as a document.
3. Install the plug-in
Here I install Ingest-Attachment offline, and directly download the offline file with the same version as elasticsearch through wget.
wget https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-7.13.3.zip
Upload to server directory
/home/es/install/ingest-attachment-7.13.3.zip
Enter the main directory of ES_HOME and execute the following command to install
cd /home/elasticsearch/ ./bin/elasticsearch-plugin install file:///home/es/install/ingest-attachment-7.13.3.zip
Restart the elasticsearch service after the installation is complete
The plugin installation is complete!
Third, define the text extraction pipeline (pipeline)
Execute in kibana’s dev tool
My email here may have multiple attachments, so I define a text extraction pipeline (Multiple attachments), and I set it here to remove the base64 binary data after processing.
It should be noted that in the case of multiple attachments, field and target_field must be written as _ingest._value.*, otherwise the correct field cannot be matched.
PUT _ingest/pipeline/multiple_attachment { "description" : "Extract attachment information from arrays", "processors" : [ { "foreach" : { "field" : "attachments", "processor" : { "attachment" : { "target_field" : "_ingest._value.attachment", "field" : "_ingest._value.content" } } } }, { "foreach" : { "field" : "attachments", "processor" : { "remove" : { "field" : "_ingest._value.content" } } } } ] }
Pipeline parameter meaning of plug-in ingest attachment
Name | Required | Default | Description |
field | yes | – | From this field Get base64 encoding |
target_field | no | attachment | used to retain attachment information, mainly used for multiple attachments |
indexed_chars | no | 100000 | Limit the maximum number of characters saved in the field. -1 for unlimited. |
indexed_chars_field | no | – | You can get the value limited by indexed_chars from the field set in the data. |
properties | no | full attribute | Select the attribute to be stored. For example content, title, name, author, keywords, date, content_type, content_length, language |
ignore_missing | no | FALSE | If true is used and the field does not exist, it will be ignored Attachments are directly written to doc; otherwise, an error will be reported. |
Fourth, create an index and set the document structure mapping
PUT mail { "settings": { "index": { "max_result_window": 100000000 }, "number_of_shards": 3, "number_of_replicas": 0 }, "mappings": { "properties": { "mfrom": { "type": "keyword" }, "mto": { "type": "keyword" }, "mcc": { "type": "keyword" }, "mbcc": { "type": "keyword" }, "rcvtime": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "subject": { "type": "keyword" }, "importance": { "type": "keyword" }, "savepath": { "type": "keyword" }, "mbody": { "type": "text", "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } } }, "attachments": { "properties": { "attachment": { "properties": { "content": { "type": "text", "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } } }, "filename": { "type": "keyword" }, "type": { "type": "keyword" } } } } } } } }
Created successfully will return
{ "acknowledged" : true, "shards_acknowledged" : true, "index" : "mail" }
5. Insert document
You can use Postman to call the rest full interface of elasticsearch to complete document insertion or update.
Request type: POST
Request address: http://192.168.31.200:9200/mail/_doc?pipeline=multiple_attachment
In the request address, mail is the index name, and pipeline=multiple_attachment specifies that the pipeline (pipeline) to be used is multiple_attachment
The request body content is in JSON format:
{ "mfrom": "[email protected]", "mto": "[email protected]", "mcc": "", "mbcc": "", "rcvtime": "2023-05-18 23:35:29", "subject": "Magic mail 2023066- ", "importance": "1", "savepath": "d:\mail\TEST123.eml", "mbody": "This is the email content", "attachments": [ { "filename": "attachment name 1.pdf", "type": ".pdf", "content": "5oiR54ix5L2g5Lit5Zu9MjAyMw==" }, { "filename": "attachment name 2.xlsx", "type": ".xlsx", "content": "Q2hhdEdQVCDniZvpgLwh" } ] }
attachments is a JSON array, which contains the information of 2 attachments. filename is the name of the attachment, and content is the base64 encoded string parsed from the attachment. When inserting, it is processed through the pipeline, and the content will be automatically identified, and the rest is the same as operating ordinary indexes.
The following is what is returned on successful execution:
{ "_index": "mail", "_type": "_doc", "_id": "eiCNNIgBUc2qXUv978Tg", "_version": 1, "result": "created", "_shards": { "total": 1, "successful": 1, "failed": 0 }, "_seq_no": 0, "_primary_term": 1 }
Screenshot of Postman
6. Query documents
6.1 View documents by _id
GET request address http://192.168.31.200:9200/mail/_doc/eiCNNIgBUc2qXUv978Tg
parameter and content None
Among them, eiCNNIgBUc2qXUv978Tg is the document_id, and mail is the index name to be queried
return result:
{ "_index": "mail", "_type": "_doc", "_id": "eiCNNIgBUc2qXUv978Tg", "_version": 1, "_seq_no": 0, "_primary_term": 1, "found": true, "_source": { "savepath": "d:\mail\TEST123.eml", "mbody": "This is the email content", "attachments": [ { "filename": "attachment name 1.pdf", "attachment": { "content_type": "text/plain; charset=UTF-8", "language": "lt", "content": "I love you China 2023", "content_length": 10 }, "type": ".pdf" }, { "filename": "attachment name 2.xlsx", "attachment": { "content_type": "text/plain; charset=UTF-8", "language": "lt", "content": "ChatGPT awesome!", "content_length": 12 }, "type": ".pdf" } ], "mbcc": "", "subject": "Magic mail 2023066- ", "importance": "1", "mfrom": "[email protected]", "mto": "[email protected]", "mcc": "", "rcvtime": "2023-05-18 23:35:29" } }
Screenshot of Postman
6.2 Fuzzy query attachment name
Post request address ?http://192.168.31.200:9200/mail/_search
?The request content is a JSON string, and attachments.filename.keyword is the name of the attachment (no word segmentation)
{ "query": { "bool": { "should": [{ "wildcard": { "attachments.filename.keyword": "*attachments*" } }] } } }
return result
{ "took": 2, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 1.0, "hits": [ { "_index": "mail", "_type": "_doc", "_id": "eiCNNIgBUc2qXUv978Tg", "_score": 1.0, "_source": { "savepath": "d:\mail\TEST123.eml", "mbody": "This is the email content", "attachments": [ { "filename": "attachment name 1.pdf", "attachment": { "content_type": "text/plain; charset=UTF-8", "language": "lt", "content": "I love you China 2023", "content_length": 10 }, "type": ".pdf" }, { "filename": "attachment name 2.xlsx", "attachment": { "content_type": "text/plain; charset=UTF-8", "language": "lt", "content": "ChatGPT awesome!", "content_length": 12 }, "type": ".pdf" } ], "mbcc": "", "subject": "Magic mail 2023066- ", "importance": "1", "mfrom": "[email protected]", "mto": "[email protected]", "mcc": "", "rcvtime": "2023-05-18 23:35:29" } } ] } }
6.3 Fuzzy query attachment content
POST request address http://192.168.31.200:9200/mail/_search
The request content is in JSON format, and attachments.attachment.content is the attachment content (not encrypted)
{ "size":"10000", "_source" :[ "_id", "seqnbr", "subject", "eml" ], "query": { "match": { "attachments.attachment.content":"*ChatGPT*" } } }
return result
{ "took": 1, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 0.2876821, "hits": [ { "_index": "mail", "_type": "_doc", "_id": "eiCNNIgBUc2qXUv978Tg", "_score": 0.2876821, "_source": { "subject": "Magic email 2023066- " } } ] } }
7. Other instructions
The following is a pipeline single_attachment for defining text extraction explained separately
Execute in kibana’s dev tool
PUT _ingest/pipeline/single_attachment
{ "description" : "Extract single attachment information", "processors" : [ { "attachment" : { "field": "data", "indexed_chars" : -1, "ignore_missing" : true } } ] }
The rest is a matter of code integration. Regarding the use of the Chinese word segmentation IK plug-in, it will be explained in detail later.