Elasticsearch actual combat processing email attachments for full-text content retrieval

Directory

1. System environment and software requirements

2. Software description

3. Define the text extraction pipeline (pipeline)

4. Create index and set document structure mapping

5. Insert document

6. Query documents


The requirement is to process the content of local emails and attachments such as PDF, EXCEL, WORD, etc., and save them to the ES database to realize full-text retrieval of email contents and attachments.

1. System environment and software requirements

System: CentOS7.3

elasticsearch version: 7.13.3

Kibana version: 7.16.3

ingest-attachment plugin version: 7.13.3

2. Software description

Kibana is an open source analytics and visualization platform designed to work with Elasticsearch. Our current use is mainly to execute some commands in kibana’s development tool dev tools.

Ingest-Attachment is an out-of-the-box plugin. Files in commonly used formats can be written to Index as attachments. The ingest attachment plugin uses Apache Tika to extract files, and the supported file formats include TXT, DOC, PPT, XLS, and PDF. Text extraction and automatic import are possible. Note: The source field must be base64-encoded binary.

Disadvantage: When dealing with xls and xlsx formats, the sheet cannot be indexed separately, and the entire file can only be inserted into es as a document.

3. Install the plug-in

Here I install Ingest-Attachment offline, and directly download the offline file with the same version as elasticsearch through wget.

wget https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-7.13.3.zip

Upload to server directory

/home/es/install/ingest-attachment-7.13.3.zip

Enter the main directory of ES_HOME and execute the following command to install

cd /home/elasticsearch/

./bin/elasticsearch-plugin install file:///home/es/install/ingest-attachment-7.13.3.zip

Restart the elasticsearch service after the installation is complete

The plugin installation is complete!

Third, define the text extraction pipeline (pipeline)

Execute in kibana’s dev tool

My email here may have multiple attachments, so I define a text extraction pipeline (Multiple attachments), and I set it here to remove the base64 binary data after processing.

It should be noted that in the case of multiple attachments, field and target_field must be written as _ingest._value.*, otherwise the correct field cannot be matched.

PUT _ingest/pipeline/multiple_attachment
{
    "description" : "Extract attachment information from arrays",
    "processors" : [
      {
        "foreach" : {
          "field" : "attachments",
          "processor" : {
            "attachment" : {
              "target_field" : "_ingest._value.attachment",
              "field" : "_ingest._value.content"
            }
          }
        }
      },
      {
        "foreach" : {
          "field" : "attachments",
          "processor" : {
            "remove" : {
              "field" : "_ingest._value.content"
            }
          }
        }
      }
    ]
}

Pipeline parameter meaning of plug-in ingest attachment

Name Required Default Description
field yes From this field Get base64 encoding
target_field no attachment used to retain attachment information, mainly used for multiple attachments
indexed_chars no 100000 Limit the maximum number of characters saved in the field. -1 for unlimited.
indexed_chars_field no You can get the value limited by indexed_chars from the field set in the data.
properties no full attribute Select the attribute to be stored. For example content, title, name, author, keywords, date, content_type, content_length, language
ignore_missing no FALSE If true is used and the field does not exist, it will be ignored Attachments are directly written to doc; otherwise, an error will be reported.

Fourth, create an index and set the document structure mapping

PUT mail
{
  "settings": {
    "index": {
      "max_result_window": 100000000
    },
    "number_of_shards": 3,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "mfrom": {
        "type": "keyword"
      },
      "mto": {
        "type": "keyword"
      },
      "mcc": {
        "type": "keyword"
      },
      "mbcc": {
        "type": "keyword"
      },
      "rcvtime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      },
      "subject": {
        "type": "keyword"
      },
      "importance": {
        "type": "keyword"
      },
      "savepath": {
        "type": "keyword"
      },
      "mbody": {
        "type": "text",
        "fields": {
          "keyword": {
            "ignore_above": 256,
            "type": "keyword"
          }
        }
      },
      "attachments": {
        "properties": {
          "attachment": {
            "properties": {
              "content": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "ignore_above": 256,
                    "type": "keyword"
                  }
                }
              },
              "filename": {
                "type": "keyword"
              },
              "type": {
                "type": "keyword"
              }
            }
          }
        }
      }
    }
  }
}

Created successfully will return

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "mail"
}

5. Insert document

You can use Postman to call the rest full interface of elasticsearch to complete document insertion or update.

Request type: POST
Request address: http://192.168.31.200:9200/mail/_doc?pipeline=multiple_attachment

In the request address, mail is the index name, and pipeline=multiple_attachment specifies that the pipeline (pipeline) to be used is multiple_attachment

The request body content is in JSON format:

{
    "mfrom": "[email protected]",
    "mto": "[email protected]",
    "mcc": "",
    "mbcc": "",
    "rcvtime": "2023-05-18 23:35:29",
    "subject": "Magic mail 2023066- ",
    "importance": "1",
    "savepath": "d:\mail\TEST123.eml",
    "mbody": "This is the email content",
     "attachments": [
        {
            "filename": "attachment name 1.pdf",
            "type": ".pdf",
            "content": "5oiR54ix5L2g5Lit5Zu9MjAyMw=="
        },
        {
            "filename": "attachment name 2.xlsx",
            "type": ".xlsx",
            "content": "Q2hhdEdQVCDniZvpgLwh"
        }
    ]
}

attachments is a JSON array, which contains the information of 2 attachments. filename is the name of the attachment, and content is the base64 encoded string parsed from the attachment. When inserting, it is processed through the pipeline, and the content will be automatically identified, and the rest is the same as operating ordinary indexes.

The following is what is returned on successful execution:

{
    "_index": "mail",
    "_type": "_doc",
    "_id": "eiCNNIgBUc2qXUv978Tg",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 0,
    "_primary_term": 1
}

Screenshot of Postman

6. Query documents

6.1 View documents by _id

GET request address http://192.168.31.200:9200/mail/_doc/eiCNNIgBUc2qXUv978Tg

parameter and content None

Among them, eiCNNIgBUc2qXUv978Tg is the document_id, and mail is the index name to be queried

return result:

{
    "_index": "mail",
    "_type": "_doc",
    "_id": "eiCNNIgBUc2qXUv978Tg",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "savepath": "d:\mail\TEST123.eml",
        "mbody": "This is the email content",
        "attachments": [
            {
                "filename": "attachment name 1.pdf",
                "attachment": {
                    "content_type": "text/plain; charset=UTF-8",
                    "language": "lt",
                    "content": "I love you China 2023",
                    "content_length": 10
                },
                "type": ".pdf"
            },
            {
                "filename": "attachment name 2.xlsx",
                "attachment": {
                    "content_type": "text/plain; charset=UTF-8",
                    "language": "lt",
                    "content": "ChatGPT awesome!",
                    "content_length": 12
                },
                "type": ".pdf"
            }
        ],
        "mbcc": "",
        "subject": "Magic mail 2023066- ",
        "importance": "1",
        "mfrom": "[email protected]",
        "mto": "[email protected]",
        "mcc": "",
        "rcvtime": "2023-05-18 23:35:29"
    }
}

Screenshot of Postman

6.2 Fuzzy query attachment name

Post request address ?http://192.168.31.200:9200/mail/_search

?The request content is a JSON string, and attachments.filename.keyword is the name of the attachment (no word segmentation)

{
  "query": {
        "bool": {
            "should": [{
                "wildcard": {
                    "attachments.filename.keyword": "*attachments*"
                  
                }
            }]
        }
    }
}

return result

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "mail",
                "_type": "_doc",
                "_id": "eiCNNIgBUc2qXUv978Tg",
                "_score": 1.0,
                "_source": {
                    "savepath": "d:\mail\TEST123.eml",
                    "mbody": "This is the email content",
                    "attachments": [
                        {
                            "filename": "attachment name 1.pdf",
                            "attachment": {
                                "content_type": "text/plain; charset=UTF-8",
                                "language": "lt",
                                "content": "I love you China 2023",
                                "content_length": 10
                            },
                            "type": ".pdf"
                        },
                        {
                            "filename": "attachment name 2.xlsx",
                            "attachment": {
                                "content_type": "text/plain; charset=UTF-8",
                                "language": "lt",
                                "content": "ChatGPT awesome!",
                                "content_length": 12
                            },
                            "type": ".pdf"
                        }
                    ],
                    "mbcc": "",
                    "subject": "Magic mail 2023066- ",
                    "importance": "1",
                    "mfrom": "[email protected]",
                    "mto": "[email protected]",
                    "mcc": "",
                    "rcvtime": "2023-05-18 23:35:29"
                }
            }
        ]
    }
}

6.3 Fuzzy query attachment content

POST request address http://192.168.31.200:9200/mail/_search

The request content is in JSON format, and attachments.attachment.content is the attachment content (not encrypted)

{
    "size":"10000",
    "_source" :[
        "_id",
        "seqnbr",
        "subject",
        "eml"
    ],
    "query": {
    "match": {
      "attachments.attachment.content":"*ChatGPT*"
    }
  }
}

return result

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "mail",
                "_type": "_doc",
                "_id": "eiCNNIgBUc2qXUv978Tg",
                "_score": 0.2876821,
                "_source": {
                    "subject": "Magic email 2023066- "
                }
            }
        ]
    }
}

7. Other instructions

The following is a pipeline single_attachment for defining text extraction explained separately

Execute in kibana’s dev tool

PUT _ingest/pipeline/single_attachment

{
  "description" : "Extract single attachment information",
  "processors" : [
    {
      "attachment" : {
        "field": "data",
        "indexed_chars" : -1,
        "ignore_missing" : true
      }
    }
  ]
}

The rest is a matter of code integration. Regarding the use of the Chinese word segmentation IK plug-in, it will be explained in detail later.