A brief discussion on Elasticsearch indexing and mapping mechanism

Elasticsearch index and mapping mechanism

Elasticsearch is a distributed, real-time search and analysis engine suitable for scenarios such as full-text search, structured search and analysis. In order to achieve high-performance search and analysis functions, Elasticsearch uses a data structure called an index and defines the structure and processing of data through a mapping mechanism. This article will introduce the indexing and mapping mechanism in Elasticsearch in detail, including field types, analyzers, filters, token generators, etc.

1. Overview of indexing and mapping

In Elasticsearch, an index is a data structure used to store and manage data. Each index consists of one or more shards, and each shard can have zero or more replicas. Data in the index is stored in the form of documents, and each document contains a set of fields and their corresponding values.

Mapping is a metadata used to define the structure of documents in the index, including field names, field types, field attributes, etc. The mapping mechanism allows Elasticsearch to provide higher performance and flexibility for querying and analysis based on how the data is structured and processed.

2. Field type

In Elasticsearch, field types are used to define the data type and processing method of the field. Field types determine how field values are stored and indexed, and how these values are used in queries and aggregation operations. Here are some commonly used field types:

2.1 Text

The text type is used to store full-text data, such as articles, comments, etc. Text fields are processed by the analyzer for full-text search. The parser breaks text data into tokens for matching at query time. Text fields support full-text search and fuzzy matching, but are not suitable for sorting and aggregation operations.

2.2 Keyword (keyword)

Keyword types are used to store non-full-text data, such as tags, categories, etc. Keyword fields are not processed by the analyzer and are therefore suitable for exact matching and aggregation operations. The keyword field supports exact matching, range query, sorting and aggregation operations, but is not suitable for full-text search and fuzzy matching.

2.3 Numerical values (integer, long, float, double)

Numeric types are used to store numerical data, such as age, price, etc. Numeric fields support range queries, sorting, and aggregation operations. Elasticsearch provides a variety of numerical types, such as integer (32-bit integer), long (64-bit integer), float (32-bit floating point number) ) and double (64-bit floating point number) to meet different precision and storage requirements.

2.4 date

Date type is used to store date and time data, such as publication time, update time, etc. Date fields support range queries, sorting and aggregation operations. Elasticsearch can automatically recognize multiple date and time formats, such as 2021-01-01, 2021-01-01T12:34:56, etc. In addition, you can also customize the date and time format through the format parameter.

2.5 boolean

The Boolean type is used to store Boolean values, such as whether it is available, whether it has been deleted, etc. Boolean fields support exact matching, range queries, and aggregation operations.

2.6 Binary

The binary type is used to store binary data, such as pictures, files, etc. Binary fields do not support search and analysis operations and are only used to store and retrieve raw data. In Elasticsearch, binary data needs to be stored using Base64 encoding.

2.7 Complex types

Elasticsearch also supports some complex types, such as object, nested and geographical location (geo_point, geo_shape), etc. These types are used to store and query complex data structures such as JSON objects, geographic coordinates, etc.

2.8 Field type configuration

In Elasticsearch, field types can be defined through mapping configurations. The following is an example of a mapping configuration:

{<!-- -->
  "mappings": {<!-- -->
    "properties": {<!-- -->
      "title": {<!-- -->
        "type": "text",
        "analyzer": "standard"
      },
      "tags": {<!-- -->
        "type": "keyword"
      },
      "publish_date": {<!-- -->
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis"
      },
      "price": {<!-- -->
        "type": "float"
      },
      "is_available": {<!-- -->
        "type": "boolean"
      },
      "location": {<!-- -->
        "type": "geo_point"
      },
      "author": {<!-- -->
        "type": "object",
        "properties": {<!-- -->
          "name": {<!-- -->
            "type": "text"
          },
          "email": {<!-- -->
            "type": "keyword"
          }
        }
      }
    }
  }
}

In this example, we define a file containing title (text type), tags (keyword type), publish_date (date type), price (floating point type), is_available (boolean type), location (location type) and author ( Object type) field mapping configuration. At the same time, we specified the standard parser for the title field and the date format for the publish_date field.

3. Analyzer

3.1 Analyzer Overview

A parser is a component used to process text data, including character filters, token generators, token filters, etc. In Elasticsearch, analyzers are used to break text fields into tokens for full-text search and analysis.

The workflow of the analyzer is as follows:

First, the character filter processes the input text data, such as removing HTML tags, converting case, etc.
Then, the token generator decomposes the processed text data into tokens, such as word segmentation by spaces, word segmentation by punctuation marks, etc.
Finally, the token filter processes the generated tokens, such as removing stop words, generating synonyms, etc.

3.2 Character Filter

Character filters are used to process characters in text data, such as removing HTML tags, converting case, etc. In Elasticsearch, you can use predefined character filters or customize character filters.

Here are some commonly used predefined character filters:

html_strip: Remove HTML tags and entities.
lowercase: Convert text data to lowercase.
pattern_replace: Use regular expressions to replace characters in text data.

3.3 Token Generator

Token generator is used to decompose text data into tokens, such as word segmentation by spaces, word segmentation by punctuation marks, etc. In Elasticsearch, you can use predefined token generators or you can customize token generators.

Here are some commonly used predefined token generators:

standard: Word segmentation according to Unicode text segmentation rules.
whitespace: Word segmentation based on space characters.
keyword: Treat the entire text data as a token.
ngram: Generate n-gram tokens in a specified length range.
edge_ngram: Generates edge n-gram tokens of a specified length range.

3.4 Token Filter

Token filters are used to process generated tokens, such as removing stop words, generating synonyms, etc. In Elasticsearch, you can use predefined token filters or you can customize token filters.

Here are some commonly used predefined token filters:

lowercase: Convert tokens to lowercase.
stop: Remove stop words.
synonym: Generate synonyms.
stemmer: Extract stems.
asciifolding: Convert Unicode characters to equivalent ASCII characters.

3.5 Analyzer Configuration

In Elasticsearch, analyzers can be defined through mapping configurations. The following is an example analyzer configuration:

{<!-- -->
  "settings": {<!-- -->
    "analysis": {<!-- -->
      "char_filter": {<!-- -->
        "my_html_strip": {<!-- -->
          "type": "html_strip",
          "escaped_tags": ["b", "i"]
        }
      },
      "tokenizer": {<!-- -->
        "my_ngram_tokenizer": {<!-- -->
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 3
        }
      },
      "filter": {<!-- -->
        "my_stopwords": {<!-- -->
          "type": "stop",
          "stopwords": ["the", "and", "is"]
        }
      },
      "analyzer": {<!-- -->
        "my_custom_analyzer": {<!-- -->
          "char_filter": ["my_html_strip"],
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}

In this example, we define a custom analyzer my_custom_analyzer, including a character filter my_html_strip (remove HTML tags, retain b and i tag), a tokenizer my_ngram_tokenizer (generates 2-3 character n-gram tokens), and a token filter my_stopwords code> (remove stop words).

4. Mapping configuration

4.1 Mapping configuration overview

Mapping configuration is a type of metadata used to define the structure of documents in the index, including field names, field types, field attributes, etc. Mapping configurations allow Elasticsearch to provide higher performance and flexibility for querying and analysis based on how the data is structured and processed.

In Elasticsearch, mapping configurations can be created or updated via PUT requests. The following is an example of a mapping configuration:

PUT /my_index
{<!-- -->
  "mappings": {<!-- -->
    "properties": {<!-- -->
      "title": {<!-- -->
        "type": "text",
        "analyzer": "standard"
      },
      "tags": {<!-- -->
        "type": "keyword"
      },
      "publish_date": {<!-- -->
        "type": "date",
        "format": "strict_date_optional_time||epoch_millis"
      },
      "price": {<!-- -->
        "type": "float"
      },
      "is_available": {<!-- -->
        "type": "boolean"
      }
    }
  }
}

In this example, we create an index named my_index and define an index containing title (text type), tags (key Mapping configuration for the publish_date (date type), price (floating point type), and is_available (boolean type) fields. At the same time, we specified the standard parser for the title field and the date format for the publish_date field.

4.2 Field attributes

In a mapping configuration, you can define a variety of properties for a field to control how the field is stored, indexed, and queried. The following are some commonly used field properties:

type: The data type of the field, such as text, keyword, date, etc.
analyzer: Analyzer used to process text fields, such as standard, whitespace, etc.
format: Date format used to parse and format date fields, such as strict_date_optional_time, epoch_millis, etc.
index: Whether the field can be indexed. The default is true, which means the field can be searched; set to false, which means the field cannot be searched, but can still be stored and retrieved.
store: Whether the field is stored separately. The default is false, which means the field value is stored in the _source field; set to true, which means the field value is stored separately and can be used for sorting and aggregation operations. .

4.3 Dynamic mapping

In Elasticsearch, you can use the dynamic mapping feature to automatically create mapping configurations for new fields. Dynamic mapping automatically infers the data type and attributes of the field based on the type and format of the field value.

Dynamic mapping functionality can be controlled via the following settings:

dynamic: The behavior of dynamic mapping. The default is true, which means the mapping configuration is automatically created; set to false, which means new fields are ignored; set to strict, which means an exception is thrown.
dynamic_templates: Dynamically mapped templates. Different mapping configurations can be defined for different types of fields, such as setting the analyzer for the text field, setting the date format for the date field, etc.

5. Summary

This article introduces the indexing and mapping mechanism in Elasticsearch in detail, including field types, analyzers, filters, token generators, etc. Understanding these concepts will help you take full advantage of Elasticsearch’s high-performance features to meet your needs for real-time analysis and big data processing. By properly configuring mapping and analyzers, you can significantly improve search and analysis performance, reduce system load, and bring a better experience to your business.