ElasticSearch advanced query and indexing principle

Write the directory title here

  • Advanced Search
    • ==MATCHING Query[match_all]==
    • ==keyword query [term]==
    • range query [range]
    • prefix query [prefix]
    • Wildcard query [wildcard]
    • Query by id array [ids]
    • Fuzzy query [fuzzy]
    • Boolean query [bool]
      • must query
      • should query
      • must_not query
      • filter query
      • Boolean combination query
    • Multi-field query [multi_match]
    • Default field word segmentation query [query_string]
    • Highlight query [highlight]
    • Return the specified number of items [size]
    • Paging query [form]
    • Specify field sorting [sort]
    • Return the specified field [_source]
  • Index principle

Advanced query

ES provides a powerful way to retrieve data, this retrieval method is called Query DSL, Query DSL is to use the Rest API to transmit requests in JSON format The rich query syntax in this way makes ES retrieval more powerful and more concise.

Match query [match_all]

  • match_all: returns all documents in the index
  • match: The search term will be segmented first, and then matched with the target query field. If any word in the segment matches the target field, it can be queried
  • match_phrase: Do not divide the search word into words, and require the search word and field content to be matched in an orderly and coherent manner. All words and sequences must be exactly the same, except for punctuation marks
  • match_phrase_prefix : Similar to match_phrase, the difference is that prefix matching is allowed

Explain the difference between them with an example

  • First store a piece of data i like eating and cooking The default tokenizer should divide the content into “i” “like” “eating ” “and” “kuing
Query term/match type match m_phrase m_p_prefix
i ? ? ?
i like ? ? ?
i like singing ? ? ?
i like ea ? ? ?
and ? ? ? ?

Summary:

  1. match will segment the search term before matching, match_phrase and match_phrase_prefix will not segment the search term
  2. match and match_phrase are exact matches, and match_phrase requires an orderly and coherent match between the search term and the field content
  3. match_phrase_prefix is not an exact match, it allows the last word to use a prefix match on the basis of match_phrase

Keyword query [term]

term keyword: use keyword query

  • keyword type: When using term to query fields of type keyword, all content needs to match
  • Integer type, double type, date type: no word segmentation, must match all
  • text type: default es standard tokenizer, Chinese word segmentation, English word segmentation

So except for the text type, other types are not participle

The standard tokenizer is used by default in es, Chinese word segmentation, English word segmentation

# query statement
GET /products/_search
{
  "query": {
    "term": {
      "title": {
        "value": "Pigman"
      }
    }
  }
}
#result
"hits": {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.2039728,
    "hits" : [
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "rj5iCYABiB8dDekOlCwE",
        "_score" : 1.2039728,
        "_source" : {
          "id" : 2,
          "title" : "Pigman",
          "price" : 0.5,
          "created_at" : "2022-04-08",
          "description" : "5 cents a pack"
        }
      }
    ]
  }

Range query [range]

range keyword: used to query documents within a specified range

# range query range
GET /products/_search
{
  "query": {
    "range": {
      "Field name": {
        "gte": 2, #lower bound
        "lte": 4 #upper boundary
      }
    }
  }
}

Prefix query [prefix]

prefix keyword: query according to the prefix of the document

# prefix query prefix
GET /products/_search
{
  "query": {
    "prefix": {
      "FIELD": {
        "value": ""
      }
    }
  }
}

Wildcard query [wildcard]

Wildcard queries can be used:

  • ? matches a single character
  • * matches multiple characters
#wildcard query
GET /products/_search
{
  "query": {
    "wildcard": {
      "FIELD": {
        "value": "VALUE"
      }
    }
  }
}

Query [ids] by id array

Query documents through an id array

#Query through a set of ids
GET /products/_search
{
  "query": {
    "ids": {
      "values": [1,2]
    }
  }
}

Fuzzy query [fuzzy]

Fuzzy search for documents containing specified keywords

Note: fuzzy fuzzy query maximum fuzzy error must be between 0-2

  • The length of the search keyword is 2, ambiguity is not allowed
  • The length of the search keyword is 3-5, allowing one fuzzy
  • The search keyword length is greater than 5, allowing a maximum of 2 blurs
GET /products/_search
{
  "query": {
    "fuzzy": {
      "FIELD": "xxxx"
    }
  }
}

Boolean query[bool]

Elasticsearch can use the bool keyword to combine multiple conditions to achieve complex queries, similar to the operations of AND, OR and NOT in SQL

The Boolean logic types supported by Elasticsearch include the following:

Types include the following:

  • must: The document must meet all the query conditions. When multiple conditions are included, it is similar to AND in SQL, and & amp; & amp;

  • should: The document must meet any one or more of the query conditions (the number of conditions that need to be satisfied can be specified by minimum_should_match), similar when multiple conditions are included OR in SQL, || in operators

  • must_not: The document must not meet all the query conditions, similar to NOT in SQL, and does not participate in the calculation of the score, the returned Branches are all 0

  • filter:: first filter out the documents that meet the conditions, and do not calculate the score. Under normal circumstances, we should first use the filter operation to filter out part of the data, and then use the query to accurately match the data to improve query efficiency

must query

When using a must query, documents must match all of the query conditions included therein.

{<!-- -->
  "query": {<!-- -->
    "bool": {<!-- -->
      "must": [
        "term": {<!-- -->
          "age": 20
        }
      ]
    }
  }
}

This query is equivalent to the corresponding SQL statement below

SELECT * FROM xxx WHERE age = 20;

When using must, you can specify multiple query conditions at the same time. In DSL, it is expressed in the form of an array, and the effect is similar to the AND operation in SQL. For example the following example:

{<!-- -->
  "query": {<!-- -->
    "bool": {<!-- -->
      "must": [
        {<!-- --> "term": {<!-- --> "age": 20 } },
        {<!-- --> "term": {<!-- --> "gender": "male" } }
      ]
    }
  }
}

should query

The should query is similar to the OR statement in SQL. When it includes two or more conditions, the query result must satisfy at least one of them. When there is only one query condition, that is, the result must satisfy that condition.

{<!-- -->
  "query": {<!-- -->
    "bool": {<!-- -->
      "should": [
        {<!-- --> "term": {<!-- --> "age": 20 } },
        {<!-- --> "term": {<!-- --> "gender": "male" } },
        {<!-- --> "range": {<!-- --> "height": {<!-- --> "gte": 170 } } },
      ]
    }
  }
}

This query is equivalent to the corresponding SQL statement below:

SELECT * FROM xxx WHERE age = 20 OR gender = "male" or height >= 170;

The difference between the should query and the OR operation in SQL is that the should query can use the minimum_should_match parameter to specify at least Several conditions need to be met. For example, in the following example, the query result needs to meet two or more query conditions:

{<!-- -->
  "query": {<!-- -->
    "bool": {<!-- -->
      "should": [
        {<!-- --> "term": {<!-- --> "age": 20 } },
        {<!-- --> "term": {<!-- --> "gender": "male" } },
        {<!-- --> "term": {<!-- --> "height": 170 } },
      ],
      "minimum_should_match": 2
    }
  }
}

If there is no must or filter in the same bool statement, the default value of minimum_should_match is 1, that is At least one of the conditions must be met; but if other must or filter exist, the default value of minimum_should_match is 0. That is to say, the should query will be invalid by default

For example, in the following query, the age value of all returned documents must be 20, but it may include documents whose status value is not “active”. If you need both to take effect at the same time, you can add a parameter “minimum_should_match”: 1 to the bool query as in the above example.

{<!-- -->
  "query": {<!-- -->
    "bool": {<!-- -->
      "must": {<!-- -->
        "term": {<!-- -->
          "age": 20
        },
      },
      "should": {<!-- -->
        "term": {<!-- -->
          "status": "active"
        }
      },
       "minimum_should_match": 1
    }
  }
}

must_not query

The must_not query is similar to the NOT operation in SQL statements, it will only return documents that do not meet the specified conditions. For example:

{<!-- -->
  "query": {<!-- -->
    "bool": {<!-- -->
      "must_not": [
        {<!-- --> "term": {<!-- --> "age": 20 } },
        {<!-- --> "term": {<!-- --> "gender": "male" } }
      ]
    }
  }
}

This query is equivalent to the following SQL query statement (because MySQL does not support the following statement using NOT, so it is rewritten to use !=):

SELECT * FROM xxx WHERE age != 20 AND gender != "male";

In addition, must_not is the same as filter, and the filter is used to execute without calculating the score of the document, so the corresponding score of the returned result is 0.

filter query

When using filter query, its effect is equivalent to must query, but it is different from must query in that it first filters out the documents that meet the conditions, and does not calculate score

For example, in the following query, all documents whose status is "active" will be returned, and their scores are all 0.0.

{<!-- -->
  "query": {<!-- -->
    "bool": {<!-- -->
      "filter": {<!-- -->
        "term": {<!-- -->
          "status": "active"
        }
      }
    }
  }
}

Boolean combination query

We can also do nested queries within individual queries. But it should be noted that the Boolean query must be included in the bool query statement, so the bool query statement must be used again inside the nested query.

{<!-- -->
  "query": {<!-- -->
    "bool": {<!-- -->
      "must": [
        {<!-- -->
          "bool": {<!-- -->
            "should": [
              {<!-- --> "term": {<!-- --> "age": 20 } },
              {<!-- --> "term": {<!-- --> "age": 25 } }
            ]
          }
        },
        {<!-- -->
          "range": {<!-- -->
            "level": {<!-- -->
              "gte": 3
            }
          }
        }
      ]
    }
  }
}

This query statement is equivalent to the following SQL statement:

SELECT * FROM xxx WHERE (age = 20 OR age = 25) AND level >= 3;

Multi-field query [multi_match]

After the query condition is divided into words, it will be used for query separately

For example, instant noodles will be divided into “paste” and “noodles” and then taken separately for query

GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "instant noodles",
      "fields": ["title","description"]
    }
  }
}

Default field word segmentation query [query_string]

  • If the type of the query field is not word-segmented, query without word-segmentation
  • If the type of the query field is word-segmented, use the word-segment query
GET /products/_search
{
  "query": {
    "query_string": {
      "default_field": "description",
      "query": "xxxx"
    }
  }
}

Highlight query[highlight]

Key words in eligible documents can be highlighted

  • Only the fields whose type is text can be highlighted
  • * means match all fields
  • Highlighting does not modify the original document, but puts the highlighted result in a highlight
GET /products/_search
{
  "query": {
    "term": {
      "description": {
        "value": "instant noodles"
      }
    }
  },
  "highlight": {
    "fields": {
      "*":{}
    }
  }
}

Custom highlight html tags: You can use pre_tags and post_tags in highlight

GET /products/_search
{
  "query": {
    "term": {
      "description": {
        "value": "xxx"
      }
    }
  },
  "highlight": {
    "post_tags": ["</span>"],
    "pre_tags": ["<span style='color:red'>"],
    "fields": {
      "*":{}
    }
  }
}

Multi-field highlighting Use require_field_match to enable multiple field highlighting

GET /products/_search
{
  "query": {
    "term": {
      "description": {
        "value": "xxx"
      }
    }
  },
  "highlight": {
    "require_field_match": "false",
    "post_tags": ["</span>"],
    "pre_tags": ["<span style='color:red'>"],
    "fields": {
      "*":{}
    }
  }
}

Return the specified number [size]

size keyword: Specifies the number of records returned in the specified query results. The default return value is 10

GET /products/_search
{
  "query": {
    "match_all": {}
  },
  "size": 5
}

Pagination query[form]

from keyword: It is used to specify the starting return position, and it can be used with size keyword to achieve paging effect

GET /products/_search
{
  "query": {
    "match_all": {}
  },
  "size": 5,
  "from": 0 #(page-1)*
}

Specified field sorting [sort]

GET /products/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "price": {
        "order": "desc"
      }
    }
  ]
}

Return to the specified field [_source]

_source keyword: It is an array, which is used to specify which fields to display in the array

GET /products/_search
{
  "query": {
    "match_all": {}
  },
  "_source": ["title","description"]
}

Index principle

An inverted index is also called a reverse index, where there is a forward direction, there is a reverse direction. The forward index is to find the value through the key, and the reverse index is to find the key through the value.

When the bottom layer of ES is searching, the bottom layer uses the inverted index

Test case

Existing indexes and mappings are as follows:

{<!-- -->
  "products" : {<!-- -->
    "mappings" : {<!-- -->
      "properties" : {<!-- -->
        "description" : {<!-- -->
          "type" : "text"
        },
        "price" : {<!-- -->
          "type" : "float"
        },
        "title" : {<!-- -->
          "type" : "keyword"
        }
      }
    }
  }

Enter the following data

_id title price description
1 Blue Moon laundry detergent 19.9 Blue Moon laundry detergent is very efficient
2 iphone13 19.9 very good phone
3 Little raccoon crisp noodles 1.5 Little raccoon is delicious

Visual representation

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture Save it and upload directly (img-EWFtAtiJ-1655732344386)(ElasticSearch.assets/image-20220410092110246.png)]

  • es builds an index based on whether the field can be word-segmented. If it can be word-segmented, it builds an index on the word; if it cannot, it builds an index on the entire field:
    • For example, the keyword type cannot be word-segmented: when indexing, the entire field value is used as the index
    • The text type can be word-segmented, and the field value will be word-segmented before building an index, and then the index will be built
  • The es index and mysql’s innodb engine create an index type. The key of the index structure stores the index field, and the value stores the id value of the entire piece of data. When querying, first find the id value through the index, and then go to the metadata area to find the corresponding entire piece of data based on the id value Documentation