[ElasticSearch] In-depth exploration of DSL query syntax to achieve different levels of retrieval of documents, as well as sorting, paging and highlighting of search results

Article directory

  • Preface
  • 1. Classification of Elasticsearch DSL Query
  • 2. Full text search query
    • 2.1 `match` query
    • 2.2 `multi_match` query
  • 3. Precise query
    • 3.1 term query
    • 3.2 range query
  • 4. Geographical coordinate query
    • 4.1 geo_bounding_box query
    • 4.2 geo_distance query
  • 5. Compound query
    • 5.1 function score query
    • 5.2 boolean query
  • 6. Processing of search results
    • 6.1 Sorting search results
    • 6.2 Paginating search results
    • 6.3 Highlight search keywords in search results

Foreword

Elasticsearch (ES for short) is a powerful open source search and analytics engine used in a wide variety of applications, from enterprise-level search engines to log and metric analysis. Its power lies in its flexible data model and rich query language, which allows users to easily perform full-text retrieval, precise query, geographical coordinate query and other operations.

This article will delve into Elasticsearch’s DSL (Domain Specific Language) query and introduce it in multiple parts. First, we will understand the classification of Elasticsearch DSL Query, and then delve into different types of queries such as full-text search queries, precise queries, geographical coordinate queries, and compound queries.

Each section provides detailed syntax and practical examples so that readers can better understand and use the powerful query capabilities in Elasticsearch. Finally, we will introduce practical processing techniques such as sorting, paging, and search keyword highlighting to improve the search experience.

Let’s dive into Elasticsearch DSL Query and discover how to leverage its powerful features to improve search and analysis.

1. Classification of Elasticsearch DSL Query

Elasticsearch provides powerful and flexible DSL (Domain Specific Language) queries for defining different types of queries on documents in the index database. Some common DSL query types are detailed below:

1. Query all – match_all

Query all documents, typically used for testing or to get documents for the entire index.

{<!-- -->
  "query": {<!-- -->
    "match_all": {<!-- -->}
  }
}

2. Full text search query

  • match query

Use a word segmenter to segment the content entered by the user, and then match the segmented words in the index.

{<!-- -->
  "query": {<!-- -->
    "match": {<!-- -->
      "field_name": "search_text"
    }
  }
}
  • multi_match query

Perform full-text search queries on multiple fields.

{<!-- -->
  "query": {<!-- -->
    "multi_match": {<!-- -->
      "query": "search_text",
      "fields": ["field1", "field2"]
    }
  }
}

3. Precise query

  • ids Query

Query documents based on document ID.

{<!-- -->
  "query": {<!-- -->
    "ids": {<!-- -->
      "values": ["doc_id1", "doc_id2"]
    }
  }
}
  • term Query

Find data based on precise entry values, suitable for keyword, numerical, date, boolean and other types of fields.

{<!-- -->
  "query": {<!-- -->
    "term": {<!-- -->
      "field_name": "exact_value"
    }
  }
}
  • range Query

Query based on range, suitable for range queries of numerical values, dates, etc.

{<!-- -->
  "query": {<!-- -->
    "range": {<!-- -->
      "field_name": {<!-- -->
        "gte": "start_value",
        "lte": "end_value"
      }
    }
  }
}

4. Geographical query

  • geo_distance Query

Query documents within a specified distance based on longitude and latitude.

{<!-- -->
  "query": {<!-- -->
    "geo_distance": {<!-- -->
      "distance": "10km",
      "location": {<!-- -->
        "lat": 40.73,
        "lon": -73.98
      }
    }
  }
}
  • geo_bounding_box Query

Query documents based on a specified rectangular box.

{<!-- -->
  "query": {<!-- -->
    "geo_bounding_box": {<!-- -->
      "location": {<!-- -->
        "top_left": {<!-- -->
          "lat": 40.73,
          "lon": -74.1
        },
        "bottom_right": {<!-- -->
          "lat": 40.01,
          "lon": -71.12
        }
      }
    }
  }
}

5. Compound query

  • bool query

By combining multiple query conditions, logic such as must, must_not, should, etc. is supported.

{<!-- -->
  "query": {<!-- -->
    "bool": {<!-- -->
      "must": [
        {<!-- --> "match": {<!-- --> "field1": "value1" } },
        {<!-- --> "range": {<!-- --> "field2": {<!-- --> "gte": 10, "lte": 20 } } }
      ],
      "must_not": [
        {<!-- --> "term": {<!-- --> "field3": "value2" } }
      ],
      "should": [
        {<!-- --> "match": {<!-- --> "field4": "value3" } }
      ]
    }
  }
}
  • function_score Query

The query results are scored according to the score calculated by a certain function, which is used to weight different query conditions.

{<!-- -->
  "query": {<!-- -->
    "function_score": {<!-- -->
      "query": {<!-- --> "match": {<!-- --> "field": "search_text" } },
      "functions": [
        {<!-- -->
          "filter": {<!-- --> "range": {<!-- --> "field2": {<!-- --> "gte": 10, "lte": 20 } } },
          "weight": 2
        }
      ],
      "score_mode": "multiply"
    }
  }
}

The above are some common Elasticsearch DSL query types. When used, these query conditions can be flexibly combined according to specific needs to achieve complex search and filtering functions. Below are detailed descriptions and query demonstrations for these different queries.

2. Full text search query

Full-text search queries achieve more flexible text search by segmenting the content entered by the user and then matching the segmented words in the index. In Elasticsearch, commonly used keywords are match and multi_match.

2.1 match query

The match query will query based on a field and is suitable for full-text retrieval of a single field. In practical applications, you can use copy_to to merge the values of multiple fields into one field, thereby achieving a query effect similar to multi_match.

For example, a full-text search query for the hotel index:

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "match": {<!-- -->
      "all": "Home Inn"
    }
  }
}

It is assumed here that the all field contains the contents of multiple fields through copy_to, such as: “brand”, “business”, and “name”. The query results are as follows:

match query results

2.2 multi_match Query

multi_match query allows full-text search queries based on multiple fields, but it should be noted that the more fields involved in the query, the lower the query efficiency may be.

For example, using the multi_match query:

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "multi_match": {<!-- -->
      "query": "Home Inn",
      "fields": ["brand", "business", "name"]
    }
  }
}

In this example, the fields participating in the query are also: “brand”, “business”, and “name”. The query results are as follows:

multi_match query results

It can be observed that the query results of match and multi_match are the same. In actual applications, the choice of which method to use depends on specific needs and performance considerations.

3. Precise query

Precise query is a commonly used query method in search engines, and is especially suitable for precise retrieval of keywords, numerical values, dates, boolean and other types of fields. In applications such as hotel booking websites, users usually want to perform accurate information retrieval based on specific conditions, such as city, star rating, brand, price range, etc.

The following will introduce how to use term and range to perform precise queries in Elasticsearch.

3.1 term query

The term query is used to query based on the exact value of a term. For example, in a hotel booking website, if the user wants to find all hotels in Shanghai, the following DSL statement can be used:

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "term": {<!-- -->
      "city": {<!-- -->
        "value": "Shanghai"
      }
    }
  }
}

Query results:

The above query returns all hotel information in Shanghai. However, it should be noted that term queries have higher requirements for exact matching of query keywords. If you change the value of the query to “Shanghai Beijing”, no matching results will be found:

Therefore, when using term to query, be sure to ensure that the query keywords are accurate.

3.2 range query

range query is used to query based on the range of values, especially suitable for numeric fields, such as price ranges. For example, if a user wants to find hotels with prices between 100 and 200, they can use the following DSL statement:

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "range": {<!-- -->
      "price": {<!-- -->
        "gte": 100,
        "lte": 200
      }
    }
  }
}

Query results:

In the above query, gte is used to indicate greater than or equal to a certain value, and lte is used to indicate less than or equal to a certain value. Through such queries, hotel information that meets the price range can be accurately obtained.

In general, precise query is a very common and practical function in search engines. Through the reasonable use of term and range queries, users’ needs for accurate information retrieval can be met. .

4. Geographical coordinate query

Geographic coordinate query is one of the common functions in Elasticsearch. It is especially suitable for scenarios that require searching based on geographical location information, such as querying nearby hotels, taxis, or nearby people. Two commonly used geographical coordinate query methods are introduced below: geo_bounding_box and geo_distance.

4.1 geo_bounding_box query

The geo_bounding_box query searches for all documents that have a geo_point value within a certain rectangle. Here is an example:

GET /indexName/_search
{<!-- -->
  "query": {<!-- -->
    "geo_bounding_box": {<!-- -->
      "FIELD": {<!-- -->
        "top_left": {<!-- -->
          "lat": 31.1,
          "lon": 121.5
        },
        "bottom_right": {<!-- -->
          "lat": 30.9,
          "lon": 121.7
        }
      }
    }
  }
}

In this example, by specifying the latitude and longitude of the upper left corner and lower right corner of the rectangle, you can query all documents located within this rectangle.

Image of the search range, rectangular in shape:

4.2 geo_distance query

The geo_distance query is used to query all documents whose distance to the specified center point is less than a certain distance value. Here is an example:

GET /indexName/_search
{<!-- -->
  "query": {<!-- -->
    "geo_distance": {<!-- -->
      "distance": "15km",
      "FIELD": "31.21,121.5"
    }
  }
}

In this example, by specifying the latitude, longitude and distance value of the center point, you can query all documents that are less than 15 kilometers away from the center point.

Image of the search range, shaped as a circle:

Through these two methods, flexible and accurate queries can be implemented in geographical coordinate information to meet users’ location search needs in different application scenarios.

5. Compound query

Compound queries can combine other simple queries to implement more complex search logic, for example:

  • function score: The score function query can control the document relevance score and the document ranking, similar to some top advertisements in Baidu search results.

  • Boolean Query: Boolean query, a combination of one or more query clauses. The combination of subqueries is:

    • must: Must match each subquery, similar to “and”.
    • should: Selective matching subquery, similar to “or”.
    • must_not: Must not match and do not participate in scoring, similar to “not”.
    • filter: Must match and does not participate in scoring.

5.1 function score query

Suppose that some of the documents in the document are advertisements. We hope that these advertisements will be at the top of the query results. Therefore, we can use the function score query to modify the score of a specific document, for example, give “Home Inn” Hotels of this brand must be ranked high:

To achieve this goal, we need to provide the following three elements:

  1. Which documents need to be weighted?

    • Hotels branded as Home Inns.
  2. What is the fractional function?

    • weight.
  3. What is weighting mode?

    • product.

The query DSL statement is as follows:

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "function_score": {<!-- -->
      "query": {<!-- -->
        "match": {<!-- -->
          "all": "Shanghai"
        }
      },
      "functions": [
        {<!-- -->
          "filter": {<!-- -->
            "term": {<!-- -->
              "brand": "Home Inn"
            }
          },
          "weight": 10
        }
      ],
      "boost_mode": "multiply"
    }
  }
}

Description:

  • query: Original query conditions, search documents and score them based on relevance (query score).
  • filter: Filter conditions. Only documents that meet the conditions will be re-scored.
  • weight: Score function. The result of the score function is called function score. In the future, it will be calculated with query score to obtain a new score. Common fractional functions include:
    • weight: Give a constant value as the function score.
    • field_value_factor: Use a field value in the document as the function result.
    • random_score: Randomly generate a value as the function result.
    • script_score: Custom calculation formula, the formula result is used as the function result.
  • boost_mode: Weighted mode, which defines the operation method of function score and query score, including:
    • multiply: Multiply the two. This is the default.
    • replace: Replace query score with function score.
    • Others: sum, avg, max, min.

Query results:
Function Score Query Result

5.2 boolean query

For example, there is now a query requirement: search for hotels whose names contain “Home Inn”, the price is not higher than 400, and are within 10km around the coordinates 31.21,121.5.

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "bool": {<!-- -->
      "must": [
        {<!-- -->
          "match": {<!-- -->
            "name": "Home Inn"
          }
        }
      ],
      "must_not": [
        {<!-- -->
          "range": {<!-- -->
            "price": {<!-- -->
              "gte": 400
            }
          }
        }
      ],
      "filter": [
        {<!-- -->
          "geo_distance": {<!-- -->
            "distance": "10km",
            "location": {<!-- -->
              "lat": 31.21,
              "lon": 121.5
            }
          }
        }
      ]
    }
  }
}

Description:

  • must: Must match each subquery, similar to “and”.
  • must_not: must not match and does not participate in scoring, similar to “not”.
  • filter: Must match and does not participate in scoring.

Search results:

Boolean Query Result

Through this example, we can see how to use Boolean queries to combine multiple conditions to achieve more precise searches.

6. Processing of search results

6.1 Sort search results

ElasticSearch supports sorting search results. The default is to sort based on relevance score (_score). Field types that can be sorted include: keyword type, numerical type, geographical coordinate type, date type, etc.

Example 1: Sort hotel data in descending order by user reviews, and sort hotel data in ascending order by price

The evaluation is the score field, and the price is the price field. Just add two sorting rules in order.

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "match_all": {<!-- -->}
  },
  "sort": [
    {<!-- -->
      "score": "desc"
    },
    {<!-- -->
      "price": "asc"
    }
  ]
}

Example 2: Sort hotel data in ascending order according to the distance to the location coordinates of (121.507712, 31.224612)

How to get the latitude and longitude: https://lbs.amap.com/demo/jsapi-v2/example/map/click-to-get-lnglat/.

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "match_all": {<!-- -->}
  },
  "sort": [
    {<!-- -->
      "_geo_distance": {<!-- -->
        "location": {<!-- -->
          "lat": 31.224612,
          "lon": 121.507712
        },
        "order": "asc",
        "unit": "km"
      }
    }
  ]
}

6.2 Paging search results

ElasticSearch only returns the top 10 data by default. If you want to query more data, you need to modify the paging parameters. In ElasticSearch, the paging results to be returned are controlled by modifying the from and size parameters.

The basic syntax is as follows:

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "match_all": {<!-- -->}
  },
  "from": 990, // The starting position of paging, the default is 0
  "size": 10, // The total number of documents expected to be obtained
  "sort": [
    {<!-- -->"price": "asc"}
  ]
}

Example: Use a match_all query and then paginate the results

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "match_all": {<!-- -->}
  },
  "from": 0,
  "size": 5
}

Search results, now only 5 results are displayed:

Paging results

Deep Pagination Issues:

ES is distributed, so it will face deep paging problems. For example, after sorting by price, get the data from = 990, size =10:

  1. First, sort and query the top 1000 documents on each data shard.
  2. Then the results of all nodes are aggregated, and the top 1000 documents are selected by reordering in memory.
  3. Finally, from these 1000 documents, 10 documents starting from 990 are selected.

If the number of search pages is too deep, or the result set (from + size) is larger, the memory and CPU consumption will be higher. Therefore, ES sets the upper limit of result set query to 10,000.

Deep Pagination Solution:

For deep paging, ES provides two solutions:

  1. **search after:** Sorting is required during paging. The principle is to query the next page of data starting from the last sort value. Officially recommended method.
  2. **scroll:** principle forms a snapshot of sorted data and saves it in memory. Officially no longer recommended.

6.3 Highlighting search keywords in search results

Highlighting is to highlight the search keywords in the search results.

Principle:

  • Tag keywords in search results.
  • Add CSS styles to tags on the page.

Syntax:

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "match": {<!-- -->
      "FIELD": "TEXT"
    }
  },
  "highlight": {<!-- -->
    "fields": {<!-- -->
      "FIELD": {<!-- -->
        "pre_tags": "<em>", // Pre-tags used to mark highlighted fields
        "post_tags": "</em>" // Post tags used to mark highlighted fields
      }
    }
  }
}

Example: Highlight a searched brand name

GET /hotel/_search
{<!-- -->
  "query": {<!-- -->
    "match": {<!-- -->
      "brand": "Home Inn"
    }
  },
  "highlight": {<!-- -->
    "fields": {<!-- -->
      "brand": {<!-- -->
        "pre_tags": "<em>",
        "post_tags": "</em>"
      }
    }
  }
}

Search results:

Highlight search results

The above are some examples and basic operations for sorting, paginating and keyword highlighting of ElasticSearch search results.