elasticsearch fuzzy query and intelligent search recommendation

Fuzzy query

Prefix search: prefix

Concept: Searches starting with xx do not calculate the relevance score.

Note:

- Prefix search matches term, not field.

- Prefix search performance is poor

- Prefix search is not cached

- Prefix search sets the prefix length as long as possible

Syntax:

GET <index>/_search
{
  "query": {
    "prefix": {
      "<field>": {
        "value": "<word_prefix>"
      }
    }
  }
}
index_prefixes: default "min_chars" : 2, "max_chars" : 5

Wildcard: wildcard

Concept: Wildcard operators are placeholders that match one or more characters. For example, the * wildcard operator matches zero or more characters. You can use wildcard operators with other characters to create wildcard patterns.

Note:

- The wildcard also matches term, not field

Syntax:

GET <index>/_search
{
  "query": {
    "wildcard": {
      "<field>": {
        "value": "<word_with_wildcard>"
      }
    }
  }
}

Regular: regexp

Concept: The performance of regexp queries can vary depending on the regular expression provided. To improve performance, avoid using wildcard patterns such as . or .? + without prefix or suffix

Syntax:

GET <index>/_search
{
  "query": {
    "regexp": {
      "<field>": {
        "value": "<regex>",
        "flags": "ALL",
      }
    }
  }
}

flags

- ALL
  Enable all optional operators.

- COMPLEMENT
  Enable operator. You can use negation of the shortest pattern below. For example
  a~bc # matches ‘adc’ and ‘aec’ but not ‘abc’

- INTERVAL
  Enable the <> operator. You can use <> to match numerical ranges. For example
  foo<1-100> # matches ‘foo1’, ‘foo2’ … ‘foo99’, ‘foo100’
  foo<01-100> # matches ‘foo01’, ‘foo02’ … ‘foo99’, ‘foo100’

- INTERSECTION
  Enables the & operator, which acts as the AND operator. If both the left and right patterns match, the match is successful. For example:
  aaa. + & amp;. + bbb # matches ‘aaabbb’

- ANYSTRING
  Enable @ operator. You can use @ to match any entire string.
  You can combine the @ operator with the & and ~ operators to create “everything except” logic. For example:
  @ & amp;~(abc. + ) # matches everything except terms beginning with ‘abc’

Confusing characters (box → fox) Missing characters (black → lack)
Extra characters (sic → sick) Reverse order (act → cat)

Fuzzy query: fuzzy

Syntax

GET <index>/_search
{
  "query": {
    "fuzzy": {
      "<field>": {
        "value": "<keyword>"
      }
    }
  }
}

Parameters:

- value: (required, keyword)

- fuzziness: Edit distance, (0, 1, 2) is not bigger, the better, the recall rate is high but the results are inaccurate

1. 1. The Damerau-Levenshtein distance between two pieces of text is the number of insertions, deletions, substitutions and transpositions required to make one string match another.
  2. Distance formula: Levenshtein is lucene, es improved version: Damerau-Levenshtein,

ax=>aex Levenshtein=2 Damerau-Levenshtein=1

- transpositions: (optional, boolean) Indicates whether the edit includes transpositions of two adjacent characters (ab→ba). Default is true.

Phrase prefix: match_phrase_prefix

match_phrase:

- match_phrase will segment words

- The retrieved field must contain all terms in match_phrase and the order must be the same

- There cannot be other terms between the terms in the match_phrase contained in the retrieved field.

Concept:

match_phrase_prefix is the same as match_phrase, but it has one more feature, that is, it allows prefix matching on the last term of the text. If it is a word, such as a, it will match all documents starting with a in the document field. If it is a phrase, such as “this is ma”, he will first perform a search prefixed with ma in the inverted index, and then perform a match_phrase query in the matched doc. (Some people on the Internet say that they match_phrase first, and then perform Prefix search is wrong)

Parameter

- analyzer specifies what analyzer to perform word segmentation on this phrase

- max_expansions limits the maximum number of matching terms

- boost is used to set the weight of the query

- slop allows term separation between phrases: the slop parameter tells match_phrase how far apart the query terms are before still considering the document as a match. What is how far apart? Meaning how many times do you need to move the terms in order for the query to match the document?

Principle analysis: How to Use Fuzzy Searches in Elasticsearch | Elastic Blog

N-gram and edge ngram

tokenizer

GET _analyze
{
  "tokenizer": "ngram",
  "text": "reba always loves me"
}

token filter

min_gram: the minimum threshold for splitting characters to create an index

max_gram: The maximum threshold for splitting characters when creating an index

ngram: Starting from each character, perform word segmentation according to the step size, suitable for prefix and infix retrieval

edge_ngram: Starting from the first character, perform word segmentation according to the step size, suitable for prefix matching scenarios

#prefix: prefix search
DELETE my_index
# elasticsearch stack
#elasticsearch search
#el
#ela
#elas elasticsearch
PUT my_index
{
  "mappings": {
    "properties": {
      "text": {
        "analyzer": "ik_max_word",
        "type": "text",
        "index_prefixes":{
          "min_chars":2,
          "max_chars":4
        },
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}
GET my_index/_mapping
POST /my_index/_bulk?filter_path=items.*.error
{"index":{"_id":"1"}}
{"text":"The urban management called the vendors to set up stalls"}
{"index":{"_id":"2"}}
{"text":"Xiaoguo Culture responds to traders and old farmers setting up stalls"}
{"index":{"_id":"3"}}
{"text":"It took the old farmer 17 years to grow the chair tree"}
{"index":{"_id":"4"}}
{"text":"The couple had been married for more than 30 years under the AA system and were arrested by the urban management"}
{"index":{"_id":"5"}}
{"text":"A black man bravely tried to stop a robbery but was handcuffed"}
GET my_index/_search
GET my_index/_mapping
GET_analyze
{
  "text": ["The couple has been married for more than 30 years under the AA system and was arrested by the urban management"]
}
GET my_index/_search
{
  "query": {
    "prefix": {
      "text": {
        "value": "urban management"
      }
    }
  }
}

################################################ #############
# wildcard
DELETE my_index
POST /my_index/_bulk
{ "index": { "_id": "1"} }
{ "text": "my english" }
{ "index": { "_id": "2"} }
{ "text": "my english is good" }
{ "index": { "_id": "3"} }
{ "text": "my chinese is good" }
{ "index": { "_id": "4"} }
{ "text": "my japanese is nice" }
{ "index": { "_id": "5"} }
{ "text": "my disk is full" }
DELETE product_en
POST /product_en/_bulk
{ "index": { "_id": "1"} }
{ "title": "my english","desc" : "shouji zhong de zhandouji","price" : 3999, "tags": [ "xingjiabi", "fashao", "buka", "1"]}
{ "index": { "_id": "2"} }
{ "title": "xiaomi nfc phone","desc" : "zhichi quangongneng nfc,shouji zhong de jianjiji","price" : 4999, "tags": [ "xingjiabi", "fashao", "gongjiaoka" , " asd2fgas"]}
{ "index": { "_id": "3"} }
{ "title": "nfc phone","desc" : "shouji zhong de hongzhaji","price" : 2999, "tags": [ "xingjiabi", "fashao", "menjinka" , "as345"]}
{ "title": { "_id": "4"} }
{ "text": "xiaomi erji","desc" : "erji zhong de huangmenji","price" : 999, "tags": [ "low", "bufangshui", "yinzhicha", "4dsg" ]}
{ "index": { "_id": "5"} }
{ "title": "hongmi erji","desc" : "erji zhong de kendeji","price" : 399, "tags": [ "lowbee", "xuhangduan", "zhiliangx" , "sdg5"]}
GET my_index/_search
GET product_en/_search

GET my_index/_search
{
  "query": {
    "wildcard": {
      "text.keyword": {
        "value": "my eng*ish"
      }
    }
  }
}
GET product_en/_mapping
#exact value
GET product_en/_search
{
  "query": {
    "wildcard": {
      "tags.keyword": {
        "value": "men*inka"
      }
    }
  }
}

################################################ #####
#regular
GET product_en/_search
GET product_en/_search
{
  "query": {
    "regexp": {
      "title": "[\s\S]*nfc[\s\S]*"
    }
  }
}
GET product_en/_search
GET product_en/_search
{
  "query": {
    "regexp": {
      "desc": {
        "value": "zh~dng",
        "flags": "COMPLEMENT"
      }
    }
  }
}
GET product_en/_search
{
  "query": {
    "regexp": {
      "tags.keyword": {
        "value": ".*<2-3>.*",
        "flags": "INTERVAL"
      }
    }
  }
}
############################################
# fuzzy: fuzzy query
GET product_en/_search
GET product_en/_search
{
  "query": {
    "fuzzy": {
      "desc": {
        "value": "quanggonneng nfc",
        "fuzziness": "2"
      }
    }
  }
}

GET product_en/_search
{
  "query": {
    "match": {
      "desc": {
        "query": "nfe quasdasdasdasd",
        "fuzziness": 1
      }
    }
  }
}
#####################################
# match_phrase_prefix
GET product_en/_search
{
  "query": {
    "match_phrase": {
      "desc": "shouji zhong de"
    }
  }
}

GET product_en/_search
{
  "query": {
    "match_phrase_prefix": {
      "desc": {
        "query": "de zhong shouji hongzhaji",
        "max_expansions": 50,
        "slop":3
      }
    }
  }
}


GET product_en/_search
{
  "query": {
    "match_phrase_prefix": {
      "desc": {
        "query": "zhong hongzhaji",
        "max_expansions": 50,
        "slop": 3
      }
    }
  }
}


# source: zhong de hongzhaji
# query: zhong > hongzhaji

# source: shouji zhong de hongzhaji
# query: de zhong shouji hongzhaji

# de shouji/zhong hongzhaji 1 time
# shouji/de zhong hongzhaji 2 times
# shouji zhong/de hongzhaji 3 times
#shoujizhongdehongzhaji 4 times

############################################
# ngram and edge-ngram
#ngram min_gram =1 "max_gram": 2

GET_analyze
{
  "tokenizer": "ik_max_word",
  "filter": [ "edge_ngram" ],
  "text": "reba always loves me"
}

#min_gram =1 "max_gram": 1
#r a l m

#min_gram =1 "max_gram": 2
#r a l m
#re al lo me

#min_gram =2 "max_gram": 3
#re al lo me
#reb alw lov me

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "2_3_edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 3
        }
      },
      "analyzer": {
        "my_edge_ngram": {
          "type":"custom",
          "tokenizer": "standard",
          "filter": [ "2_3_edge_ngram" ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer":"my_edge_ngram",
        "search_analyzer": "standard"
      }
    }
  }
}
GET /my_index/_mapping


POST /my_index/_bulk
{ "index": { "_id": "1"} }
{ "text": "my english" }
{ "index": { "_id": "2"} }
{ "text": "my english is good" }
{ "index": { "_id": "3"} }
{ "text": "my chinese is good" }
{ "index": { "_id": "4"} }
{ "text": "my japanese is nice" }
{ "index": { "_id": "5"} }
{ "text": "my disk is full" }


GET /my_index/_search
GET /my_index/_mapping
GET /my_index/_search
{
  "query": {
    "match_phrase": {
      "text": "my eng is goo"
    }
  }
}



PUT my_index2
{
  "settings": {
    "analysis": {
      "filter": {
        "2_3_grams": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 3
        }
      },
      "analyzer": {
        "my_edge_ngram": {
          "type":"custom",
          "tokenizer": "standard",
          "filter": [ "2_3_grams" ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer":"my_edge_ngram",
        "search_analyzer": "standard"
      }
    }
  }
}
GET /my_index2/_mapping
POST /my_index2/_bulk
{ "index": { "_id": "1"} }
{ "text": "my english" }
{ "index": { "_id": "2"} }
{ "text": "my english is good" }
{ "index": { "_id": "3"} }
{ "text": "my chinese is good" }
{ "index": { "_id": "4"} }
{ "text": "my japanese is nice" }
{ "index": { "_id": "5"} }
{ "text": "my disk is full" }

GET /my_index2/_search
{
  "query": {
    "match_phrase": {
      "text": "my eng is goo"
    }
  }
}

GET_analyze
{
  "tokenizer": "ik_max_word",
  "filter": [ "ngram" ],
  "text": "Make skin with your heart, play games with your feet"
}