laitimes

ES使用edge n-gram实现高效前缀搜索

author:Flash Gene
ES使用edge n-gram实现高效前缀搜索

Preface

For nearly a decade, Elasticsearch has become the most popular open-source search and data analytics engine on the market due to its superior performance. It not only plays a central role in offline data warehousing, real-time retrieval, and search services for enterprise customers, but also has accumulated a wealth of practical experience and optimization results.

However, in complex consumer-facing scenarios such as high concurrency, high availability, and processing large-scale data, relevant optimization data is still relatively scarce. To this end, we hope to provide valuable reference for our peers through specific optimization cases in the field of large-screen projection search, so as to inspire innovative thinking and practice in Elasticsearch optimization.

Dangbei has widely used Elasticsearch as its core retrieval engine in the field of film and television search,And successfully coped with the massive search requests in the late spring of the past few years。 However, as content and data volumes skyrocket, so does business processing time and CPU load. After in-depth analysis, we found that the performance bottleneck mainly occurred in the prefix search process. To solve this problem, we use the edge n-gram-based word segmentation strategy to optimize the prefix search, which greatly improves the query efficiency.

background

In the current film and television content search business, Elasticsearch plays a vital role as the main search engine. In order to provide a more flexible and convenient search experience, a special search mechanism, namely wildcard prefix search, is specially designed in the business logic for the query conditions of three words or less entered by the user.

Specifically, when a user enters three words or fewer characters in the search box, the system automatically executes a prefix matching query based on the input to quickly retrieve the name of the movie that matches the prefix entered by the user. This design fully considers the needs of users when typing quickly, so that the search results can reflect the user's query intent in real time, thus greatly improving the user experience and search efficiency. In this way, users can find the movie they are looking for more quickly without having to type in the entire title, which is especially important when users are quick to browse and select.

Some pain points of prefix search

The following is an example of a query statement:

POST /xy_test_pinyin_ik/_search
{
  "query": {
    "wildcard": {
      "text": {
        "value": "杭州当贝*"
      }
    }
  }
}
           

This kind of search often leads to the following problems:

  • Wildcard queries, especially those that use wildcards at the beginning of a string, can cause Elasticsearch to traverse a large number of indexed items, which can cause queries to slow down.
  • Due to the complexity of wildcard queries, they often fail to take advantage of the optimization features of indexes, such as inverted indexes and caching.
  • When using a prefix query, if the length of the prefix is longer, the fewer documents can be matched, and the performance will be better, if the prefix is too short and there is only one character, the amount of matching data is too large, which will affect the performance.
  • Elasticsearch's query caches, such as query caches and filter caches, are generally not available for wildcard queries because the results of wildcard queries can change frequently due to changes in the index.
  • Wildcard queries can cause Elasticsearch to use more memory because it needs to maintain more state to handle complex matching logic.

What is n-gram

N-gram is a language model that works by dividing text into consecutive sequences of n characters.

For example, for the word "Elasticsearch":

  • 2-gram ("bigram") 会将其划分为 “El”, “la”, “as”, “si”, “ic”, “ch”, “he”, “es”, “se”。
  • 3-gram ("trigram") 则会划分为 "Ella", "Las", "Asi", "Sik", "Ich", "Che", "Hes", "Ase".

What is edge_ngram

edge_ngram is an n-gram tokenizer designed for prefix matching, which generates n-grams from the beginning of text only. This approach is effective for improving search performance, reducing index size, and enhancing the user experience, especially in autocomplete and suggestion features that quickly return words that match the beginning of the text entered by the user.

The main benefits of edge_ngram include:

  • Fast prefix matching: Only the prefix portion of each word is indexed, allowing Elasticsearch to quickly locate all words that begin with the user's entered character.
  • Improved search performance: The number of indexed items generated is smaller, so prefix queries are faster and only need to search a subset of the index.
  • Reduce index size: edge_ngram generates smaller indexes than storing whole words or using a regular n-gram tokenizer, saving disk space and memory.
  • Optimized autocomplete experience: Designed to respond quickly to user input and provide relevant suggestions, ensuring instant feedback.
  • Improve search relevance: Focus on prefix matching and ensure that only search results that start with the term entered by the user are returned.
  • Easy to configure: Easily control the size of your n-grams by adjusting min_gram and max_gram parameters to meet specific search needs.
  • Multi-language support: Prefix generation for any language that doesn't depend on language-specific rules.
  • Reduce unnecessary matches: Reduce the occurrence of irrelevant search results and improve search accuracy with accurate prefix matching.
  • With these advantages, edge_ngram tokenizer is ideal for improving search efficiency and user experience.

By applying the edge_ngram tokenizer, we can decompose the word "Elastic" into a set of substrings based on the beginning of the word, as follows:

  • Length 1: E
  • Length 2: El
  • Length 3: Ela
  • 长度4:Elas
  • Length 5: Elast
  • 长度6:Elasti
  • 长度7:Elastic

Use edge_ngram for prefix searches

1. Create an index with an Edge Ngram tokenizer

PUT xy_test_pinyin_ik
{
  "settings": {
    "number_of_shards": 1, 
    "number_of_replicas": 0, 
    "index": {
      "max_ngram_diff": 10
    },
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "tokenizer": "custom_tokenizer"
        }
      },
      "tokenizer": {
        "custom_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 7
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "custom_analyzer"
      }
    }
  }
}
           

The filter generates an ngram with a minimum fixed value of 1 and a maximum of 7.

2. Use the created custom_tokenizer tokenizer to view the tokenization result

The results obtained were in line with expectations

POST /xy_test_pinyin_ik/_analyze
{
  "text": "杭州当贝网络科技有限公司",
  "analyzer": "custom_analyzer"
}
           

Outcome:

{
  "tokens" : [
    {
      "token" : "杭",
      ...
      "position" : 0
    },
    {
      "token" : "杭州",
      ...
      "position" : 1
    },
    {
      "token" : "杭州当",
      ...
      "position" : 2
    },
    {
      "token" : "杭州当贝",
      ...
      "position" : 3
    },
    {
      "token" : "杭州当贝网",
      ...
      "position" : 4
    },
    {
      "token" : "杭州当贝网络",
      ...
      "position" : 5
    },
    {
      "token" : "杭州当贝网络科",
      ...
      "position" : 6
    }
  ]
}
           

3. Write data

POST /xy_test_pinyin_ik/_doc/1
{"text": "杭州当贝网络科技有限公司"}
{"text": "杭州"}
           

4. Query the data

If you use match, only Hangzhou's will also come out, full-text search, but the score is relatively low.

It is recommended to use match_phrase, which requires every term, and the position is just one digit, which meets our expectations.

GET /xy_test_pinyin_ik/_search
{
  "query": {
    "match_phrase": {
      "text": "杭州当贝"
    }
  }
}
           

Outcome:

{
  ...
  "hits" : {
    ...
    "hits" : [
      {
        "_index" : "xy_test_pinyin_ik",
        ...
        "_source" : {
          "text" : "杭州当贝网络科技有限公司"
        }
      }
    ]
  }
}
           

summary

This topic analyzes the problems encountered in the search business scenario, selects the appropriate technology, selects the appropriate solution, integrates the solution, and verifies the problem. We ended up using Edge N-Gram, which completely solved the performance bottleneck in this scenario. This article hopes to provide an idea for other students to follow the same path to solve business problems when they encounter performance problems related to Elasticsearch.

Reference Links

  1. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
  2. https://segmentfault.com/a/1190000022100153
  3. https://blog.csdn.net/tiancityycf/article/details/114847911

Source-WeChat public account:Dangbei technical team

Source: https://mp.weixin.qq.com/s/tr9e6jzrP6A6IU3T3wb85g