天天看點

幹貨 | 一步步拆解 Elasticsearch BM25 模型評分細節1、Okapi BM25 基本概念

在拆解評分算法之前,必須簡單解釋一下背後的理論——Elasticsearch 基于 Lucene。要了解 Elasticsearch,我們必須了解 Lucene。

1、Okapi BM25 基本概念

Okapi BM25 模型的計算公式如下:

幹貨 | 一步步拆解 Elasticsearch BM25 模型評分細節1、Okapi BM25 基本概念

類似的公式,我看到後的第一反應:這是科研人員才能搞懂的事情,我等隻能圍觀。

但,為了進一步深入算分機制,我們一個個參數拆解一下,期望能“撥開雲天、豁然開朗”!

上述公式中:

D:代表文檔。

Q:代表查詢。

K1:自由參數,預設值:1.2。

b:自由參數,預設值:0.75。

參見 Lucene 官方文檔:

https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html

2、詞頻 TF

詞頻英文釋義:TF(Term Frequency) ,即:分詞單元(Term)在文檔中出現的頻率。

由于每個文本的長度不同,一個單詞在長文檔中出現的次數可能比短文檔中出現的次數要多得多。

一個詞出現的次數越多,它的得分就越高。

可以簡記為:

    特定分詞單元 Term 出現次數 (Number of times term t appears in a document)

TF =  ----------------------------------------------

        所在文檔 Terms 總數 (Total number of terms in the document)

3、逆文檔頻率 IDF

逆文檔頻率英文釋義: IDF(Inverse Document Frequency),衡量分詞單元Term的重要性。

但是,衆所周知,諸如“the”、“is”、“of、“that”、“的”、“嗎”等之類的特定詞可能會出現很多次但重要性不大。

是以,我們需要通過計算以下公式來降低常用分詞單元的權重,同時擴大稀有分詞單元的權重。

幹貨 | 一步步拆解 Elasticsearch BM25 模型評分細節1、Okapi BM25 基本概念

               文檔數(Total number of documents)

IDF = log  ---------------------------------------

         包含特定分詞單元 Term 的文檔數 (Number of documents with term t in it)

4、實戰探索

4.1 索引準備

本文基于:7.12.0 版本的 Elasticsearch 進行拆解驗證。

建立索引:got,并制定字段 quote 為 text 類型,同時指定:english 分詞器。

DELETE got

PUT got

{

 "settings": {

   "number_of_shards": 1,

   "number_of_replicas": 0

 },

 "mappings": {

   "properties": {

     "quote": {

       "type": "text",

       "analyzer": "english"

     }

   }

 }

}

4.2 資料準備

bulk 批量導入資料,資料來自《權利的遊戲》電視劇的台詞。

POST _bulk

{ "index" : { "_index" : "got", "_id" : "1" } }

{ "quote" : "A mind needs books as a sword needs a whetstone, if it is to keep its edge." }

{ "index" : { "_index" : "got", "_id" : "2" } }

{ "quote" : "Never forget what you are, for surely the world will not. Make it your strength. Then it can never be your weakness. Armor yourself in it, and it will never be used to hurt you." }

{ "index" : { "_index" : "got", "_id" : "3" } }

{ "quote" : "Let them see that their words can cut you, and you’ll never be free of the mockery. If they want to give you a name, take it, make it your own. Then they can’t hurt you with it anymore." }

{ "index" : { "_index" : "got", "_id" : "4" } }

{ "quote" : "When you play the game of thrones, you win or you die. There is no middle ground." }

{ "index" : { "_index" : "got", "_id" : "5" } }

{ "quote" : "The common people pray for rain, healthy children, and a summer that never ends. It is no matter to them if the high lords play their game of thrones, so long as they are left in peace." }

{ "index" : { "_index" : "got", "_id" : "6" } }

{ "quote" : "If you would take a man’s life, you owe it to him to look into his eyes and hear his final words. And if you cannot bear to do that, then perhaps the man does not deserve to die." }

{ "index" : { "_index" : "got", "_id" : "7" } }

{ "quote" : "Sorcery is the sauce fools spoon over failure to hide the flavor of their own incompetence." }

{ "index" : { "_index" : "got", "_id" : "8" } }

{ "quote" : "Power resides where men believe it resides. No more and no less." }

{ "index" : { "_index" : "got", "_id" : "9" } }

{ "quote" : "There’s no shame in fear, my father told me, what matters is how we face it." }

{ "index" : { "_index" : "got", "_id" : "10" } }

{ "quote" : "Love is poison. A sweet poison, yes, but it will kill you all the same." }

{ "index" : { "_index" : "got", "_id" : "11" } }

{ "quote" : "What good is this, I ask you? He who hurries through life hurries to his grave." }

{ "index" : { "_index" : "got", "_id" : "12" } }

{ "quote" : "Old stories are like old friends, she used to say. You have to visit them from time to time." }

{ "index" : { "_index" : "got", "_id" : "13" } }

{ "quote" : "The greatest fools are ofttimes more clever than the men who laugh at them." }

{ "index" : { "_index" : "got", "_id" : "14" } }

{ "quote" : "Everyone wants something, Alayne. And when you know what a man wants you know who he is, and how to move him." }

{ "index" : { "_index" : "got", "_id" : "15" } }

{ "quote" : "Always keep your foes confused. If they are never certain who you are or what you want, they cannot know what you are like to do next. Sometimes the best way to baffle them is to make moves that have no purpose, or even seem to work against you." }

{ "index" : { "_index" : "got", "_id" : "16" } }

{ "quote" : "One voice may speak you false, but in many there is always truth to be found." }

{ "index" : { "_index" : "got", "_id" : "17" } }

{ "quote" : "History is a wheel, for the nature of man is fundamentally unchanging." }

{ "index" : { "_index" : "got", "_id" : "18" } }

{ "quote" : "Knowledge is a weapon, Jon. Arm yourself well before you ride forth to battle." }

{ "index" : { "_index" : "got", "_id" : "19" } }

{ "quote" : "I prefer my history dead. Dead history is writ in ink, the living sort in blood." }

{ "index" : { "_index" : "got", "_id" : "20" } }

{ "quote" : "In the game of thrones, even the humblest pieces can have wills of their own. Sometimes they refuse to make the moves you’ve planned for them. Mark that well, Alayne. It’s a lesson that Cersei Lannister still has yet to learn." }

{ "index" : { "_index" : "got", "_id" : "21" } }

{ "quote" : "Every man should lose a battle in his youth, so he does not lose a war when he is old." }

{ "index" : { "_index" : "got", "_id" : "22" } }

{ "quote" : "A reader lives a thousand lives before he dies. The man who never reads lives only one." }

{ "index" : { "_index" : "got", "_id" : "23" } }

{ "quote" : "The fisherman drowned, but his daughter got Stark to the Sisters before the boat went down. They say he left her with a bag of silver and a bastard in her belly. Jon Snow, she named him, after Arryn." }

{ "index" : { "_index" : "got", "_id" : "24" } }

{ "quote" : "You could make a poultice out of mud to cool a fever. You could plant seeds in mud and grow a crop to feed your children. Mud would nourish you, where fire would only consume you, but fools and children and young girls would choose fire every time." }

{ "index" : { "_index" : "got", "_id" : "25" } }

{ "quote" : "Men live their lives trapped in an eternal present, between the mists of memory and the sea of shadow that is all we know of the days to come." }

{ "index" : { "_index" : "got", "_id" : "26" } }

{ "quote" : "No. Hear me, Daenerys Targaryen. The glass candles are burning. Soon comes the pale mare, and after her the others. Kraken and dark flame, lion and griffin, the sun’s son and the mummer’s dragon. Trust none of them. Remember the Undying. Beware the perfumed seneschal." }

4.3 實施檢索

GET got/_search

 "query": {

   "match": {

     "quote": "live"

傳回結果(僅列舉評分、Quote 字段)如下:

Score Quote

3.3297362 A reader lives a thousand lives before he dies.  The man who never reads lives only one.

2.847715 Men live their lives trapped in an eternal present, between the mists of memory  and the sea of shadow that is all we know of the days to come.

2.313831 I prefer my history dead. Dead history is writ in ink, the living sort in blood.

這時候會面臨我們的終極疑惑——這些評分咋來的?咋計算的呢?

别急,我們一步步拆解。

5、評分拆解

加上 "explain":true 一探究竟。

     "quote": "you"

 "explain": true

拿第一個傳回文檔也就是評分為:3.3297362 的結果資料為例,自頂向下的方法有利于了解計算。

如下拆解結果所示,分數 3.3297362 是分詞單元 live 的 boost * IDF * TF 三者的乘積,簡記為:

總評分 = 2.2 * 2.043074 * 0.74080354 = 3.3297362。

explain 執行後的結果,核心部分如下所示:

       "_shard" : "[got][0]",

       "_node" : "m9VCQfPDRqqMuupU_Xz5Eg",

       "_index" : "got",

       "_type" : "_doc",

       "_id" : "22",

       "_score" : 3.3297362,

       "_source" : {

         "quote" : "A reader lives a thousand lives before he dies. The man who never reads lives only one."

       },

       "_explanation" : {

         "value" : 3.3297362,

         "description" : "weight(quote:live in 21) [PerFieldSimilarity], result of:",

         "details" : [

           {

             "value" : 3.3297362,

             "description" : "score(freq=3.0), computed as boost * idf * tf from:",

             "details" : [

               {

                 "value" : 2.2,

                 "description" : "boost",

                 "details" : [ ]

               },

                 "value" : 2.043074,

                 "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",

                 "value" : 0.7408035,

                 "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",

               }

             ]

           }

         ]

       }

5.1 詞頻 TF 拆解

執行 explain 後,詞頻 TF 拆解計算如下,

   "value":0.7408035,

   "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",

   "details":[

       {

           "value":"3.0",

           "description":"freq, occurrences of term within document",

           "details":[

           ]

           "value":1.2,

           "description":"k1, term saturation parameter",

           "value":0.75,

           "description":"b, length normalization parameter",

           "value":"14.0",

           "description":"dl, length of field",

           "value":16.807692,

           "description":"avgdl, average length of field",

   ]

詞頻計算涉及參數如下:

freq = 分詞單元 live 在文檔中出現的次數為 3 次,如下圖所示:

幹貨 | 一步步拆解 Elasticsearch BM25 模型評分細節1、Okapi BM25 基本概念
  • k1:1.2,預設值。
  • b:0.75 預設值。
  • dl:該文檔的分詞後分詞單元的個數(number of tokens),為 14。

可以借助——analyze API 驗證:

POST got/_analyze

 "text": "A reader lives a thousand lives before he dies. The man who never reads lives only one",

 "analyzer": "english"

分詞後的 token 為:

幹貨 | 一步步拆解 Elasticsearch BM25 模型評分細節1、Okapi BM25 基本概念

avgdl:等于所有文檔的分詞單元的總數  / 文檔個數) ,計算結果為:16.807692。

如何計算的呢?這裡有同學會有疑惑,解讀如下:

avgdl 計算步驟 1:所有文檔的分詞單元的總數。

如下所示:共 437個。文檔數為 26 個。

為了方面檢視,我把 26 個文檔的全部 document 内容集合到一個文檔裡面,求得的分詞後的結果值為 437 。

幹貨 | 一步步拆解 Elasticsearch BM25 模型評分細節1、Okapi BM25 基本概念

avgdl 計算步驟 2:avgdl = 437 / 26 = 16.807692。

最終 TF 詞頻 求解結果為:0.740803524(該手算值精度和最終 Elasticsearch 傳回結果精度值不完全一緻,屬于精度問題,不影響了解全局),其求解步驟如下:

TF = freq / (freq + k1 * (1 - b + b * dl / avgdl))  

= 3 / (3 + 1.2 *( 1 - 0.75 + 0.75 * 14 / 16.807692))

= 3 / (3 + 1.2 *0.87471397)

= 3/(3+1.049656764)

= 3/4.049656764

= 0.740803524

5.2 逆文檔頻率 IDF 拆解

執行 explain 後,逆文檔頻率 IDF 拆解如下:

   "value":2.043074,

   "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",

           "value":3,

           "description":"n, number of documents containing term",

           "value":26,

           "description":"N, total number of documents with field",

N:待檢索文檔數,本示例為 26。

n:包含分詞單元 live 的文檔數目,本示例為 3。

最終 IDF 求解結果為:2.043074,其計算公式如下:

IDF = log(1 + (N - n + 0.5) / (n + 0.5))

= log( 1 + ( 26 - 3 + 0.5) / ( 3 + 0.5))

= log( 1 + 23.5/3.5)

= log( 1 + 6.714285)

= log(7.714285)

= 2.043074

如上計算對數, 底數為 e。

5.3 總評分拆解

總評分

= boost * TF * IDF

= 2.2 * 0.74080354  * 2.043074 = 3.3297362

boost 為什麼等于 2.2 ?

如果我們不指定 boost,boost 就是使用預設值,預設值是 2.2。

boost 參見:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html https://www.infoq.com/articles/similarity-scoring-elasticsearch/

6、小結

一步步拆解,才能知道 BM25 模型的評分‘奧秘’所在,原來難懂的數學計算公式,也變得清晰明朗!

有了拆解,再來看其他的檢索評分問題自然會“毫不費力"。

本文由英文部落格:

https://blog.mimacom.com/bm25-got/

翻譯而來,較原來部落格内容,增加了計算的細節和個人解讀,確定每一個計算細節國小生都能看懂。

歡迎就評分問題留言交流細節。

參考

實戰 | Elasticsearch自定義評分的N種方法

https://ruby-china.org/topics/31934 https://www.elastic.co/guide/en/elasticsearch/reference/current/similarity.html https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html