天天看點

使用HanLP增強Elasticsearch分詞功能

hanlp-ext 插件源碼位址: http://git.oschina.net/hualongdata/hanlp-ext https://github.com/hualongdata/hanlp-ext

Elasticsearch 預設對中文分詞是按“字”進行分詞的,這是肯定不能達到我們進行分詞搜尋的要求的。官方有一個 SmartCN 中文分詞插件,另外還有一個 IK 分詞插件使用也比較廣。但這裡,我們采用 HanLP 這款 自然語言處理工具 來進行中文分詞。

Elasticsearch

Elasticsearch 的預設分詞效果是慘不忍睹的。

GET /_analyze?pretty
{
  "text" : ["重慶華龍網海數科技有限公司"]
}           

輸出:

{

"tokens": [

{
  "token": "重",
  "start_offset": 0,
  "end_offset": 1,
  "type": "<IDEOGRAPHIC>",
  "position": 0
},
{
  "token": "慶",
  "start_offset": 1,
  "end_offset": 2,
  "type": "<IDEOGRAPHIC>",
  "position": 1
},
{
  "token": "華",
  "start_offset": 2,
  "end_offset": 3,
  "type": "<IDEOGRAPHIC>",
  "position": 2
},
{
  "token": "龍",
  "start_offset": 3,
  "end_offset": 4,
  "type": "<IDEOGRAPHIC>",
  "position": 3
},
{
  "token": "網",
  "start_offset": 4,
  "end_offset": 5,
  "type": "<IDEOGRAPHIC>",
  "position": 4
},
{
  "token": "海",
  "start_offset": 5,
  "end_offset": 6,
  "type": "<IDEOGRAPHIC>",
  "position": 5
},
{
  "token": "數",
  "start_offset": 6,
  "end_offset": 7,
  "type": "<IDEOGRAPHIC>",
  "position": 6
},
{
  "token": "科",
  "start_offset": 7,
  "end_offset": 8,
  "type": "<IDEOGRAPHIC>",
  "position": 7
},
{
  "token": "技",
  "start_offset": 8,
  "end_offset": 9,
  "type": "<IDEOGRAPHIC>",
  "position": 8
},
{
  "token": "有",
  "start_offset": 9,
  "end_offset": 10,
  "type": "<IDEOGRAPHIC>",
  "position": 9
},
{
  "token": "限",
  "start_offset": 10,
  "end_offset": 11,
  "type": "<IDEOGRAPHIC>",
  "position": 10
},
{
  "token": "公",
  "start_offset": 11,
  "end_offset": 12,
  "type": "<IDEOGRAPHIC>",
  "position": 11
},
{
  "token": "司",
  "start_offset": 12,
  "end_offset": 13,
  "type": "<IDEOGRAPHIC>",
  "position": 12
}           

]

}

可以看到,預設是按字進行分詞的。

elasticsearch-hanlp

HanLP

HanLP 是一款使用 Java 實作的優秀的,具有如下功能:

中文分詞

詞性标注

命名實體識别

關鍵詞提取

自動摘要

短語提取

拼音轉換

簡繁轉換

文本推薦

依存句法分析

語料庫工具

安裝 elasticsearch-hanlp(安裝見:

https://github.com/hualongdata/hanlp-ext/tree/master/es-plugin

)插件以後,我們再來看看分詞效果。

GET /_analyze?pretty
{
  "analyzer" : "hanlp",
  "text" : ["重慶華龍網海數科技有限公司"]
}           
{
  "token": "重慶",
  "start_offset": 0,
  "end_offset": 2,
  "type": "ns",
  "position": 0
},
{
  "token": "華龍網",
  "start_offset": 2,
  "end_offset": 5,
  "type": "nr",
  "position": 1
},
{
  "token": "海數",
  "start_offset": 5,
  "end_offset": 7,
  "type": "nr",
  "position": 2
},
{
  "token": "科技",
  "start_offset": 7,
  "end_offset": 9,
  "type": "n",
  "position": 3
},
{
  "token": "有限公司",
  "start_offset": 9,
  "end_offset": 13,
  "type": "nis",
  "position": 4
}           

HanLP 的功能不止簡單的中文分詞,有很多功能都可以內建到 Elasticsearch 中。

文章來源于羊八井的部落格