hanlp-ext 插件源碼位址: http://git.oschina.net/hualongdata/hanlp-ext 或 https://github.com/hualongdata/hanlp-ext
Elasticsearch 預設對中文分詞是按“字”進行分詞的,這是肯定不能達到我們進行分詞搜尋的要求的。官方有一個 SmartCN 中文分詞插件,另外還有一個 IK 分詞插件使用也比較廣。但這裡,我們采用 HanLP 這款 自然語言處理工具 來進行中文分詞。
Elasticsearch
Elasticsearch 的預設分詞效果是慘不忍睹的。
GET /_analyze?pretty
{
"text" : ["重慶華龍網海數科技有限公司"]
}
輸出:
{
"tokens": [
{
"token": "重",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "慶",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "華",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "龍",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "網",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "海",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "數",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "科",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"token": "技",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 8
},
{
"token": "有",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 9
},
{
"token": "限",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 10
},
{
"token": "公",
"start_offset": 11,
"end_offset": 12,
"type": "<IDEOGRAPHIC>",
"position": 11
},
{
"token": "司",
"start_offset": 12,
"end_offset": 13,
"type": "<IDEOGRAPHIC>",
"position": 12
}
]
}
可以看到,預設是按字進行分詞的。
elasticsearch-hanlp
HanLP
HanLP 是一款使用 Java 實作的優秀的,具有如下功能:
中文分詞
詞性标注
命名實體識别
關鍵詞提取
自動摘要
短語提取
拼音轉換
簡繁轉換
文本推薦
依存句法分析
語料庫工具
安裝 elasticsearch-hanlp(安裝見:
https://github.com/hualongdata/hanlp-ext/tree/master/es-plugin)插件以後,我們再來看看分詞效果。
GET /_analyze?pretty
{
"analyzer" : "hanlp",
"text" : ["重慶華龍網海數科技有限公司"]
}
{
"token": "重慶",
"start_offset": 0,
"end_offset": 2,
"type": "ns",
"position": 0
},
{
"token": "華龍網",
"start_offset": 2,
"end_offset": 5,
"type": "nr",
"position": 1
},
{
"token": "海數",
"start_offset": 5,
"end_offset": 7,
"type": "nr",
"position": 2
},
{
"token": "科技",
"start_offset": 7,
"end_offset": 9,
"type": "n",
"position": 3
},
{
"token": "有限公司",
"start_offset": 9,
"end_offset": 13,
"type": "nis",
"position": 4
}
HanLP 的功能不止簡單的中文分詞,有很多功能都可以內建到 Elasticsearch 中。
文章來源于羊八井的部落格