HanLP代碼與詞典分離方案與流程

之前在spark環境中一直用的是portable版本，詞條數量不是很夠，且有心想把jieba,swcs詞典加進來，

其他像ik,ansi-seg等分詞詞典由于沒有詞性并沒有加進來. 本次修改主要是采用jar包方包将詞典目錄

data與hanlp.properties合成一個data.jar檔案.

1. pom.xml 過濾資源檔案的配置

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-jar-plugin</artifactId>

<version>${maven-jar-plugin.version}</version>

<exclude>**/*.properties</exclude>

</excludes>

</configuration>

</plugin>

這裡把properties檔案從jar封包件中去掉,因而結果檔案是沒有properties檔案的.

可根據需要來确定是否把properties加入jar包中.由于我打算把hanlp.properties與詞典目錄寫在一起

這裡是要過濾掉hanlp.properties檔案

2. 修改hanlp.properties檔案

root=

#将根目錄置為空，或者注釋掉root

CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; scws.txt; jieba.txt; 現代漢語補充詞庫.txt; 全國地名大全.txt ns; 人名詞典.txt; 機構名詞典.txt; 上海地名.txt ns;data/dictionary/person/nrf.txt nrf;

#增加更多的配置檔案,這裡增加了結巴分詞,scws分詞

#IOAdapter=com.hankcs.hanlp.corpus.io.FileIOAdapter

IOAdapter=com.hankcs.hanlp.corpus.io.JarIOAdapter

#修改IOAdapter,以便使用jar包形式加載詞典

3. 修改HanLP.java

if ( root.length() != 0 && !root.endsWith("/")) root += "/";

當root的長度為0時，不用在root字元串後面添加'/'

4. 增加處理詞典jar包的代碼檔案: JarIOAdapter.java

package com.hankcs.hanlp.corpus.io;

import java.io.*;

/**

* 基于普通檔案系統的IO擴充卡

* @author hankcs

public class JarIOAdapter implements IIOAdapter

{

@Override

public InputStream open(String path) throws FileNotFoundException

{

采用第一行的方式加載資料會在分布式環境報錯

改用第二行的方式

//return ClassLoader.getSystemClassLoader().getResourceAsStream(path);

return JarIOAdapter.class.getClassLoader().getResourceAsStream(path);

}

public OutputStream create(String path) throws FileNotFoundException

return new FileOutputStream(path);

}

在跑DemoStopWord時,發現

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoStopWord

報錯,原因是接口不統一導緻. 修改

DMAG.java如下:

public MDAG(File dataFile) throws IOException

BufferedReader dataFileBufferedReader = new BufferedReader(new InputStreamReader(IOAdapter == null ?

new FileInputStream(dataFile) :

//IOAdapter.open(dataFile.getAbsolutePath())

IOAdapter.open(dataFile.getPath())

, "UTF-8"));

即可.

5. 如何将詞典與配置檔案打成一個jar包

最好是把txt格式的檔案做成bin或dat格式的檔案,然後做成jar包，否則打包運作後無法再寫成bin或dat格式檔案.

簡單的辦法是跑一下示例，即可生成相應的bin或dat格式檔案.

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoAtFirstSight

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoChineseNameRecognition

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoJapaneseNameRecognition

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoPinyin

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoPlaceRecognition

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoOrganizationRecognition

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoTokenizerConfig #命名實體識别,包括上面的人名,地名等

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoTraditionalChinese2SimplifiedChinese

或者用以下shell腳本完成

:>a;while read cl; do echo $cl; echo "=========="$cl"=======" >>a;java -cp .:test.jar:hanlp-1.3.2.jar $cl 1>> a 2>&1;done < <(jar tvf test.jar | awk '$(NF)~"Demo"{print $(NF)}' | sed 's/.class$//;s/\//./g')

我們把data目錄與hanlp.properties檔案放在一個目錄，比如xxx目錄

cd xxx

jar cvf data.jar .

即可生成data.jar包

6. 如何運作

[dxp@Flyme-SearchTag-32-220 makeNewDict]$ ls

data.jar hanlp-1.3.2.jar README.md test test.jar

[dxp@Flyme-SearchTag-32-220 makeNewDict]$ java -cp data.jar:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoAtFirstSight

7. 在spark中應用

IDE如（intellij idea）中maven項目

引入以下依賴：

<groupId>com.hankcs</groupId>

<artifactId>hanlp</artifactId>

<scope>system</scope>

<systemPath>${LocalPath}/hanlp-1.3.2.jar</systemPath>

</dependency>

spark-submit送出任務時增加

--jar hanlp-1.3.2.jar,data.jar

轉載自 cicido的個人空間

HanLP代碼與詞典分離方案與流程

繼續閱讀

ACS基本配置-權限等級管理

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

maven No compiler is provided in this environment. Perhaps you are running on a JRE rather than a J

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

Opendaylight課堂之深度剖析toaster（一）