基于Lucene類庫開發中小型應用的輕量級搜素引擎

關于Lucene：

是一個開放源代碼的全文檢索引擎工具包，但它不是一個完整的全文檢索引擎，而是一個全文檢索引擎的架構，提供了完整的查詢引擎和索引引擎。在Java開發環境裡Lucene是一個成熟的免費開源工具。就其本身而言，Lucene是目前以及最近幾年最受歡迎的免費Java資訊檢索程式庫。[1]

ElasticSearch(ES)搜素引擎：

該搜尋引擎是基于Lucene進行的二次開發，具有較高的性能，目前大部分大型網際網路公司會使用該搜素引擎提供搜尋服務。但是ES搜素引擎屬于産品級的開源項目，最新的ElasticSearch搜素引擎Wins版本超過了200MB，而Linux版本也有80MB之大。對個人開發者來講，一個中小型應用的站内搜尋，ES搜素引擎體量較大，而且大部分功能不會被使用，是以可以利用隻有8MB大小的Lucene工具包開發一個輕量級的搜素引擎足以。

下載下傳開發工具：

Lucene類庫：https://github.com/totoro-dev/lucene/raw/master/lib/Lucene8.2.zip
支援中文的分詞器：https://github.com/totoro-dev/lucene/raw/master/lib/MyAnalyzer.jar
JSON工具包：(FastJson) https://github.com/totoro-dev/lucene/raw/master/lib/fastjson-1.2.2.jar

建立工程導入工具包：

基于Lucene類庫開發中小型應用的輕量級搜素引擎

如果是使用IDEA進行開發記得講jar包"Add as Library"。

模拟搜尋引擎場景：

假設，我們需要一個搜素引擎，它可以存儲各種文章的标題（title）、摘要（note）和連結（link），通過輸入關鍵字，可以擷取到與關鍵字比對的文章的标題、摘要和連結資訊，并将搜尋的結果進行排序輸出。

建立文章索引檔案：

指定儲存索引檔案的目錄（Directory）。注意隻是目錄，因為Lucene會在該目錄下自動建立索引檔案。
建立索引分析（IndexWriterConfig），指定對索引内容的分詞器（MyAnalyzer），這是基于IKAnalyzer修改的能夠适配Lucene 8.2版本的支援中文分詞的分詞器。
建立索引域（TextField），這些域就是在搜尋時能夠比對關鍵字的内容。
建立存儲域（StoredField），這種域會被完整儲存，不會被分詞當作索引，而是能被索引域指向的目标内容。
建立索引文檔（Document），将所有的域添加進文檔，形成一個索引檔案。
使用IndexWiter将索引文檔寫入索引目錄。

// xx/xx/index:索引檔案存放的目錄
Directory directory = FSDirectory.open(new File("xx/xx/index").toPath());
// 參數：分析器對象
IndexWriterConfig config = new IndexWriterConfig(new MyIKAnalyzer(true));
IndexWriter indexWriter = new IndexWriter(directory, config);
// 第一個參數：域的名稱，第二個參數：域的内容，第三個參數：是否存儲
Field titleField = new TextField("title", title, Field.Store.YES);
Field noteField = new TextField("note", note, Field.Store.YES);
Field linkField = new StoredField("link", url);
Document document = new Document();
document.add(titleField);
document.add(noteField);
document.add(linkField);
indexWriter.addDocument(document);
indexWriter.close();

測試索引，建立的索引庫内容：

title：黃龍淼是程式員

note：note1

link：link1

基于Lucene類庫開發中小型應用的輕量級搜素引擎

從索引庫中進行搜尋：

1. 建立存儲檢索結果的集合：

// 檢索結果中連結的比重，結果排序的重要依據
private final Map<String, Integer> linksWeight = new LinkedHashMap<>();
// 所有的檢索結果
private final List<String> links = new ArrayList<>();

2. 建立檢索器IndexSearcher：

Directory directory = FSDirectory.open(new File("xx/xx/index").toPath());
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);

3. 根據關鍵字建立檢索條件（Query）：使用MultiFieldQueryParser可以實作對多個索引域的同時檢索。

QueryParser parser = new MultiFieldQueryParser(new String[]{"title","note"}, new MyIKAnalyzer(true));
Query query = parser.parse(key);

4. 通過query在索引庫中檢索結果：

// 查詢索引庫
TopDocs topDocs = indexSearcher.search(query, 100);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
String links[] = new String[scoreDocs.length];
// 周遊查詢結果
for (int i = 0; i < scoreDocs.length; i++) {
    int docId = scoreDocs[i].doc;
    // 通過id查詢文檔對象
    Document document = indexSearcher.doc(docId);
    String link = document.get("link");
    links[i] = link;
    if (this.links.contains(link)) {
        int origin = linksWeight.get(link);
        origin++;
        linksWeight.put(link, origin);
    } else {
        this.links.add(link);
        linksWeight.put(link, 1);
    }
}
// 關閉索引庫
indexSearcher.getIndexReader().close();

5. 對檢索結果進行排序：limit用于限制排序結果的最大數量

private String[] sort(int limit) {
    if (links.size() == 0) return new String[0];
    String result[] = new String[Math.min(limit, links.size())];
    String link = links.get(0);
    int maxWeight = linksWeight.get(link);
    result[0] = link;
    for (int i = 0; i < Math.min(limit, links.size()); i++) {
        for (int j = 1; j < linksWeight.size(); j++) {
            link = links.get(j);
            int weight = linksWeight.get(link);
            if (maxWeight < weight) {
                maxWeight = weight;
                result[i] = link;
            }
        }
        linksWeight.put(result[i], 0);
        link = links.get(0);
        maxWeight = linksWeight.get(link);
    }
    return result;
}

測試檢索，檢索結果：

基于Lucene類庫開發中小型應用的輕量級搜素引擎

Lucene引擎經驗總結：

以上就是對Lucene進行擴充開發的一般步驟，通過IndexWriter建立索引檔案，再通過IndexSearcher進行索引的檢索，最終對得到的結果進行适當的排序，一個輕量級的搜尋引擎也就誕生了。過程實作起來并不複雜，這主要得益于Lucene在底層為我們實作了複雜的索引和檢索功能。基于Lucene基本上就能實作中小型應用的站内搜尋的功能，而不必依賴于ElasticSearch這樣的重量級搜尋引擎。

值得一提的是，上面的檢索隻使用了Lucene提供的衆多搜尋關鍵字條件Query中的一種，如果面對項目中一些有特定搜尋要求的場景，我們可以使用Lucene中更複雜的條件檢索。如果有機會我也會跟大家分享更多的精準條件搜尋條件的使用經驗。

參考文獻：

[1] 百度百科：lucene

基于Lucene類庫開發中小型應用的輕量級搜素引擎

關于Lucene：

ElasticSearch(ES)搜素引擎：

下載下傳開發工具：

建立工程導入工具包：

模拟搜尋引擎場景：

建立文章索引檔案：

從索引庫中進行搜尋：

Lucene引擎經驗總結：

參考文獻：

繼續閱讀

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

30天了解30種技術系列---(10)面向Cloud的搜尋引擎 ElasticSearch

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method