邊學邊記（四） lucene index索引結構詳解一

lucene的索引檔案資訊主要包括段（segment），文檔（document），域（field），項（term）

說到lucene的索引存儲的存儲結構，堪稱精妙。lucene給出的存儲的資料類型有以下幾種

Primitive Types
- Byte
- UInt32
- Uint64
- VInt
- Chars
- String

bype就是一個位元組8位二進制，對于lucene中的數字的存儲，先看java中的兩種整數類型 int long

一個int的數值的範圍是－231 到 231 －1 對應lucene中的UInt32

一個long型的數值的範圍是-263 到 263 -1 對應lucene中的UInt64

VINT 類型lucene給出了定義是一個可變長度的byte組合來表示一個正整數

VInt lucene給出了一個例子表格

Value	First byte	Second byte	Third byte
00000000
1	00000001
2	00000010
...
127	01111111
128	10000000	00000001
129	10000001	00000001
130	10000010	00000001
...
16,383	11111111	01111111
16,384	10000000	10000000	00000001
16,385	10000001	10000000	00000001

看表格裡德資料意思就是一個byte 表示0-127的整數如果次數<=127那麼隻占用一個byte，如果>127則會追加byte來存儲此整數的值，那麼第一個byte和第二個byte的關系和規則是什麼呢看表格中的表示形式貌似是如果此byte不夠表示的話那麼将目前byte的二進制中的頭位變1，這樣一來就沒有符号位的表示了，首位0說明隻有一個byte首位1則說明後續還有byte來存儲此數字。看看lucene的代碼中是如果來存儲一個int數字為一個VInt類型的可變的byte的吧

/** Writes an int in a variable-length format.  Writes between one and
   * five bytes.  Smaller values take fewer bytes.  Negative numbers are not
   * supported.
   * @see IndexInput#readVInt()
   */
  public void writeVInt(int i) throws IOException {
    while ((i & ~0x7F) != 0) {
      writeByte((byte)((i & 0x7f) | 0x80));
      i >>>= 7;
    }
    writeByte((byte)i);
  }

首先~0x7F的二進制表示是

11111111 11111111 11111111 10000000

一個正整數如果<127的話和此值進行邏輯與操作的話都是0 這樣就隻寫入一個byte

在看大數的情況下是如果确定第一個byte的值的就拿128來說吧

0x7F的和0x80的二進制表示形式

0x7F的二進制如下：

00000000 00000000 00000000 01111111 就是~0x7F的補碼

0x80的二進制如下：

00000000 00000000 00000000 10000000

128 & 0x7F

00000000 00000000 00000000 10000000 &

00000000 00000000 00000000 01111111

= 00000000 00000000 00000000 00000000 |

00000000 00000000 00000000 10000000

= 00000000 00000000 00000000 10000000

和清單中給出的值是一樣的強轉為byte的話取低8位寫入

然後帶符号位右移7位舍掉前7位将1寫入

按照這個邏輯0-127隻有一位 128-255 一位隻有低8位表示是以經過第二次右移以後第二個byte始終都是00000001 以此類推 256 -

27*2 - 1 隻有兩位byte。。。。。。

看看這樣移位以後再讀取的時候lucene的反向移位運算代碼：

/** Reads an int stored in variable-length format.  Reads between one and
   * five bytes.  Smaller values take fewer bytes.  Negative numbers are not
   * supported.
   * @see IndexOutput#writeVInt(int)
   */
  public int readVInt() throws IOException {
    byte b = readByte();
    int i = b & 0x7F;
    for (int shift = 7; (b & 0x80) != 0; shift += 7) {
      b = readByte();
      i |= (b & 0x7F) << shift;
    }
    return i;
  }

反向運算中先于0x7F取與如果<=127那麼次操作過後i的值就等于b的值直接傳回如果>127 則i的值隻是b的低7位不涉及符号位這時進入循環将第二個byte取出還是去掉符号位(和0x7F取于的話就是去掉符号位的低7位)就是正向操作裡頭一次将高位付給第一個byte以後的值

高位在左移7 14 21.。。。就可以傳回原來的高位在和i進行或操作就對高位和低位進行了合并操作以此類推傳回原值

這個的存儲就有點麻煩線比較兩外兩種存儲數字的類型就比較容易 UInt32 UInt64分别固定儲存為4個byte和8個byte 和vint不同的是這兩種類型可以儲存負數

這個邏輯比較簡單有興趣的可以二進制計算下以下是lucene中寫入和讀取兩種資料類型的源碼

寫入一個UInt32的資料 (int 類型)

/** Reads an int stored in variable-length format.  Reads between one and
   * five bytes.  Smaller values take fewer bytes.  Negative numbers are not
   * supported.
   * @see IndexOutput#writeVInt(int)
   */
  public int readVInt() throws IOException {
    byte b = readByte();
    int i = b & 0x7F;
    for (int shift = 7; (b & 0x80) != 0; shift += 7) {
      b = readByte();
      i |= (b & 0x7F) << shift;
    }
    return i;
  }

寫入一個UInt64(long類型)

/** Writes a long as eight bytes.
   * @see IndexInput#readLong()
   */
  public void writeLong(long i) throws IOException {
    writeInt((int) (i >> 32));
    writeInt((int) i);
  }

讀取的時候就是反向移位在進行或運算

讀取UInt32

/** Reads four bytes and returns an int.
   * @see IndexOutput#writeInt(int)
   */
  public int readInt() throws IOException {
    return ((readByte() & 0xFF) << 24) | ((readByte() & 0xFF) << 16)
         | ((readByte() & 0xFF) <<  8) |  (readByte() & 0xFF);
  }

讀取UInt64

/** Reads eight bytes and returns a long.
   * @see IndexOutput#writeLong(long)
   */
  public long readLong() throws IOException {
    return (((long)readInt()) << 32) | (readInt() & 0xFFFFFFFFL);
  }

這兩種類型的資料相對比較簡單就不在分析正向反向的移位運算和邏輯運算了

這樣對數字存儲的話有效地控制了數字類型的資料占用的索引檔案的容量

舉個例子加入将數字作為字元串寫入磁盤的話一位數字就是一個byte

0就占一個byte 127就占三個byte 100000 就占6個byte 一個最大的整數231 -1 = 2147483647 就需要10個byte 數字位數越多byte數就越多

而按照lucene的這三個數字類型的話 VInt是可變的根據數字的大小使用不同的byte數目

相應的UInt32 和UInt64也一樣固定需要4個byte和8個byte

有了這幾個類型資料的存儲和讀取方法我們就可以讀取lucene的file format中說明的segment.gen檔案了讀取的函數直接繼承lucene的org.apache.lucene.store.BufferedIndexInput類作為lucene索引檔案的讀取類隻要将其中的byte能翻譯成我們想要的結果自然就知道裡面存的是什麼值怎麼存儲了。

lucene3.0以前的版本沒用過，在3.0中lucene建立索引的時候預設使用混合檔案模式檔案隻有cfx，cfs，gen，segments_N檔案如果做過删除操作的話應該還有.del的檔案其他的檔案類型.tvd，.tvx等什麼時候出現現在我還不曉得。

先建立包含一兩個document的索引檔案（不使用混合檔案模式）

/****************
 *
 *Create Class:CreateIndex.java
 *Author:a276202460
 *Create at:2010-6-1
 */
package com.rich.lucene.index;
import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class CreateIndex {
	/**
	 * @param args
	 * @throws IOException
	 * @throws CorruptIndexException
	 */
	public static void main(String[] args) throws CorruptIndexException,
			IOException {
		String indexdir = "D:/lucenetest/indexs/txtindex/index4";
		Directory indexdirectory = FSDirectory.open(new File(indexdir));
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
		IndexWriter writer = null;
		try {
			writer = new IndexWriter(indexdirectory, analyzer, true,
					IndexWriter.MaxFieldLength.LIMITED);
			writer.setUseCompoundFile(false);
			writer.addDocument(getDocument("http://www.baidu.com", "百度搜尋",
					"百度搜尋引擎，國内最大的搜尋引擎"));
			writer.addDocument(getDocument("http://www.g.cn", "谷歌搜尋",
					"全球做大的搜尋引擎"));
			writer.optimize();
		} finally {
			writer.close();
		}
	}
	public static Document getDocument(String url, String title, String content) {
		Document doc = new Document();
		doc.add(new Field("title", title, Field.Store.YES,
						Field.Index.ANALYZED));
		doc.add(new Field("url", url, Field.Store.YES,
						Field.Index.NOT_ANALYZED));
		doc.add(new Field("content", content, Field.Store.NO,
				Field.Index.ANALYZED));
		return doc;
	}
}

索引建立完成後檔案結構如下：

邊學邊記（四） lucene index索引結構詳解一

lucene的索引檔案讀取類IndexFileInput：

/****************
 *
 *Create Class:IndexFileInput.java
 *Author:a276202460
 *Create at:2010-6-1
 */
package com.rich.lucene.io;
import java.io.IOException;
import java.io.RandomAccessFile;
import org.apache.lucene.store.BufferedIndexInput;
public class IndexFileInput extends BufferedIndexInput {
	
	private final IndexFile file;
	boolean isClone;
	
	public IndexFileInput(String filepath) throws Exception{
		super();
		file = new IndexFile(filepath,"r");
	}
	protected void readInternal(byte[] b, int offset, int length)
			throws IOException {
		 synchronized(file){
			 long position = this.getFilePointer();
			 if(position != file.position){
				 file.seek(position);
				 file.position = position;
			 }
			 int total = 0;
			 do{
				 int readlength = length - total;
				 final int i = file.read(b, offset+total,	 readlength);
				 if(i == -1){
					 throw new IOException("read past EOF");
				 }
				 total += i;
			 }while(total < length);
		 }
         
	}
	protected void seekInternal(long pos) throws IOException {
		 
	}
	public void close() throws IOException {
		 if(!isClone) file.close();
	}
	public long length() {
		 
		return file.length;
	}
	
	public Object Clone(){
		IndexFileInput input = (IndexFileInput)super.clone();
		input.isClone = true;
		return input;
	}
}
class IndexFile extends RandomAccessFile{
	
	protected volatile boolean isOpen;
    long position;
    final long length;
	public IndexFile(String name, String mode) throws Exception {
		super(name, mode);
		isOpen = true;
		length = this.length();
		 
	}
	 
    public void close() throws IOException {
      if (isOpen) {
        isOpen=false;
        super.close();
      }
    }
	
}

lucene的文檔中對segments.gen檔案的解釋是這樣的

As of 2.1, there is also a file segments.gen . This file contains the current generation (the _N in segments_N ) of the index. This is used only as a fallback in case the current generation cannot be accurately determined by directory listing alone (as is the case for some NFS clients with time-based directory cache expiraation). This file simply contains an Int32 version header (SegmentInfos.FORMAT_LOCKLESS = -2), followed by the generation recorded as Int64, written twice

其中就隻存了兩項一個是Int32 version header 一個是段資訊檔案中的the _N in segments_N

按照存儲的内容此檔案隻有20個byte 用IndexFileInput來讀取這個檔案看看裡面存儲的什麼資訊

/****************
 *
 *Create Class:Segmentread.java
 *Author:a276202460
 *Create at:2010-6-1
 */
package com.rich.lucene.io;
 
public class Segmentread {
	/**
	 * @param args
	 * @throws Exception 
	 */
	public static void main(String[] args) throws Exception {
		String segmentpath = "D:/lucenetest/indexs/txtindex/index4/segments.gen";
		 IndexFileInput input = new IndexFileInput(segmentpath);
		  
		 int length = (int)input.length();
		 System.out.println("segments.gen檔案存儲的byte數目為:"+length);
		 for(int i = 0;i < length;i++ ){
			 System.out.print(input.readByte()+" ");
		 }
		 System.out.println();
		 input.seek(0);
		 System.out.println(input.readInt());
		 System.out.println(input.readLong());
		 System.out.println(input.readLong());
		 input.close();
	}
}

運作結果如下：

segments.gen檔案存儲的byte數目為:20

-1 -1 -1 -2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2

-2

存儲的byte個數确實是20個按照lucene的及時前4個byte存儲的是一個UInt32的數字後面存了兩個8位的UInt64的資料

邊學邊記（四） lucene index索引結構詳解一

邊學邊記（四） lucene index索引結構詳解一

繼續閱讀

解析pdf、word2003、Excel2003、word2007、Excel2007、PowerPoint、Text 可用于Lucene

eclipse中配置heritrix的圖文過程----heritrix-1.14.3

Lucene 基本原理

ajax技術學習網址

Ajax學習--網址備忘錄

開放源代碼搜尋引擎

轉：基于lucene實作自己的推薦引擎

基于LUCENE實作自己的推薦引擎

Lucene.net和盤古分詞使用小結

Apache Lucene 5.x 內建中文分詞庫 IKAnalyzer

JFLex使用者手冊中文版安裝與配置運作JFLEX 配置檔案編寫

svn配置權限

MySQL和Lucene索引對比分析1. MySQL索引實作2. Lucene索引實作3. MySQL與Lucence對比參考：

Lucence的基本原理

lucene 關鍵字高亮

專家訪談：搜尋開源力量：Lucene技術前景