lucene中索引檔案的存儲用段segments來描述的,通過segments.gen來維護gen資訊
和segments.gen一樣頭資訊是一個表示此索引中段的格式化版本 即format值 根據format值判斷此段資訊中的資訊存儲格式和存儲内容
一個segment_N檔案中儲存的資訊和資訊的類型 如下圖所示
資訊存儲的格式是這樣的
Format, Version, NameCounter, SegCount, <SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField, NormGenNumField , IsCompoundFile, DeletionCount, HasProx, Diagnostics>SegCount , CommitUserData, Checksum
Format 是一個int32位的整數一般為-9 SegmentInfos.FORMAT_DIAGNOSTICS
Version 值儲存操作此索引的時間的long型時間 System.currentTimeMillis()
NameCounter 用來生成新的segment的檔案名
SegCount 記錄索引中有多少個段 有多少個段後面紅色标記的屬性就會重複多少次
每個段的資訊描述:
SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField, NormGenNumField , IsCompoundFile, DeletionCount, HasProx, Diagnostics
SegName -- 和字段有關的索引檔案的字首 上篇博文中生成的索引中 存儲的應該是"_0"
SegSize -- 記錄的此段中 document的個數
DelGen -- 記錄了删除操作過後産生的*.del檔案的數量 如果為-1 則沒有此檔案 如果為0 就在索引中查找_X.del檔案 如果 >0就查找_X_N.del檔案
DocStoreOffset -- 記錄了目前段中的document的存儲的位置偏移量 如果為-1 說明 此段使用獨立的檔案來存儲doc,隻有在偏移量部位-1的時候 [DocStoreSegment, DocStoreIsCompoundFile] 資訊才會存儲
DocStoreSegment -- 存儲和此段共享store的segment的name
DocStoreIsCompoundFile --用來辨別存儲是否是混合檔案存儲 如果值為1則說明使用了混合模式來存儲索引 那麼操作的時候就回去找.cfx檔案
HasSingleNormFile -- 如果為1 表明 域(field)的資訊存儲來字尾為.fnm檔案裡 如果不是則存在.fN檔案中
NumField -- field的個數 即 NormGen數組的大小 這個數組中具體存了域的gen資訊 long類型 前提是每個field都分開存儲的
DeletionCount, HasProx, Diagnostics
DeletionCount -- 記錄在此段中被删除的doc的數量
HasProx -- 此段中是否已一些域(field) 省略了term的頻率和位置 如果值為1 則不存在這樣的field
Diagnostics -- 診斷表 key - value 比如目前索引的version,系統的版本,java的版本等
CommitUserData 記錄使用者自己的一些屬性資訊 在使用commit preparecommit 等操作的時候産生
Checksum 記錄了目前Segments_N檔案中到此資訊之前的byte的數量 這個資訊用來在打開索引的時候做一個核對 如果使用者非法自己修改過此檔案 那麼這個檔案記錄的checksum就和檔案的大小産生了沖突
這些資訊都存儲在SegmentInfos bean中 單個segmeng的資訊存儲在bean SegmentInfo 中 用上篇部落格中寫好的索引讀取類
檢測segments_n中存儲的資訊 這裡僅檢測前幾個資料 其他的資料都由SegmentInfos 自己構造 讀取函數都是從IndexInput類中繼承
編寫CheckSegmentsInfo 類
/**************** * *Create Class:CheckSegmentsInfo.java *Author:a276202460 *Create at:2010-6-2 */
package com.rich.lucene.io;
import java.io.File;
import java.util.Iterator;
import java.util.Map;
import org.apache.lucene.index.SegmentInfo;
import org.apache.lucene.index.SegmentInfos;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class CheckSegmentsInfo {
/**
* @param args * @throws Exception
*/
public static void main(String[] args) throws Exception {
String indexdir = "D:/lucenetest/indexs/txtindex/index4";
IndexFileInput input = null;
try {
input = new IndexFileInput(indexdir + "/segments_2");
System.out.println("seg format:" + input.readInt());
System.currentTimeMillis();
System.out.println("seg Version:" + input.readLong());
} finally {
if (input != null) {
input.close();
}
}
Directory directory = FSDirectory.open(new File(indexdir));
SegmentInfos infos = new SegmentInfos();
infos.read(directory, "segments_2");
System.out.println("info Version:" + infos.getVersion());
System.out.println("info Counter:" + infos.counter);
System.out.println("info Seg Count:" + infos.size());
for (int i = 0; i < infos.size(); i++) {
SegmentInfo info = infos.get(i);
System.out.println("****************** segment [" + i + "]");
System.out.println("segment name:" + info.name);
System.out.println("the doc count in segment:" + info.docCount);
System.out.println("del doc count in segment:" + info.getDelCount());
System.out.println("segment doc store offset:" + info.getDocStoreOffset());
if (info.getDocStoreOffset() != -1) {
System.out.println("segment's DocStoreSegment:" + info.getDocStoreSegment());
System.out.println("segment's DocStoreIsCompoundFile:" + info.getDocStoreIsCompoundFile());
}
System.out.println("segment IsCompoundFile :" + info.getDocStoreIsCompoundFile());
System.out.println("segment's delcount:" + info.getDelCount());
System.out.println("segment's is hasprox:" + info.getHasProx());
Map infodiag = info.getDiagnostics();
Iterator keyit = infodiag.keySet().iterator();
while (keyit.hasNext()) {
String key = keyit.next().toString();
System.out.println("Diagnostic key:" + key + " Diagnostic value:" + infodiag.get(key));
}
Map userdatas = infos.getUserData();
Iterator datait = userdatas.keySet().iterator();
while (datait.hasNext()) {
String key = datait.next().toString();
System.out.println("user data key:" + key + " value:" + userdatas.get(key));
}
}
}
}
運作結果如下:
seg format:-9
seg Version:1275404730705
info Version:1275404730705
info Counter:1
info Seg Count:1
****************** segment [0]
segment name:_0
the doc count in segment:2
del doc count in segment:0
segment doc store offset:0
segment's DocStoreSegment:_0
segment's DocStoreIsCompoundFile:false
segment IsCompoundFile :false
segment's delcount:0
segment's is hasprox:true
Diagnostic key:os.version Diagnostic value:5.1
Diagnostic key:os Diagnostic value:Windows XP
Diagnostic key:lucene.version Diagnostic value:3.0.0 883080 - 2009-11-22 15:43:58
Diagnostic key:source Diagnostic value:flush
Diagnostic key:os.arch Diagnostic value:x86
Diagnostic key:java.version Diagnostic value:1.6.0
Diagnostic key:java.vendor Diagnostic value:Sun Microsystems Inc.
根據提到的個個byte的值表示的資訊可以檢測檔案中都存的什麼值,下面着重分資訊lucene對string類型的存儲和讀取
luncene 寫字元串的代碼如下:
public void writeString(String s) throws IOException {
UnicodeUtil.UTF16toUTF8(s, 0, s.length(), utf8Result);
writeVInt(utf8Result.length);
writeBytes(utf8Result.result, 0, utf8Result.length);
}
/**
* Encode characters from this String, starting at offset * for length characters. Returns the number of bytes * written to bytesOut.
*/
public static void UTF16toUTF8(final String s, final int offset, final int length, UTF8Result result) {
final int end = offset + length;
byte[] out = result.result;
int upto = 0;
for (int i = offset; i < end; i++) {
final int code = (int) s.charAt(i);
if (upto + 4 > out.length) {
byte[] newOut = new byte[2 * out.length];
assert newOut.length >= upto + 4;
System.arraycopy(out, 0, newOut, 0, upto);
result.result = out = newOut;
}
if (code < 0x80) out[upto++] = (byte) code;
else if (code < 0x800) {
out[upto++] = (byte) (0xC0 | (code >> 6));
out[upto++] = (byte) (0x80 | (code & 0x3F));
} else if (code < 0xD800 || code > 0xDFFF) {
out[upto++] = (byte) (0xE0 | (code >> 12));
out[upto++] = (byte) (0x80 | ((code >> 6) & 0x3F));
out[upto++] = (byte) (0x80 | (code & 0x3F));
} else {
// surrogate pair
// confirm valid high surrogate
if (code < 0xDC00 && (i < end - 1)) {
int utf32 = (int) s.charAt(i + 1);
// confirm valid low surrogate and write pair
if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF);
i++;
out[upto++] = (byte) (0xF0 | (utf32 >> 18));
out[upto++] = (byte) (0x80 | ((utf32 >> 12) & 0x3F));
out[upto++] = (byte) (0x80 | ((utf32 >> 6) & 0x3F));
out[upto++] = (byte) (0x80 | (utf32 & 0x3F));
continue;
}
}
// replace unpaired surrogate or out-of-order low surrogate
// with substitution
character out[ upto++]=(byte) 0xEF;
out[upto++] = (byte) 0xBF;
out[upto++] = (byte) 0xBD;
}
}
assert matches(s, offset, length, out, upto);
result.length = upto;
}
lucene寫入字元串的時候使用的是UTF-8 編碼格式,java本身的字元串是用unicode表示的 也就是UTF-16 1-2個位元組表示 這裡說下UTF-16是unicode的一種實作。
UTF-8編碼的優點這裡就不說了 和UTF-16不一樣 UTF-8是用1-3個位元組來存儲的
轉換為UTF-8後将所得的byte數組的長度作為字首存為一個VInt類型整數後面跟字元串,例如:字元串:’“我” 在lucene中的存儲為:
3 -26 -120 -111
3表示字元串"我"的utf8 的byte的長度 後面接的三位就是字元串的位元組碼
這樣也是為了友善讀取。具體代碼可以看IndexInput類 。很簡單
至于map<String,String>也就是一個循環
mapsize,<String,String>N
lucene讀取segment_n檔案的代碼:
/**
* Read a particular segmentFileName. Note that this may * throw an IOException if a commit is in process. * * @param directory -- directory containing the segments file * @param segmentFileName -- segment file to load * @throws CorruptIndexException if the index is corrupt * @throws IOException if there is a low-level IO error
*/
public final void read(Directory directory, String segmentFileName) throws CorruptIndexException, IOException {
boolean success = false;
// Clear any previous segments:
clear();
ChecksumIndexInput input = new ChecksumIndexInput(directory.openInput(segmentFileName));
generation = generationFromSegmentsFileName(segmentFileName);
lastGeneration = generation;
try {
int format = input.readInt();
if (format < 0) {
// file contains explicit format info
// check that it is a format we can understand
if (format < CURRENT_FORMAT) throw new CorruptIndexException("Unknown format version: " + format);
version = input.readLong();
// read version
counter = input.readInt();
// read counter
} else {
// file is in old format without explicit format info
counter = format;
}
for (int i = input.readInt(); i > 0; i--) {
// read segmentInfos
add(new SegmentInfo(directory, format, input));
}
if (format >= 0) {
// in old format the version number may be at the end of the file
if (input.getFilePointer() >= input.length())
version = System.currentTimeMillis();
// old file format without version number
else version = input.readLong();
// read version
}
if (format <= FORMAT_USER_DATA) {
if (format <= FORMAT_DIAGNOSTICS) {
userData = input.readStringStringMap();
} else if (0 != input.readByte()) {
userData = Collections.singletonMap("userData", input.readString());
} else {
userData = Collections.<String, String>emptyMap();
}
} else {
userData = Collections.<String, String>emptyMap();
}
if (format <= FORMAT_CHECKSUM) {
final long checksumNow = input.getChecksum();
final long checksumThen = input.readLong();
if (checksumNow != checksumThen) throw new CorruptIndexException("checksum mismatch in segments file");
}
success = true;
} finally {
input.close();
if (!success) {
// Clear any segment infos we had loaded so we
// have a clean slate on retry:
clear();
}
}
}
/**
* Construct a new SegmentInfo instance by reading a * previously saved SegmentInfo from input. * * @param dir directory to load from * @param format format of the segments info file * @param input input handle to read segment info from
*/
SegmentInfo(Directory dir, int format, IndexInput input) throws IOException {
this.dir = dir;
name = input.readString();
docCount = input.readInt();
if (format <= SegmentInfos.FORMAT_LOCKLESS) {
delGen = input.readLong();
if (format <= SegmentInfos.FORMAT_SHARED_DOC_STORE) {
docStoreOffset = input.readInt();
if (docStoreOffset != -1) {
docStoreSegment = input.readString();
docStoreIsCompoundFile = (1 == input.readByte());
} else {
docStoreSegment = name;
docStoreIsCompoundFile = false;
}
} else {
docStoreOffset = -1;
docStoreSegment = name;
docStoreIsCompoundFile = false;
}
if (format <= SegmentInfos.FORMAT_SINGLE_NORM_FILE) {
hasSingleNormFile = (1 == input.readByte());
} else {
hasSingleNormFile = false;
}
int numNormGen = input.readInt();
if (numNormGen == NO) {
normGen = null;
} else {
normGen = new long[numNormGen];
for (int j = 0; j < numNormGen; j++) {
normGen[j] = input.readLong();
}
}
isCompoundFile = input.readByte();
preLockless = (isCompoundFile == CHECK_DIR);
if (format <= SegmentInfos.FORMAT_DEL_COUNT) {
delCount = input.readInt();
assert delCount <= docCount;
} else delCount = -1;
if (format <= SegmentInfos.FORMAT_HAS_PROX) hasProx = input.readByte() == 1;
else hasProx = true;
if (format <= SegmentInfos.FORMAT_DIAGNOSTICS) {
diagnostics = input.readStringStringMap();
} else {
diagnostics = Collections.<String, String>emptyMap();
}
} else {
delGen = CHECK_DIR;
normGen = null;
isCompoundFile = CHECK_DIR;
preLockless = true;
hasSingleNormFile = false;
docStoreOffset = -1;
docStoreIsCompoundFile = false;
docStoreSegment = null;
delCount = -1;
hasProx = true;
diagnostics = Collections.<String, String>emptyMap();
}
}