天天看点

Lucene in action 笔记 search篇

一. 用lucene怎么完成search

1. 创建indexsearcher

indexsearcher searcher = new indexsearcher(directory);

2. 生成需要搜索的词

term t = new term("subject", "ant");

3. 创建查询

query query = new termquery(t);

4. search并得到结果

hits hits = searcher.search(query);

此处的query就简单的由term生成的, 对于复杂的query需要用queryparser来生成

query query = queryparser.parse("+junit +ant -mock", "contents", new simpleanalyzer());

static public query parse(string query, string field, analyzer analyzer) throws parseexception

query: 需要parse的查询string

field: default的field

analyzer: 处理查询string需要的analyzer, 来处理大小写等

二. using indexsearcher

1. search

hits search(query query)   straightforward searches needing no filtering.

hits search(query query, filter filter) searches constrained to a subset of available documents, based on filter criteria.

void search(query query, hitcollector results) used only when all documents found from a search will be needed.

hitcollector 和 hits不同, 会包含所有结果, 所以注意这个函数performance问题

2. working with hits

hits应该会记录下所有匹配的结果, 但是只会加载top 100到内存, 因为一般用户也只需要这些. 如果你用doc(n)访问未被cache的文章, 需要先从index里面load, 所以需要时才去调用比较合理.

length() number of documents in the hits collection

doc(n) document instance of the nth top-scoring document

id(n) document id of the nth top-scoring document

score(n) normalized score (based on the score of the topmost document) of the nth topscoring document, guaranteed to be greater than 0 and less than or equal to 1

对于分页显示hits的solution, 推荐每页都从新去做search来生成hits, 并从中取出你所需要的页. 这样符合stateless的服务端的原则

3. reading indexes into memory

如果你内存足够大, 而且index本身是没有变化的时候

你可以把index载入内存, search将飞快...

ramdirectory ramdir = new ramdirectory(dir);

三. understanding lucene scoring

至于怎么排名的就不解释了.

这里有个接口可以查看排名的具体细节, 为什么他就排第一个了?

explanation explanation = searcher.explain(query, hits.id(0));

system.out.println(explanation.tostring());

好了通过打出的各个参数的值, 就能理解他为什么排第一个了

四. creating queries programmatically

用程序的方式取创建查询, 还有一种方法就是用queryparser去生成查询.

1. searching by term: termquery

term t = new term("contents", "java"); term(field, value)

term的value是大小写敏感的, 所以search时的大小写要和index时的一致

2. searching within a range: rangequery

begin = new term("pubmonth","198805");

end = new term("pubmonth","198810");

rangequery query = new rangequery(begin, end, true); 最后这个flag是表示是否包含begin和end

queryparser中的表示为: [begin to end] or {begin to end}.

3. searching on a string: prefixquery

找前缀, 底下这个意思是找到这个目录, 及其所有子目录

term term = new term("category", "/technology/computers/programming");

prefixquery query = new prefixquery(term);

queryparser中的表示为:prefix*

4. combining queries: booleanquery

termquery searchingbooks =new termquery(new term("subject","search"));

rangequery currentbooks =new rangequery(new term("pubmonth","200401"), new term("pubmonth","200412"),true);

booleanquery currentsearchingbooks = new booleanquery();

currentsearchingbooks.add(searchingbook s, true, false);

currentsearchingbooks.add(currentbooks, true, false);

可见, 可以通过add往booleanquery上加子查询, 并通过最后两个参数来决定与或的关系

两个参数分别是required and prohibited, 就是要求和禁止, 两个不可能全true

false, false: clause is optional, 即或

true, false: clause must match, 即与

false, true: clause must not match, 即非

queryparser中的表示为: –, +, and, or, and not

lucene新的接口是:

public void add(query query, booleanclause.occur occur)

where occur can be booleanclause.occur.must, booleanclause.occur.should or booleanclause.occur.must_not

(and )booleanclause.occur.must means exactly that: only documents matching that clause are considered. 

(or )booleanclause.occur.should means the term is optional.

(not )booleanclause.occur.must_not means any documents matching this clause are excluded from the results.

5. 后面还有phrasequery, wildcardquery, fuzzyquery就不一一介绍了, 大家用到再去查吧

五. parsing query expressions: queryparser

虽然用api去创建query很好, 但有时也需要用human-readable textual query representation.

对于一个已有的query, 我们可以用query.tostring, 得到一个query的human-readable textual query representation.

这个就不具体说了, queryparser看上去不错, 不过他并不能表示所有的查询, 有些查询必须用api生成. 而且用queryparser去parse应该是要额外耗费一些时间的. 对于要提供给用户查询ui的应用, queryparser是非常方便的.

本文章摘自博客园,原文发布日期:2011-07-04