Lucene in action 笔记 search篇

一. 用lucene怎么完成search

1. 创建indexsearcher

indexsearcher searcher = new indexsearcher(directory);

2. 生成需要搜索的词

term t = new term("subject", "ant");

3. 创建查询

query query = new termquery(t);

4. search并得到结果

hits hits = searcher.search(query);

此处的query就简单的由term生成的, 对于复杂的query需要用queryparser来生成

query query = queryparser.parse("+junit +ant -mock", "contents", new simpleanalyzer());

static public query parse(string query, string field, analyzer analyzer) throws parseexception

query: 需要parse的查询string

field: default的field

analyzer: 处理查询string需要的analyzer, 来处理大小写等

二. using indexsearcher

1. search

hits search(query query) straightforward searches needing no filtering.

hits search(query query, filter filter) searches constrained to a subset of available documents, based on filter criteria.

void search(query query, hitcollector results) used only when all documents found from a search will be needed.

hitcollector 和 hits不同, 会包含所有结果, 所以注意这个函数performance问题

2. working with hits

hits应该会记录下所有匹配的结果, 但是只会加载top 100到内存, 因为一般用户也只需要这些. 如果你用doc(n)访问未被cache的文章, 需要先从index里面load, 所以需要时才去调用比较合理.

length() number of documents in the hits collection

doc(n) document instance of the nth top-scoring document

id(n) document id of the nth top-scoring document

score(n) normalized score (based on the score of the topmost document) of the nth topscoring document, guaranteed to be greater than 0 and less than or equal to 1

对于分页显示hits的solution, 推荐每页都从新去做search来生成hits, 并从中取出你所需要的页. 这样符合stateless的服务端的原则

3. reading indexes into memory

如果你内存足够大, 而且index本身是没有变化的时候

你可以把index载入内存, search将飞快...

ramdirectory ramdir = new ramdirectory(dir);

三. understanding lucene scoring

至于怎么排名的就不解释了.

这里有个接口可以查看排名的具体细节, 为什么他就排第一个了?

explanation explanation = searcher.explain(query, hits.id(0));

system.out.println(explanation.tostring());

好了通过打出的各个参数的值, 就能理解他为什么排第一个了

四. creating queries programmatically

用程序的方式取创建查询, 还有一种方法就是用queryparser去生成查询.

1. searching by term: termquery

term t = new term("contents", "java"); term(field, value)

term的value是大小写敏感的, 所以search时的大小写要和index时的一致

2. searching within a range: rangequery

begin = new term("pubmonth","198805");

end = new term("pubmonth","198810");

rangequery query = new rangequery(begin, end, true); 最后这个flag是表示是否包含begin和end

queryparser中的表示为: [begin to end] or {begin to end}.

3. searching on a string: prefixquery

找前缀, 底下这个意思是找到这个目录, 及其所有子目录

term term = new term("category", "/technology/computers/programming");

prefixquery query = new prefixquery(term);

queryparser中的表示为:prefix*

4. combining queries: booleanquery

termquery searchingbooks =new termquery(new term("subject","search"));

rangequery currentbooks =new rangequery(new term("pubmonth","200401"), new term("pubmonth","200412"),true);

booleanquery currentsearchingbooks = new booleanquery();

currentsearchingbooks.add(searchingbook s, true, false);

currentsearchingbooks.add(currentbooks, true, false);

可见, 可以通过add往booleanquery上加子查询, 并通过最后两个参数来决定与或的关系

两个参数分别是required and prohibited, 就是要求和禁止, 两个不可能全true

false, false: clause is optional, 即或

true, false: clause must match, 即与

false, true: clause must not match, 即非

queryparser中的表示为: –, +, and, or, and not

lucene新的接口是:

public void add(query query, booleanclause.occur occur)

where occur can be booleanclause.occur.must, booleanclause.occur.should or booleanclause.occur.must_not

(and )booleanclause.occur.must means exactly that: only documents matching that clause are considered.

(or )booleanclause.occur.should means the term is optional.

(not )booleanclause.occur.must_not means any documents matching this clause are excluded from the results.

5. 后面还有phrasequery, wildcardquery, fuzzyquery就不一一介绍了, 大家用到再去查吧

五. parsing query expressions: queryparser

虽然用api去创建query很好, 但有时也需要用human-readable textual query representation.

对于一个已有的query, 我们可以用query.tostring, 得到一个query的human-readable textual query representation.

这个就不具体说了, queryparser看上去不错, 不过他并不能表示所有的查询, 有些查询必须用api生成. 而且用queryparser去parse应该是要额外耗费一些时间的. 对于要提供给用户查询ui的应用, queryparser是非常方便的.

本文章摘自博客园，原文发布日期：2011-07-04

Lucene in action 笔记 search篇

继续阅读

关于Gradle配置的小结

Java小案例——随机数猜测随机数猜测

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method