Nutch1.3集成Solr网页快照功能实现（一）

Nutch1.3版本以后使用了Solr作为索引功能的提供者，在索引效率、集群功能方面做了很大改进，但与Nutch1.2版本相比，Solr缺失了网页快照的功能，按官方手册中集成配置后，每次查询返回的结果中仅包含解析处理过的HTML正文部分，如下图所示：

对于需要原网页快照功能的使用者来说，带来了巨大的麻烦。因此，需要对Nutch1.3做一些改动，使其支持集成后的网页快照功能。

参考Nutch1.2原来的实现方式，其自带的索引功能其实是将整个网页进行了索引，而1.3版本在调用Solr服务之前，Nutch主动将无用的Html标签信息去掉了（其内部机制在此不做探讨），结果Solr中仅获取了网页之中的“正文”部分，也就是上面图片中看到的Content标签中的内容。我们所要做的工作，其核心就是将整个网页的缓存信息也交给Solr，并在查询Solr时作为结果内容返回。

在工程中找到“SolrIndexer”类，中的“indexSolr”方法，如下：

public void indexSolr(String solrUrl, Path crawlDb, Path linkDb,

List<Path> segments) throws IOException {

SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

long start = System.currentTimeMillis();

LOG.info("SolrIndexer: starting at " + sdf.format(start));

final JobConf job = new NutchJob(getConf());

job.setJobName("index-solr " + solrUrl);

IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job);

job.set(SolrConstants.SERVER_URL, solrUrl);

NutchIndexWriterFactory.addClassToConf(job, SolrWriter.class);

job.setReduceSpeculativeExecution(false);

final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-" +

new Random().nextInt());

FileOutputFormat.setOutputPath(job, tmp);

try {

JobClient.runJob(job);

// do the commits once and for all the reducers in one go

SolrServer solr = new CommonsHttpSolrServer(solrUrl);

solr.commit();

long end = System.currentTimeMillis();

LOG.info("SolrIndexer: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));

}

catch (Exception e){

LOG.error(e);

} finally {

FileSystem.get(job).delete(tmp, true);

}

Nutch在这里使用了Hadoop的分布式计算机制，我们跳转到：“IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job)”方法中看一下，如下：

public static void initMRJob(Path crawlDb, Path linkDb,

Collection<Path> segments,

JobConf job) {

LOG.info("IndexerMapReduce: crawldb: " + crawlDb);

LOG.info("IndexerMapReduce: linkdb: " + linkDb);

for (final Path segment : segments) {

LOG.info("IndexerMapReduces: adding segment: " + segment);

FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.FETCH_DIR_NAME));

FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.PARSE_DIR_NAME));

FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME));

FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));

FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));

FileInputFormat.addInputPath(job, new Path(linkDb, LinkDb.CURRENT_NAME));

job.setInputFormat(SequenceFileInputFormat.class);

job.setMapperClass(IndexerMapReduce.class);

job.setReducerClass(IndexerMapReduce.class);

job.setOutputFormat(IndexerOutputFormat.class);

job.setOutputKeyClass(Text.class);

job.setMapOutputValueClass(NutchWritable.class);

job.setOutputValueClass(NutchWritable.class);

可以看到，FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));中仅处理了Segment文件夹下“parse_data”与“parse_text”中的内容。

本文转自william_xu 51CTO博客，原文链接：http://blog.51cto.com/williamx/722707，如需转载请自行联系原作者

Nutch1.3集成Solr网页快照功能实现（一）

继续阅读

Java小案例——随机数猜测随机数猜测

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

Ambari介绍和架构原理

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method