天天看點

Nutch 1.3 學習筆記 外傳 擴充Nutch插件實作自定義索引字段

擴充Nutch插件實作自定義索引字段

1.Nutch與Solr的使用介紹

  1.1 一些基本的配置

  • 在conf/nutch-site.xml加入http.agent.name的屬性
  •  生成一個種子檔案夾,mkdir -p urls,在其中生成一個種子檔案,在這個檔案中寫入一個url,如http://nutch.apache.org/
  •  編輯conf/regex-urlfilter.txt檔案,配置url過濾器,一般用預設的好了,也可以加入如下配置,隻抓取nutch.apache.org這個網址 +^http://([a-z0-9]*\.)*nutch.apache.org/
  •  使用如下指令來抓取網頁
 bin/nutch crawl urls -dir crawl -depth 3 -topN 5 說明: -dir 抓取結果目錄名 -depth 抓取的深度 -topN 最一層的最大抓取個數 一般抓取完成後會看到如下的目錄 crawl/crawldb crawl/linkdb crawl/segments
  • 使用如下來建立索引

 bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb crawldb/linkdb crawldb/segments/* 使用這個指令的前提是你已經開啟了預設的solr服務 開啟預設solr服務的指令如下 cd ${APACHE_SOLR_HOME}/example java -jar start.jar 這個時候服務就開啟了 你可以在浏覽器中輸入如下位址進行測試 http://localhost:8983/solr/admin/ http://localhost:8983/solr/admin/stats.jsp

  但是要結合Nutch來使用solr,還要在solr中加一個相應的政策配置,在nutch的conf目錄中有一個預設的配置,把它複制到solr的相應目錄中就可以使用了 cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/ 這個時候要重新啟動一下solr

 索引建立完成以後就你就可以用關鍵詞進行查詢,solr預設傳回的是一個xml檔案

2.Nutch的索引過濾插件介紹

 除了一些中繼資料,如segment,boost,digest,nutch的其它索引字段都是通過一個叫做索引過濾的插件來完成的,如index-basic,index-more,index-anchor,這些都是通過nutch的插件機制來完成索引檔案的字段生成,如果你要自定義相應的索引字段,你就要實作IndexingFilter這個接口,其定義如下:

/** Extension point for indexing.  Permits one to add metadata to the indexed
 * fields.  All plugins found which implement this extension point are run
 * sequentially on the parse.
 */
public interface IndexingFilter extends Pluggable, Configurable {
  /** The name of the extension point. */
  final static String X_POINT_ID = IndexingFilter.class.getName();




  /**
   * Adds fields or otherwise modifies the document that will be indexed for a
   * parse. Unwanted documents can be removed from indexing by returning a null value.
   * 
   * @param doc document instance for collecting fields
   * @param parse parse data instance
   * @param url page url
   * @param datum crawl datum for the page
   * @param inlinks page inlinks
   * @return modified (or a new) document instance, or null (meaning the document
   * should be discarded)
   * @throws IndexingException
   */
  NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
    throws IndexingException;
}
           

主要是實作filter這個抽象接口.

對于NutchDocument的生成都在IndexerMapReduce.java中的reduce方法中,其中調用了indexing filters插件來進行索引字段的設定。

3.寫一個自己的索引過濾插件

現在如果我們有這樣的需求,要自定義索引檔案的字段值,如要再生成一個metadata與fetchTime字段,并且在Solr中能查詢顯示出來,下面是相應插件的代碼與一些說明

這個時候要首先要寫一個索引過濾器,代碼如下:

package org.apache.nutch.indexer.metadata;

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseData;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;

import java.util.Date;
import org.apache.hadoop.conf.Configuration;

/**
 * Add (or reset) a few metaData properties as respective fields (if they are
 * available), so that they can be displayed by more.jsp (called by search.jsp).
 * 
 * @author Lemo lu
 */


public class MetaDataIndexingFilter implements IndexingFilter {
 public static final Logger LOG = LoggerFactory
 .getLogger(MetaDataIndexingFilter.class);
 private Configuration conf;

 public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
 CrawlDatum datum, Inlinks inlinks) throws IndexingException {

 // add metadata field
 addMetaData(doc, parse.getData(), datum);
 // add fetch time field
 addFetchTime(doc, parse.getData(),datum);

 return doc;
 }

 private NutchDocument addFetchTime(NutchDocument doc, ParseData data,CrawlDatum datum)
 {
 long fetchTime = datum.getFetchTime();
 doc.add("fetchTime",new Date(fetchTime));
 return doc;
 }
 private NutchDocument addMetaData(NutchDocument doc, ParseData data,
                             CrawlDatum datum) {
     String metadata = data.getParseMeta().toString();
     doc.add("metadata", metadata);
     return doc;
   }
 public void setConf(Configuration conf) {
 this.conf = conf;
 }

 public Configuration getConf() {
 return this.conf;
 }
}
           

這個時候插件就寫好了,打成jar包放到plugins目錄下的index-metadata目錄就行,這個index-metadata要自己建立的。然後再寫一個相應的plugin.xml配置檔案,這樣可以讓nutch的插件正确的動态加載這些子產品,plugin.xml如下:

<?xml version="1.0" encoding="UTF-8"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->
<plugin
   id="index-metadata"
   name="Metadata Indexing Filter"
   version="1.0.0"
   provider-name="nutch.org">

   <runtime>
      <library name="index-metadata.jar">
         <export name="*"/>
      </library>
   </runtime>
   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="org.apache.nutch.indexer.more"
              name="Nutch MetaData Indexing Filter"
              point="org.apache.nutch.indexer.IndexingFilter">
      <implementation id="MetaDataIndexingFilter"
                      class="org.apache.nutch.indexer.metadata.MetaDataIndexingFilter"/>
   </extension>
</plugin>
           

這個時候自己定義的索引過濾器就完成了,

下面要再配置兩個檔案,一個是schema.xml,在其中的fields标簽下加入如下代碼

<!-- metadata fields -->
 <field name="fetchTime" type="date" stored="true" indexed="true"/>
 <field name="metadata" type="string" stored="true" indexed="true"/>
           

說明:

其中的stored表示這個字段的值要存儲在lucene的索引中

其中的indexed表示這個字段的值是不是要進行分詞查詢

還有一個是solrindex-mapping.xml檔案,這個檔案的作用是把索引過濾器中生成的字段名與schema.xml中的做一個對應關系,要在其fields标簽中加入如下代碼:

<field dest="fetchTime" source="fetchTime"/>
 <field dest="metadata" source="metadata"/>
           

這樣自定義索引過濾插件就算完成了,記得這裡的schema.xml檔案是在solr/conf目錄下的,修改以後要重新開機一下,不知道solr支不支援修改了配置檔案後不重新開機就可以生效。

4.建立索引

再使用上面的指令重建立一下索引

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb crawldb/linkdb crawldb/segments/*
           

這時候可以看到索引已經建立完成,對了,solr的索引檔案在solr/data/index中,你可以用luke這個工具加開其索引檔案,看一下其中的一些元資訊,這個時候你就應該可以看到fetchTime與metadata這兩個字段了

如圖:

Nutch 1.3 學習筆記 外傳 擴充Nutch插件實作自定義索引字段

5.查詢

這個時候你可以打開浏覽器,在其中輸入http://localhost:8983/solr/admin/,再輸入一些查詢條件,你就可以看到結果了,大概結果内容如下:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">a</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<float name="boost">1.1090536</float>
<str name="digest">da3aefc69d8a5a7c1ea5447f9680d66d</str>
<date name="fetchTime">2012-04-11T03:19:33.088Z</date>
<str name="id">http://nutch.apache.org/</str>
<str name="metadata">
CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
</str>
<str name="segment">20120410231836</str>
<str name="title">Welcome to Apache Nutch®</str>
<date name="tstamp">2012-04-11T03:19:33.088Z</date>
<str name="url">http://nutch.apache.org/</str>
</doc>
</result>
</response>
           

有沒有看到我們定義的兩個字段的值

6.參考

http://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search