Nutch 1.3 學習筆記 插件擴充 10-2
---------------------------------
1. 自己擴充一個簡單的插件
這裡擴充一個Nutch的URLFilter插件,叫MyURLFilter
1.1 生成一個Package
首先生成一個與urlfilter-regex類似的包結構
如org.apache.nutch.urlfilter.my
1.2 在這個包中生成相應的擴充檔案
再生成一個MyURLFilter.java檔案,内容如下:
package org.apache.nutch.urlfilter.my;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.net.URLFilter;
import org.apache.nutch.urlfilter.prefix.PrefixURLFilter;
public class MyURLFilter implements URLFilter{ // 這裡的繼承自Nutch的URLFilter擴充
private Configuration conf;
public MyURLFilter()
{}
@Override
public String filter(String urlString) { // 對url字元串進行過濾
// TODO Auto-generated method stub
return "My Filter:"+ urlString;
}
@Override
public Configuration getConf() {
// TODO Auto-generated method stub
return this.conf;
}
@Override
public void setConf(Configuration conf) {
// TODO Auto-generated method stub
this.conf = conf;
}
public static void main(String[] args) throws IOException
{
MyURLFilter filter = new MyURLFilter();
BufferedReader in=new BufferedReader(new InputStreamReader(System.in));
String line;
while((line=in.readLine())!=null) {
String out=filter.filter(line);
if(out!=null) {
System.out.println(out);
}
}
}
}
1.3 打包成jar包并生成相應的plugin.xml檔案
打包可以用ivy或者是eclipse來打,每一個plugin都有一個描述檔案plugin.xml,内容如下:
<plugin
id="urlfilter-my"
name="My URL Filter"
version="1.0.0"
provider-name="nutch.org">
<runtime>
<library name="urlfilter-my.jar">
<export name="*"/>
</library>
<!-- 如果這裡你的插件有依賴第三方庫的話,可以這樣寫
<library name="fontbox-1.4.0.jar"/>
<library name="geronimo-stax-api_1.0_spec-1.jar"/>
-->
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
</requires>
<extension id="org.apache.nutch.net.urlfilter.my"
name="Nutch My URL Filter"
point="org.apache.nutch.net.URLFilter">
<implementation id="MyURLFilter"
class="org.apache.nutch.urlfilter.prefix.MyURLFilter"/>
<!-- by default, attribute "file" is undefined, to keep classic behavior.
<implementation id="PrefixURLFilter"
class="org.apache.nutch.net.PrefixURLFilter">
<parameter name="file" value="urlfilter-prefix.txt"/>
</implementation>
-->
</extension>
</plugin>
1.4 把需要的包與配置檔案放入plugins目錄中
最後把打好的jar包與plugin.xml放到一個urlfilter-my檔案夾中,再把這個檔案夾到到nutch的plugins目錄下
2. 使用bin/nutch plugin來進行測試
在運作bin/nutch plugin指令之前你要修改一下nutch-site.xml這個配置檔案,在下面加入我們寫的插件,如下
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|prefix|my)|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
在本機測試結果如下:
[email protected]:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch plugin urlfilter-my org.apache.nutch.urlfilter.my.MyURLFilter
urlString1
My Filter:urlString1
urlString2
My Filter:urlString2
3. 總結
這裡隻是寫了一個簡單的插件,當然你可以根據你的需求寫出更加複雜的插件.
4. 參考
http://wiki.apache.org/nutch/WritingPluginExample#The_Example