nutch 0.9在Windows下的安裝

一、環境：

1.作業系統：windowsXp,windows2000+

2.java1.6，設定JAVA_HOME到環境變量

3.cygwin,當然這個不是必需的，隻是nutch提供的腳本隻能在shell環境下使用，是以使用cygwin來虛拟shell指令。

4.nutch版本：0.9

5.tomcat：6.0

二、nutch的安裝和配置：

1，安裝Cygwin1.5.5（我這裡裝到F:\cygSys）,将nutch解壓縮後放置到cygSys \home\使用者名的一個目錄下(我放在F:\cygSys\home\dyk\nutch下),如圖：

2007112001.jpg

2，在Cygwin環境下進入nutch-0.9目錄下，使用指令 bin/nutch進行測試，正常的情況下出現的結果是：

2007112002.jpg

3，進行抓取網站的測試，以抓取http://www.163.com/為例

1) 建立一個檔案myurl,在檔案中輸入http://www.163.com/儲存，這個檔案可以放在任何地方（我這個檔案放在F:\cygSys\home\dyk\nutch\myurl）,另外再建立一個爬蟲日志目錄logs(我放在F:\cygSys\home\dyk\nutch\logs)

2) 打開nutch-0.9\conf\nutch-site.xml檔案，在<configuration></configuration>内插入以下内容：

<name>http.agent.name</name>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -

please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents

http.agent.description

http.agent.url

http.agent.email

http.agent.version

and set their values appropriately.

</description>

</property>

<name>http.agent.description</name>

<description>Further description of our bot- this text is used in

the User-Agent header. It appears in parenthesis after the agent name.

<name>http.agent.url</name>

<description>A URL to advertise in the User-Agent header. This will

appear in parenthesis after the agent name. Custom dictates that this

should be a URL of a page explaining the purpose and behavior of this

crawler.

<name>http.agent.email</name>

<description>An email address to advertise in the HTTP 'From' request

header and User-Agent header. A good practice is to mangle this

address (e.g. 'info at example dot com') to avoid spamming.

可以把<name>XXX</name>之間的内容替換為其他字元，當然就算是不替換也無所謂，這裡的設定，是因為nutch遵守了robots協定，在擷取response時，把自己的相關資訊送出給被爬行的網站，以供識别。

3) 打開nutch-0.9\conf\crawl-urlfilter.txt檔案，把MY.DOMAIN.NAME字元替換為myurl内的域名（比如我改成了“+^http://([a-z0-9]*\.)*163.com/”，其實更簡單點，直接删除MY.DOMAIN.NAME這幾個字就可以了，也就是說，隻儲存+^http://([a-z0-9]*\.)*這幾個字就可以了，表示所有http的網站都同意爬行）。

4) 運作爬蟲,在Cygwin輸入以下指令：

bin/nutch crawl ../myurl –dir ../mydir –depth 2 >&../logs/crawl1.log

這裡dir表示存儲的目錄，-depth表示網址爬的深度，最後是指明日志檔案

2007112003.jpg

運作結束後，你可以打開日志檔案檢視爬蟲運作的詳細過程。

5，在tomcat上運作Nutch

把nutch-0.9.war拷貝到Tomcat\webapps\下面

在浏覽器中輸入http://localhost:8080/nutch-0.9/這步是為了使tomcat展開nutch-0.9.war，然後修改webapps/ nutch-0.9/WEB-INF/classes/nutch-site.xml檔案如下：

<name>searcher.dir</name>

<value>F:\\cygSys\\home\\dyk\\nutch\\mydir4</value>

</property>

</configuration>

為了支援中文的搜尋，修改Tomcat\conf\server.xml。找到對應的地方修改成

<Connector port="8080" protocol="HTTP/1.1"

connectionTimeout="20000"

redirectPort="8443" URIEncoding="UTF-8" useBodyEncodingForURI="true"/>

在浏覽器中輸入http://localhost:8080/nutch-0.9，

搜尋“nba”，結果是

本文轉自Phinecos(洞庭散人)部落格園部落格，原文連結：http://www.cnblogs.com/phinecos/archive/2007/11/20/965835.html，如需轉載請自行聯系原作者

nutch 0.9在Windows下的安裝

繼續閱讀

spec檔案詳解

Windows下使用GSL（GNU Scientific Library）

BMP檔案結構及圖像每行位元組計算方法

磁盤結構及在Linux中的命名

HK-2000資料采集儀資料庫操作說明

終端環境之tmux

查找檔案中的字元串

windows不能在本地計算機上運作oracleDbConsoleorcl

Windows下VS開發環境環境安裝工程項目設定關于Debug和Release的提示

拒絕使用者登入:/bin/false和/usr/sbin/nologin

Shell程式設計——sort排序、uniq忽略重複、tr替換壓縮删除、cut指定删除字段、正規表達式元字元sort 指令uniq 指令tr 指令cut 指令正規表達式

Linxu常用指令技巧彙總

Windows下配置Apache的SSL服務

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

Mac｜Windows系統本地照片自動上傳到伺服器

ACS基本配置-權限等級管理