天天看點

Nutch相關架構安裝使用最佳指南

一、nutch1.2

二、nutch1.5.1

三、nutch2.0

四、配置SSH

五、安裝Hadoop Cluster(僞分布式運作模式)并運作Nutch

六、安裝Hadoop Cluster(分布式運作模式)并運作Nutch

七、配置Ganglia監控Hadoop叢集和HBase叢集

八、Hadoop配置Snappy壓縮

九、Hadoop配置Lzo壓縮 

十、配置zookeeper叢集以運作hbase

十一、配置Hbase叢集以運作nutch-2.1(Region Servers會因為記憶體的問題當機)

十二、配置Accumulo叢集以運作nutch-2.1(gora存在BUG)

十三、配置Cassandra 叢集以運作nutch-2.1(Cassandra 采用去中心化結構)

十四、配置MySQL 單機伺服器以運作nutch-2.1

十五、nutch2.1 使用DataFileAvroStore作為資料源

十六、nutch2.1 使用AvroStore作為資料源

十七、配置SOLR 

十八、Nagios監控

十九、配置Splunk

二十、配置Pig

二十一、配置Hive

二十二、配置Hadoop2.x叢集

一、nutch1.2

 步驟和二大同小異,在步驟 5、配置建構路徑 中需要多兩個操作:在左部Package Explorer的 nutch1.2檔案夾上單擊右鍵 > Build Path > Configure Build Path...   >  選中Source選項 > Default output folder:修改nutch1.2/bin為nutch1.2/_bin,在左部Package Explorer的 nutch1.2檔案夾下的bin檔案夾上單擊右鍵 > Team > 還原

 二中黃色背景部分是版本号的差異,紅色部分是1.2版本沒有的,綠色部分是不一樣的地方,如下:

 1、Add JARs... >  nutch1.2 > lib ,選中所有的.jar檔案 > OK

 2、crawl-urlfilter.txt

 3、将crawl -urlfilter.txt.template改名為crawl -urlfilter.txt

 4、修改crawl-urlfilter.txt,将 

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/   # skip everything else

-.

 5、cd /home/ysc/workspace/nutch1.2

 nutch1.2是一個完整的搜尋引擎,nutch1.5.1隻是一個爬蟲。nutch1.2可以把索引送出給SOLR,也可以直接生成LUCENE索引,nutch1.5.1則隻能把索引送出給SOLR:

 1、cd /home/ysc

 2、wget  http://mirrors.tuna.tsinghua.edu.cn/apache/tomcat/tomcat-7/v7.0.29/bin/apache-tomcat-7.0.29.tar.gz

 3、tar -xvf apache-tomcat-7.0.29.tar.gz

 4、在左部Package Explorer的 nutch1.2檔案夾下的build.xml檔案上單擊右鍵 > Run As > Ant Build... > 選中war target > Run

 5、cd /home/ysc/workspace/nutch1.2/build

 6、unzip nutch-1.2.war -d nutch-1.2

 7、cp -r nutch-1.2 /home/ysc/apache-tomcat-7.0.29/webapps

 8、vi /home/ysc/apache-tomcat-7.0.29/webapps/nutch-1.2/WEB-INF/classes/nutch-site.xml

 加入以下配置:

 <property>

  <name>searcher.dir</name>

  <value>/home/ysc/workspace/nutch1.2/data</value>

  <description>

  Path to root of crawl.  This directory is searched (in

  order) for either the file search-servers.txt, containing a list of

  distributed search servers, or the directory "index" containing

  merged indexes, or the directory "segments" containing segment

  indexes.

  </description>

</property>

9、vi /home/ysc/apache-tomcat-7.0.29/conf/server.xml

<Connector port="8080" protocol="HTTP/1.1"

               connectionTimeout="20000"

               redirectPort="8443"/>

改為

<Connector port="8080" protocol="HTTP/1.1"

               connectionTimeout="20000"

               redirectPort="8443" URIEncoding="utf-8"/> 10、cd /home/ysc/apache-tomcat-7.0.29/bin

11、./startup.sh

12、通路: http://localhost:8080/nutch-1.2/ 關于nutch1.2更多的BUG修複及資料,請參看我在CSDN釋出的資源: http://download.csdn.net/user/yangshangchuan 二、nutch1.5.1

1、下載下傳并解壓eclipse(內建開發環境)

 下載下傳位址: http://www.eclipse.org/downloads/,下載下傳Eclipse IDE for Java EE Developers

2、安裝Subclipse插件(SVN用戶端)

 插件位址: http://subclipse.tigris.org/update_1.8.x,

3、安裝IvyDE插件(下載下傳依賴Jar)

 插件位址: http://www.apache.org/dist/ant/ivyde/updatesite/

4、簽出代碼

 File > New > Project > SVN > 從SVN 檢出項目

 建立新的資源庫位置 > URL: https://svn.apache.org/repos/asf/nutch/tags/release-1.5.1/ > 選中URL > Finish

 彈出New Project向導,選擇Java Project > Next,輸入Project name:nutch1.5.1 > Finish

5、配置建構路徑

 在左部Package Explorer的 nutch1.5.1檔案夾上單擊右鍵 > Build Path > Configure Build Path...  

> 選中Source選項 > 選擇src > Remove > Add Folder... > 選擇src/bin, src/java, src/test 和 src/testresources(對于插件,需要選中src/plugin目錄下的每一個插件目錄下的src/java , src/test檔案夾) > OK

 切換到Libraries選項 > 

 Add Class Folder... > 選中nutch1.5.1/conf > OK

 Add JARs... >  需要選中src/plugin目錄下的每一個插件目錄下的lib目錄下的jar檔案 > OK

 Add Library... > IvyDE Managed Dependencies > Next > Main > Ivy File > Browse > ivy/ivy.xml > Finish

 切換到Order and Export選項>

 選中conf > Top

6、執行ANT

 在左部Package Explorer的 nutch1.5.1檔案夾下的build.xml檔案上單擊右鍵 > Run As > Ant Build

 在左部Package Explorer的 nutch1.5.1檔案夾上單擊右鍵 > Refresh

 在左部Package Explorer的 nutch1.5.1檔案夾上單擊右鍵 > Build Path > Configure Build Path...   >  選中Libraries選項 > Add Class Folder... >  選中build > OK

7、修改配置檔案nutch-site.xml 和regex-urlfilter.txt

 将nutch-site.xml.template改名為nutch-site.xml

 将regex-urlfilter.txt.template改名為regex-urlfilter.txt

 在左部Package Explorer的 nutch1.5.1檔案夾上單擊右鍵 > Refresh

 将如下配置項加入檔案nutch-site.xml:

<property>

  <name>http.agent.name</name>

  <value>nutch</value>

</property>

<property>

  <name>http.content.limit</name>

  <value>-1</value>

</property>

 修改regex-urlfilter.txt,将 

# accept anything else 

+.

 替換為:

+^http://([a-z0-9]*\.)*news.163.com/ 

-.

8、開發調試

 在左部Package Explorer的 nutch1.5.1檔案夾上單擊右鍵 > New > Folder > Folder name: urls

 在剛建立的urls目錄下建立一個文本檔案url,文本内容為: http://news.163.com

 打開src/java下的org.apache.nutch.crawl.Crawl.java類,單擊右鍵Run As > Run Configurations > Arguments > 在Program arguments輸入框中輸入: urls -dir data -depth 3 > Run

 在需要調試的地方打上斷點Debug As > Java Applicaton

9、檢視結果

 檢視segments目錄:

 打開src/java下的org.apache.nutch.segment.SegmentReader.java類

 單擊右鍵Run As > Java Applicaton,控制台會輸出該指令的使用方法

 單擊右鍵Run As > Run Configurations > Arguments > 在Program arguments輸入框中輸入: -dump data/segments/*  data/segments/dump

 用文本編輯器打開檔案data/segments/dump/dump檢視segments中存儲的資訊  檢視crawldb目錄:

 打開src/java下的org.apache.nutch.crawl.CrawlDbReader.java類

 單擊右鍵Run As > Java Applicaton,控制台會輸出該指令的使用方法

 單擊右鍵Run As > Run Configurations > Arguments > 在Program arguments輸入框中輸入: data/crawldb -stats

 控制台會輸出 crawldb統計資訊  檢視linkdb目錄:

 打開src/java下的org.apache.nutch.crawl.LinkDbReader.java類

 單擊右鍵Run As > Java Applicaton,控制台會輸出該指令的使用方法

 單擊右鍵Run As > Run Configurations > Arguments > 在Program arguments輸入框中輸入: data/linkdb -dump data/linkdb_dump

 用文本編輯器打開檔案data/linkdb_dump/part-00000檢視linkdb中存儲的資訊

10、全網分步驟抓取

 在左部Package Explorer的 nutch1.5.1檔案夾下的build.xml檔案上單擊右鍵 > Run As > Ant Build

 cd  /home/ysc/workspace/nutch1.5.1/runtime/local

 #準備URL清單

 wget  http://rdf.dmoz.org/rdf/content.rdf.u8.gz

 gunzip content.rdf.u8.gz

 mkdir dmoz

 bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/url

 #注入URL

 bin/nutch inject crawl/crawldb dmoz

 #生成抓取清單

 bin/nutch generate crawl/crawldb crawl/segments

 #第一次抓取

 s1=`ls -d crawl/segments/2* | tail -1`

 echo $s1

 #抓取網頁

 bin/nutch fetch $s1

 #解析網頁

 bin/nutch parse $s1

 #更新URL狀态

 bin/nutch updatedb crawl/crawldb $s1

 #第二次抓取

 bin/nutch generate crawl/crawldb crawl/segments -topN 1000

 s2=`ls -d crawl/segments/2* | tail -1`

 echo $s2

 bin/nutch fetch $s2

 bin/nutch parse $s2

 bin/nutch updatedb crawl/crawldb $s2

 #第三次抓取

 bin/nutch generate crawl/crawldb crawl/segments -topN 1000

 s3=`ls -d crawl/segments/2* | tail -1`

 echo $s3

 bin/nutch fetch $s3

 bin/nutch parse $s3

 bin/nutch updatedb crawl/crawldb $s3

 #生成反向連結庫

 bin/nutch invertlinks crawl/linkdb -dir crawl/segments 11、索引和搜尋

 cd  /home/ysc/ 

 wget  http://mirror.bjtu.edu.cn/apache/lucene/solr/3.6.1/apache-solr-3.6.1.tgz

 tar -xvf apache-solr-3.6.1.tgz

 cd apache-solr-3.6.1 /example

 NUTCH_RUNTIME_HOME=/home/ysc/workspace/nutch1.5.1/runtime/local

 APACHE_SOLR_HOME=/home/ysc/apache-solr-3.6.1  cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/

 如果需要把網頁内容存儲到索引中,則修改 schema.xml檔案中的

 <field name="content" type="text" stored="false" indexed="true"/>

 為

 <field name="content" type="text" stored="true" indexed="true"/>  修改${APACHE_SOLR_HOME}/example/solr/conf/solrconfig.xml,将裡面的<str name="df">text</str>都替換為<str name="df">content</str>  把${APACHE_SOLR_HOME}/example/solr/conf/schema.xml中的 <schema name="nutch" version="1.5.1">修改為<schema name="nutch" version="1.5">

 #啟動SOLR伺服器

 java -jar start.jar   http://127.0.0.1:8983/solr/admin/

  http://127.0.0.1:8983/solr/admin/stats.jsp  cd  /home/ysc/workspace/nutch1.5.1/runtime/local

 #送出索引

 bin/nutch solrindex  http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*  執行完整crawl:

 bin/nutch crawl urls -dir data -depth 2 -topN 100 -solr  http://127.0.0.1:8983/solr/  使用以下指令分頁檢視所有索引的文檔:

  http://127.0.0.1:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on

 标題包含“網易”的文檔:

  http://127.0.0.1:8983/solr/select/?q=title%3A%E7%BD%91%E6%98%93&version=2.2&start=0&rows=10&indent=on 12、檢視索引資訊

 cd  /home/ysc/

 wget  http://luke.googlecode.com/files/lukeall-3.5.0.jar

 java -jar lukeall-3.5.0.jar 

 Path: /home/ysc/apache-solr-3.6.1/example/solr/data 13、配置SOLR的中文分詞

 cd  /home/ysc/

 wget  http://mmseg4j.googlecode.com/files/mmseg4j-1.8.5.zip

 unzip mmseg4j-1.8.5.zip -d  mmseg4j-1.8.5

 APACHE_SOLR_HOME=/home/ysc/apache-solr-3.6.1

 mkdir $APACHE_SOLR_HOME/example/solr/lib

 mkdir $APACHE_SOLR_HOME/example/solr/dic

 cp mmseg4j-1.8.5/mmseg4j-all-1.8.5.jar $APACHE_SOLR_HOME/example/solr/lib

 cp mmseg4j-1.8.5/data/*.dic $APACHE_SOLR_HOME/example/solr/dic

 将${APACHE_SOLR_HOME}/example/solr/conf/schema.xml檔案中的

 <tokenizer class="solr.WhitespaceTokenizerFactory"/>

 和

 <tokenizer class="solr.StandardTokenizerFactory"/>

 替換為

 <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/ysc/apache-solr-3.6.1/example/solr/dic"/>

 #重新啟動SOLR伺服器

 java -jar start.jar  #重建索引,示範在開發環境中如何操作

 打開src/java下的org.apache.nutch.indexer.solr.SolrIndexer.java類

 單擊右鍵Run As > Java Applicaton,控制台會輸出該指令的使用方法

 單擊右鍵Run As > Run Configurations > Arguments > 在Program arguments輸入框中輸入:  http://127.0.0.1:8983/solr/ ; data/crawldb -linkdb  data/linkdb  data/segments/*

 使用luke重新打開索引就會發現分詞起作用了 三、nutch2.0

 nutch2.0和二中的nutch1.5.1的步驟相同,但在8、開發調試之前需要做以下配置:

 在左部Package Explorer的 nutch2.0檔案夾上單擊右鍵 > New > Folder > Folder name: data并指定資料存儲方式,選如下之一:

 1、使用mysql作為資料存儲

  1)、在nutch2.0/conf/nutch-site.xml中加入如下配置:

 <property>

  <name>storage.data.store.class</name>

  <value>org.apache.gora.sql.store.SqlStore</value>

</property>

  2)、将nutch2.0/conf/gora.properties檔案中的  

  gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver

gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest

gora.sqlstore.jdbc.user=sa

gora.sqlstore.jdbc.password=

  修改為

  gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver

gora.sqlstore.jdbc.url=jdbc:mysql://127.0.0.1:3306/nutch2

gora.sqlstore.jdbc.user=root

gora.sqlstore.jdbc.password=ROOT

  3)、打開nutch2.0/ivy/ivy.xml中的mysql-connector-java依賴

  4)、sudo apt-get install mysql-server

 2、使用hbase作為資料存儲

  1)、在nutch2.0/conf/nutch-site.xml中加入如下配置:

 <property>

  <name>storage.data.store.class</name>

  <value>org.apache.gora.hbase.store.HBaseStore</value>

</property>

  2)、打開nutch2.0/ivy/ivy.xml中的gora-hbase依賴

  3)、cd /home/ysc

  4)、wget  http://mirror.bit.edu.cn/apache/hbase/hbase-0.90.5/hbase-0.90.5.tar.gz

  5)、tar -xvf hbase-0.90.5.tar.gz

  6)、vi  hbase-0.90.5/conf/hbase-site.xml

   加入以下配置:

  <property>

    <name>hbase.rootdir</name>

    <value> file:///home/ysc/hbase-0.90.5-database</value>

  </property>

7)、hbase-0.90.5/bin/start-hbase.sh

8)、将/home/ysc/hbase-0.90.5/hbase-0.90.5.jar加入開發環境eclipse的build path 四、配置SSH

 三台機器 devcluster01, devcluster02, devcluster03,分别在每一台機器上面執行如下操作:

 1、sudo vi /etc/hosts

 加入以下配置:

 192.168.1.1 devcluster01

 192.168.1.2 devcluster02

 192.168.1.3 devcluster03

 2、安裝SSH服務:

  sudo apt-get install openssh-server

 3、(有提示的時候Enter鍵确認)

  ssh-keygen -t rsa

  該指令會在使用者主目錄下建立 .ssh 目錄,并在其中建立兩個檔案:id_rsa 私鑰檔案。是基于 RSA 算法建立。該私鑰檔案要妥善保管,不要洩漏。id_rsa.pub 公鑰檔案。和 id_rsa 檔案是一對兒,該檔案作為公鑰檔案,可以公開。

 4、cp .ssh/id_rsa.pub .ssh/authorized_keys

 把 三台機器 devcluster01, devcluster02, devcluster03 的檔案/home/ysc/.ssh/authorized_keys的内容複制出來合并成一個檔案并替換每一台機器上的/home/ysc/.ssh/authorized_keys檔案

 在devcluster01上面執行時,以下兩條指令的主機為02和03

 在devcluster02上面執行時,以下兩條指令的主機為01和03

 在devcluster03上面執行時,以下兩條指令的主機為01和02

 5、ssh-copy-id -i .ssh/id_rsa.pub [email protected] devcluster02

 6、ssh-copy-id -i .ssh/id_rsa.pub [email protected] devcluster03

 以上兩條指令實際上是将 .ssh/id_rsa.pub 公鑰檔案追加到遠端主機 server 的 user 主目錄下的 .ssh/authorized_keys 檔案中。 五、安裝Hadoop Cluster(僞分布式運作模式)并運作Nutch

 步驟和四大同小異,隻需要1台機器 devcluster01,是以黃色背景部分全部設定為devcluster01,不需要第11步 六、安裝Hadoop Cluster(分布式運作模式)并運作Nutch

 三台機器 devcluster01, devcluster02, devcluster03(vi /etc/hostname)

 使用使用者ysc登陸 devcluster01:

 1、cd /home/ysc

 2、wget  http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-1.1.1/hadoop-1.1.1-bin.tar.gz

 3、tar -xvf hadoop-1.1.1-bin.tar.gz

 4、cd  hadoop-1.1.1

 5、vi conf/masters

  替換内容為 :

  devcluster01

 6、vi conf/slaves

  替換内容為 :

  devcluster02

  devcluster03

 7、vi conf/core-site.xml

  加入配置:

  <property>

    <name>fs.default.name</name>

    <value>hdfs://devcluster01:9000</value>

    <description>

       Where to find the Hadoop Filesystem through the network. 

       Note 9000 is not the default port.

       (This is slightly changed from previous versions which didnt have "hdfs")

    </description>

  </property>

    <property> 

     <name>hadoop.security.authorization</name> 

      <value>true</value> 

    </property>

編輯conf/hadoop-policy.xml

 8、vi conf/hdfs-site.xml

  加入配置:

<property>

  <name>dfs.name.dir</name>

  <value>/home/ysc/dfs/filesystem/name</value>

</property> <property>

  <name>dfs.data.dir</name>

  <value>/home/ysc/dfs/filesystem/data</value>

</property> <property>

  <name>dfs.replication</name>

  <value>1</value>

</property>  <property>

  <name>dfs.block.size</name>

  <value>671088640</value>

  <description>The default block size for new files.</description>

</property>

 9、vi conf/mapred-site.xml

  加入配置:

<property>

  <name>mapred.job.tracker</name>

  <value>devcluster01:9001</value>

  <description>

    The host and port that the MapReduce job tracker runs at. If 

    "local", then jobs are run in-process as a single map and 

    reduce task.

    Note 9001 is not the default port.

  </description>

</property> <property>

  <name>mapred.reduce.tasks.speculative.execution</name>

  <value>false</value>

  <description>If true, then multiple instances of some reduce tasks 

               may be executed in parallel.</description>

</property> <property>

  <name>mapred.map.tasks.speculative.execution</name>

  <value>false</value>

  <description>If true, then multiple instances of some map tasks 

               may be executed in parallel.</description>

</property> <property> 

  <name>mapred.child.java.opts</name>

  <value>-Xmx2000m</value>

</property> <property> 

  <name>mapred.tasktracker.map.tasks.maximum</name>

  <value>4</value>

  <description>

    the core number of host

  </description>

</property> <property> 

  <name>mapred.map.tasks</name>

  <value>4</value>

</property> <property> 

  <name>mapred.tasktracker.reduce.tasks.maximum</name>

  <value>4</value>

    <description>

    define mapred.map tasks to be number of slave hosts.the best number is the  number of slave hosts plus the core numbers of per host

    </description> 

</property> <property> 

  <name>mapred.reduce.tasks</name>

  <value>4</value>

  <description>

    define mapred.reduce tasks to be number of slave hosts.the best number is the  number of slave hosts plus the core numbers of per host

  </description> 

</property> <property>

  <name>mapred.output.compression.type</name>

  <value>BLOCK</value>

  <description>If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK.

  </description>

</property> <property>

  <name>mapred.output.compress</name>

  <value>true</value>

  <description>Should the job outputs be compressed?

  </description>

</property> <property>

  <name>mapred.compress.map.output</name>

  <value>true</value>

  <description>Should the outputs of the maps be compressed before being                sent across the network. Uses SequenceFile compression.

  </description>

</property> <property>

  <name>mapred.system.dir</name>

  <value>/home/ysc/mapreduce/system</value>

</property> <property>

  <name>mapred.local.dir</name>

  <value>/home/ysc/mapreduce/local</value>

</property>

 10、vi conf/hadoop-env.sh

  追加:

export JAVA_HOME=/home/ysc/jdk1.7.0_05

  export HADOOP_HEAPSIZE=2000

  #替換掉預設的垃圾回收器,因為預設的垃圾回收器在多線程環境下會有更多的wait等待

  export HADOOP_OPTS="-server -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"

 11、複制HADOOP檔案

  scp -r /home/ysc/hadoop-1.1.1 [email protected]:/home/ysc/hadoop-1.1.1

  scp -r /home/ysc/hadoop-1.1.1 [email protected]:/home/ysc/hadoop-1.1.1

 12、sudo vi /etc/profile

  追加并重新開機系統:

  export PATH=/home/ysc/hadoop-1.1.1/bin:$PATH

 13、格式化名稱節點并啟動叢集

  hadoop namenode -format

  start-all.sh

 14、cd /home/ysc/workspace/nutch1.5.1/runtime/deploy

  mkdir urls

  echo  http://news.163.com > urls/url

  hadoop dfs -put urls urls

  bin/nutch crawl urls -dir data -depth 2 -topN 100 

 15、通路  http://localhost:50030 可以檢視 JobTracker 的運作狀态。通路 http://localhost:50060 可以檢視 TaskTracker 的運作狀态。通路 http://localhost:50070 可以檢視 NameNode 以及整個分布式檔案系統的狀态,浏覽分布式檔案系統中的檔案以及 log 等

 16、通過stop-all.sh停止叢集

 17、如果NameNode和SecondaryNameNode不在同一台機器上,則在SecondaryNameNode的conf/hdfs-site.xml檔案中加入配置:

   <property>

     <name>dfs.http.address</name>

     <value>namenode:50070</value>

   </property> 七、配置Ganglia監控Hadoop叢集和HBase叢集

 1、伺服器端(安裝到master devcluster01上)

  1)、ssh devcluster01

  2)、addgroup ganglia

           adduser --ingroup ganglia ganglia 

  3)、sudo apt-get install  ganglia-monitor ganglia-webfront gmetad

   //補充:在Ubuntu10.04上,ganglia-webfront這個package名字叫ganglia-webfrontend

   //如果install出錯,則運作sudo apt-get update,如果update出錯,則删除出錯路徑

  4)、vi /etc/ganglia/gmond.conf

   先找到setuid = yes,改成setuid =no; 

   在找到cluster塊中的name,改成name =”hadoop-cluster”;

  5)、sudo apt-get install rrdtool

  6)、vi /etc/ganglia/gmetad.conf

   在這個配置檔案中增加一些datasource,即其他2個被監控的節點,增加以下内容: 

   data_source “hadoop-cluster” devcluster01:8649 devcluster02:8649 devcluster03:8649

   gridname "Hadoop"

 2、資料源端(安裝到所有slaves上)

  1)、ssh devcluster02

   addgroup ganglia

   adduser --ingroup ganglia ganglia 

   sudo apt-get install  ganglia-monitor

  2)、ssh devcluster03

   addgroup ganglia

   adduser --ingroup ganglia ganglia 

   sudo apt-get install  ganglia-monitor

  3)、ssh devcluster01

   scp /etc/ganglia/gmond.conf devcluster02:/etc/ganglia/gmond.conf

   scp /etc/ganglia/gmond.conf devcluster03:/etc/ganglia/gmond.conf

 3、配置WEB

  1)、ssh devcluster01

  2)、sudo ln -s /usr/share/ganglia-webfrontend /var/www/ganglia

  3)、vi /etc/apache2/apache2.conf

   添加:

   ServerName devcluster01

 4、重新開機服務

  1)、ssh devcluster02

   sudo /etc/init.d/ganglia-monitor restart

   ssh devcluster03

   sudo /etc/init.d/ganglia-monitor restart

  2)、ssh devcluster01

   sudo /etc/init.d/ganglia-monitor restart

   sudo /etc/init.d/gmetad restart

   sudo /etc/init.d/apache2 restart

 5、通路頁面

  http:// devcluster01/ganglia

 6、內建hadoop

  1)、ssh devcluster01

  2)、cd /home/ysc/hadoop-1.1.1

  3)、vi conf/hadoop-metrics2.properties

  # 大于0.20以後的版本用ganglia31  *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31

  *.sink.ganglia.period=10

  # default for supportsparse is false

  *.sink.ganglia.supportsparse=true

 *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both

 *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40

  #廣播IP位址,這是預設的,統一設該值(隻能用多點傳播位址239.2.11.71)

  namenode.sink.ganglia.servers=239.2.11.71:8649

  datanode.sink.ganglia.servers=239.2.11.71:8649

  jobtracker.sink.ganglia.servers=239.2.11.71:8649

  tasktracker.sink.ganglia.servers=239.2.11.71:8649

  maptask.sink.ganglia.servers=239.2.11.71:8649

  reducetask.sink.ganglia.servers=239.2.11.71:8649

  dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

  dfs.period=10

  dfs.servers=239.2.11.71:8649

  mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

  mapred.period=10

  mapred.servers=239.2.11.71:8649

  jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

  jvm.period=10

  jvm.servers=239.2.11.71:8649

  4)、scp conf/hadoop-metrics2.properties [email protected]:/home/ysc/hadoop-1.1.1/conf/hadoop-metrics2.properties

  5)、scp conf/hadoop-metrics2.properties [email protected]:/home/ysc/hadoop-1.1.1/conf/hadoop-metrics2.properties

  6)、stop-all.sh

  7)、start-all.sh

 7、內建hbase

  1)、ssh devcluster01

  2)、cd /home/ysc/hbase-0.92.2

  3)、vi conf/hadoop-metrics.properties(隻能用多點傳播位址239.2.11.71)

   hbase.extendedperiod = 3600

   hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

   hbase.period=10

   hbase.servers=239.2.11.71:8649

   jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

   jvm.period=10

   jvm.servers=239.2.11.71:8649

   rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

   rpc.period=10

   rpc.servers=239.2.11.71:8649

  4)、scp conf/hadoop-metrics.properties [email protected]cluster02:/home/ysc/ hbase-0.92.2/conf/hadoop-metrics.properties

  5)、scp conf/hadoop-metrics.properties [email protected]:/home/ysc/ hbase-0.92.2/conf/hadoop-metrics.properties

  6)、stop-hbase.sh

  7)、start-hbase.sh 八、Hadoop配置Snappy壓縮

 1、wget  http://snappy.googlecode.com/files/snappy-1.0.5.tar.gz

 2、tar -xzvf snappy-1.0.5.tar.gz

 3、cd snappy-1.0.5

 4、./configure

 5、make

 6、make install

 7、scp /usr/local/lib/libsnappy* devcluster01:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/

 scp /usr/local/lib/libsnappy* devcluster02:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/

 scp /usr/local/lib/libsnappy* devcluster03:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/

 8、vi /etc/profile

  追加:

  export LD_LIBRARY_PATH=/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64

 9、修改mapred-site.xml

  <property>

    <name>mapred.output.compression.type</name>

    <value>BLOCK</value>

    <description>If the job outputs are to compressed as SequenceFiles, how should

        they be compressed? Should be one of NONE, RECORD or BLOCK.

    </description>

  </property>   <property>

    <name>mapred.output.compress</name>

    <value>true</value>

    <description>Should the job outputs be compressed?

    </description>

  </property>   <property>

    <name>mapred.compress.map.output</name>

    <value>true</value>

    <description>Should the outputs of the maps be compressed before being

        sent across the network. Uses SequenceFile compression.

    </description>

  </property>   <property>

    <name>mapred.map.output.compression.codec</name>

    <value>org.apache.hadoop.io.compress.SnappyCodec</value>

    <description>If the map outputs are compressed, how should they be 

        compressed?

    </description>

  </property>   <property>

    <name>mapred.output.compression.codec</name>

    <value>org.apache.hadoop.io.compress.SnappyCodec</value>

    <description>If the job outputs are compressed, how should they be compressed?

    </description>

  </property> 九、Hadoop配置Lzo壓縮 

 1、wget  http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz

 2、tar -zxvf lzo-2.06.tar.gz

 3、cd lzo-2.06

 4、./configure --enable-shared

 5、make

 6、make install

 7、scp /usr/local/lib/liblzo2.* devcluster01:/lib/x86_64-linux-gnu

 scp /usr/local/lib/liblzo2.* devcluster02:/lib/x86_64-linux-gnu

 scp /usr/local/lib/liblzo2.* devcluster03:/lib/x86_64-linux-gnu

 8、wget  http://hadoop-gpl-compression.apache-extras.org.codespot.com/files/hadoop-gpl-compression-0.1.0-rc0.tar.gz

 9、tar -xzvf hadoop-gpl-compression-0.1.0-rc0.tar.gz

 10、cd hadoop-gpl-compression-0.1.0

 11、cp lib/native/Linux-amd64-64/* /home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/

 12、cp hadoop-gpl-compression-0.1.0.jar /home/ysc/hadoop-1.1.1/lib/(這裡hadoop叢集的版本要和compression使用的版本一緻)

 13、scp -r /home/ysc/hadoop-1.1.1/lib devcluster02:/home/ysc/hadoop-1.1.1/

 scp -r /home/ysc/hadoop-1.1.1/lib devcluster03:/home/ysc/hadoop-1.1.1/

 14、vi /etc/profile

  追加:

  export LD_LIBRARY_PATH=/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64

 15、修改core-site.xml

  <property>

    <name>io.compression.codecs</name>

    <value>com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value>

    <description>A list of the compression codec classes that can be used 

        for compression/decompression.</description>

  </property>   <property>

    <name>io.compression.codec.lzo.class</name>

    <value>com.hadoop.compression.lzo.LzoCodec</value>

  </property>   <property>

    <name>fs.trash.interval</name>

    <value>1440</value>

    <description>Number of minutes between trash checkpoints.

    If zero, the trash feature is disabled.

    </description>

  </property>

 16、修改mapred-site.xml

  <property>

    <name>mapred.output.compression.type</name>

    <value>BLOCK</value>

    <description>If the job outputs are to compressed as SequenceFiles, how should

        they be compressed? Should be one of NONE, RECORD or BLOCK.

    </description>

  </property>   <property>

    <name>mapred.output.compress</name>

    <value>true</value>

    <description>Should the job outputs be compressed?

    </description>

  </property>   <property>

    <name>mapred.compress.map.output</name>

    <value>true</value>

    <description>Should the outputs of the maps be compressed before being

        sent across the network. Uses SequenceFile compression.

    </description>

  </property>   <property>

    <name>mapred.map.output.compression.codec</name>

    <value>com.hadoop.compression.lzo.LzoCodec</value>

    <description>If the map outputs are compressed, how should they be 

        compressed?

    </description>

  </property>   <property>

    <name>mapred.output.compression.codec</name>

    <value>com.hadoop.compression.lzo.LzoCodec</value>

    <description>If the job outputs are compressed, how should they be compressed?

    </description>

  </property> 十、配置zookeeper叢集以運作hbase

 1、ssh devcluster01

 2、cd /home/ysc

 3、wget  http://mirror.bjtu.edu.cn/apache/zookeeper/stable/zookeeper-3.4.5.tar.gz

 4、tar -zxvf  zookeeper-3.4.5.tar.gz

 5、cd zookeeper-3.4.5

 6、cp conf/zoo_sample.cfg  conf/zoo.cfg

 7、vi conf/zoo.cfg

  修改:dataDir=/home/ysc/zookeeper

  添加:

   server.1=devcluster01:2888:3888

   server.2=devcluster02:2888:3888 

   server.3=devcluster03:2888:3888

   maxClientCnxns=100

 8、scp -r  zookeeper-3.4.5  devcluster01:/home/ysc

 scp -r  zookeeper-3.4.5  devcluster02:/home/ysc

 scp -r  zookeeper-3.4.5  devcluster03:/home/ysc

 9、分别在三台機器上面執行:

  ssh devcluster01

  mkdir /home/ysc/zookeeper(注:dataDir是zookeeper的資料目錄,需要手動建立)

  echo 1 > /home/ysc/zookeeper/myid

  ssh devcluster02

  mkdir /home/ysc/zookeeper

  echo 2 > /home/ysc/zookeeper/myid

  ssh devcluster03

  mkdir /home/ysc/zookeeper

  echo 3 > /home/ysc/zookeeper/myid

 10、分别在三台機器上面執行:

  cd /home/ysc/zookeeper-3.4.5

  bin/zkServer.sh start

  bin/zkCli.sh -server devcluster01:2181 

  bin/zkServer.sh status 十一、配置Hbase叢集以運作nutch-2.1(Region Servers會因為記憶體的問題當機)

1、nutch-2.1使用gora-0.2.1, gora-0.2.1使用hbase-0.90.4,hbase-0.90.4和hadoop-1.1.1不相容,hbase-0.94.4和gora-0.2.1不相容,hbase-0.92.2沒問題。hbase存在系統時間同步的問題,并且誤差要再30s以内。

 sudo apt-get install ntp

 sudo ntpdate -u 210.72.145.44

2、HBase是資料庫,會在同一時間使用很多的檔案句柄。大多數linux系統使用的預設值1024是不能滿足的。還需要修改 hbase 使用者的 nproc,在壓力下,如果過低會造成 OutOfMemoryError異常。

 vi /etc/security/limits.conf

 添加:

   ysc soft nproc 32000

   ysc hard nproc 32000

   ysc soft nofile 32768

   ysc hard nofile 32768

 vi /etc/pam.d/common-session

 添加:

   session required  pam_limits.so

 3、登陸master,下載下傳并解壓hbase

  ssh devcluster01

  cd /home/ysc

  wget  http://apache.etoak.com/hbase/hbase-0.92.2/hbase-0.92.2.tar.gz

  tar -zxvf hbase-0.92.2.tar.gz

  cd hbase-0.92.2

 4、修改配置檔案hbase-env.sh

  vi conf/hbase-env.sh

  追加:

  export JAVA_HOME=/home/ysc/jdk1.7.0_05

  export HBASE_MANAGES_ZK=false

  export HBASE_HEAPSIZE=10000

  #替換掉預設的垃圾回收器,因為預設的垃圾回收器在多線程環境下會有更多的wait等待

  export HBASE_OPTS="-server -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"

 5、修改配置檔案hbase-site.xml

  vi conf/hbase-site.xml

  <property>  

   <name>hbase.rootdir</name>  

   <value>hdfs://devcluster01:9000/hbase</value>     

  </property> 

  <property>  

   <name>hbase.cluster.distributed</name>  

   <value>true</value>  

  </property>  

  <property>   

   <name>hbase.zookeeper.quorum</name>        

   <value>devcluster01,devcluster02,devcluster03</value>   

  </property>

  <property>

   <name>hfile.block.cache.size</name>

   <value>0.25</value>

   <description>

    Percentage of maximum heap (-Xmx setting) to allocate to block cache

    used by HFile/StoreFile. Default of 0.25 means allocate 25%.

    Set to 0 to disable but it's not recommended.

   </description>

  </property>

  <property>

   <name>hbase.regionserver.global.memstore.upperLimit</name>

   <value>0.4</value>

   <description>Maximum size of all memstores in a region server before new

     updates are blocked and flushes are forced. Defaults to 40% of heap

   </description>

  </property>

    <property>

   <name>hbase.regionserver.global.memstore.lowerLimit</name>

   <value>0.35</value>

   <description>When memstores are being forced to flush to make room in

    memory, keep flushing until we hit this mark. Defaults to 35% of heap.

    This value equal to hbase.regionserver.global.memstore.upperLimit causes

    the minimum possible flushing to occur when updates are blocked due to

    memstore limiting.

   </description>

    </property>

  <property>

   <name>hbase.hregion.majorcompaction</name>

   <value>0</value>

   <description>The time (in miliseconds) between 'major' compactions of all

    HStoreFiles in a region.  Default: 1 day.

    Set to 0 to disable automated major compactions.

   </description>

  </property>

 6、修改配置檔案regionservers

  vi conf/regionservers

  devcluster01

  devcluster02

  devcluster03

 7、因為HBase建立在Hadoop之上,Hadoop使用的hadoop*.jar和HBase使用的 必須 一緻。是以要将 HBase lib 目錄下的hadoop*.jar替換成Hadoop裡面的那個,防止版本沖突。

  cp  /home/ysc/hadoop-1.1.1/hadoop-core-1.1.1.jar  /home/ysc/hbase-0.92.2/lib

  rm  /home/ysc/hbase-0.92.2/lib/hadoop-core-1.0.3.jar

 8、複制檔案到regionservers

  scp -r /home/ysc/hbase-0.92.2 devcluster01:/home/ysc

  scp -r /home/ysc/hbase-0.92.2 devcluster02:/home/ysc

  scp -r /home/ysc/hbase-0.92.2 devcluster03:/home/ysc 

 9、啟動hadoop并建立目錄

  hadoop fs -mkdir /hbase

 10、管理HBase叢集:

  啟動初始 HBase 叢集:

   bin/start-hbase.sh

  停止HBase 叢集:

   bin/stop-hbase.sh

  啟動額外備份主伺服器,可以啟動到 9 個備份伺服器 (總數10 個):

   bin/local-master-backup.sh start 1

   bin/local-master-backup.sh start 2 3

  啟動更多 regionservers, 支援到 99 個額外regionservers (總100個):

   bin/local-regionservers.sh start 1

   bin/local-regionservers.sh start 2 3 4 5

  停止備份主伺服器: 

   cat /tmp/hbase-ysc-1-master.pid |xargs kill -9

  停止單獨 regionserver:

   bin/local-regionservers.sh stop 1

  使用HBase指令行模式: 

   bin/hbase shell

 11、web界面

   http://devcluster01:60010

   http://devcluster01:60030

 12、如運作nutch2.1則方法一:

  cp conf/hbase-site.xml /home/ysc/nutch-2.1/conf

  cd /home/ysc/nutch-2.1

  ant

  cd runtime/deploy

  unzip -d apache-nutch-2.1 apache-nutch-2.1.job

  rm  apache-nutch-2.1.job

  cd apache-nutch-2.1

  rm lib/hbase-0.90.4.jar

  cp /home/ysc/hbase-0.92.2/hbase-0.92.2.jar  lib

  zip -r ../apache-nutch-2.1.job ./*

  cd ..

  rm -r apache-nutch-2.1

 13、如運作nutch2.1則方法二:

  cp conf/hbase-site.xml /home/ysc/nutch-2.1/conf

  cd /home/ysc/nutch-2.1

  cp /home/ysc/hbase-0.92.2/hbase-0.92.2.jar  lib

  ant

  cd runtime/deploy

  zip -d apache-nutch-2.1.job lib/hbase-0.90.4.jar  啟用snappy壓縮:

 1、vi conf/gora-hbase-mapping.xml

  在family上面添加屬性:compression="SNAPPY"

 2、mkdir /home/ysc/hbase-0.92.2/lib/native/Linux-amd64-64

 3、cp /home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/* /home/ysc/hbase-0.92.2/lib/native/Linux-amd64-64

 4、vi /home/ysc/hbase-0.92.2/conf/hbase-site.xml

  增加:

                <property>

                        <name>hbase.regionserver.codecs</name>

                        <value>snappy</value>

                </property>   十二、配置Accumulo叢集以運作nutch-2.1(gora存在BUG)

 1、wget  http://apache.etoak.com/accumulo/1.4.2/accumulo-1.4.2-dist.tar.gz

 2、tar -xzvf accumulo-1.4.2-dist.tar.gz

 3、cd accumulo-1.4.2

 4、cp conf/examples/3GB/standalone/* conf

 5、vi conf/accumulo-env.sh

  export HADOOP_HOME=/home/ysc/cluster3

  export ZOOKEEPER_HOME=/home/ysc/zookeeper-3.4.5

  export JAVA_HOME=/home/jdk1.7.0_01

  export ACCUMULO_HOME=/home/ysc/accumulo-1.4.2

 6、vi conf/slaves

  devcluster01

  devcluster02

  devcluster03

 7、vi conf/masters

  devcluster01

 8、vi conf/accumulo-site.xml

  <property>

    <name>instance.zookeeper.host</name>

    <value>host6:2181,host8:2181</value>

    <description>comma separated list of zookeeper servers</description>

  </property>   <property>

    <name>logger.dir.walog</name>

    <value>walogs</value>

    <description>The directory used to store write-ahead logs on the local filesystem. It is possible to specify a comma-separated list of directories.</description>

  </property>   <property>

    <name>instance.secret</name>

    <value>ysc</value>

    <description>A secret unique to a given instance that all servers must know in order to communicate with one another.

        Change it before initialization. To change it later use ./bin/accumulo org.apache.accumulo.server.util.ChangeSecret [oldpasswd] [newpasswd],

        and then update this file.

    </description>

  </property>   <property>

    <name>tserver.memory.maps.max</name>

    <value>3G</value>

  </property>   <property>

    <name>tserver.cache.data.size</name>

    <value>50M</value>

  </property>   <property>

    <name>tserver.cache.index.size</name>

    <value>512M</value>

  </property>   <property>

    <name>trace.password</name>

    <!--

   change this to the root user's password, and/or change the user below

     -->

    <value>ysc</value>

  </property>   <property>

    <name>trace.user</name>

    <value>root</value>

  </property>

 9、bin/accumulo init

 10、bin/start-all.sh

 11、bin/stop-all.sh

 12、web通路: http://devcluster01:50095/  修改nutch2.1:

 1、cd  /home/ysc/nutch-2.1

 2、vi  conf/gora.properties

  增加:

  gora.datastore.default=org.apache.gora.accumulo.store.AccumuloStore

  gora.datastore.accumulo.mock=false

  gora.datastore.accumulo.instance=accumulo

  gora.datastore.accumulo.zookeepers=host6,host8

  gora.datastore.accumulo.user=root

  gora.datastore.accumulo.password=ysc

 3、vi  conf/nutch-site.xml

  增加:

  <property>

    <name>storage.data.store.class</name>

    <value>org.apache.gora.accumulo.store.AccumuloStore</value>

  </property>

 4、vi ivy/ivy.xml

  增加:

  <dependency org="org.apache.gora" name="gora-accumulo" rev="0.2.1" conf="*->default" />

 5、更新accumulo

  cp /home/ysc/accumulo-1.4.2/lib/accumulo-core-1.4.2.jar  /home/ysc/nutch-2.1/lib

  cp /home/ysc/accumulo-1.4.2/lib/accumulo-start-1.4.2.jar  /home/ysc/nutch-2.1/lib

  cp /home/ysc/accumulo-1.4.2/lib/cloudtrace-1.4.2.jar  /home/ysc/nutch-2.1/lib

 6、ant

 7、cd runtime/deploy

 8、删除舊jar

  zip -d apache-nutch-2.1.job lib/accumulo-core-1.4.0.jar

  zip -d apache-nutch-2.1.job lib/accumulo-start-1.4.0.jar

  zip -d apache-nutch-2.1.job lib/cloudtrace-1.4.2.jar 十三、配置Cassandra 叢集以運作nutch-2.1(Cassandra 采用去中心化結構)

 1、vi /etc/hosts(注意:需要登入到每一台機器上面,将localhost解析到實際位址)

  192.168.1.1       localhost

 2、wget  http://labs.mop.com/apache-mirror/cassandra/1.2.0/apache-cassandra-1.2.0-bin.tar.gz

 3、tar -xzvf  apache-cassandra-1.2.0-bin.tar.gz

 4、cd apache-cassandra-1.2.0

 5、vi conf/cassandra-env.sh

  增加:

  MAX_HEAP_SIZE="4G"

  HEAP_NEWSIZE="800M"

 6、vi conf/log4j-server.properties

  修改:

  log4j.appender.R.File=/home/ysc/cassandra/system.log

 7、vi conf/cassandra.yaml

  修改:

  cluster_name: 'Cassandra  Cluster'

  data_file_directories:

      - /home/ysc/cassandra/data

  commitlog_directory: /home/ysc/cassandra/commitlog

  saved_caches_directory: /home/ysc/cassandra/saved_caches   - seeds: "192.168.1.1"

  listen_address: 192.168.1.1

  rpc_address: 192.168.1.1   thrift_framed_transport_size_in_mb: 1023

  thrift_max_message_length_in_mb: 1024

 8、vi bin/stop-server

  增加:

  user=`whoami`

  pgrep -u $user -f cassandra | xargs kill -9

 9、複制cassandra到其他節點:

  cd ..

  scp -r apache-cassandra-1.2.0 devcluster02:/home/ysc

  scp -r apache-cassandra-1.2.0 devcluster03:/home/ysc

  分别在devcluster02和devcluster03上面修改:

  vi conf/cassandra.yaml

   listen_address: 192.168.1.2

   rpc_address: 192.168.1.2

  vi conf/cassandra.yaml

   listen_address: 192.168.1.3

   rpc_address: 192.168.1.3

 10、分别在3個節點上面運作

  bin/cassandra

  bin/cassandra -f   參數 -f 的作用是讓 Cassandra 以前端程式方式運作,這樣有利于調試和觀察日志資訊,而在實際生産環境中這個參數是不需要的(即 Cassandra 會以 daemon 方式運作)

 11、bin/nodetool -host devcluster01 ring

        bin/nodetool -host devcluster01 info

 12、bin/stop-server

 13、bin/cassandra-cli  修改nutch2.1:

 1、cd  /home/ysc/nutch-2.1

 2、vi  conf/gora.properties

  增加:

  gora.cassandrastore.servers=host2:9160,host6:9160,host8:9160

 3、vi  conf/nutch-site.xml

  增加:

  <property>

    <name>storage.data.store.class</name>

    <value>org.apache.gora.cassandra.store.CassandraStore</value>

  </property>

 4、vi ivy/ivy.xml

  增加:

  <dependency org="org.apache.gora" name="gora-cassandra" rev="0.2.1" conf="*->default" />

 5、更新cassandra

  cp /home/ysc/apache-cassandra-1.2.0/lib/apache-cassandra-1.2.0.jar  /home/ysc/nutch-2.1/lib

  cp /home/ysc/apache-cassandra-1.2.0/lib/apache-cassandra-thrift-1.2.0.jar  /home/ysc/nutch-2.1/lib

  cp /home/ysc/apache-cassandra-1.2.0/lib/jline-1.0.jar  /home/ysc/nutch-2.1/lib

 6、ant

 7、cd runtime/deploy

 8、删除舊jar

  zip -d apache-nutch-2.1.job lib/cassandra-thrift-1.1.2.jar

  zip -d apache-nutch-2.1.job lib/jline-0.9.1.jar 十四、配置MySQL 單機伺服器以運作nutch-2.1

 1、apt-get install mysql-server mysql-client

 2、vi /etc/mysql/my.cnf

  修改:

  bind-address            = 221.194.43.2

  在[client]下增加:

  default-character-set=utf8

  在[mysqld]下增加:

  default-character-set=utf8

 3、mysql –uroot –pysc

  SHOW VARIABLES LIKE '%character%';

 4、service mysql restart

 5、mysql –uroot –pysc

  GRANT ALL PRIVILEGES ON *.* TO [email protected]"%" IDENTIFIED BY "ysc";

 6、vi conf/gora-sql-mapping.xml

  修改字段的長度

  <primarykey column="id" length="333"/>

  <field name="content" column="content" />

  <field name="text" column="text" length="19892"/>

 7、啟動nutch之後登陸mysql

   ALTER TABLE webpage MODIFY COLUMN content MEDIUMBLOB;

   ALTER TABLE webpage MODIFY COLUMN text MEDIUMTEXT;

   ALTER TABLE webpage MODIFY COLUMN title MEDIUMTEXT;

   ALTER TABLE webpage MODIFY COLUMN reprUrl MEDIUMTEXT;

   ALTER TABLE webpage MODIFY COLUMN baseUrl MEDIUMTEXT;

   ALTER TABLE webpage MODIFY COLUMN typ MEDIUMTEXT;

   ALTER TABLE webpage MODIFY COLUMN inlinks MEDIUMBLOB;

   ALTER TABLE webpage MODIFY COLUMN outlinks MEDIUMBLOB;  修改nutch2.1:

 1、cd  /home/ysc/nutch-2.1

 2、vi  conf/gora.properties

  增加:

   gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver

 gora.sqlstore.jdbc.url=jdbc:mysql://host2:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8

  gora.sqlstore.jdbc.user=root

  gora.sqlstore.jdbc.password=ysc

 3、vi  conf/nutch-site.xml

  增加:

  <property>

    <name>storage.data.store.class</name>

    <value>org.apache.gora.sql.store.SqlStore </value>

  </property>   <property>

    <name>encodingdetector.charset.min.confidence</name>

    <value>1</value>

    <description>A integer between 0-100 indicating minimum confidence value

    for charset auto-detection. Any negative value disables auto-detection.

    </description>

  </property>

 4、vi ivy/ivy.xml

  增加:

  <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/> 十五、nutch2.1 使用DataFileAvroStore作為資料源

 1、cd  /home/ysc/nutch-2.1

 2、vi  conf/gora.properties

  增加:

  gora.datafileavrostore.output.path=datafileavrostore

  gora.datafileavrostore.input.path=datafileavrostore

 3、vi  conf/nutch-site.xml

  增加:

  <property>

    <name>storage.data.store.class</name>

    <value>org.apache.gora.avro.store.DataFileAvroStore</value>

  </property>   <property>

    <name>encodingdetector.charset.min.confidence</name>

    <value>1</value>

    <description>A integer between 0-100 indicating minimum confidence value

    for charset auto-detection. Any negative value disables auto-detection.

    </description>

  </property>   十六、nutch2.1 使用AvroStore作為資料源

 1、cd  /home/ysc/nutch-2.1

 2、vi  conf/gora.properties

  增加:

  gora.avrostore.codec.type=BINARY

  gora.avrostore.input.path=avrostore

  gora.avrostore.output.path=avrostore

 3、vi  conf/nutch-site.xml

  增加:

  <property>

    <name>storage.data.store.class</name>

    <value>org.apache.gora.avro.store.AvroStore</value>

  </property>   <property>

    <name>encodingdetector.charset.min.confidence</name>

    <value>1</value>

    <description>A integer between 0-100 indicating minimum confidence value

    for charset auto-detection. Any negative value disables auto-detection.

    </description>

  </property>   十七、配置SOLR 

 配置tomcat:

 1、wget  http://www.fayea.com/apache-mirror/tomcat/tomcat-7/v7.0.35/bin/apache-tomcat-7.0.35.tar.gz

 2、tar -xzvf apache-tomcat-7.0.35.tar.gz

 3、cd apache-tomcat-7.0.35

 4、vi conf/server.xml

 增加URIEncoding="UTF-8":

  <Connector port="8080" protocol="HTTP/1.1"

       connectionTimeout="20000"

       redirectPort="8443" URIEncoding="UTF-8"/>

 5、mkdir conf/Catalina

 6、mkdir conf/Catalina/localhost

 7、vi conf/Catalina/localhost/solr.xml

 增加:

  <Context path="/solr">

   <Environment name="solr/home" type="java.lang.String" value="/home/ysc/solr/configuration/" override="false"/>

  </Context>

 8、cd ..  下載下傳SOLR:

 1、wget  http://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr/4.1.0/solr-4.1.0.tgz

 2、tar -xzvf solr-4.1.0.tgz  複制資源:

 1、mkdir /home/ysc/solr

 2、cp -r solr-4.1.0/example/solr  /home/ysc/solr/configuration

 3、unzip solr-4.1.0/example/webapps/solr.war -d /home/ysc/apache-tomcat-7.0.35/webapps/solr  配置nutch:

 1、複制schema:

  cp /home/ysc/nutch-1.6/conf/schema-solr4.xml /home/ysc/solr/configuration/collection1/conf/schema.xml

 2、vi /home/ysc/solr/configuration/collection1/conf/schema.xml

  在<fields>下增加:

  <field name="_version_" type="long" indexed="true" stored="true"/>  配置中文分詞:

 1、wget  http://mmseg4j.googlecode.com/files/mmseg4j-1.9.1.v20130120-SNAPSHOT.zip

 2、unzip mmseg4j-1.9.1.v20130120-SNAPSHOT.zip

 3、cp mmseg4j-1.9.1-SNAPSHOT/dist/* /home/ysc/apache-tomcat-7.0.35/webapps/solr/WEB-INF/lib

 4、unzip mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT.jar -d  mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT

 5、mkdir /home/ysc/dic

 6、cp   mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT/data/* /home/ysc/dic

 7、vi /home/ysc/solr/configuration/collection1/conf/schema.xml

  将檔案中的

  <tokenizer class="solr.WhitespaceTokenizerFactory"/>

  和

  <tokenizer class="solr.StandardTokenizerFactory"/>

  替換為

  <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/ysc/dic"/>  配置tomcat本地庫:

 1、wget  http://apache.spd.co.il/apr/apr-1.4.6.tar.gz

 2、tar -xzvf apr-1.4.6.tar.gz

 3、cd apr-1.4.6

 4、./configure

 5、make

 6、make  install  1、wget  http://mirror.bjtu.edu.cn/apache/apr/apr-util-1.5.1.tar.gz

 2、tar -xzvf apr-util-1.5.1.tar.gz

 3、cd apr-util-1.5.1

 4、./configure --with-apr=/usr/local/apr

 5、make

 6、make  install  1、wget  http://mirror.bjtu.edu.cn/apache//tomcat/tomcat-connectors/native/1.1.24/source/tomcat-native-1.1.24-src.tar.gz

 2、tar -zxvf tomcat-native-1.1.24-src.tar.gz

 3、cd tomcat-native-1.1.24-src/jni/native

 4、./configure --with-apr=/usr/local/apr \

                --with-java-home=/home/ysc/jdk1.7.0_01 \

                --with-ssl=no \

                --prefix=/home/ysc/apache-tomcat-7.0.35

 5、make

 6、make  install

 7、vi /etc/profile

 增加:

 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/ysc/apache-tomcat-7.0.35/lib:/usr/local/apr/lib

 8、source /etc/profile  啟動tomcat:

 cd apache-tomcat-7.0.35

 bin/catalina.sh start

  http://devcluster01:8080/solr/ 十八、Nagios監控

 服務端:

 1、apt-get install apache2 nagios3 nagios-nrpe-plugin

  輸入密碼:nagiosadmin

 2、apt-get install nagios3-doc

 3、vi /etc/nagios3/conf.d/hostgroups_nagios2.cfg

   define hostgroup {

     hostgroup_name  nagios-servers

     alias           nagios servers

     members         devcluster01,devcluster02,devcluster03

   }

 4、cp  /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster01_nagios2.cfg

  vi /etc/nagios3/conf.d/devcluster01_nagios2.cfg

  替換:

   g/localhost/s//devcluster01/g

   g/127.0.0.1/s//192.168.1.1/g

 5、cp  /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster02_nagios2.cfg

  vi /etc/nagios3/conf.d/devcluster02_nagios2.cfg

  替換:

   g/localhost/s//devcluster02/g

   g/127.0.0.1/s//192.168.1.2/g

 6、cp  /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster03_nagios2.cfg

  vi /etc/nagios3/conf.d/devcluster03_nagios2.cfg

  替換:

   g/localhost/s//devcluster03/g

   g/127.0.0.1/s//192.168.1.3/g  7、vi /etc/nagios3/conf.d/services_nagios2.cfg

  将hostgroup_name改為nagios-servers

  增加:

   # check that web services are running

   define service {

     hostgroup_name                  nagios-servers

     service_description             HTTP

     check_command                   check_http

     use                             generic-service

     notification_interval           0 ; set > 0 if you want to be renotified

   }    # check that ssh services are running

   define service {

     hostgroup_name                  nagios-servers

     service_description             SSH

     check_command                   check_ssh

     use                             generic-service

     notification_interval           0 ; set > 0 if you want to be renotified

   }

 8、vi /etc/nagios3/conf.d/extinfo_nagios2.cfg

  将hostgroup_name改為nagios-servers

  增加:

   define hostextinfo{

     hostgroup_name   nagios-servers

     notes            nagios-servers

   #       notes_url         http://webserver.localhost.localdomain/hostinfo.pl?host=netware1

     icon_image       base/debian.png

     icon_image_alt   Debian GNU/Linux

     vrml_image       debian.png

     statusmap_image  base/debian.gd2

     }

 9、sudo /etc/init.d/nagios3 restart

 10、通路 http://devcluster01/nagios3/

  使用者名:nagiosadmin密碼:nagiosadmin  監控端:

 1、apt-get install nagios-nrpe-server

 2、vi /etc/nagios/nrpe.cfg

  替換:

  g/127.0.0.1/s//192.168.1.1/g

 3、sudo /etc/init.d/nagios-nrpe-server restart 十九、配置Splunk

 1、wget  http://download.splunk.com/releases/5.0.2/splunk/linux/splunk-5.0.2-149561-Linux-x86_64.tgz

 2、tar -zxvf splunk-5.0.2-149561-Linux-x86_64.tgz

 3、cd splunk

 4、bin/splunk start --answer-yes --no-prompt --accept-license

 5、通路 http://devcluster01:8000

  使用者名:admin 密碼:changeme

 6、添加資料 -> 從 UDP 端口 -> UDP 端口 *: 1688 -> 來源類型 從清單 log4j -> 儲存

 7、配置hadoop

  vi /home/ysc/hadoop-1.1.1/conf/log4j.properties

  修改:

   log4j.rootLogger=${hadoop.root.logger}, EventCounter, SYSLOG

  增加:

   log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender  

   log4j.appender.SYSLOG.facility=local1  

   log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout  

   log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n  

   log4j.appender.SYSLOG.SyslogHost=host6:1688 

   log4j.appender.SYSLOG.threshold=INFO  

   log4j.appender.SYSLOG.Header=true 

   log4j.appender.SYSLOG.FacilityPrinting=true  

 8、配置hbase

  vi /home/ysc/hbase-0.92.2/conf/log4j.properties

  修改:

   log4j.rootLogger=${hbase.root.logger},SYSLOG

  增加:

   log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender  

   log4j.appender.SYSLOG.facility=local1  

   log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout  

   log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n  

   log4j.appender.SYSLOG.SyslogHost=host6:1688 

   log4j.appender.SYSLOG.threshold=INFO  

   log4j.appender.SYSLOG.Header=true 

   log4j.appender.SYSLOG.FacilityPrinting=true

 9、配置nutch

  vi /home/lanke/ysc/nutch-2.1-hbase/conf/log4j.properties

  修改:

   log4j.rootLogger=INFO,DRFA,SYSLOG

  增加:

   log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender  

   log4j.appender.SYSLOG.facility=local1  

   log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout  

   log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n  

   log4j.appender.SYSLOG.SyslogHost=host6:1688 

   log4j.appender.SYSLOG.threshold=INFO  

   log4j.appender.SYSLOG.Header=true 

   log4j.appender.SYSLOG.FacilityPrinting=true

 10、啟動hadoop和hbase

  start-all.sh

  start-hbase.sh 二十、配置Pig

 1、wget  http://labs.mop.com/apache-mirror/pig/pig-0.11.0/pig-0.11.0.tar.gz

 2、tar -xzvf pig-0.11.0.tar.gz

 3、cd pig-0.11.0

 4、vi /etc/profile

  增加:

  export PIG_HOME=/home/ysc/pig-0.11.0

  export PATH=$PIG_HOME/bin:$PATH

 5、source /etc/profile

 6、cp conf/log4j.properties.template conf/log4j.properties

 7、vi conf/log4j.properties

 8、pig 二十一、配置Hive

 1、wget  http://mirrors.cnnic.cn/apache/hive/hive-0.10.0/hive-0.10.0.tar.gz

 2、tar -xzvf hive-0.10.0.tar.gz

 3、cd hive-0.10.0

 4、vi /etc/profile

  增加:

  export HIVE_HOME=/home/ysc/hive-0.10.0

  export PATH=$HIVE_HOME/bin:$PATH

 5、source /etc/profile

 6、cp conf/hive-log4j.properties.template conf/hive-log4j.properties

 7、vi conf/hive-log4j.properties

  替換:

  log4j.appender.EventCounter=org.apache.hadoop.metrics.jvm.EventCounter

  為:

  log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter

二十二、配置Hadoop2.x叢集

 1、wget  http://labs.mop.com/apache-mirror/hadoop/common/hadoop-2.0.2-alpha/hadoop-2.0.2-alpha.tar.gz

 2、tar -xzvf hadoop-2.0.2-alpha.tar.gz

 3、cd hadoop-2.0.2-alpha

 4、vi etc/hadoop/hadoop-env.sh

  追加:

export JAVA_HOME=/home/ysc/jdk1.7.0_05

  export HADOOP_HEAPSIZE=2000

 5、vi etc/hadoop/core-site.xml

  <property>

   <name>fs.defaultFS</name>

   <value>hdfs://devcluster01:9000</value>

   <description>

      Where to find the Hadoop Filesystem through the network. 

      Note 9000 is not the default port.

      (This is slightly changed from previous versions which didnt have "hdfs")

   </description>

   </property>

   <property>

    <name>io.file.buffer.size</name>

    <value>131072</value>

    <description>The size of buffer for use in sequence files.

    The size of this buffer should probably be a multiple of hardware

    page size (4096 on Intel x86), and it determines how much data is

    buffered during read and write operations.</description>

  </property>

 6、vi etc/hadoop/mapred-site.xml

  <property>

    <name>mapreduce.framework.name</name>

    <value>yarn</value>

  </property>   <property>

    <name>mapred.job.reduce.input.buffer.percent</name>

    <value>1</value>

    <description>The percentage of memory- relative to the maximum heap size- to

    retain map outputs during the reduce. When the shuffle is concluded, any

    remaining map outputs in memory must consume less than this threshold before

    the reduce can begin.

    </description>

  </property>   <property>

    <name>mapred.job.shuffle.input.buffer.percent</name>

    <value>1</value>

    <description>The percentage of memory to be allocated from the maximum heap

    size to storing map outputs during the shuffle.

    </description>

  </property>   <property>

    <name>mapred.inmem.merge.threshold</name>

    <value>0</value>

    <description>The threshold, in terms of the number of files 

    for the in-memory merge process. When we accumulate threshold number of files

    we initiate the in-memory merge and spill to disk. A value of 0 or less than

    0 indicates we want to DON'T have any threshold and instead depend only on

    the ramfs's memory consumption to trigger the merge.

    </description>

  </property>   <property>

    <name>io.sort.factor</name>

    <value>100</value>

    <description>The number of streams to merge at once while sorting

    files.  This determines the number of open file handles.</description>

  </property>   <property>

    <name>io.sort.mb</name>

    <value>240</value>

    <description>The total amount of buffer memory to use while sorting 

    files, in megabytes.  By default, gives each merge stream 1MB, which

    should minimize seeks.</description>

  </property>

    <property>

      <name>mapred.map.output.compression.codec</name>

      <value>org.apache.hadoop.io.compress.SnappyCodec</value>

      <description>If the map outputs are compressed, how should they be 

          compressed?

      </description>

    </property>     <property>

      <name>mapred.output.compression.codec</name>

      <value>org.apache.hadoop.io.compress.SnappyCodec</value>

      <description>If the job outputs are compressed, how should they be compressed?

      </description>

    </property>

  <property>

    <name>mapred.output.compression.type</name>

    <value>BLOCK</value>

    <description>If the job outputs are to compressed as SequenceFiles, how should

        they be compressed? Should be one of NONE, RECORD or BLOCK.

    </description>

  </property>

  <property> 

    <name>mapred.child.java.opts</name>

    <value>-Xmx2000m</value>

  </property>   <property>

    <name>mapred.output.compress</name>

    <value>true</value>

    <description>Should the job outputs be compressed?

    </description>

  </property>   <property>

    <name>mapred.compress.map.output</name>

    <value>true</value>

    <description>Should the outputs of the maps be compressed before being

        sent across the network. Uses SequenceFile compression.

    </description>

  </property>   <property> 

    <name>mapred.tasktracker.map.tasks.maximum</name>

    <value>5</value>

  </property>   <property> 

    <name>mapred.map.tasks</name>

    <value>15</value>

  </property>   <property> 

    <name>mapred.tasktracker.reduce.tasks.maximum</name>

    <value>5</value>

   <description>

   define mapred.map tasks to be number of slave hosts.the best number is the  number of slave hosts plus the core numbers of per host

   </description> 

  </property>   <property> 

    <name>mapred.reduce.tasks</name>

    <value>15</value>

    <description>

   define mapred.reduce tasks to be number of slave hosts.the best number is the  number of slave hosts plus the core numbers of per host

    </description> 

  </property> 

  <property>

    <name>mapred.system.dir</name>

    <value>/home/ysc/mapreduce/system</value>

  </property>   <property>

    <name>mapred.local.dir</name>

    <value>/home/ysc/mapreduce/local</value>

  </property>   <property>

    <name>mapreduce.job.counters.max</name>

    <value>12000</value>

    <description>Limit on the number of counters allowed per job.

    </description>

  </property>

 7、vi etc/hadoop/yarn-site.xml

  <property>    

    <name>yarn.resourcemanager.resource-tracker.address</name>   

    <value>devcluster01:8031</value> 

   </property>   

   <property>  

    <name>yarn.resourcemanager.address</name>     

    <value>devcluster01:8032</value>  

   </property> 

   <property>    

    <name>yarn.resourcemanager.scheduler.address</name>  

    <value>devcluster01:8030</value> 

   </property>

   <property>  

    <name>yarn.resourcemanager.admin.address</name>  

    <value>devcluster01:8033</value>   

   </property>   

   <property>    

    <name>yarn.resourcemanager.webapp.address</name>    

    <value>devcluster01:8088</value>  

   </property>  

   <property>   

    <description>Classpath for typical applications.</description> 

    <name>yarn.application.classpath</name>  

    <value>       

    $HADOOP_CONF_DIR,      

    $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,    

    $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,       

    $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,   

    $YARN_HOME/*,$YARN_HOME/lib/*   

    </value>  

   </property>

   <property>  

    <name>yarn.nodemanager.aux-services</name>  

    <value>mapreduce.shuffle</value>  

   </property>   

   <property>    

    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>  

    <value>org.apache.hadoop.mapred.ShuffleHandler</value>  

   </property>  

   <property>   

    <name>yarn.nodemanager.local-dirs</name>     <value>/home/ysc/h2/data/1/yarn/local,/home/ysc/h2/data/2/yarn/local,/home/ysc/h2/data/3/yarn/local</value> 

   </property>

   <property> 

    <name>yarn.nodemanager.log-dirs</name>      <value>/home/ysc/h2/data/1/yarn/logs,/home/ysc/h2/data/2/yarn/logs,/home/ysc/h2/data/3/yarn/logs</value> 

   </property>  

   <property>   

    <description>Where to aggregate logs</description> 

    <name>yarn.nodemanager.remote-app-log-dir</name>    

    <value>/home/ysc/h2/var/log/hadoop-yarn/apps</value> 

   </property>    

   <property>    

    <name>mapreduce.jobhistory.address</name>   

    <value>devcluster01:10020</value> 

   </property>   

   <property>    

    <name>mapreduce.jobhistory.webapp.address</name>   

    <value>devcluster01:19888</value> 

   </property>   

 8、vi etc/hadoop/hdfs-site.xml

  <property>  

   <name>dfs.permissions.superusergroup</name>  

   <value>root</value> 

  </property>

  <property>

    <name>dfs.name.dir</name>

    <value>/home/ysc/dfs/filesystem/name</value>

  </property>

  <property>

    <name>dfs.data.dir</name>

    <value>/home/ysc/dfs/filesystem/data</value>

  </property>

  <property>

    <name>dfs.replication</name>

    <value>3</value>

  </property>

  <property>

    <name>dfs.block.size</name>

    <value>6710886400</value>

    <description>The default block size for new files.</description>

  </property>

 9、啟動hadoop

  bin/hdfs namenode -format

  sbin/start-dfs.sh

  sbin/start-yarn.sh

 10、通路管理頁面

   http://devcluster01:8088

   http://devcluster01:50070

繼續閱讀