天天看點

Nutch-2.2.1學習之九Nutch過濾URL實踐

通過分析Nutch的配置檔案Nutch-default.xml和閱讀了部分源代碼後,了解了Nutch的插件機制以及如何通過修改conf中的檔案實作過濾抓取資料。預設情況下,實作URL過濾的類為RegexURLFilter,對應的過濾檔案為regex-urlfilter.txt,在不修改該檔案的情況下,Nutch可以過濾字尾以gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS結尾的檔案,過濾包含?*[email protected]=字元的URL,過濾/SameSomething/重複出現三次的URL,而接受其他一切URL。現在以http://hadoop.apache.com為抓取的URL為例,分為預設抓取和隻抓取包含hadoop的URL兩種情況。

先看第一種情況,即對rgex-urlfilter.txt不做任何修改,代碼及結果如下所示。從結果可以看到,總共抓取了38條記錄。

[[email protected] deploy]$ bin/crawl urls hadoop http://localhost:8983/solr/ 1

hbase(main):012:0> scan 'hadoop_webpage', {COLUMNS=>'f:ts'}
ROW                                      COLUMN+CELL                                                                                                         
 com.apachecon.eu.www:http/c/aceu2009/   column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1D1                                                    
 com.apachecon.us:http/c/acus2008/       column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Ft                                                    
 com.cafepress.www:http/hadoop/          column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fw                                                    
 com.yahoo.developer:http/blogs/hadoop/2 column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV\x1D2                                                    
 008/07/apache_hadoop_wins_terabyte_sort                                                                                                                     
 _benchmark.html                                                                                                                                             
 org.apache.avro:http/                   column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fw                                                    
 org.apache.cassandra:http/              column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fx                                                    
 org.apache.forrest:http/                column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV\x1F\xAC                                                 
 org.apache.hadoop:http/                 column=f:ts, timestamp=1388471264808, value=\x00\x00\x01C\xE1\xD1\xB2\xFC                                           
 org.apache.hadoop:http/bylaws.html      column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fy                                                    
 org.apache.hadoop:http/docs/current/    column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x06                                                    
 org.apache.hadoop:http/docs/r0.23.10/   column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fy                                                    
 org.apache.hadoop:http/docs/r1.2.1/     column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fz                                                    
 org.apache.hadoop:http/docs/r2.1.1-beta column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x07                                                    
 /                                                                                                                                                           
 org.apache.hadoop:http/docs/r2.2.0/     column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F{                                                    
 org.apache.hadoop:http/docs/stable/     column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x07                                                    
 org.apache.hadoop:http/index.html       column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F|                                                    
 org.apache.hadoop:http/index.pdf        column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x08                                                    
 org.apache.hadoop:http/issue_tracking.h column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x08                                                    
 tml                                                                                                                                                         
 org.apache.hadoop:http/mailing_lists.ht column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x08                                                    
 ml                                                                                                                                                          
 org.apache.hadoop:http/privacy_policy.h column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x09                                                    
 tml                                                                                                                                                         
 org.apache.hadoop:http/releases.html    column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F|                                                    
 org.apache.hadoop:http/who.html         column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F}                                                    
 org.apache.hbase:http/                  column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x09                                                    
 org.apache.hive:http/                   column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F~                                                    
 org.apache.incubator:http/ambari/       column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x7F                                                 
 org.apache.incubator:http/chukwa/       column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x09                                                    
 org.apache.incubator:http/hama/         column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x0A                                                    
 org.apache.mahout:http/                 column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x7F                                                 
 org.apache.pig:http/                    column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x82                                                 
 org.apache.wiki:http/hadoop             column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x0A                                                    
 org.apache.wiki:http/hadoop/PoweredBy   column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x0A                                                    
 org.apache.www:http/                    column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x12                                                    
 org.apache.www:http/foundation/sponsors column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x83                                                 
 hip.html                                                                                                                                                    
 org.apache.www:http/foundation/thanks.h column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x12                                                    
 tml                                                                                                                                                         
 org.apache.www:http/licenses/           column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x83                                                 
 org.apache.zookeeper:http/              column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x85                                                 
 org.sortbenchmark:http/                 column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x13                                                    
 uk.co.guardian.www:http/technology/2011 column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x14                                                    
 /mar/25/media-guardian-innovation-award                                                                                                                     
 s-apache-hadoop                                                                                                                                             
38 row(s) in 0.2590 seconds
           

第二種情況是修改rgex-urlfilter.txt檔案,修改最後一行為+^http://.*hadoop.*,即隻抓取包含hadoop的URL。抓取的結果如下所示,隻包含20行,并且rowkey僅僅包含hadoop的RUL。

[[email protected] deploy]$ bin/crawl urls hadoopWithFilter http://localhost:8983/solr/ 1
hbase(main):016:0> scan 'hadoopWithFilter_webpage',{COLUMNS=>'f:ts'}
ROW                                      COLUMN+CELL                                                                                                         
 com.cafepress.www:http/hadoop/          column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtGl                                                       
 com.yahoo.developer:http/blogs/hadoop/2 column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtG\x88                                                    
 008/07/apache_hadoop_wins_terabyte_sort                                                                                                                     
 _benchmark.html                                                                                                                                             
 org.apache.hadoop:http/                 column=f:ts, timestamp=1388473240778, value=\x00\x00\x01C\xE1\xF0\xEB\xCB                                           
 org.apache.hadoop:http/bylaws.html      column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHv                                                       
 org.apache.hadoop:http/docs/current/    column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA7                                                    
 org.apache.hadoop:http/docs/r0.23.10/   column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHw                                                       
 org.apache.hadoop:http/docs/r1.2.1/     column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHw                                                       
 org.apache.hadoop:http/docs/r2.1.1-beta column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA8                                                    
 /                                                                                                                                                           
 org.apache.hadoop:http/docs/r2.2.0/     column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHx                                                       
 org.apache.hadoop:http/docs/stable/     column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA8                                                    
 org.apache.hadoop:http/index.html       column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHx                                                       
 org.apache.hadoop:http/index.pdf        column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA8                                                    
 org.apache.hadoop:http/issue_tracking.h column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA9                                                    
 tml                                                                                                                                                         
 org.apache.hadoop:http/mailing_lists.ht column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA9                                                    
 ml                                                                                                                                                          
 org.apache.hadoop:http/privacy_policy.h column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA9                                                    
 tml                                                                                                                                                         
 org.apache.hadoop:http/releases.html    column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHy                                                       
 org.apache.hadoop:http/who.html         column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHz                                                       
 org.apache.wiki:http/hadoop             column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xAA                                                    
 org.apache.wiki:http/hadoop/PoweredBy   column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xAA                                                    
 uk.co.guardian.www:http/technology/2011 column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xAB                                                    
 /mar/25/media-guardian-innovation-award                                                                                                                     
 s-apache-hadoop                                                                                                                                             
20 row(s) in 0.3090 seconds
           

通過上面的結果可以發現,通過修改rgex-urlfilter.txt檔案中的正規表達式,可以實作定制抓取URL,僅僅抓取自己感興趣的内容。

繼續閱讀