天天看點

[Python基礎]-- python簡單爬取頁面資訊及實作打開、關閉浏覽器

聲明:本文僅是為了學習而舉例說明python的強大,禁止用于不良目的!

1、python可以打開浏覽器并浏覽網頁,并且儲存網頁内容到本地硬碟

實作代碼如下:

import urllib
import webBrowe as  web  # web是别名
 
   url="http://www.jd.com"
   content=urllib.urlopen(url).read()
   open('data.html','w').write(content)
   #打開剛才寫入的檔案data.html
   web.open_new_tab("data.html");
 
2、能夠調用作業系統的指令關閉浏覽器
window指令是:taskkill /F/IM  應用名稱  ,如  taskkill /F /IM  qq.exe 就關閉了qq
linux 指令是:killall(kill不建議使用)   /F /IM qq.exe
python實作代碼如下:
import os 
     os.system('taskkill /F /IM  qq.exe')
     #linux中:os.system('killall   /F /IM  qq.exe')
 
3、實作打開網頁?次和關閉網頁?次,以及打開?次網頁後才關閉網頁
python實作代碼打開10次網頁後關閉一次,一下實作打開最少(10*count)次:
import webBrowe as web
import time
import os 
import urllib
import random 
#産生随機數範圍:[1,9)
count=random.randint(1,10)
#定義變量控制循環結束
j=0
while 
         #定義第count次打開次數
          i=0
           #打開浏覽器的控制
while  i<=9   :
            #打開浏覽器
             web.open_new_tab("需要打開的位址")
            #控制循環次數
            i+=1
 #留給浏覽器反應時間:0.8s
             time.sleep(0.8)
else  :
   #殺死程式,我使用的是360浏覽器打開
   os.system('taskkill /F /IM  360se.exe')
   #控制外層循環
    j+=1      

注意:本文舉例是基于python 2.7版本,開發工具使用pycharm,

            如果是python3.0以上版本可能不支援,部分方法需要稍微修改

其他一:參考

​​http://justcoding.iteye.com/blog/1940717​​

​​http://www.open-open.com/lib/view/open1419163083058.html​​

​​https://www.douban.com/note/572528169/​​

在linux下爬取網頁資訊的步驟

1、安裝wget指令:yum install -y wget 

2、執行指令

#wget -o /tmp/wget.log  -P /opt/testdata  --no-parent --no-verbose -m -D mydata -N --convert-links --random-wait -A html,HTML,JSP http://www.***.com

#wget -r -np -d -o /itcast --accept=iso,html,HTML,ASP,asp  http://www.itcast.cn/ 

3、追蹤爬取的日志

#tail -F  /tmp/wget.log

4、成功下載下傳後,壓縮檔案

#yum -y install zip

#zip -r mydata.zip  mydata

其他二:爬取某站内容(注意縮進)

#--coding:utf-8--
 from urlparse import urlparse, urljoin
 from os.path import splitext, dirname, isdir, exists
 from os import sep, unlink , makedirs
 from string import replace, find , lower
 from urllib import urlretrieve
 from htmllib import HTMLParser
 from formatter import AbstractFormatter, DumbWriter
 from cStringIO import StringIO

 class Retriever(object): # download web pages
     def __init__(self, url):
         self.url = url
         self.file = self.filename(url)
         
     def filename(self, url, deffile='index.html'):
         parsedurl = urlparse(url, 'http:', 0)  # parse path
         path = parsedurl[1] + parsedurl[2]
         print path
         ext = splitext(path)
         print ext
         if ext[1] == '': # no file, use default
             if path[-1] == '/':
                 path += deffile
             else:
                 path += '/' + deffile
         ldir = dirname(path) # local directory
         print path
         print ldir
         if sep != '/': # os-indep. path separator
             ldir = replace(ldir, '/', sep)
         if not isdir(ldir): # create archive dir if nec.
             if exists(ldir): unlink(ldir)
             makedirs(ldir)
         return path
     
     def download(self): # download web page
         try:
             retval = urlretrieve(self.url, self.file)
         except IOError:
             retval = ('*** ERROR: invalid URL "%s"' % self.url)
         return retval
         
     def parseAndGetLinks(self): # parse HTML, save links
         self.parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
         self.parser.feed(open(self.file).read())
         self.parser.close()
         return self.parser.anchorlist
    
 class Crawler(object): # manage entrie crawling process
     
     count = 0 # static downloaded page counter 
     
     def __init__(self, url):
         self.q = [url]
         self.seen = []
         self.dom = urlparse(url)[1]
         print 'self.dom: ', self.dom
         
     def getPage(self, url):
         r = Retriever(url)
         retval = r.download()
         if retval[0] == '*': # error situation, do not parse 對于上面54行的錯誤字元串
             print retval, '... skipping parse'
             return
         Crawler.count += 1
         print '\n(', Crawler.count,')'
         print 'URL:', url
         print 'FILE:', retval[0]
         self.seen.append(url)
          
         links = r.parseAndGetLinks() # get and process links
         for eachLink in links:
             if eachLink[:4] != 'http' and find(eachLink, '://') == -1:
                 eachLink = urljoin(url, eachLink)
             print '* ', eachLink
             
             # 如果發現有郵箱位址連接配接
             if find(lower(eachLink), 'mailto:') != -1:
                 print '... discarded, mailto link'
                 continue
             
             if eachLink not in self.seen:
                 if find(eachLink, self.dom) == -1:
                     print '... discarded, not in domain'
                 else:
                     if eachLink not in self.q:
                         self.q.append(eachLink)
                         print '... new, added to Q'
                     else:
                         print '... discarded, already in Q'
                         
     def go(self): # process links in queue
         while self.q:
             url = self.q.pop()
             self.getPage(url)
                 
                 
 def main():
     try:
         url = raw_input('Enter starting URL: ')
     except(KeyboardInterrupt, EOFError):
         url = ''
     
     if not url: return
 #     robot = Crawler('http://baike.bd.com/subview/2202550/11243904.htm')
     robot = Crawler(url)
     robot.go()
     print 'Done!'
         
         
 if __name__ == '__main__':
     main()      

編寫shell腳本:

#!/bin/sh

URL="$2"
PATH="$1"

echo "download url: $URL"
echo "download dir: $PATH"

/usr/bin/wget -e robots=off -w 1 -xq -np -nH -pk -m  -t 1 -P "$PATH" "$URL"

echo "success to download"      

-x 建立鏡像網站對應的目錄結構

-q 靜默下載下傳,即不顯示下載下傳資訊,你如果想知道wget目前在下載下傳什麼資源的話,可以去掉這個選項

-m 它會打開鏡像相關的選項,比如無限深度的子目錄遞歸下載下傳。

-t times 某個資源下載下傳失敗後的重試下載下傳次數

-w seconds 資源請求下載下傳之間的等待時間(減輕伺服器的壓力)