簡單爬蟲架構
動态運作流程 URL管理器的作用 URL管理器的3種實作方式 網頁下載下傳器的作用 Python網頁下載下傳器的種類 urllib2下載下傳網頁的3種方法 網頁解析器的作用 Python的幾種網頁解析器 結構化解析依賴DOM樹 Beautiful Soup文法代碼舉例:
1.建立Beautiful Soup對象
1 from bs4 import BeautifulSoup
2
3 soup = BeautifulSoup(
4 html_doc, #HTML文檔字元串
5 'heml.parser', #HTML解析器
6 from_encoding='utf-8' #HTML文檔的編碼
7 )
2.find_all find方法的使用
3.通路節點資訊
4.Beautiful Soup處理html文檔舉例
1 from bs4 import BeautifulSoup
2 import re
3
4 html_doc = """
5 <html><head><title>The Dormouse's story</title></head>
6 <body>
7 <p class="title"><b>The Dormouse's story</b></p>
8
9 <p class="story">Once upon a time there were three little sisters; and their names were
10 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
11 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
12 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
13 and they lived at the bottom of a well.</p>
14
15 <p class="story">...</p>
16 """
17
18 soup = BeautifulSoup(
19 html_doc, #HTML文檔字元串
20 'html.parser', #HTML解析器
21 from_encoding='utf-8' #HTML文檔的編碼
22 )
23
24 print('擷取所有的連接配接')
25 links = soup.find_all('a')
26 for link in links:
27 print(link.name,link['href'],link.get_text())
28
29 print('擷取tillie的連接配接')
30 link_node = soup.find('a',href='http://example.com/tillie')
31 print(link_node.name,link_node['href'],link_node.get_text())
32
33 print('正規表達式比對')
34 link_node2 = soup.find('a',href=re.compile(r'lsi'))
35 print(link_node2.name,link_node2['href'],link_node2.get_text())
36
37 print('擷取P段落文字')
38 p_node = soup.find('p',class_='title')
39 print(p_node.name,p_node.get_text())
控制台輸出:
1 擷取所有的連接配接
2 a http://example.com/elsie Elsie
3 a http://example.com/lacie Lacie
4 a http://example.com/tillie Tillie
5 擷取tillie的連接配接
6 a http://example.com/tillie Tillie
7 正規表達式比對
8 a http://example.com/elsie Elsie
9 擷取P段落文字
10 p The Dormouse's story
更進階的爬蟲還會涉及到“需登陸、驗證碼、Ajax、伺服器防爬蟲、多線程、分布式”等情況