python爬取百度百科搜尋結果_python爬取百度搜尋結果

作為python學習的一個練習：爬取百度搜尋結果的前八頁的搜尋結果，每個條目儲存标題、連結、描述。

環境：

1，python-3.3.2，環境編碼格式utf-8

2，beautifulsoup4-4.1.0

說明：

1，将要搜尋的關鍵詞放在個腳本檔案同級目錄下searchfile.txt中，一個關鍵詞一行

2，搜尋結果會位于同級目錄下data檔案夾中，一個關鍵詞一個檔案

腳本：

#coding:utf-8

import sys

import time

import urllib.request

from bs4 import BeautifulSoup #from BeautifulSoup import BeautifulSoup 舊的版本，

import os

mymap=['0','1','2','3','4','5','6','7']

#函數1，根據關鍵字擷取查詢網頁

def baidu_search(key_words,pagenum):

url='http://www.baidu.com/s?wd='+key_words+'&pn='+mymap[pagenum]

html=urllib.request.urlopen(url).read()

return html

#函數2，處理一個要搜尋的關鍵字

def deal_key(key_words):

if os.path.exists('data')==False:

os.mkdir('data')

filename='data\\'+key_words+'.txt'

fp=open(filename,'wb') #打開方式用‘w'時，下邊的寫要str轉換，而對于網頁要編碼轉換，遇到有些不規範的空格還出錯

if fp:

pass

else:

print('檔案打失敗：'+filename)

return

x=0

while x<=7:

htmlpage=baidu_search(key_words,x)

soup=BeautifulSoup(htmlpage)

for item in soup.findAll("div", {"class": "result"}): #這個格式應該參考百度網頁布局

a_click = item.find('a')

if a_click:

fp.write(a_click.get_text().encode('utf-8')) #标題

fp.write(b'#')

if a_click:

fp.write(a_click.get("href").encode('utf-8')) #連結

fp.write(b'#')

c_abstract=item.find("div", {"class": "c-abstract"})

if c_abstract:

strtmp=c_abstract.get_text()

fp.write(strtmp.encode('utf-8')) #描述

fp.write(b'#')

x=x+1

fp.write(b'\n')

fp.close()

#函數3，讀取搜尋檔案内容，依次取出要搜尋的關鍵字

def search_file():

fp=open('searchfile.txt')

i=0

keyword=fp.readline()

while keyword:

i=i+1

if i==5:

print('sleep...')

time.sleep(15)

print('end...')

i=0

nPos=keyword.find('\n')

if nPos>-1:

keyword=keyword[:-1]#keyword.replace('\n','')

deal_key(keyword)

keyword=fp.readline()

#腳本入口

print('Start:')

search_file()

print('End！')

python爬取百度百科搜尋結果_python爬取百度搜尋結果

繼續閱讀

python爬取百度百科搜尋結果_用Python抓取百度搜尋結果,python,爬取,的