bs4解析具體使用

2023-05-27 08:13:29

from bs4 import BeautifulSoup
對象的執行個體化：
1.本地html檔案加載
fp=open('./test.html','r',encoding='utf-8')
soup=BeautifulSoup(fp,'lxml')
2.網際網路擷取頁面加載
page_text=response.text
soup=BeautifulSoup(page_text,'lxml')
提供用于資料解析方法和屬性：
1.soup.tagname:傳回文檔中第一次出現對應的标簽
2.soup.find('tagname')
  soup.find('div',class_='song')
3.soup.find_all('tagname')傳回符合要求的所有标簽
4.soup.select('id')
  soup.select('.tang>ul>li>a'):>表示的是一個層級
  soup.select('.tang>ul a'):空格表示的是多個層級
5.擷取标簽中的文本資料：
  soup.a.text/string/get_text()
  其中text，get_text()獲得全部文本内容，string隻擷取直系文本内容
6.擷取标簽屬性：
soup.a['href']

例子如下：

#需求：爬取三國演義所有章節标題及内容
from bs4 import BeautifulSoup
import requests


if __name__ == '__main__':
    headers = {
        'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    }
    url='https://www.shicimingju.com/book/sanguoyanyi.html'
    page_text=requests.get(url=url,headers=headers).text
    soup=BeautifulSoup(page_text,'lxml')
    li_title=soup.select('.book-mulu>ul>li')
    fp=open('./sanguo.txt','w',encoding='utf-8')
    for i in li_title:
        title=i.a.string
        detail_url='https://www.shicimingju.com'+i.a['href']
        detail_content=requests.get(url=detail_url,headers=headers).text
        detail_soup=BeautifulSoup(detail_content,'lxml')
        div_tag=detail_soup.find('div',class_='chapter_content')
        content=div_tag.text
        fp.write(title+":"+content+'\n')
        print(title,'爬取成功')

bs4解析具體使用

繼續閱讀

使用bs4爬取小說

爬蟲學習筆記5——mongoDB的簡單使用

aiohttp子產品

Python爬蟲學習2--百度貼吧

Python中的Web爬蟲101：工具概述&每種工具的優缺點Web 基礎手動建立一個socket并且發送HTTP請求requests & BeautifulSoup（庫）ScrapySelenium & Chrome —headless

爬蟲學習筆記6——selenium庫的簡單使用

爬蟲學習筆記-1爬蟲學習筆記-1

爬蟲學習筆記（一）

爬蟲學習筆記-爬蟲基礎

lxml與Xpath的使用

從0到1爬蟲學習筆記：01爬蟲原理與資料抓取

re子產品方法