UnicodeEncodeError:的解決方法

2023-06-29 07:55:54

今天在學習《集體智慧程式設計》這本書中第四章的搜尋與排名時，自己動手嘗試将書中Python2編寫的執行個體程式改寫成Python3的版本，編寫好第一個爬蟲程式，代碼如下：

#從一個小網頁開始進行廣度優先搜尋，直至某一給定深度
    #期間為網頁建立索引
    def crawl(self,pages,depth=2):
        print('searching %s'% pages)

        for i in range(depth):
            newpages=set()
            for page in pages:
                try:
                    c=urllib.request.urlopen(page)
                except:
                    <span style="color:#ff0000;">print("Could not open %s "%page)</span>
                    continue

                try:
                    soup=BeautifulSoup(c.read(),'html.parser')
                    self.addtoindex(page,soup)

                    links=soup.find_all('a')
                    for link in links:
                        if('href' in link.attrs):
                            url=urljoin(page,link['href'])
                            if url.find("'")!=-1:
                                continue
                            url=url.split('#')[0]  #去掉位置部分
                            if url[0:4]=='http' and not self.isindexed(url):
                                newpages.add(url)
                                print('the link is %s'%url)
                            linkText=self.gettextonly(link)
                            self.addlinkref(page,url,linkText)
                    self.dbcommit()
                except:
                    print("Could not parse %s "% page)
            pages=newpages

運作該程式，通路英文網站時速度雖然慢，但沒有出錯，但是在通路中文網站，如百度，網易等網址時，總是會出現下圖所示的錯誤：

UnicodeEncodeError:的解決方法

上述的消息提示編碼有錯誤，嘗試了将page字元串進行unicode編碼，還是不行，後來經過多方搜尋，終于在網上找到了問題的答案：http://www.crifan.com/unicodeencodeerror_gbk_codec_can_not_encode_character_in_position_illegal_multibyte_sequence/

原來對于Unicode字元，在Python中如果需要Print出來的話，由于本地系統是Win7的cmd，預設的是GBK編碼，是以需要先将Unicode字元編碼為GBK，然後在Cmd中顯示出來，然而由于Unicode字元串中包含一些GBK中無法顯示的字元，導緻此時提示“gbk“ codec can't encode的錯誤的。

解決方式有二：

一：在Print的時候，将unicode字元串進行GBK編碼，在編碼是添加‘ignore’參數，忽略無法編碼的字元，這樣就可以正常編碼為GBK了。代碼如下：

<pre name="code" class="python"><span style="white-space:pre">		</span>try:
                    c=urllib.request.urlopen(page)
                except:
                    print("Could not open %s "%page.encode('GBK','ignore'))
                    continue

二：将Unicode字元串轉換為GBK編碼的超集GB18030，（即，GBK是GB18030的子集），代碼如下：

<span style="white-space:pre">		</span>try:
                    c=urllib.request.urlopen(page)
                except:
                    print("Could not open %s "%page.encode('GB18030'))
                    continue

這樣，再運作該爬蟲程式的時候，就不會産生錯誤了。

UnicodeEncodeError:的解決方法

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入