python批量轉換檔案編碼

2023-03-10 05:28:50

via: http://www.g2w.me/2012/02/python-batch-convert-file-encodings/

今天在 eclipse 中導入了個之前的 swing 項目，結果跑起來後亂碼，檢查代碼發現竟然一部分 java 檔案是 utf-8編碼，一部分卻是 gb2312 的，而檔案又比較多，一個一個去看顯示太麻煩了，于是又該 python 出手了。

這裡需要用到一個 python 的庫 chardet 1.0.1 ，用于自動檢測檔案的編碼，使用起來非常友善。

>>> import chardet
>>> chardet.detect(open(r'E:\Workspaces\java\GCHMCreator\main\g2w\app\gchm\gui\ContentElement.java').read())
{'confidence': 0.99, 'encoding': 'GB2312'}

detect檔案傳回的是一個字典，其中

encoding

的值為檢測到的編碼類型，

confidence

為該編碼的符合度，

我需要做這樣的事：

周遊項目中所有的 .java 檔案，并檢測其編碼
備份每個 .java 檔案中 .java.bak 以便于恢複
将 .java 檔案從檢測到的編碼格式轉換成 utf-8 格式
提供一個恢複工具，在轉換錯誤後能夠恢複到原來的檔案
提供一個清理工具，在確定檔案轉換正确後，可以清除備份的檔案

其中最關鍵的部分在第二條，利用 chardet 檢測出檔案的編碼

source_encoding

，将文本内容通過

source_encoding

decode

成 unicode ，再利用 codecs 将檔案輸出成正确的編碼格式。

完整代碼

#-*- coding: utf-8 -*-

import codecs
import os
import shutil
import re
import chardet

def convert_encoding(filename, target_encoding):
    # Backup the origin file.
    shutil.copyfile(filename, filename + '.bak')

    # convert file from the source encoding to target encoding
    content = codecs.open(filename, 'r').read()
    source_encoding = chardet.detect(content)['encoding']
    print source_encoding, filename
    content = content.decode(source_encoding) #.encode(source_encoding)
    codecs.open(filename, 'w', encoding=target_encoding).write(content)

def main():
    for root, dirs, files in os.walk(os.getcwd()):
        for f in files:
            if f.lower().endswith('.java'):
                filename = os.path.join(root, f)
                try:
                    convert_encoding(filename, 'utf-8')
                except Exception, e:
                    print filename

def process_bak_files(action='restore'):
    for root, dirs, files in os.walk(os.getcwd()):
        for f in files:
            if f.lower().endswith('.java.bak'):
                source = os.path.join(root, f)
                target = os.path.join(root, re.sub('\.java\.bak$', '.java', f, flags=re.IGNORECASE))
                try:
                    if action == 'restore':
                        shutil.move(source, target)
                    elif action == 'clear':
                        os.remove(source)
                except Exception, e:
                    print source

if __name__ == '__main__':
    # process_bak_files(action='clear')
    main()

python批量轉換檔案編碼

繼續閱讀

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Netty——自定義協定解決TCP粘包拆包問題什麼是TCP粘包拆包自定義協定解決拆包粘包問題

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入