python批量转换文件编码

2023-03-10 05:28:50

via: http://www.g2w.me/2012/02/python-batch-convert-file-encodings/

今天在 eclipse 中导入了个之前的 swing 项目，结果跑起来后乱码，检查代码发现竟然一部分 java 文件是 utf-8编码，一部分却是 gb2312 的，而文件又比较多，一个一个去看显示太麻烦了，于是又该 python 出手了。

这里需要用到一个 python 的库 chardet 1.0.1 ，用于自动检测文件的编码，使用起来非常方便。

>>> import chardet
>>> chardet.detect(open(r'E:\Workspaces\java\GCHMCreator\main\g2w\app\gchm\gui\ContentElement.java').read())
{'confidence': 0.99, 'encoding': 'GB2312'}

detect文件返回的是一个字典，其中

encoding

的值为检测到的编码类型，

confidence

为该编码的符合度，

我需要做这样的事：

遍历项目中所有的 .java 文件，并检测其编码
备份每个 .java 文件中 .java.bak 以便于恢复
将 .java 文件从检测到的编码格式转换成 utf-8 格式
提供一个恢复工具，在转换错误后能够恢复到原来的文件
提供一个清理工具，在确保文件转换正确后，可以清除备份的文件

其中最关键的部分在第二条，利用 chardet 检测出文件的编码

source_encoding

，将文本内容通过

source_encoding

decode

成 unicode ，再利用 codecs 将文件输出成正确的编码格式。

完整代码

#-*- coding: utf-8 -*-

import codecs
import os
import shutil
import re
import chardet

def convert_encoding(filename, target_encoding):
    # Backup the origin file.
    shutil.copyfile(filename, filename + '.bak')

    # convert file from the source encoding to target encoding
    content = codecs.open(filename, 'r').read()
    source_encoding = chardet.detect(content)['encoding']
    print source_encoding, filename
    content = content.decode(source_encoding) #.encode(source_encoding)
    codecs.open(filename, 'w', encoding=target_encoding).write(content)

def main():
    for root, dirs, files in os.walk(os.getcwd()):
        for f in files:
            if f.lower().endswith('.java'):
                filename = os.path.join(root, f)
                try:
                    convert_encoding(filename, 'utf-8')
                except Exception, e:
                    print filename

def process_bak_files(action='restore'):
    for root, dirs, files in os.walk(os.getcwd()):
        for f in files:
            if f.lower().endswith('.java.bak'):
                source = os.path.join(root, f)
                target = os.path.join(root, re.sub('\.java\.bak$', '.java', f, flags=re.IGNORECASE))
                try:
                    if action == 'restore':
                        shutil.move(source, target)
                    elif action == 'clear':
                        os.remove(source)
                except Exception, e:
                    print source

if __name__ == '__main__':
    # process_bak_files(action='clear')
    main()

python批量转换文件编码

继续阅读

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Netty——自定义协议解决TCP粘包拆包问题什么是TCP粘包拆包自定义协议解决拆包粘包问题

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入