天天看点

python utf 8_在Python 3中从utf-16转换为utf-8

在Python 3中,使用字符串操作时,有两种重要的数据类型很重要.首先是字符串类,它是表示unicode代码点的对象.重要的是,该字符串不是字节,而是一个字符序列.其次,存在字节类,它只是字节序列,通常表示存储在编码中的字符串(如utf-8或iso-8859-15).

这对您意味着什么?据我了解,您想读写utf-8文件.让我们编写一个程序,用“?”字符替换所有“?”

def main():

# Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8

with open('output_file', 'w', encoding='utf-8') as out_file:

# read every line. We give open() the encoding so it will return a Unicode string.

for line in open('input_file', encoding='utf-8'):

#Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.

print(line.replace('?', '?'), out_file)

那么什么时候应该使用字节?不经常.我能想到的一个例子是当您从套接字读取内容时.如果在bytes对象中有此对象,则可以通过执行bytes.decode(‘encoding’)使其成为unicode字符串,反之亦然,使用str.encode(‘encoding’)即可.但是如前所述,可能您将不需要它.

尽管如此,因为它很有趣,所以这里是一种困难的方式,您可以自己编码所有内容:

def main():

# Open the file in binary mode. So we are going to write bytes to it instead of strings

with open('output_file', 'wb') as out_file:

# read every line. Again, we open it binary, so we get bytes

for line_bytes in open('input_file', 'rb'):

#Convert the bytes to a string

line_string = bytes.decode('utf-8')

#Replace the characters we want.

line_string = line_string.replace('?', '?')

#Make a bytes to print

out_bytes = line_string.encode('utf-8')

#Print the bytes

print(out_bytes, out_file)

(PS,如您所见,我在这篇文章中没有提到utf-16.我实际上不知道python是否将其用作内部解码,但这是完全不相关的.目前,您正在使用字符串,您使用字符(代码点),而不是字节.