天天看点

PostgreSQL sql文件编码引起的数据导入乱码或查询字符集异常报错(invalid byte sequence)

背景

当用户客户端字符集与服务端字符集不匹配时,写入的多字节字符(例如中文)可能出现乱码。

例子

数据库字符集为sql_ascii,允许存储任意编码字符。

digoal@pg11-320tb-zfs-> psql

psql (11beta4)

Type "help" for help.

postgres=# l+

List of databases             
Name Owner Encoding Collate Ctype Access privileges Size Tablespace Description
postgres SQL_ASCII C en_US.utf8 140 TB pg_default default administrative connection database
template0 =c/postgres + 15 MB unmodifiable empty database
postgres=CTc/postgres
template1 default template for new databases

(3 rows)

客户端为utf8编码

digoal@pg11-320tb-zfs-> echo $LANG

en_US.utf8

编辑一个文件,以UTF8编码

vi test.sql

insert into tbl values (1, '你好');

内容如下

digoal@pg11-320tb-zfs-> cat test.sql

编码如下

digoal@pg11-320tb-zfs-> file test.sql

test.sql: UTF-8 Unicode text

转换为GBK,写入数据库

digoal@pg11-320tb-zfs-> iconv --help

Usage: iconv [OPTION...] [FILE...]

Convert encoding of given files from one encoding to another.

Input/Output format specification:

-f, --from-code=NAME encoding of original text

-t, --to-code=NAME encoding for output

Information:

-l, --list list all known coded character sets

Output control:

-c omit invalid characters from output

-o, --output=FILE output file

-s, --silent suppress warnings

--verbose              print progress information  
           

-?, --help Give this help list

--usage                Give a short usage message             

-V, --version Print program version

Mandatory or optional arguments to long options are also mandatory or optional

for any corresponding short options.

For bug reporting instructions, please see:

http://www.gnu.org/software/libc/bugs.html

.

digoal@pg11-320tb-zfs-> iconv -f UTF8 -t GBK test.sql|psql -f -

INSERT 0 1

或者这样写入(client_encoding和server都设置为sql_ascii时,不检查编码合法性,直接存入数据库)

digoal@pg11-320tb-zfs-> iconv -f UTF8 -t GBK test.sql -o test.sql.gbk

postgres=# set client_encoding =sql_ascii;

SET

postgres=# i ./test.sql.gbk

当设置客户端client_encoding为utf8编码时,由于存入的数据编码不合法,导致查询异常

postgres=# set client_encoding =utf8;

postgres=# select * from tbl;

ERROR: invalid byte sequence for encoding "UTF8": 0xc4 0xe3

当client_encoding设置为GBK编码,查询为乱码

postgres=# set client_encoding =gbk;

postgres=# select * from tbl;

id info
1 ?oí

参考

《PostgreSQL 多字节字符集合法性检测》

《[转] SqlServe到PG迁移错误:无效的编码序列"UTF8": 0x00》

《PostgreSQL UTF8 和 GB18030编码map文件不完整的问题》

《PostgreSQL WHY ERROR: invalid byte sequence for encoding "UTF8"》

《PostgreSQL SQL_ASCII encoding introduce》

《PostgreSQL Server Encoding sql_ascii attention》

转自阿里云德哥