昨天一同事在线上做变更,为了保证主库的稳定性,先在备库把binlog关闭,然后在进行ddl变更,在通过切换ha,把备库切换为主库,在老的主库上做ddl变更
看上去这样做法没有太大的问题,但是当备库变更一做完,ha切换到备库,开始老主库变更的时候,备库就出现复制出现错误:
last_error: table definition on master and slave does not match: column 10 type mismatch – received type 3, dbname.table_name has type 8
skip_counter: 0
exec_master_log_pos: 1046252634
relay_log_space: 2910773181
until_condition: none
until_log_file:
until_log_pos: 0
master_ssl_allowed: no
master_ssl_ca_file:
master_ssl_ca_path:
master_ssl_cert:
master_ssl_cipher:
master_ssl_key:
seconds_behind_master: null
master_ssl_verify_server_cert: no
last_io_errno: 0
last_io_error:
last_sql_errno: 1535
last_sql_error: table definition on master and slave does not match: column 10 type mismatch – received type 3, dbname.table_name has type 8
1 row in set (0.00 sec)
<1>从这个错误上来看,是主备的表结构不一致导致的,但是之前的复制都是好好的,为什么做完变更后就会出现这个问题,应该是在ddl变更后导致的问题;
master:
mysql -uroot dbname -e “show create table table_name”>master.sql
slave:
mysql -uroot dbname -e “show create table table_name”>slave.sql
diff -u master.sql slave.sql没有找到两个表结构有什么大问题;
<2>查看出问题的数据字段:
[email protected] : dbname 17:46:35> desc table_name;
+—————-+———————+——+—–+——————-+—————————–+
| field | type | null | key | default | extra |
| url | varchar(333) | no | uni | null | |
| description | varchar(255) | yes | | null | |
| httpstatus | int(11) | yes | | null | |
| addtime | timestamp | no | | current_timestamp | on update current_timestamp |
| serverbanner | varchar(255) | yes | | null | |
| tasktag | varchar(255) | no | mul | null | |
| tasktag2 | varchar(255) | yes | | null | |
| domainname | varchar(255) | no | mul | null | |
| r_domainname | varchar(255) | yes | mul | null | |
| scantaskid | int(11) | yes | | null | |
| subtaskid | bigint(20) unsigned | no | pri | null | auto_increment |
| urlhash | varchar(32) | yes | uni | | |
| duplicatehash | varchar(32) | yes | | | |
| enable | tinyint(1) | yes | | 0 | |
| webappid | int(11) | yes | | null | |
| crc_domainname | int(10) unsigned | yes | mul | null | |
| wapscore | int(11) | yes | | 0 | |
| ip | varchar(45) | yes | | null | |
[email protected] : information_schema 14:59:40> select * from columns where table_schema=”dbname” and table_name=”table_name” and ordinal_position= 10\g;
*************************** 1. row ***************************
table_catalog: null
table_schema: dbname
table_name: table_name
column_name: scantaskid
ordinal_position: 10
column_default: null
is_nullable: yes
data_type: int
character_maximum_length: null
character_octet_length: null
numeric_precision: 10
numeric_scale: 0
character_set_name: null
collation_name: null
column_type: int(11)
column_key:
extra:
privileges: select,insert,update,references
column_comment:
[email protected] : information_schema 14:59:19> select * from information_schema.columns where table_schema=”dbname” and table_name=”table_name” and ordinal_position= 10\g;
查看数据字段,主备库还是一致的,这个时候好像到了穷途;
<3>查看同事昨天的ddl变更脚本,涉及到加字段,调整字段的长度,看上去很平常,
我们是先在备库做的变更,然后在到主库的变更,期间的binlog是关闭的,这时候,印风同学想到如果在备库变更的时候,主库的业务是没有停止的,
如果主库变更的数据同步到备库,备库的变更做完,主备已经不一致了,这样的话,就会造成复制失败了,看了看脚本中有字段长度调长的,这下就迎刃而解了;
问了一下b2b的plinux,他们只有加字段的时候,才放到备库上去做,其他的还是在主库上直接做的;
<4>刚才看到从 information_schema.columns 中查询有问题的列的时候,直接代入ordinal_position= 10得到的是scantaskid
字段,但出问题的字段是第11为字段(即我们调整长度的字段),所以binlog中是从0开始计算字段的位置的;
<5>5.5中报错显得更加人性了:
column 0 of table ‘test.t3’ cannot be converted from type ‘int’ to type ‘bigint(20)’;
<6>.那如何避免这样的问题喃,由于我们的库采用的是row模式,只要把slave的复制改为statement就可以了,将主库的binlog_format由row改为statement,这样达到备库的binlog就不会出现错误;