使用MHA實作MySQL主從複制高可用

本文連結：https://blog.csdn.net/wzy0623/article/details/81304654

一、MHA簡介

二、實驗架構設計

1. 基本環境

2. 架構設計

三、MHA安裝配置

1. 配置主從複制

2. 安裝Perl等依賴子產品

3. 配置SSH登入無密碼驗證

4. 安裝MHA Node

5. 安裝MHA Manager

6. 配置MHA

7. 建立相關腳本

四、檢查MHA配置

1. 檢查SSH配置

2. 檢查整個複制環境狀況

3. 檢查MHA Manager的狀态

4. 檢視啟動日志

五、功能測試

1. 初始綁定VIP

2. 測試自動切換

3. 測試手工切換

4. 測試線上切換

5. 修複當機的Master

參考：

MHA（Master High Availability）目前在MySQL高可用方面是一個相對成熟的解決方案，它由日本DeNA公司的youshimaton（現就職于Facebook公司）開發，是一套優秀的作為MySQL高可用性環境下故障切換和主從提升的高可用軟體。在MySQL故障切換過程中，MHA能做到在0~30秒之内自動完成資料庫的故障切換操作，并且在進行故障切換的過程中，MHA能在最大程度上保證資料的一緻性，以達到真正意義上的高可用。

該軟體由兩部分組成：MHA Manager（管理節點）和MHA Node（資料節點）。MHA Manager可以單獨部署在一台獨立的機器上管理多個master-slave叢集，也可以部署在一台slave節點上。MHA Node運作在每台MySQL伺服器上，MHA Manager會定時探測叢集中的master節點，當master出現故障時，它可以自動将最新資料的slave提升為新的master，然後将所有其他的slave重新指向新的master。整個故障轉移過程對應用程式完全透明。

在MHA自動故障切換過程中，MHA試圖從當機的主伺服器上儲存二進制日志，最大程度的保證資料的不丢失，但這并不總是可行的。例如，如果主伺服器硬體故障或無法通過ssh通路，MHA沒法儲存二進制日志，隻進行故障轉移而丢失了最新的資料。使用MySQL 5.5的半同步複制，可以大大降低資料丢失的風險。MHA可以與半同步複制結合起來。如果隻有一個slave已經收到了最新的二進制日志，MHA可以将最新的二進制日志應用于其他所有的slave伺服器上，是以可以保證所有節點的資料一緻性。

目前MHA主要支援一主多從的架構。要搭建MHA，要求一個複制叢集中必須最少有三台資料庫伺服器，一主二從，即一台充當master，一台充當備用master，另外一台充當從庫。因為至少需要三台伺服器，出于機器成本的考慮，淘寶也在該基礎上進行了改造，目前淘寶TMHA已經支援一主一從。（出自：《深入淺出MySQL(第二版)》）從代碼層面看，MHA就是一套Perl腳本，那麼相信以阿裡系的技術實力，将MHA改成支援一主一從也并非難事。

圖1所示為MHA架構：

圖1

MHA工作原理總結為以下幾條：

從當機崩潰的master儲存二進制日志事件（binlog events）；
識别含有最新更新的slave；
應用差異的中繼日志（relay log）到其他slave；
應用從master儲存的二進制日志事件（binlog events）；
提升一個slave為新master；
使用其他的slave連接配接新的master進行複制。

官方介紹：https://code.google.com/archive/p/mysql-master-ha/

作業系統版本：CentOS Linux release 7.2.1511 (Core)
MySQL版本：5.6.14
VIP（虛IP）：172.16.1.100
主機資訊：見表1

角色	IP	主機名	網卡	server_id	功能
Monitor Host	172.16.1.124	hdp1	-		監控複制組
Master	172.16.1.127	hdp4	ens160	127	響應寫請求
Candidate Master	172.16.1.126	hdp3	ens32	126	響應讀請求
Slave	172.16.1.125	hdp2		125

表1

實驗架構如圖2所示。

圖2

hdp1作為MHA Manager，其它三台主機構成MySQL一主二從複制叢集，作為MHA Node。

MySQL主從複制的配置較為簡單，具體過程可參考MySQL官方文檔，這裡從略。如果是全新搭建的複制，隻要打開Master的binlog，然後将Slave change master到指定的file和pos，再start slave即可。如果是為已經存在且正在使用的資料庫搭建從庫，有兩種方式，一是用mysqldump master-data參數記錄master的file和pos，但可能卡庫；比較好的方法是用innobackupex聯機搭建從庫，過程如下：

（1）前置條件

主從都安裝好依賴包：

yum install perl perl-DBI perl-DBD-MySQL perl-IO-Socket-SSL perl-Time-HiRes

主從都安裝percona-xtrabackup
設定PATH環境變量，如：

.:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/home/mysql/mysql-5.6.14/bin:/home/mysql/percona-xtrabackup-2.2.4-Linux-x86_64/bin:/home/mysql/bin

（2）配置主到從的SSH免密碼連接配接

在主上用mysql使用者執行：

  ssh-keygen           
       ... 一路回車 ...           
       ssh-copy-id slave的IP或主機名

（3）備份并傳輸

例如，在主上用mysql使用者執行：

innobackupex --user root --password 123456 --defaults-file=/home/mysql/mysql-5.6.14/my.cnf --no-lock --socket=/home/mysql/mysql-5.6.14/mysql.sock --port 3306 --stream=tar ./ | ssh [email protected] \ "cat - > /home/mysql/backup.tar"

（4）恢複備份

在從上用mysql使用者執行：

  # 解壓縮       
       tar -ixvf backup.tar -C /home/mysql/mysql-5.6.14/data       
       # 應用日志       
       innobackupex --apply-log /home/mysql/mysql-5.6.14/data/          
           # 檢視binlog日志檔案的位置值       
       cat /home/mysql/mysql-5.6.14/data/xtrabackup_binlog_info       
           # 編輯my.cnf       
       vi /etc/my.cnf       
           # 啟動MySQL，目錄要和主保持一緻       
       service mysql start       
           mysql -uroot -p123456 -P3306 -h127.0.0.1       
           # 配置複制       
       reset master;       
       reset slave all;       
           change master to       
       master_host='172.16.1.127',       
       master_port=3306,       
       master_user='repl',       
       master_password='123456',       
       master_log_file='mysql-bin.000001',       
       master_log_pos=120;       
           # 其中master_log_file和master_log_pos賦予/home/mysql/mysql5.6.14/data/xtrabackup_binlog_info中的值。       
           # 啟動slave       
       start slave;       
           # 檢視slave狀态       
       show slave status\G

（5）後續工作

備份my.cnf、bat檔案和crontab等。

用root使用者在所有四個節點執行下面的操作。

  # 安裝一個epel源       
       wget -O /etc/yum.repos.d/epel-7.repo http://mirrors.aliyun.com/repo/epel-7.repo       
           # 用yum安裝依賴包       
       yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes -y

在hdp1 172.16.1.124（Monitor）上用root使用者執行：

  ssh-keygen -t rsa       
       ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]       
       ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]       
       ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]

在hdp4 172.16.1.127（Master）上用root使用者執行：

  ssh-keygen -t rsa       
       ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]       
       ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]

在hdp3 172.16.1.126（slave1）上用root使用者執行：

  ssh-keygen -t rsa       
       ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]       
       ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]

在hdp2 172.16.1.125（slave2）上用root使用者執行：

  ssh-keygen -t rsa       
       ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]       
       ssh-copy-id -i /root/.ssh/id_rsa.pub [email protected]

下載下傳位址：https://github.com/yoshinorim/mha4mysql-manager/wiki/Downloads

在hdp2、hdp3、hdp4上用root使用者執行下面的操作。

rpm -ivh mha4mysql-node-0.56-0.el6.noarch.rpm

安裝完成後，在/usr/bin/目錄下有如下MHA相關檔案：

  apply_diff_relay_logs       
       filter_mysqlbinlog       
       purge_relay_logs       
       save_binary_logs

這些腳本工具通常由MHA Manager的腳本觸發，無需人為操作。腳本說明：

apply_diff_relay_logs：識别差異的中繼日志事件并将其差異的事件應用于其它slave。
filter_mysqlbinlog：去除不必要的ROLLBACK事件（MHA已不再使用這個工具）。
purge_relay_logs：清除中繼日志（不會阻塞SQL線程）。
save_binary_logs：儲存和複制master的二進制日志。

在hdp1上用root使用者執行下面的操作。

rpm -ivh mha4mysql-manager-0.56-0.el6.noarch.rpm

  masterha_check_repl       
       masterha_check_ssh       
       masterha_check_status       
       masterha_conf_host       
       masterha_manager       
       masterha_master_monitor       
       masterha_master_switch       
       masterha_secondary_check       
       masterha_stop       
       apply_diff_relay_logs       
       filter_mysqlbinlog       
       purge_relay_logs       
       save_binary_logs

在hdp1上用root使用者執行下面（1）、（2）、（3）的操作。

（1）建立配置檔案目錄

mkdir -p /etc/masterha

（2）建立配置檔案/etc/masterha/app1.cnf，内容如下：

  [server default]       
       manager_log=/var/log/masterha/app1/manager.log       
       manager_workdir=/var/log/masterha/app1.log       
       master_binlog_dir=/data       
       master_ip_failover_script=/usr/bin/master_ip_failover       
       master_ip_online_change_script=/usr/bin/master_ip_online_change       
       password=123456       
       ping_interval=1       
       remote_workdir=/tmp       
       repl_password=123456       
       repl_user=repl       
       secondary_check_script=/usr/bin/masterha_secondary_check -s hdp4 -s hdp3 --user=root --master_host=hdp4 --master_ip=172.16.1.127 --master_port=3306       
       shutdown_script=""       
       ssh_user=root       
       user=root       
           [server1]       
       hostname=172.16.1.127       
       port=3306       
           [server2]       
       candidate_master=1       
       check_repl_delay=0       
       hostname=172.16.1.126       
       port=3306       
           [server3]       
       hostname=172.16.1.125       
       port=3306

server default段是manager的一些基本配置參數，server1、server2、server3分别對應複制中的master、第一個slave、第二個slave。該檔案的文法要求嚴格，變量值後不要有多餘的空格。主要配置項說明如下。

manager_log：設定manager的日志檔案。
manager_workdir：設定manager的工作目錄。
master_binlog_dir：設定master儲存binlog的位置，以便MHA可以找到master的日志，這裡的也就是mysql的資料目錄。
master_ip_failover_script：設定自動failover時候的切換腳本。
master_ip_online_change_script：設定手動切換時候的切換腳本。
password：設定mysql中root使用者的密碼。
ping_interval：設定監控主庫，發送ping包的時間間隔，預設是3秒，嘗試三次沒有回應的時候自動進行railover。
remote_workdir：設定遠端mysql在發生切換時binlog的儲存位置。
repl_password：設定複制使用者的密碼。
repl_user：設定複制環境中的複制使用者名
secondary_check_script：一旦MHA到hdp4的監控之間出現問題，MHA Manager将會嘗試從hdp3登入到hdp4。
shutdown_script：設定故障發生後關閉故障主機腳本。該腳本的主要作用是關閉主機放在發生腦裂，這裡沒有使用。
ssh_user：設定ssh的登入使用者名。
user：設定監控使用者為root。
candidate_master：設定為候選master。設定該參數以後，發生主從切換以後将會将此從庫提升為主庫，即使這個主庫不是叢集中事件最新的slave。
check_repl_delay：預設情況下如果一個slave落後master 100M的relay logs的話，MHA将不會選擇該slave作為一個新的master，因為對于這個slave的恢複需要花費很長時間，通過設定check_repl_delay=0，MHA觸發切換在選擇一個新的master的時候将會忽略複制延時，這個參數對于設定了candidate_master=1的主機非常有用，因為這個候選主在切換的過程中一定是新的master。

（3）建立軟連接配接

  ln -s /home/mysql/mysql-5.6.14/bin/mysqlbinlog /usr/bin/mysqlbinlog       
       ln -s /home/mysql/mysql-5.6.14/bin/mysql /usr/bin/mysql

（4）設定複制中Slave的relay_log_purge參數

在hdp3和hdp2上用mysql使用者執行：

mysql -uroot -p123456 -e "set global relay_log_purge=0"

注意，MHA在發生切換的過程中，從庫的恢複過程中依賴于relay log的相關資訊，是以這裡要将relay log的自動清除設定為OFF，采用手動清除relay log的方式。預設情況下，從伺服器上的中繼日志會在SQL線程執行完畢後被自動删除。但是在MHA環境中，這些中繼日志在恢複其他從伺服器時可能會被用到，是以需要禁用中繼日志的自動删除功能。定期清除中繼日志需要考慮到複制延時的問題。在ext3的檔案系統下，删除大的檔案需要一定的時間，會導緻嚴重的複制延時。為了避免複制延時，需要暫時為中繼日志建立硬連結，因為在linux系統中通過硬連結删除大檔案速度會很快。（在mysql資料庫中，删除大表時，通常也采用建立硬連結的方式）

（1）建立定期清理relay腳本

在hdp3、hdp2兩台slave上建立/root/purge_relay_log.sh檔案，内容如下：

  #!/bin/bash       
           . /home/mysql/.bashrc       
           user=root       
       passwd=123456       
       port=3306       
       log_dir='/data'       
       work_dir='/data'       
       purge='/usr/bin/purge_relay_logs'       
           if [ ! -d $log_dir ]       
       then       
          mkdir $log_dir -p       
       fi       
           $purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1

purge_relay_logs的參數說明：

user mysql：MySQL使用者名。
password mysql：MySQL使用者密碼。
port：MySQL端口号。
workdir：指定建立relay log的硬連結的位置，預設是/var/tmp。由于系統不同分區建立硬連結檔案會失敗，故需要執行硬連結具體位置，成功執行腳本後，硬連結的中繼日志檔案被删除。
disable_relay_log_purge：預設情況下，如果relay_log_purge=1，腳本會什麼都不清理，自動退出。通過設定這個參數，當relay_log_purge=1的情況下會将relay_log_purge設定為0。清理relay log之後，最後将參數設定為OFF。

改模式為可執行：

chmod 755 purge_relay_log.sh

手工執行/root/purge_relay_log.sh，在控制台輸出：

  2018-07-31 12:45:20: purge_relay_logs script started.       
        Found relay_log.info: /data/relay-log.info       
        Opening /data/hdp2-relay-bin.000001 ..       
        Opening /data/hdp2-relay-bin.000002 ..       
        Executing SET GLOBAL relay_log_purge=1; FLUSH LOGS; sleeping a few seconds so that SQL thread can delete older relay log       
        files (if it keeps up); SET GLOBAL relay_log_purge=0; .. ok.       
       2018-07-31 12:45:23: All relay log purging operations succeeded.

添加到crontab中：

0 4 * * * /bin/bash /root/purge_relay_log.sh

（2）建立自動failover腳本

在hdp1上建立/usr/bin/master_ip_failover檔案，内容如下：

  #!/usr/bin/env perl       
       use strict;       
       use warnings FATAL => 'all';       
           use Getopt::Long;       
           my (       
           $command,          $ssh_user,        $orig_master_host, $orig_master_ip,       
           $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port       
       );       
           my $vip = '172.16.1.100';  # Virtual IP        
       my $key = "1";        
       my $ssh_start_vip = "/sbin/ifconfig ens32:$key $vip";       
       my $ssh_stop_vip = "/sbin/ifconfig ens160:$key down";       
           GetOptions(       
           'command=s'          => \$command,       
           'ssh_user=s'         => \$ssh_user,       
           'orig_master_host=s' => \$orig_master_host,       
           'orig_master_ip=s'   => \$orig_master_ip,       
           'orig_master_port=i' => \$orig_master_port,       
           'new_master_host=s'  => \$new_master_host,       
           'new_master_ip=s'    => \$new_master_ip,       
           'new_master_port=i'  => \$new_master_port,       
       );       
           exit &main();       
           sub main {       
               print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";        
               if ( $command eq "stop" || $command eq "stopssh" ) {       
                   # $orig_master_host, $orig_master_ip, $orig_master_port are passed.       
               # If you manage master ip address at global catalog database,       
               # invalidate orig_master_ip here.       
               my $exit_code = 1;       
               eval {       
                   print "Disabling the VIP on old master: $orig_master_host \n";       
                   &stop_vip();       
                   $exit_code = 0;       
               };       
               if ($@) {       
                   warn "Got Error: $@\n";       
                   exit $exit_code;       
               }       
               exit $exit_code;       
           }       
           elsif ( $command eq "start" ) {       
                   # all arguments are passed.       
               # If you manage master ip address at global catalog database,       
               # activate new_master_ip here.       
               # You can also grant write access (create user, set read_only=0, etc) here.       
               my $exit_code = 10;       
               eval {       
                   print "Enabling the VIP - $vip on the new master - $new_master_host \n";       
                   &start_vip();       
                   $exit_code = 0;       
               };       
               if ($@) {       
                   warn $@;       
                   exit $exit_code;       
               }       
               exit $exit_code;       
           }       
           elsif ( $command eq "status" ) {       
               print "Checking the Status of the script.. OK \n";        
               `ssh $ssh_user\@$orig_master_host \" $ssh_start_vip \"`;       
               exit 0;       
           }       
           else {       
               &usage();       
               exit 1;       
           }       
       }       
           # A simple system call that enable the VIP on the new master        
       sub start_vip() {       
           `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;       
       }       
       # A simple system call that disable the VIP on the old_master       
       sub stop_vip() {       
           `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;       
       }       
           sub usage {       
           print       
           "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";       
       }

注意腳本中VIP漂移的部分。

（3）建立手動failover腳本

在hdp1上建立/usr/bin/master_ip_online_change檔案，内容如下：

  #!/usr/bin/env perl       
           ## Note: This is a sample script and is notcomplete. Modify the script based on your environment.       
           use strict;       
       use warnings FATAL => 'all';       
           use Getopt::Long;       
       use MHA::DBHelper;       
       use MHA::NodeUtil;       
       # use Time::HiRes qw( sleep gettimeofdaytv_interval );       
       use Time::HiRes qw(sleep gettimeofday tv_interval);       
       use Data::Dumper;       
           my $_tstart;       
       my $_running_interval = 0.1;       
       my (       
        $command,         $orig_master_host, $orig_master_ip,       
        $orig_master_port, $orig_master_user,       
        $new_master_host, $new_master_ip,   $new_master_port,       
        $new_master_user,        
       );       
           my $vip = '172.16.1.100';  # Virtual IP        
       my $key = "1";        
       my $ssh_start_vip = "/sbin/ifconfig ens32:$key $vip";       
       my $ssh_stop_vip = "/sbin/ifconfig ens160:$key down";       
       my $ssh_user = "root";       
       my $new_master_password = "123456";       
       my $orig_master_password = "123456";       
           GetOptions(       
        'command=s'              =>\$command,       
        #'ssh_user=s'             => \$ssh_user,        
        'orig_master_host=s'     =>\$orig_master_host,       
        'orig_master_ip=s'       =>\$orig_master_ip,       
        'orig_master_port=i'     =>\$orig_master_port,       
        'orig_master_user=s'     =>\$orig_master_user,       
        #'orig_master_password=s' => \$orig_master_password,       
        'new_master_host=s'      =>\$new_master_host,       
        'new_master_ip=s'        =>\$new_master_ip,       
        'new_master_port=i'      =>\$new_master_port,       
        'new_master_user=s'      =>\$new_master_user,       
        #'new_master_password=s'  =>\$new_master_password,       
       );       
           exit &main();       
           sub current_time_us {       
         my ($sec, $microsec ) = gettimeofday();       
         my$curdate = localtime($sec);       
        return $curdate . " " . sprintf( "%06d", $microsec);       
       }       
           sub sleep_until {       
         my$elapsed = tv_interval($_tstart);       
         if ($_running_interval > $elapsed ) {       
          sleep( $_running_interval - $elapsed );       
         }       
       }       
           sub get_threads_util {       
         my$dbh                    = shift;       
         my$my_connection_id       = shift;       
         my$running_time_threshold = shift;       
         my$type                   = shift;       
        $running_time_threshold = 0 unless ($running_time_threshold);       
        $type                   = 0 unless($type);       
         my@threads;       
             my$sth = $dbh->prepare("SHOW PROCESSLIST");       
        $sth->execute();       
            while ( my $ref = $sth->fetchrow_hashref() ) {       
           my$id         = $ref->{Id};       
           my$user       = $ref->{User};       
           my$host       = $ref->{Host};       
           my$command    = $ref->{Command};       
           my$state      = $ref->{State};       
           my$query_time = $ref->{Time};       
           my$info       = $ref->{Info};       
          $info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);       
          next if ( $my_connection_id == $id );       
          next if ( defined($query_time) && $query_time <$running_time_threshold );       
          next if ( defined($command)   && $command eq "Binlog Dump" );       
          next if ( defined($user)      && $user eq "system user" );       
          next       
            if ( defined($command)       
            && $command eq "Sleep"       
            && defined($query_time)       
            && $query_time >= 1 );       
               if( $type >= 1 ) {       
            next if ( defined($command) && $command eq "Sleep" );       
             nextif ( defined($command) && $command eq "Connect" );       
           }       
               if( $type >= 2 ) {       
            next if ( defined($info) && $info =~ m/^select/i );       
            next if ( defined($info) && $info =~ m/^show/i );       
           }       
              push @threads, $ref;       
         }       
        return @threads;       
       }       
           sub main {       
         if ($command eq "stop" ) {       
           ##Gracefully killing connections on the current master       
           #1. Set read_only= 1 on the new master       
           #2. DROP USER so that no app user can establish new connections       
           #3. Set read_only= 1 on the current master       
           #4. Kill current queries       
           #* Any database access failure will result in script die.       
           my$exit_code = 1;       
          eval {       
            ## Setting read_only=1 on the new master (to avoid accident)       
            my $new_master_handler = new MHA::DBHelper();       
                # args: hostname, port, user, password, raise_error(die_on_error)_or_not       
            $new_master_handler->connect( $new_master_ip, $new_master_port,       
              $new_master_user, $new_master_password, 1 );       
            print current_time_us() . " Set read_only on the new master..";       
            $new_master_handler->enable_read_only();       
            if ( $new_master_handler->is_read_only() ) {       
              print "ok.\n";       
            }       
            else {       
              die "Failed!\n";       
            }       
            $new_master_handler->disconnect();       
                # Connecting to the orig master, die if any database error happens       
            my $orig_master_handler = new MHA::DBHelper();       
            $orig_master_handler->connect( $orig_master_ip, $orig_master_port,       
              $orig_master_user, $orig_master_password, 1 );       
                 ## Drop application user so that nobodycan connect. Disabling per-session binlog beforehand       
            #$orig_master_handler->disable_log_bin_local();       
            #print current_time_us() . " Drpping app user on the origmaster..\n";       
            #FIXME_xxx_drop_app_user($orig_master_handler);       
                ## Waiting for N * 100 milliseconds so that current connections can exit       
            my $time_until_read_only = 15;       
            $_tstart = [gettimeofday];       
            my @threads = get_threads_util( $orig_master_handler->{dbh},       
              $orig_master_handler->{connection_id} );       
            while ( $time_until_read_only > 0 && $#threads >= 0 ) {       
              if ( $time_until_read_only % 5 == 0 ) {       
                printf "%s Waiting all running %d threads aredisconnected.. (max %d milliseconds)\n",       
                  current_time_us(), $#threads + 1, $time_until_read_only * 100;       
                if ( $#threads < 5 ) {       
                  print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump ."\n"       
                    foreach (@threads);       
                }       
              }       
              sleep_until();       
              $_tstart = [gettimeofday];       
              $time_until_read_only--;       
              @threads = get_threads_util( $orig_master_handler->{dbh},       
                $orig_master_handler->{connection_id} );       
            }       
                ## Setting read_only=1 on the current master so that nobody(exceptSUPER) can write       
            print current_time_us() . " Set read_only=1 on the orig master..";       
            $orig_master_handler->enable_read_only();       
            if ( $orig_master_handler->is_read_only() ) {       
              print "ok.\n";       
            }       
            else {       
              die "Failed!\n";       
            }       
                ## Waiting for M * 100 milliseconds so that current update queries cancomplete       
            my $time_until_kill_threads = 5;       
            @threads = get_threads_util( $orig_master_handler->{dbh},       
              $orig_master_handler->{connection_id} );       
            while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {       
              if ( $time_until_kill_threads % 5 == 0 ) {       
                printf "%s Waiting all running %d queries aredisconnected.. (max %d milliseconds)\n",       
                  current_time_us(), $#threads + 1, $time_until_kill_threads * 100;       
                if ( $#threads < 5 ) {       
                  print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump ."\n"       
                    foreach (@threads);       
                }       
              }       
              sleep_until();       
              $_tstart = [gettimeofday];       
              $time_until_kill_threads--;       
              @threads = get_threads_util( $orig_master_handler->{dbh},       
                $orig_master_handler->{connection_id} );       
            }       
                           print "Disabling the VIPon old master: $orig_master_host \n";       
                       &stop_vip();           
                ## Terminating all threads       
            print current_time_us() . " Killing all applicationthreads..\n";       
            $orig_master_handler->kill_threads(@threads) if ( $#threads >= 0);       
            print current_time_us() . " done.\n";       
            #$orig_master_handler->enable_log_bin_local();       
            $orig_master_handler->disconnect();       
                ## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK       
            $exit_code = 0;       
           };       
           if($@) {       
            warn "Got Error: $@\n";       
            exit $exit_code;       
           }       
          exit $exit_code;       
         }       
        elsif ( $command eq "start" ) {       
           ##Activating master ip on the new master       
           #1. Create app user with write privileges       
           #2. Moving backup script if needed       
           #3. Register new master's ip to the catalog database       
           # We don't return error even thoughactivating updatable accounts/ip failed so that we don't interrupt slaves'recovery.       
       # If exit code is 0 or 10, MHA does notabort       
           my$exit_code = 10;       
           eval{       
            my $new_master_handler = new MHA::DBHelper();       
                # args: hostname, port, user, password, raise_error_or_not       
            $new_master_handler->connect( $new_master_ip, $new_master_port,       
              $new_master_user, $new_master_password, 1 );       
                ## Set read_only=0 on the new master       
            #$new_master_handler->disable_log_bin_local();       
            print current_time_us() . " Set read_only=0 on the newmaster.\n";       
            $new_master_handler->disable_read_only();       
                ## Creating an app user on the new master       
            #print current_time_us() . " Creating app user on the newmaster..\n";       
            #FIXME_xxx_create_app_user($new_master_handler);       
            #$new_master_handler->enable_log_bin_local();       
            $new_master_handler->disconnect();       
                ## Update master ip on the catalog database, etc       
                       print "Enabling the VIP -$vip on the new master - $new_master_host \n";       
                       &start_vip();       
                       $exit_code = 0;       
           };       
           if($@) {       
            warn "Got Error: $@\n";       
            exit $exit_code;       
           }       
          exit $exit_code;       
         }       
        elsif ( $command eq "status" ) {       
               #do nothing       
          exit 0;       
         }       
         else{       
          &usage();       
          exit 1;       
         }       
       }       
           # A simple system call that enable the VIPon the new master       
       sub start_vip() {       
          `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;       
       }       
       # A simple system call that disable the VIPon the old_master       
       sub stop_vip() {       
          `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;       
       }       
           sub usage {       
        print       
       "Usage: master_ip_online_change --command=start|stop|status--orig_master_host=host --orig_master_ip=ip --orig_master_port=port--new_master_host=host --new_master_ip=ip --new_master_port=port\n";       
         die;       
       }

在hdp1上用root使用者操作。

  [root@hdp1~]#masterha_check_ssh --conf=/etc/masterha/app1.cnf       
       Tue Jul 31 12:50:22 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.       
       Tue Jul 31 12:50:22 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..       
       Tue Jul 31 12:50:22 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..       
       Tue Jul 31 12:50:22 2018 - [info] Starting SSH connection tests..       
       Tue Jul 31 12:50:23 2018 - [debug]        
       Tue Jul 31 12:50:22 2018 - [debug]  Connecting via SSH from [email protected](172.16.1.127:22) to [email protected](172.16.1.126:22)..       
       Tue Jul 31 12:50:22 2018 - [debug]   ok.       
       Tue Jul 31 12:50:22 2018 - [debug]  Connecting via SSH from [email protected](172.16.1.127:22) to [email protected](172.16.1.125:22)..       
       Tue Jul 31 12:50:23 2018 - [debug]   ok.       
       Tue Jul 31 12:50:24 2018 - [debug]        
       Tue Jul 31 12:50:23 2018 - [debug]  Connecting via SSH from [email protected](172.16.1.126:22) to [email protected](172.16.1.127:22)..       
       Tue Jul 31 12:50:23 2018 - [debug]   ok.       
       Tue Jul 31 12:50:23 2018 - [debug]  Connecting via SSH from [email protected](172.16.1.126:22) to [email protected](172.16.1.125:22)..       
       Tue Jul 31 12:50:23 2018 - [debug]   ok.       
       Tue Jul 31 12:50:25 2018 - [debug]        
       Tue Jul 31 12:50:23 2018 - [debug]  Connecting via SSH from [email protected](172.16.1.125:22) to [email protected](172.16.1.127:22)..       
       Tue Jul 31 12:50:23 2018 - [debug]   ok.       
       Tue Jul 31 12:50:23 2018 - [debug]  Connecting via SSH from [email protected](172.16.1.125:22) to [email protected](172.16.1.126:22)..       
       Tue Jul 31 12:50:24 2018 - [debug]   ok.       
       Tue Jul 31 12:50:25 2018 - [info] All SSH connection tests passed successfully.       
       [root@hdp1~]#

可以看到各個節點ssh驗證都是ok的。

  [root@hdp1~]#masterha_check_repl --conf=/etc/masterha/app1.cnf       
       Tue Jul 31 12:52:19 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.       
       Tue Jul 31 12:52:19 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..       
       Tue Jul 31 12:52:19 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..       
       Tue Jul 31 12:52:19 2018 - [info] MHA::MasterMonitor version 0.56.       
       Tue Jul 31 12:52:21 2018 - [info] GTID failover mode = 0       
       Tue Jul 31 12:52:21 2018 - [info] Dead Servers:       
       Tue Jul 31 12:52:21 2018 - [info] Alive Servers:       
       Tue Jul 31 12:52:21 2018 - [info]   172.16.1.127(172.16.1.127:3306)       
       Tue Jul 31 12:52:21 2018 - [info]   172.16.1.126(172.16.1.126:3306)       
       Tue Jul 31 12:52:21 2018 - [info]   172.16.1.125(172.16.1.125:3306)       
       Tue Jul 31 12:52:21 2018 - [info] Alive Slaves:       
       Tue Jul 31 12:52:21 2018 - [info]   172.16.1.126(172.16.1.126:3306)  Version=5.6.14-log (oldest major version between slaves) log-bin:enabled       
       Tue Jul 31 12:52:21 2018 - [info]     Replicating from 172.16.1.127(172.16.1.127:3306)       
       Tue Jul 31 12:52:21 2018 - [info]     Primary candidate for the new Master (candidate_master is set)       
       Tue Jul 31 12:52:21 2018 - [info]   172.16.1.125(172.16.1.125:3306)  Version=5.6.14-log (oldest major version between slaves) log-bin:enabled       
       Tue Jul 31 12:52:21 2018 - [info]     Replicating from 172.16.1.127(172.16.1.127:3306)       
       Tue Jul 31 12:52:21 2018 - [info] Current Alive Master: 172.16.1.127(172.16.1.127:3306)       
       Tue Jul 31 12:52:21 2018 - [info] Checking slave configurations..       
       Tue Jul 31 12:52:21 2018 - [info]  read_only=1 is not set on slave 172.16.1.126(172.16.1.126:3306).       
       Tue Jul 31 12:52:21 2018 - [info] Checking replication filtering settings..       
       Tue Jul 31 12:52:21 2018 - [info]  binlog_do_db= , binlog_ignore_db=        
       Tue Jul 31 12:52:21 2018 - [info]  Replication filtering check ok.       
       Tue Jul 31 12:52:21 2018 - [info] GTID (with auto-pos) is not supported       
       Tue Jul 31 12:52:21 2018 - [info] Starting SSH connection tests..       
       Tue Jul 31 12:52:23 2018 - [info] All SSH connection tests passed successfully.       
       Tue Jul 31 12:52:23 2018 - [info] Checking MHA Node version..       
       Tue Jul 31 12:52:24 2018 - [info]  Version check ok.       
       Tue Jul 31 12:52:24 2018 - [info] Checking SSH publickey authentication settings on the current master..       
       Tue Jul 31 12:52:24 2018 - [info] HealthCheck: SSH to 172.16.1.127 is reachable.       
       Tue Jul 31 12:52:24 2018 - [info] Master MHA Node version is 0.56.       
       Tue Jul 31 12:52:24 2018 - [info] Checking recovery script configurations on 172.16.1.127(172.16.1.127:3306)..       
       Tue Jul 31 12:52:24 2018 - [info]   Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000001        
       Tue Jul 31 12:52:24 2018 - [info]   Connecting to [email protected](172.16.1.127:22)..        
         Creating /tmp if not exists..    ok.       
         Checking output directory is accessible or not..       
          ok.       
         Binlog found at /data, up to mysql-bin.000001       
       Tue Jul 31 12:52:25 2018 - [info] Binlog setting check done.       
       Tue Jul 31 12:52:25 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..       
       Tue Jul 31 12:52:25 2018 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=172.16.1.126 --slave_ip=172.16.1.126 --slave_port=3306 --workdir=/tmp --target_version=5.6.14-log --manager_version=0.56 --relay_log_info=/data/relay-log.info  --relay_dir=/data/  --slave_pass=xxx       
       Tue Jul 31 12:52:25 2018 - [info]   Connecting to [email protected](172.16.1.126:22)..        
         Checking slave recovery environment settings..       
           Opening /data/relay-log.info ... ok.       
           Relay log found at /data, up to hdp3-relay-bin.000003       
           Temporary relay log file is /data/hdp3-relay-bin.000003       
           Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.       
        done.       
           Testing mysqlbinlog output.. done.       
           Cleaning up test file(s).. done.       
       Tue Jul 31 12:52:25 2018 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=172.16.1.125 --slave_ip=172.16.1.125 --slave_port=3306 --workdir=/tmp --target_version=5.6.14-log --manager_version=0.56 --relay_log_info=/data/relay-log.info  --relay_dir=/data/  --slave_pass=xxx       
       Tue Jul 31 12:52:25 2018 - [info]   Connecting to [email protected](172.16.1.125:22)..        
         Checking slave recovery environment settings..       
           Opening /data/relay-log.info ... ok.       
           Relay log found at /data, up to hdp2-relay-bin.000003       
           Temporary relay log file is /data/hdp2-relay-bin.000003       
           Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.       
        done.       
           Testing mysqlbinlog output.. done.       
           Cleaning up test file(s).. done.       
       Tue Jul 31 12:52:25 2018 - [info] Slaves settings check done.       
       Tue Jul 31 12:52:25 2018 - [info]        
       172.16.1.127(172.16.1.127:3306) (current master)       
        +--172.16.1.126(172.16.1.126:3306)       
        +--172.16.1.125(172.16.1.125:3306)       
           Tue Jul 31 12:52:25 2018 - [info] Checking replication health on 172.16.1.126..       
       Tue Jul 31 12:52:25 2018 - [info]  ok.       
       Tue Jul 31 12:52:25 2018 - [info] Checking replication health on 172.16.1.125..       
       Tue Jul 31 12:52:25 2018 - [info]  ok.       
       Tue Jul 31 12:52:25 2018 - [info] Checking master_ip_failover_script status:       
       Tue Jul 31 12:52:25 2018 - [info]   /usr/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=172.16.1.127 --orig_master_ip=172.16.1.127 --orig_master_port=3306        
               IN SCRIPT TEST====/sbin/ifconfig ens160:1 down==/sbin/ifconfig ens32:1 172.16.1.100===       
           Checking the Status of the script.. OK        
       SIOCSIFADDR: No such device       
       ens32:1: ERROR while getting interface flags: No such device       
       Tue Jul 31 12:52:25 2018 - [info]  OK.       
       Tue Jul 31 12:52:25 2018 - [warning] shutdown_script is not defined.       
       Tue Jul 31 12:52:25 2018 - [info] Got exit code 0 (Not master dead).       
           MySQL Replication Health is OK.

沒有明顯報錯，隻有幾個警告而已，複制顯示正常。

  [root@hdp1~]#masterha_check_status --conf=/etc/masterha/app1.cnf       
       app1 is stopped(2:NOT_RUNNING).       
       [root@hdp1~]#

顯示"NOT_RUNNING"，這代表MHA監控沒有開啟。執行下面的指令背景啟動MHA。

  mkdir -p  /var/log/masterha/app1/       
       nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &

啟動參數說明：

remove_dead_master_conf：該參數代表當發生主從切換後，老的主庫的ip将會從配置檔案中移除。
manger_log：日志存放位置。
ignore_last_failover：在預設情況下，如果MHA檢測到連續發生當機，且兩次當機間隔不足8小時的話，則不會進行Failover，之是以這樣限制是為了避免ping-pong效應。該參數代表忽略上次MHA觸發切換産生的檔案，預設情況下，MHA發生切換後會在日志目錄，也就是上面設定的/data産生app1.failover.complete檔案，下次再次切換的時候如果發現該目錄下存在該檔案将不允許觸發切換，除非在第一次切換後收到删除該檔案。為了友善，這裡設定為--ignore_last_failover。

再次檢查MHA Manager的狀态：

  [root@hdp1~]#masterha_check_status --conf=/etc/masterha/app1.cnf       
       app1 (pid:298237) is running(0:PING_OK), master:172.16.1.127       
       [root@hdp1~]#

可以看見已經在監控了，而且master的主機為172.16.1.127。

  [root@hdp1~]#tail -n20 /var/log/masterha/app1/manager.log       
       Tue Jul 31 12:57:06 2018 - [info]        
       172.16.1.127(172.16.1.127:3306) (current master)       
        +--172.16.1.126(172.16.1.126:3306)       
        +--172.16.1.125(172.16.1.125:3306)       
           Tue Jul 31 12:57:06 2018 - [info] Checking master_ip_failover_script status:       
       Tue Jul 31 12:57:06 2018 - [info]   /usr/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=172.16.1.127 --orig_master_ip=172.16.1.127 --orig_master_port=3306        
               IN SCRIPT TEST====/sbin/ifconfig ens160:1 down==/sbin/ifconfig ens32:1 172.16.1.100===       
           Checking the Status of the script.. OK        
       SIOCSIFADDR: No such device       
       ens32:1: ERROR while getting interface flags: No such device       
       Tue Jul 31 12:57:06 2018 - [info]  OK.       
       Tue Jul 31 12:57:06 2018 - [warning] shutdown_script is not defined.       
       Tue Jul 31 12:57:06 2018 - [info] Set master ping interval 1 seconds.       
       Tue Jul 31 12:57:06 2018 - [info] Set secondary check script: /usr/bin/masterha_secondary_check -s hdp4 -s hdp3 --user=root --master_host=hdp4 --master_ip=172.16.1.127 --master_port=3306       
       Tue Jul 31 12:57:06 2018 - [info] Starting ping health check on 172.16.1.127(172.16.1.127:3306)..       
       Tue Jul 31 12:57:06 2018 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..       
       [root@hdp1~]#

在hdp4 172.16.1.127（master）上用root使用者執行：

/sbin/ifconfig ens160:1 172.16.1.100/24

檢視VIP：

  [root@hdp4~]#ip a       
       1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN        
           link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00       
           inet 127.0.0.1/8 scope host lo       
              valid_lft forever preferred_lft forever       
           inet6 ::1/128 scope host        
              valid_lft forever preferred_lft forever       
       2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000       
           link/ether 00:50:56:a5:49:7f brd ff:ff:ff:ff:ff:ff       
           inet 172.16.1.127/24 brd 172.16.1.255 scope global ens160       
              valid_lft forever preferred_lft forever       
           inet 172.16.1.100/16 brd 172.16.255.255 scope global ens160:1       
              valid_lft forever preferred_lft forever       
           inet6 fe80::250:56ff:fea5:497f/64 scope link        
              valid_lft forever preferred_lft forever       
       [root@hdp4~]#

（1）在slave1庫（172.16.1.126）上停掉slave IO線程，模拟主從延時：

mysql -uroot -p123456 -e "stop slave io_thread;"

（2）在master庫（172.16.1.127）安裝sysbench，進行sysbench資料生成，在sbtest庫下生成sbtest表，共10W記錄。

  # 用root使用者安裝sysbench       
       yum install sysbench -y       
           # 用mysql使用者建立sbtest 資料庫       
       mysql -uroot -p123456 -e "create database sbtest;"       
           # 用mysql使用者執行sysbench生成資料       
       sysbench /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-user=root --mysql-password=123456 --oltp-test-mode=complex --oltp-tables-count=10 --oltp-table-size=10000 --threads=10 --time=120 --report-interval=10 --db-driver=mysql prepare

（3）用root使用者停止master的mysql服務。

service mysql stop

（4）驗證VIP漂移。

在hdp3上用root使用者操作。

  [root@hdp3~]#ip a       
       1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN        
           link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00       
           inet 127.0.0.1/8 scope host lo       
              valid_lft forever preferred_lft forever       
           inet6 ::1/128 scope host        
              valid_lft forever preferred_lft forever       
       2: ens32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000       
           link/ether 00:50:56:a5:0f:77 brd ff:ff:ff:ff:ff:ff       
           inet 172.16.1.126/24 brd 172.16.1.255 scope global ens32       
              valid_lft forever preferred_lft forever       
           inet 172.16.1.100/16 brd 172.16.255.255 scope global ens32:1       
              valid_lft forever preferred_lft forever       
           inet6 fe80::250:56ff:fea5:f77/64 scope link        
              valid_lft forever preferred_lft forever       
       [root@hdp3~]#

在hdp4上用root使用者操作。

  [root@hdp4~]#ip a       
       1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN        
           link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00       
           inet 127.0.0.1/8 scope host lo       
              valid_lft forever preferred_lft forever       
           inet6 ::1/128 scope host        
              valid_lft forever preferred_lft forever       
       2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000       
           link/ether 00:50:56:a5:49:7f brd ff:ff:ff:ff:ff:ff       
           inet 172.16.1.127/24 brd 172.16.1.255 scope global ens160       
              valid_lft forever preferred_lft forever       
           inet6 fe80::250:56ff:fea5:497f/64 scope link        
              valid_lft forever preferred_lft forever       
       [root@hdp4~]#

可以看到VIP已經從hdp4 172.16.1.127（master）漂移到了hdp3 172.16.1.126（slave1）。

（5）用戶端用VIP通路資料庫

  C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.100 -e "show databases; use sbtest; show tables; select count(*) from sbtest1; select count(*) from sbtest10;"       
       mysql: [Warning] Using a password on the command line interface can be insecure.       
       +--------------------+       
       | Database           |       
       +--------------------+       
       | information_schema |       
       | mysql              |       
       | performance_schema |       
       | sbtest             |       
       | source             |       
       | test               |       
       +--------------------+       
       +------------------+       
       | Tables_in_sbtest |       
       +------------------+       
       | sbtest1          |       
       | sbtest10         |       
       | sbtest2          |       
       | sbtest3          |       
       | sbtest4          |       
       | sbtest5          |       
       | sbtest6          |       
       | sbtest7          |       
       | sbtest8          |       
       | sbtest9          |       
       +------------------+       
       +----------+       
       | count(*) |       
       +----------+       
       |    10000 |       
       +----------+       
       +----------+       
       | count(*) |       
       +----------+       
       |    10000 |       
       +----------+       
           C:\WINDOWS\system32>

在還沒建立sbtest庫的時候，172.16.1.126就停了slave sql線程。在新的Master 172.16.1.126上檢視資料，可以看到落後的資料也同步過來了，資料沒有丢失。

（6）檢視複制的主從切換

  C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.125 -e "show slave status\G"       
       mysql: [Warning] Using a password on the command line interface can be insecure.       
       *************************** 1. row ***************************       
                      Slave_IO_State: Waiting for master to send event       
                         Master_Host: 172.16.1.126       
                         Master_User: repl       
                         Master_Port: 3306       
                       Connect_Retry: 60       
                     Master_Log_File: mysql-bin.000001       
                 Read_Master_Log_Pos: 19093607       
                      Relay_Log_File: hdp2-relay-bin.000002       
                       Relay_Log_Pos: 283       
               Relay_Master_Log_File: mysql-bin.000001       
                    Slave_IO_Running: Yes       
                   Slave_SQL_Running: Yes       
                     Replicate_Do_DB:       
                 Replicate_Ignore_DB:       
                  Replicate_Do_Table:       
              Replicate_Ignore_Table:       
             Replicate_Wild_Do_Table:       
         Replicate_Wild_Ignore_Table:       
                          Last_Errno: 0       
                          Last_Error:       
                        Skip_Counter: 0       
                 Exec_Master_Log_Pos: 19093607       
                     Relay_Log_Space: 455       
                     Until_Condition: None       
                      Until_Log_File:       
                       Until_Log_Pos: 0       
                  Master_SSL_Allowed: No       
                  Master_SSL_CA_File:       
                  Master_SSL_CA_Path:       
                     Master_SSL_Cert:       
                   Master_SSL_Cipher:       
                      Master_SSL_Key:       
               Seconds_Behind_Master: 0       
       Master_SSL_Verify_Server_Cert: No       
                       Last_IO_Errno: 0       
                       Last_IO_Error:       
                      Last_SQL_Errno: 0       
                      Last_SQL_Error:       
         Replicate_Ignore_Server_Ids:       
                    Master_Server_Id: 126       
                         Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56       
                    Master_Info_File: /data/master.info       
                           SQL_Delay: 0       
                 SQL_Remaining_Delay: NULL       
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it       
                  Master_Retry_Count: 86400       
                         Master_Bind:       
             Last_IO_Error_Timestamp:       
            Last_SQL_Error_Timestamp:       
                      Master_SSL_Crl:       
                  Master_SSL_Crlpath:       
                  Retrieved_Gtid_Set:       
                   Executed_Gtid_Set:       
                       Auto_Position: 0       
           C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.126 -e "show slave status\G"       
       mysql: [Warning] Using a password on the command line interface can be insecure.       
           C:\WINDOWS\system32>

可以看到，172.16.1.126稱為新的master，而172.16.1.125也指向了這個新的master。

（7）檢查MHA Manager的狀态

  [root@hdp1~]#masterha_check_status --conf=/etc/masterha/app1.cnf       
       app1 is stopped(2:NOT_RUNNING).       
       [1]+  Done                    nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1       
       [root@hdp1~]#

發現在執行了一次自動failover後，MHA Manager程序停止了。官網上對這種情況的解釋如下：

意思是安裝一個程序工具，通過該工具結合腳本來管理程序。

首先要還原環境。

還原資料庫複制：

  -- 在hdp4、hdp3、hdp2上重置master、slave       
       stop slave;       
       drop database sbtest;       
       reset master;       
       reset slave all;       
           -- 在hdp3、hdp2上重新指向hdp4為master       
       change master to       
       master_host='172.16.1.127',       
       master_port=3306,       
       master_user='repl',       
       master_password='123456',       
       master_log_file='mysql-bin.000001',       
       master_log_pos=120;       
           start slave;       
       show slave status\G

還原VIP綁定：

  # 在hdp3上用root使用者執行       
       /sbin/ifconfig ens32:1 down       
           # 在hdp4上用root使用者執行       
       /sbin/ifconfig ens160:1 172.16.1.100

還原配置檔案：

編輯在hdp1上/etc/masterha/app1.cnf，将[server1]段添加回去。

啟動MHA Manage：

  # 在hdp1上用root使用者執行       
       nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &

至此環境還原完畢，可以開始測試手工切換。當主伺服器故障時，人工手動調用MHA來進行故障切換操作，步驟如下。

（1）停止MHA Manage

masterha_stop --conf=/etc/masterha/app1.cnf

（2）關閉master

service mysql stop

（3）執行手工切換

masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=172.16.1.127 --dead_master_port=3306 --new_master_host=172.16.1.126 --new_master_port=3306 --ignore_last_failover

（4）驗證VIP漂移到172.16.1.126

  [root@hdp3~]#ip a       
       1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN        
           link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00       
           inet 127.0.0.1/8 scope host lo       
              valid_lft forever preferred_lft forever       
           inet6 ::1/128 scope host        
              valid_lft forever preferred_lft forever       
       2: ens32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000       
           link/ether 00:50:56:a5:0f:77 brd ff:ff:ff:ff:ff:ff       
           inet 172.16.1.126/24 brd 172.16.1.255 scope global ens32       
              valid_lft forever preferred_lft forever       
           inet 172.16.1.100/16 brd 172.16.255.255 scope global ens32:1       
              valid_lft forever preferred_lft forever       
           inet6 fe80::250:56ff:fea5:f77/64 scope link        
              valid_lft forever preferred_lft forever       
       [root@hdp3~]#

（5）驗證複制關系

  C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.125 -e "show slave status\G"       
       mysql: [Warning] Using a password on the command line interface can be insecure.       
       *************************** 1. row ***************************       
                      Slave_IO_State: Waiting for master to send event       
                         Master_Host: 172.16.1.126       
                         Master_User: repl       
                         Master_Port: 3306       
                       Connect_Retry: 60       
                     Master_Log_File: mysql-bin.000001       
                 Read_Master_Log_Pos: 120       
                      Relay_Log_File: hdp2-relay-bin.000002       
                       Relay_Log_Pos: 283       
               Relay_Master_Log_File: mysql-bin.000001       
                    Slave_IO_Running: Yes       
                   Slave_SQL_Running: Yes       
                     Replicate_Do_DB:       
                 Replicate_Ignore_DB:       
                  Replicate_Do_Table:       
              Replicate_Ignore_Table:       
             Replicate_Wild_Do_Table:       
         Replicate_Wild_Ignore_Table:       
                          Last_Errno: 0       
                          Last_Error:       
                        Skip_Counter: 0       
                 Exec_Master_Log_Pos: 120       
                     Relay_Log_Space: 455       
                     Until_Condition: None       
                      Until_Log_File:       
                       Until_Log_Pos: 0       
                  Master_SSL_Allowed: No       
                  Master_SSL_CA_File:       
                  Master_SSL_CA_Path:       
                     Master_SSL_Cert:       
                   Master_SSL_Cipher:       
                      Master_SSL_Key:       
               Seconds_Behind_Master: 0       
       Master_SSL_Verify_Server_Cert: No       
                       Last_IO_Errno: 0       
                       Last_IO_Error:       
                      Last_SQL_Errno: 0       
                      Last_SQL_Error:       
         Replicate_Ignore_Server_Ids:       
                    Master_Server_Id: 126       
                         Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56       
                    Master_Info_File: /data/master.info       
                           SQL_Delay: 0       
                 SQL_Remaining_Delay: NULL       
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it       
                  Master_Retry_Count: 86400       
                         Master_Bind:       
             Last_IO_Error_Timestamp:       
            Last_SQL_Error_Timestamp:       
                      Master_SSL_Crl:       
                  Master_SSL_Crlpath:       
                  Retrieved_Gtid_Set:       
                   Executed_Gtid_Set:       
                       Auto_Position: 0       
           C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.126 -e "show slave status\G"       
       mysql: [Warning] Using a password on the command line interface can be insecure.       
           C:\WINDOWS\system32>

（6）驗證用戶端VIP通路

  C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.100 -e "show variables like 'server_id'; show databases;"       
       mysql: [Warning] Using a password on the command line interface can be insecure.       
       +---------------+-------+       
       | Variable_name | Value |       
       +---------------+-------+       
       | server_id     | 126   |       
       +---------------+-------+       
       +--------------------+       
       | Database           |       
       +--------------------+       
       | information_schema |       
       | mysql              |       
       | performance_schema |       
       | source             |       
       | test               |       
       +--------------------+       
           C:\WINDOWS\system32>

在許多情況下，需要将現有的主伺服器遷移到另外一台伺服器上。比如主伺服器硬體故障，RAID控制卡需要重建，将主伺服器移到性能更好的伺服器上等等。維護主伺服器引起性能下降，導緻停機時間至少無法寫入資料。另外，阻塞或殺掉目前運作的會話會導緻主主之間資料不一緻的問題發生。MHA 提供快速切換和優雅的阻塞寫入，這個切換過程隻需要 0.5-2s 的時間，這段時間内資料是無法寫入的。在很多情況下，0.5-2s 的阻塞寫入是可以接受的。是以切換主伺服器不需要計劃配置設定維護時間視窗。

MHA線上切換的大概過程：

檢測複制設定和确定目前主伺服器
确定新的主伺服器
阻塞寫入到目前主伺服器
等待所有從伺服器趕上複制
授予寫入到新的主伺服器
重新設定從伺服器

注意，線上切換的時候應用架構需要考慮以下兩個問題：

自動識别master和slave的問題（master的機器可能會切換），如果采用了vip的方式，基本可以解決這個問題。
負載均衡的問題（可以定義大概的讀寫比例，每台機器可承擔的負載比例，當有機器離開叢集時，需要考慮這個問題）

為了保證資料完全一緻性，在最快的時間内完成切換，MHA的線上切換必須滿足以下條件才會切換成功，否則會切換失敗。

所有slave的IO線程都在運作
所有slave的SQL線程都在運作
所有的show slave status的輸出中Seconds_Behind_Master參數小于或者等于running_updates_limit秒。如果在切換過程中不指定running_updates_limit，那麼預設情況下running_updates_limit為1秒。
在master端，通過show processlist輸出，沒有一個更新花費的時間大于running_updates_limit秒。

在測試前，先按照上面“測試手工切換”測試前的步驟執行還原環境（手工切換不用修改/etc/masterha/app1.cnf配置檔案），然後按以下步驟測試線切換：

masterha_stop --conf=/etc/masterha/app1.cnf

（2）執行線上切換指令

masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.16.1.126 --new_master_port=3306  --orig_master_is_new_slave --running_updates_limit=10000

（3）驗證複制關系

在hdp2、hdp3、hdp4檢視slave status：

  C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.125 -e "show slave status\G"       
       mysql: [Warning] Using a password on the command line interface can be insecure.       
       *************************** 1. row ***************************       
                      Slave_IO_State: Waiting for master to send event       
                         Master_Host: 172.16.1.126       
                         Master_User: repl       
                         Master_Port: 3306       
                       Connect_Retry: 60       
                     Master_Log_File: mysql-bin.000001       
                 Read_Master_Log_Pos: 120       
                      Relay_Log_File: hdp2-relay-bin.000002       
                       Relay_Log_Pos: 283       
               Relay_Master_Log_File: mysql-bin.000001       
                    Slave_IO_Running: Yes       
                   Slave_SQL_Running: Yes       
                     Replicate_Do_DB:       
                 Replicate_Ignore_DB:       
                  Replicate_Do_Table:       
              Replicate_Ignore_Table:       
             Replicate_Wild_Do_Table:       
         Replicate_Wild_Ignore_Table:       
                          Last_Errno: 0       
                          Last_Error:       
                        Skip_Counter: 0       
                 Exec_Master_Log_Pos: 120       
                     Relay_Log_Space: 455       
                     Until_Condition: None       
                      Until_Log_File:       
                       Until_Log_Pos: 0       
                  Master_SSL_Allowed: No       
                  Master_SSL_CA_File:       
                  Master_SSL_CA_Path:       
                     Master_SSL_Cert:       
                   Master_SSL_Cipher:       
                      Master_SSL_Key:       
               Seconds_Behind_Master: 0       
       Master_SSL_Verify_Server_Cert: No       
                       Last_IO_Errno: 0       
                       Last_IO_Error:       
                      Last_SQL_Errno: 0       
                      Last_SQL_Error:       
         Replicate_Ignore_Server_Ids:       
                    Master_Server_Id: 126       
                         Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56       
                    Master_Info_File: /data/master.info       
                           SQL_Delay: 0       
                 SQL_Remaining_Delay: NULL       
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it       
                  Master_Retry_Count: 86400       
                         Master_Bind:       
             Last_IO_Error_Timestamp:       
            Last_SQL_Error_Timestamp:       
                      Master_SSL_Crl:       
                  Master_SSL_Crlpath:       
                  Retrieved_Gtid_Set:       
                   Executed_Gtid_Set:       
                       Auto_Position: 0       
           C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.126 -e "show slave status\G"       
       mysql: [Warning] Using a password on the command line interface can be insecure.       
           C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.127 -e "show slave status\G"       
       mysql: [Warning] Using a password on the command line interface can be insecure.       
       *************************** 1. row ***************************       
                      Slave_IO_State: Waiting for master to send event       
                         Master_Host: 172.16.1.126       
                         Master_User: repl       
                         Master_Port: 3306       
                       Connect_Retry: 60       
                     Master_Log_File: mysql-bin.000001       
                 Read_Master_Log_Pos: 120       
                      Relay_Log_File: hdp4-relay-bin.000002       
                       Relay_Log_Pos: 283       
               Relay_Master_Log_File: mysql-bin.000001       
                    Slave_IO_Running: Yes       
                   Slave_SQL_Running: Yes       
                     Replicate_Do_DB:       
                 Replicate_Ignore_DB:       
                  Replicate_Do_Table:       
              Replicate_Ignore_Table:       
             Replicate_Wild_Do_Table:       
         Replicate_Wild_Ignore_Table:       
                          Last_Errno: 0       
                          Last_Error:       
                        Skip_Counter: 0       
                 Exec_Master_Log_Pos: 120       
                     Relay_Log_Space: 455       
                     Until_Condition: None       
                      Until_Log_File:       
                       Until_Log_Pos: 0       
                  Master_SSL_Allowed: No       
                  Master_SSL_CA_File:       
                  Master_SSL_CA_Path:       
                     Master_SSL_Cert:       
                   Master_SSL_Cipher:       
                      Master_SSL_Key:       
               Seconds_Behind_Master: 0       
       Master_SSL_Verify_Server_Cert: No       
                       Last_IO_Errno: 0       
                       Last_IO_Error:       
                      Last_SQL_Errno: 0       
                      Last_SQL_Error:       
         Replicate_Ignore_Server_Ids:       
                    Master_Server_Id: 126       
                         Master_UUID: fadd5b7d-7d9f-11e8-90b4-13ccc7802b56       
                    Master_Info_File: /data/master.info       
                           SQL_Delay: 0       
                 SQL_Remaining_Delay: NULL       
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it       
                  Master_Retry_Count: 86400       
                         Master_Bind:       
             Last_IO_Error_Timestamp:       
            Last_SQL_Error_Timestamp:       
                      Master_SSL_Crl:       
                  Master_SSL_Crlpath:       
                  Retrieved_Gtid_Set:       
                   Executed_Gtid_Set:       
                       Auto_Position: 0       
           C:\WINDOWS\system32>

可以看到hdp3 172.16.1.126成為新的master，而hdp2 172.16.1.125和hdp4 172.16.1.127 成為指向新master的slave。

（4）驗證VIP自動漂移

  [root@hdp3~]#ip a       
       1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN        
           link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00       
           inet 127.0.0.1/8 scope host lo       
              valid_lft forever preferred_lft forever       
           inet6 ::1/128 scope host        
              valid_lft forever preferred_lft forever       
       2: ens32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000       
           link/ether 00:50:56:a5:0f:77 brd ff:ff:ff:ff:ff:ff       
           inet 172.16.1.126/24 brd 172.16.1.255 scope global ens32       
              valid_lft forever preferred_lft forever       
           inet 172.16.1.100/16 brd 172.16.255.255 scope global ens32:1       
              valid_lft forever preferred_lft forever       
           inet6 fe80::250:56ff:fea5:f77/64 scope link        
              valid_lft forever preferred_lft forever       
       [root@hdp3~]#

（5）驗證用戶端通過VIP通路資料庫

  C:\WINDOWS\system32>mysql -uroot -p123456 -h172.16.1.100 -e "show variables like 'server_id'"       
       mysql: [Warning] Using a password on the command line interface can be insecure.       
       +---------------+-------+       
       | Variable_name | Value |       
       +---------------+-------+       
       | server_id     | 126   |       
       +---------------+-------+       
           C:\WINDOWS\system32>

通常情況下自動切換以後，原master可能已經廢棄掉，待原master主機修複後，如果資料完整的情況下，可能想把原來master重新作為新主庫的slave。這時我們可以借助當時自動切換時刻的MHA日志來完成對原master的修複。下面是提取相關日志的指令：

grep -i "All other slaves should start" /var/log/masterha/app1/manager.log

可以看到類似下面的資訊：

All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.1.126', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000005', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='123456';

意思是說，如果Master主機修複好了，可以在修複好後的Master執行CHANGE MASTER操作，作為新的slave庫。

MySQL高可用之MHA的搭建

使用MHA實作MySQL主從複制高可用

繼續閱讀

SQL優化SQL語句優化的目的

資料遷移方法資料遷移原則資料遷移之雙寫方案資料遷移之級聯同步方案

redis叢集資料一緻性_RedisRaft為Redis叢集帶來強大的資料一緻性

JAVA高效程式設計指南

寶塔面闆mysql恢複2018.1.8更新

Centos7 MySQL 5.7 安裝MySQL 5.7 安裝

查找入職員工時間排名倒數第三的員工所有資訊

Hibernate使用Hibernate的“3個準備，7個步驟”Hibernate API簡介操作實體對象對象識别

雲計算面試題——mysql/存儲引擎/備份

關于SQL語言

SQL語言基礎：常用的資料查詢語句

Ubuntu16.04安裝Apache+MySQL+PHP1. 安裝Apache2. 安裝MySQL3. 安裝PHP4. 安裝phpMyAdmin

MySQL的4種隔離級别？出現問題

neo4j之cypher使用文檔

mysql使用source指令導入.sql檔案

sqlServer根據經緯查距離