DNS 引起經典RAC故障 DNS 引起經典RAC故障

<b></b>

作者：吳偉龍(PrudentWoo)

這是一套四年前部署的RAC系統，之前運作一直很好，沒有出過問題，平時基本處于無人管的狀态。

OS:Redhat EnterPrise Linux 5.8

x86_x64

DB:Oracle Database

EnterPrise 11.2.0.4

GI:Oracle Grid Infrastructure

11.2.0.4

昨天臨近下班接到現場人員故障請求，描述為資料庫無法連接配接，報ORA-12547:TNS:

lost CONNECT。當時第一反應是網絡和監聽故障，讓現場人員進行tnsping和ping都是正常的。

我到達現場後，首先檢視了資料庫的狀态，發現資料庫執行個體是停止運作狀态，并且從日志中看不出明顯報錯；

Starting up:

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP, Data Mining

and Real Application Testing options.

ORACLE_HOME = /u01/app/oracle/11.2.0.4/product/db_1

System name: Linux

Node name: db01

Release: 3.8.13-44.1.1.el6uek.x86_64

Version: #2 SMP Wed Sep 10 06:10:25 PDT 2014

Machine: x86_64

VM name: VMWare Version: 6

Using parameter settings in server-side pfile /u01/app/oracle/11.2.0.4/product/db_1/dbs/initwoo1.ora

System parameters with non-default values:

processes = 600

sessions = 922

spfile = "+DATA/woo/spfilewoo.ora"

nls_language = "SIMPLIFIED CHINESE"

nls_territory = "CHINA"

memory_target = 1584M

control_files = "+DATA/woo/controlfile/current.260.930748953"

control_files = "+FRA01/woo/controlfile/current.256.930748953"

db_block_size = 8192

compatible = "11.2.0.4.0"

cluster_database = TRUE

db_create_file_dest = "+DATA"

db_recovery_file_dest = "+FRA01"

db_recovery_file_dest_size= 4407M

thread = 1

undo_tablespace = "UNDOTBS1"

instance_number = 1

remote_login_passwordfile= "EXCLUSIVE"

db_domain = ""

dispatchers = "(PROTOCOL=TCP) (SERVICE=wooXDB)"

remote_listener = "scan.prudentwoo.com:1521"

audit_file_dest = "/u01/app/oracle/admin/woo/adump"

audit_trail = "DB"

db_name = "woo"

open_cursors = 300

diagnostic_dest = "/u01/app/oracle"

Cluster communication is configured to use the following interface(s) for this instance

169.254.51.38

169.254.243.157

cluster interconnect IPC version:Oracle UDP/IP (generic)

IPC Vendor 1 proto 2

Fri Dec 16 15:24:55 2016

USER (ospid: 4044): terminating the instance due to error 119

Instance terminated by USER, pid = 4044

[oracle@db01 ~]$ crsctl status res -t

--------------------------------------------------------------------------------

NAME TARGET STATE SERVER STATE_DETAILS

Local Resources

ora.BAK01.dg

ONLINE ONLINE db01

ONLINE ONLINE db02

ora.DATA.dg

ora.FRA01.dg

ora.LISTENER.lsnr

ora.OCR_VOT.dg

ora.asm

ONLINE ONLINE db01 Started

ONLINE ONLINE db02 Started

ora.gsd

OFFLINE OFFLINE db01

OFFLINE OFFLINE db02

ora.net1.network

ora.ons

Cluster Resources

ora.LISTENER_SCAN1.lsnr

1 ONLINE ONLINE db02

ora.LISTENER_SCAN2.lsnr

1 ONLINE ONLINE db01

ora.LISTENER_SCAN3.lsnr

ora.cvu

ora.db01.vip

ora.db02.vip

ora.oc4j

ora.scan1.vip

ora.scan2.vip

ora.scan3.vip

ora.woo.db

1 ONLINE OFFLINE Instance Shutdown

2 ONLINE OFFLINE Instance Shutdown

[oracle@db01 ~]$ srvctl status database -d woo

Instance woo1 is not running on node db01

Instance woo2 is not running on node db02

[oracle@db01 trace]$ srvctl start database -d woo

PRCR-1079 : Failed to start resource ora.woo.db

CRS-5017: The resource action "ora.woo.db start" encountered the following error:

ORA-00119: invalid specification for system parameter REMOTE_LISTENER

ORA-00132: syntax error or unresolved network name 'scan.prudentwoo.com:1521'

. For details refer to "(:CLSN00107:)" in "/u01/app/11.2.0.4/product/grid/log/db02/agent/crsd/oraagent_oracle/oraagent_oracle.log".

. For details refer to "(:CLSN00107:)" in "/u01/app/11.2.0.4/product/grid/log/db01/agent/crsd/oraagent_oracle/oraagent_oracle.log".

CRS-2674: Start of 'ora.woo.db' on 'db02' failed

CRS-2674: Start of 'ora.woo.db' on 'db01' failed

CRS-2632: There are no more servers to try to place resource 'ora.woo.db' on that would satisfy its placement policy

alert.log:

[oracle@db01 trace]$ tail -0f alert_woo1.log

Fri Dec 16 15:37:08 2016

Starting ORACLE instance (normal)

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Initial number of CPU is 2

Private Interface 'eth1:1' configured from GPnP for use as a private interconnect.

[name='eth1:1', type=1, ip=169.254.51.38, mac=00-0c-29-7c-44-ca, net=169.254.0.0/17, mask=255.255.128.0, use=haip:cluster_interconnect/62]

Private Interface 'eth2:1' configured from GPnP for use as a private interconnect.

[name='eth2:1', type=1, ip=169.254.243.157, mac=00-0c-29-7c-44-d4, net=169.254.128.0/17, mask=255.255.128.0, use=haip:cluster_interconnect/62]

Public Interface 'eth0' configured from GPnP for use as a public interface.

[name='eth0', type=1, ip=192.168.84.11, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]

Public Interface 'eth0:1' configured from GPnP for use as a public interface.

[name='eth0:1', type=1, ip=192.168.84.22, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]

Public Interface 'eth0:3' configured from GPnP for use as a public interface.

[name='eth0:3', type=1, ip=192.168.84.20, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]

Public Interface 'eth0:5' configured from GPnP for use as a public interface.

[name='eth0:5', type=1, ip=192.168.84.13, mac=00-0c-29-7c-44-c0, net=192.168.84.0/24, mask=255.255.255.0, use=public/1]

CELL communication is configured to use 0 interface(s):

CELL IP affinity details:

NUMA status: non-NUMA system

cellaffinity.ora status: N/A

CELL communication will use 1 IP group(s):

Grp 0:

Picked latch-free SCN scheme 3

Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST

Autotune of undo retention is turned on.

LICENSE_MAX_USERS = 0

SYS auditing is disabled

Fri Dec 16 15:37:49 2016

USER (ospid: 6043): terminating the instance due to error 119

Instance terminated by USER, pid = 6043

我從啟動資料庫來看，發現資料庫此時無法正常啟動，并随着報ORA-00132，日志報error

119。

根據啟動提示可以将問題定位到scan，因scan故障引起資料庫無法正常啟動。

#check scan info:

[oracle@db01 ~]$ srvctl config scan

SCAN name: scan.prudentwoo.com, Network: 1/192.168.84.0/255.255.255.0/eth0

SCAN VIP name: scan1, IP: /scan.prudentwoo.com/192.168.84.21

SCAN VIP name: scan2, IP: /scan.prudentwoo.com/192.168.84.22

SCAN VIP name: scan3, IP: /scan.prudentwoo.com/192.168.84.20

[oracle@db01 ~]$ ping 192.168.84.20 -c 2

PING 192.168.84.20 (192.168.84.20) 56(84) bytes of data.

64 bytes from 192.168.84.20: icmp_seq=1 ttl=64 time=0.032 ms

64 bytes from 192.168.84.20: icmp_seq=2 ttl=64 time=0.039 ms

--- 192.168.84.20 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1000ms

rtt min/avg/max/mdev = 0.032/0.035/0.039/0.006 ms

[oracle@db01 ~]$ ping 192.168.84.21 -c 2

PING 192.168.84.21 (192.168.84.21) 56(84) bytes of data.

64 bytes from 192.168.84.21: icmp_seq=1 ttl=64 time=0.231 ms

64 bytes from 192.168.84.21: icmp_seq=2 ttl=64 time=0.292 ms

--- 192.168.84.21 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1001ms

rtt min/avg/max/mdev = 0.231/0.261/0.292/0.034 ms

[oracle@db01 ~]$ ping 192.168.84.22 -c 2

PING 192.168.84.22 (192.168.84.22) 56(84) bytes of data.

64 bytes from 192.168.84.22: icmp_seq=1 ttl=64 time=0.024 ms

64 bytes from 192.168.84.22: icmp_seq=2 ttl=64 time=0.034 ms

--- 192.168.84.22 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 999ms

rtt min/avg/max/mdev = 0.024/0.029/0.034/0.005 ms

[oracle@db01 ~]$ ping scan.prudentwoo.com -c 2

ping: unknown host scan.prudentwoo.com

我們可以看到，現在scan對應的三個位址都是通的，說明SCAN的服務是好的，但是ping

scan所對應的域名的時候報無法找到主機，無法解析域名，那麼下一步可以定位應該是域名服務出問題了。

#check dns client and server:

[oracle@db01 ~]$ /sbin/chkconfig --list|grep named

[oracle@db01 ~]$ ssh db02 '/sbin/chkconfig --list|grep named'

[oracle@db01 ~]$

check dns client:

[oracle@db01 ~]$ cat /etc/resolv.conf

search prudentwoo.com

nameserver 192.168.84.15

[oracle@db01 ~]$ ping 192.168.84.15 -c 2

PING 192.168.84.15 (192.168.84.15) 56(84) bytes of data.

From 192.168.84.11 icmp_seq=1 Destination Host Unreachable

From 192.168.84.11 icmp_seq=2 Destination Host Unreachable

--- 192.168.84.15 ping statistics ---

2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3007ms

pipe 2

PING scan.prudentwoo.com (192.168.84.21) 56(84) bytes of data.

64 bytes from scan.prudentwoo.com (192.168.84.21): icmp_seq=1 ttl=64 time=0.494 ms

64 bytes from scan.prudentwoo.com (192.168.84.21): icmp_seq=2 ttl=64 time=0.289 ms

--- scan.prudentwoo.com ping statistics ---

rtt min/avg/max/mdev = 0.289/0.391/0.494/0.104 ms

[oracle@db01 ~]$ srvctl start database -d woo

Instance woo1 is running on node db01

Instance woo2 is running on node db02

[oracle@db01 ~]$ srvctl config database -d woo

Database unique name: woo

Database name: woo

Oracle home: /u01/app/oracle/11.2.0.4/product/db_1

Oracle user: oracle

Spfile: +DATA/woo/spfilewoo.ora

Domain:

Start options: open

Stop options: immediate

Database role: PRIMARY

Management policy: AUTOMATIC

Server pools: woo

Database instances: woo1,woo2

Disk Groups: DATA,FRA01

Mount point paths:

Services:

Type: RAC

Database is administrator managed

能正常啟動，故障修複。

當然我是很慶幸的，出于職業敏感度，一堆報錯中瞬間發現問題根源ORA-00132，而沒有從其它報錯資訊入手。

DNS 引起經典RAC故障 DNS 引起經典RAC故障

繼續閱讀

禁止ubuntu系統彈出報錯界面

MySQL的4種隔離級别？出現問題

Ubuntu Linux下Apache的配置檔案

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

samba伺服器的功能

登入plsql 報錯 the account is locked --使用者被鎖

【Linux】UDP廣播封包接收速率問題

SequoiaDB巨杉資料庫C++驅動概述

Linux裝置模型（中）之上層容器

Oracle 批量查詢傳入List 傳回List

PowerPC平台 Linux移植三