解決 OCFS2 不能自動挂載提示 o2net_connect_expired

RAC 在啟動的是要要先啟動OCFS2，在修改/etc/sysconfig/o2cb的配置後，發現兩機器隻有一台可以自動挂載ocfs2分區，而另外一台不能自動挂載。但啟動完畢後，手動挂載正常。

一、詳細情況

兩機器分别是dbsrv-1和dbsrv-2，使用交叉線做網絡心跳，并在cluster.conf中使用私有心跳IP，非公用IP位址。

1、檢查o2cb狀态

啟動後，o2cb服務是啟動正常的，ocfs2子產品也加載正常的，但心跳是Not Active：

引用

Checking heartbeat: Not Active

2、檢查/etc/fstab檔案

引用

#cat /etc/fstab|grep ocfs2

/dev/sdc1 /oradata ocfs2 _netdev,datavolume,nointr 0 0

配置正确；

3、檢查兩機器的/etc/ocfs2/cluster.conf内容

引用

# more /etc/ocfs2/cluster.conf

node:

ip_port = 7777

ip_address = 172.20.3.2

number = 0

name = dbsrv-2

cluster = ocfs2

node:

ip_port = 7777

ip_address = 172.20.3.1

number = 1

name = dbsrv-1

cluster = ocfs2

cluster:

node_count = 2

name = ocfs2

已經确認，兩機器該檔案是完全相同的。

4、檢視系統日志

報錯資訊如下：

引用

Jul 20 19:33:18 dbsrv-2 kernel: OCFS2 1.2.3

Jul 20 19:33:24 dbsrv-2 kernel: (4452,0):

o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.

Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_request_join:786 ERROR: status = -107

Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_try_to_join_domain:934 ERROR: status = -107

Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_join_domain:1186 ERROR: status = -107

Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):dlm_register_domain:1379 ERROR: status = -107

Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_dlm_init:2009 ERROR: status = -107

Jul 20 19:33:24 dbsrv-2 kernel: (4478,2):ocfs2_mount_volume:1062 ERROR: status = -107

Jul 20 19:33:24 dbsrv-2 kernel: ocfs2: Unmounting device (8,33) on (node 0)

Jul 20 19:33:26 dbsrv-2 mount: mount.ocfs2: Transport endpoint is not connected

Jul 20 19:33:26 dbsrv-2 mount:

Jul 20 19:33:26 dbsrv-2 netfs: Mounting other filesystems: failed

二、分析問題

1、node節點的啟動順序

從Google搜尋到如此的資訊：

引用

Mount triggers the heartbeat thread which triggers the o2net

to make a connection to all heartbeating nodes. If this connection

fails,the mount fails. (The larger node number initiates the connection

to the lower node number.)

說明o2cb啟動的時候，是根據node節點的大小順序啟動的。

而在cluster.conf中，node0是dbsrv-2，node1是dbsrv-1，是以，dbsrv-1在啟動的時候馬上可聯通本機IP，然後挂載ocfs2分區；但dbsrv-2啟動的時候，則不能即時發現對方IP位址，是以啟動失敗。

2、嘗試修改HEARTBEAT_THRESHOLD參數

從Goolge搜尋到另外一條資訊：

引用

After confirming with Stephan, this problem appears to relate to the HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering this myself and having confirmed with a couple of other people in the list that it has caused problems, it seems that the default threshold of 7 is possibly too short, even in reasonably fast server-storage solutions such as an HP DL380 Packaged Cluster.

Does the OCFS2 development team also consider this to be too short, or is altering the paramater just a workaround that shouldn't be used? If this is the case then how should we approach the problem of self-fencing nodes?

Also, can we expect this behaviour with some platforms but not others, or is it too short for all platforms? If it is a blanket problem, then should the default threshold be raised?

Finally, if the altering the threshold is a valid solution, could it please be added to the FAQs and the user guide so that people know to adjust it as a first step on encountering the problem, rather than having to post to the list and wait for replies.

并參考網上的資料，修改/etc/sysconfig/o2cb的HEARTBEAT_THRESHOLD參數為301，啟動後報：

引用

Jul 23 13:59:50 dbsrv-2 kernel: (4477,0):o2hb_check_slot:883 ERROR: Node 1 on device sdc1 has a dead count of 14000 ms, but our count is 602000 ms.

Jul 23 13:59:50 dbsrv-2 kernel: Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD'

Jul 23 13:59:54 dbsrv-2 kernel: OCFS2 1.2.3

Jul 23 14:00:00 dbsrv-2 kernel: (4449,0):o2net_connect_expired:1446 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.

Jul 23 14:00:00 dbsrv-2 kernel: (4475,2):dlm_request_join:786 ERROR: status = -107

問題依舊。

※注釋

引用

[隔離時間（秒）] = (O2CB_HEARTBEAT_THRESHOLD - 1) * 2

(301 - 1) * 2 = 600 秒

綜上所述，已經能清楚所有配置都是正确的。

導緻故障的原因是：

在啟動o2cb服務的前，由于某些原因，o2cb依賴的IP位址未能及時取得聯系，操作了其限定的時間，而啟動失敗。而在機器完整啟動後，網絡已經正常，是以，手動挂載ocfs2分區正常。

三、解決問題

1、Oracle metalink給出的資訊

引用

The problem here is that network layer not becoming fully functional even after /etc/init.d/network script is done executing. The proposed patch is a work around and is not fixing a problem in o2cb script.

2、解決方法

引用

a）確定所有配置檔案都正确，無差異；

b）確定兩伺服器的機器時間不要相差太遠；

（可使用時間同步）

c）o2cb使用的cluster.conf檔案中，應使用心跳IP，而非公網IP

d）修改/etc/init.d/o2cb腳本，在最前面加入一個sleep的延遲時間，以等待網絡正常；

e）實在還是不行，把啟動腳本放到/etc/rc.local中

mount -t ocfs2 -o datavolume,nointr /dev/sdc1 /oradata

/etc/init.d/init.crs start

解決 OCFS2 不能自動挂載提示 o2net_connect_expired

繼續閱讀

Oracle查詢簡單資料字典SQL語句

經典SQL之資料統計

Sql資料優化方案

SQL JOIN 連接配接的幾種方式

SQL内部連接配接3個表？

SQL 關于多個表連接配接問題

sql語句：連接配接表

Windows Mobile開發的一些小技巧

How to add local administrator via cmd

黑馬程式員_Sql學習總結(二)

JBPM資料庫表說明(3)

Introduction to System.DirectoryServices.Protocols (S.DS.P) 系統目錄服務協定類庫——部分翻譯Introduction to System.DirectoryServices.Protocols (S.DS.P)介紹系統目錄服務協定類庫

删除表中重複記錄隻留一條即可

查詢表中重複記錄

單字母域名

hadoop 用MR實作join操作

解決 OCFS2 不能自動挂載 提示 o2net_connect_expired

繼續閱讀

解決 OCFS2 不能自動挂載提示 o2net_connect_expired