β

After OS reboot, Ohasd(cssd) start fail is due to

ANBOB 259 阅读

前几天帮助同事处理了个案例, 主机意外重启后数据库无法启动, 环境是11.2.0.3 standalone on aix, 用的是ASM.最后确认是OLR损坏, 因为当时没有记录具体流程,在这里只简单的记录.

# check init.ohasd process is running
ps -ef | grep init.ohasd
# if not the run as root 
/etc/init.d/init.ohasd run >/dev/null 2>&1

到这步发现HAS已启动,但是cssd都没启动, cssd和一些local OCR做为资源是随HAS启动的, 如果没有数据库使用ASM, 默认是不会自动启动cssd和asm的。 但是使用crsctl start res ora.cssd -init又提示没有资源, 查看GI alert log 如下

# gi alert
[ohasd(7210)]CRS-8017:location: /etc/oracle/lastgasp has 4 reboot advisory log files, 0 were announced and 0 errors occurred
-10-18 12:51:23.952
[ohasd(7210)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'.
-10-18 13:52:40.249
[ohasd(8085)]CRS-2112:The OLR service started on node anbob.
-10-18 13:52:40.314
[ohasd(8085)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'.
-10-18 14:01:57.696
[ohasd(8668)]CRS-2112:The OLR service started on node anbob.
-10-18 14:01:57.761
[ohasd(8668)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'.
-10-18 15:03:27.157
[ohasd(11026)]CRS-2112:The OLR service started on node anbob.
-10-18 15:03:27.202
[ohasd(11296)]CRS-1339:Oracle High Availability Service aborted due to an unexpected error 
[Failed to initialize Oracle Local Registry]. Details at (:OHAS00106:) in 
/oracle/app/oracle/product/11.2.0/grid/log/anbob/ohasd/ohasd.log.
-10-18 15:03:27.220
[ohasd(11026)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'.
-10-18 15:23:08.435
[ohasd(12135)]CRS-2112:The OLR service started on node anbob.
-10-18 15:23:08.499
[ohasd(12135)]CRS-2772:Server 'anbob' has been assigned to pool 'Free'.

# OHASD.LOG  

[grid@anbob anbob]$ vi /oracle/app/oracle/product/11.2.0/grid/log/anbob/ohasd/ohasd.log
-10-18 15:03:27.159: [  OCRSRV][1032845088]th_init: Local listener did not reach valid state
-10-18 15:03:27.159: [  CRSOCR][555579168] CAAOCR GET Debug sblevel Level[default]: 0
...
[  clsdmt][4118787840]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=anbobDBG_OHASD))
-10-18 15:03:27.170: [  clsdmt][4118787840]PID for the Process [11026], connkey 8
-10-18 15:03:27.176: [ default][555579168] Ohasd Daemon Started.
-10-18 15:03:27.176: [   CLSLG][3747604224] Last Gasp Monitor thread started
-10-18 15:03:27.176: [   CLSLG][3747604224] processing Last Gasp disk location /etc/oracle/lastgasp
-10-18 15:03:27.177: [    CLSE][555579168]clse_get_auth_loc: Returning default authloc: /oracle/app/oracle/product/11.2.0/grid/auth/ohasd/anbob
-10-18 15:03:27.177: [ default][555579168] AuthLoc /oracle/app/oracle/product/11.2.0/grid/auth/ohasd/anbob
-10-18 15:03:27.177: [ default][555579168] PE Engine: NEW
-10-18 15:03:27.177: [ default][555579168] Using OCR batch ops : ENABLED
-10-18 15:03:27.177: [ default][555579168] RD registrations and Clusterization disabled.
-10-18 15:03:27.177: [   CLSLG][3747604224] monitoring new interface 0.0.0.0
-10-18 15:03:27.178: [ default][555579168][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_IPC_SOCKET_11))
-10-18 15:03:27.178: [CLSFRAME][555579168] Inited lsf context 0x220efe0
-10-18 15:03:27.178: [CLSFRAME][555579168] Initing CLS Framework messaging
-10-18 15:03:27.179: [  CLSVER][555579168] Static Version 11.2.0.1.0
-10-18 15:03:27.179: [ default][555579168][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_IPC_SOCKET_11))
-10-18 15:03:27.180: [UiServer][555579168] UI Comms initalize() 1
-10-18 15:03:27.181: [CLSFRAME][555579168] New Framework state: 2
-10-18 15:03:27.181: [CLSFRAME][555579168] M2M is starting...
-10-18 15:03:27.182: [ CRSCOMM][555579168] m_pClscCtx=0x22c8030m_pUgblm=0x22d32f0
-10-18 15:03:27.182: [ CRSCOMM][555579168] Starting send thread
-10-18 15:03:27.182: [ CRSCOMM][555579168] IPC Listener instantiated for: (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_IPC_SOCKET_11))
-10-18 15:03:27.182: [ CRSCOMM][205494016] clsIpc: sendWork thread started.
-10-18 15:03:27.182: [ CRSCOMM][555579168] IPC Listener started listening.
-10-18 15:03:27.183: [ CRSCOMM][4097808128] IPCL thread started listening
-10-18 15:03:27.183: [CLSFRAME][555579168] Starting thread model named: AgfwProxySrvTM
-10-18 15:03:27.183: [CLSFRAME][555579168] Starting thread model named: OcrModuleTM
-10-18 15:03:27.183: [CLSFRAME][555579168] Starting thread model named: PolicyEngineTM
-10-18 15:03:27.184: [CLSFRAME][555579168] Starting thread model named: SharedThreadTM
-10-18 15:03:27.184: [CLSFRAME][555579168] Starting thread model named: UiServerTM
-10-18 15:03:27.184: [CLSFRAME][555579168] New Framework state: 3
-10-18 15:03:27.185: [  CRSRPT][3699324672] Enabled
-10-18 15:03:27.185: [   CRSPE][3701425920] PE Role|State Update: old role [INVALID] new [INVALID]; old state [Not yet initialized] new [Enabling: waiting for role]
-10-18 15:03:27.185: [   CRSSE][3699324672] SE module master election disabled
-10-18 15:03:27.185: [   CRSSE][3699324672] Master Change Event; New Master Node ID:0 This Node's ID:0
-10-18 15:03:27.189: [   CRSPE][3701425920] Sent request to write event sequence number 1400000 to repository
-10-18 15:03:27.189: [   CRSPE][3701425920] Reading (1) servers
-10-18 15:03:27.189: [   CRSPE][3701425920] There are no resource types to read.
-10-18 15:03:27.189: [   CRSPE][3701425920] There are no resources to read.
-10-18 15:03:27.191: [   CRSPE][3701425920] Wrote new event sequence to repository
-10-18 15:03:27.192: [   CRSPE][3701425920] Reading (1) server pools
-10-18 15:03:27.196: [   CRSPE][3701425920] Finished reading configuration. Parsing...
-10-18 15:03:27.202: [   CRSPE][3701425920] Parsing server pools...
-10-18 15:03:27.202: [  CRSOCR][1032845088] OCR context init failure.  Error: PROCL-24: Error in the messaging layer Messaging error [18]
-10-18 15:03:27.202: [ default][1032845088] OLR initalization failured, rc=24
-10-18 15:03:27.202: [   CRSPE][3701425920] Parsed and validated SERVERPOOL: Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED
-10-18 15:03:27.202: [ default][1032845088]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
-10-18 15:03:27.202: [   CRSPE][3701425920] Server pools parsed
-10-18 15:03:27.202: [ default][1032845088][PANIC] OHASD exiting; Could not init OLR
-10-18 15:03:27.202: [   CRSPE][3701425920] Server Pool Free has been registered
-10-18 15:03:27.202: [ default][1032845088] Done.
-10-18 15:03:27.202: [   CRSPE][3701425920] Cluster reboot took place.
-10-18 15:03:27.203: [   CRSPE][3701425920] Configuration has been parsed
-10-18 15:03:27.203: [ default][3697223424][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_UI_SOCKET))
-10-18 15:03:27.204: [UiServer][3697223424] UI socket on: (ADDRESS=(PROTOCOL=IPC)(KEY=OHASD_UI_SOCKET))
-10-18 15:03:27.204: [ default][3697223424][F-ALGO] getIpcPath returning (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))
-10-18 15:03:27.204: [UiServer][3697223424] UI socket on: (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))
-10-18 15:03:27.204: [UiServer][3695122176] UI comms listening for events.
-10-18 15:03:27.204: [CLSFRAME][3711932160] Module Enabling is complete

从上面可以看到OLR初始化失败,应该是olr miss或corrupted,因为是Standalone所有找个本地的olr备份还原一下应该就可以解决。

[grid@anbob ~]$ cd $ORACLE_HOME
[grid@anbob grid]$ cd cdata
[grid@anbob cdata]$ ls
anbob  localhost
[grid@anbob cdata]$ cd anbob
[grid@anbob anbob]$ ls
backup_20140818_113812.olr
[grid@anbob anbob]$ ls -lrt
total 5260
-rwxr-xr-x 1 grid oinstall 5386240 Aug 18  2014 backup_20140818_113812.olr

[grid@anbob anbob]$ ocrconfig -local -showbackup
PROTL-25: Manual backups for the Oracle Local Registry are not available

可以看到有个2014年的一个olr备份,ocr备份存放在$GRID_HOME/cdata, 其它没有手动备份的。

[grid@anbob anbob]$ ls -lrt /etc/oracle/
total 24
drwxr-xr-x 3 root oinstall 4096 Oct 18 04:58 scls_scr
drwxrwxr-x 5 root oinstall 4096 Oct 18 04:58 oprocd
-rw-r--r-- 1 root root        0 Oct 18 04:58 olr.loc.orig
-rw-r--r-- 1 root oinstall  130 Oct 18 04:58 olr.loc
-rw-r--r-- 1 root root       16 Oct 18 04:58 ocr.loc.orig
-rw-r----- 1 grid oinstall   95 Oct 18 04:58 ocr.loc
drwxrwx--- 3 root oinstall 4096 Oct 18 13:33 lastgasp

# To restore OLR

crsctl stop has
ocrconfig -local -restore  /oracle/app/oracle/product/11.2.0/grid/cdata/anbob/backup_20140818_113812.olr

crsctl start has
crsctl start res ora.cssd -init

终于可以显示资源了,并且cssd已自动启动,剩下的工作就是手动启asm,mount asm磁盘组,启数据库

su - grid
sqlplus / as sysasm
# startup asm
alter diskgroup  mount
su - oracle
# startup database;

这里因为刚好有以前的备份,如果没有备份时可以参考NOTE 1539020.1, 重建OLR

In an environment with fresh GI installation:
1. Deconfig the existing clusterware, as this is only done on the problematic node, the other nodes should have Clusterware up and running, this command should only deconfig the clusterware configuration for local node, it should not touch/change OCR.
# /crs/install/rootcrs.pl -deconfig -force
2. Rerun root.sh
# /root.sh

作者:ANBOB
A No Bad Oracle Blog
原文地址:After OS reboot, Ohasd(cssd) start fail is due to , 感谢原作者分享。

发表评论