ORA-00494: enqueue [CF] held for too long (more than 900 seconds) cause instance crash_综合

故障现象：alert 日志
Restarting dead background process MMON
Tue May 16 17:24:51 2017
Process 0x0x6b0f14da8 appears to be hung while dumping
Current time = 465593890, process death time = 465533027 interval = 60000
Attempting to kill process 0x0x6b0f14da8 with OS pid = 61258
OSD kill succeeded for process 0x6b0f14da8

VERSION INFORMATION:
TNS for Linux: Version 11.2.0.1.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.1.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.1.0 - Production
Time: 16-5?? -2017 17:24:51
Tracing not turned on.
Tns error struct:
ns main err code: 12535

Tue May 16 17:24:52 2017
Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_m000_54933.trc:
ORA-12751: cpu time or run time policy violation
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.17.0.52)(PORT=65434))
Restarting dead background process MMON
Tue May 16 17:24:51 2017
Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc0_16428.trc (incident=258198):
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'
Tue May 16 17:24:51 2017
Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc2_50083.trc (incident=258214):
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'
Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258198/LHDW_arc0_16428_i258198.trc
Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258214/LHDW_arc2_50083_i258214.trc
Tue May 16 17:24:51 2017
Thread 1 advanced to log sequence 86188 (LGWR switch)
Current log# 10 seq# 86188 mem# 0: /oracle/ora11g/oradata/LHDW/redo10.log
Current log# 10 seq# 86188 mem# 1: /oracle/ora11g/flash_recovery_area/LHDW/redo10a.log
Tue May 16 17:04:56 2017
LNS: Standby redo logfile selected for thread 1 sequence 86188 for destination LOG_ARCHIVE_DEST_2
Tue May 16 17:12:23 2017

***********************************************************************
Tue May 16 17:21:50 2017
AUD: Audit Commit Delay exceeded, written a copy to OS Audit Trail

Fatal NI connect error 12170.
Tue May 16 17:24:51 2017
Process 0x0x6b0f14da8 appears to be hung while dumping
Current time = 465593890, process death time = 465533027 interval = 60000
Attempting to kill process 0x0x6b0f14da8 with OS pid = 61258
OSD kill succeeded for process 0x6b0f14da8

Tue May 16 17:24:52 2017
Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_m000_54933.trc:
ORA-12751: cpu time or run time policy violation
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

Restarting dead background process MMON
Tue May 16 17:24:51 2017
Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc0_16428.trc (incident=258198):
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'
Tue May 16 17:24:51 2017
Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc2_50083.trc (incident=258214):
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'
Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258198/LHDW_arc0_16428_i258198.trc
Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258214/LHDW_arc2_50083_i258214.trc
Tue May 16 17:24:51 2017
Tue May 16 17:24:52 2017
Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc1_50394.trc (incident=258205):
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'
Errors in file /oracle/ora11g/diag/rdbms/lhdw/LHDW/trace/LHDW_arc3_16457.trc (incident=258222):
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 4595'
Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258222/LHDW_arc3_16457_i258222.trc
Incident details in: /oracle/ora11g/diag/rdbms/lhdw/LHDW/incident/incdir_258205/LHDW_arc1_50394_i258205.trc
Killing enqueue blocker (pid=4595) on resource CF-00000000-00000000 by (pid=16428)
Killing enqueue blocker (pid=4595) on resource CF-00000000-00000000 by (pid=50083) by killing session 586.1

Killing enqueue blocker (pid=4595) on resource CF-00000000-00000000 by (pid=16428)
by killing session 586.1
by terminating the process
Killing enqueue blocker (pid=4595) on resource CF-00000000-00000000 by (pid=50083)ARC0 (ospid: 16428): terminating the instance due to error 2103 by terminating the process
Tue May 16 17:25:04 2017
Instance terminated by ARC0, pid = 16428

[oracle@hadoop-21 trace]$ ls -trl *4595*
-rw-r----- 1 oracle oinstall 894 May 16 17:00 LHDW_ckpt_4595.trm
-rw-r----- 1 oracle oinstall 62781 May 16 17:00 LHDW_ckpt_4595.trc

##CAUSE
I/O contention or slowness leads to control file enqueue timeout.
One particular situation that can be seen is LGWR timeout while waiting for control file enqueue, and the blocker is CKPT :

From the AWR:
1) high "log file parallel write" and "control file sequential read" waits
2) Very slow Tablespace I/O, Av Rd(ms) of 1000-4000 ms (when lower than 20 ms is acceptable)
3) very high %iowait : 98.57%.
4) confirmed IO peak during that time 本次故障原因

Please note: Remote archive destination is also a possible cause. Networking issues can also cause this type of issue when a remote archive destination is in use for a standby database.

###SOLUTION
Since the problem is caused by I/O contention/slow, improving the I/O performance will avoid such issue.

It is an expected behavior that the control file enqueue requestor terminates itself with ORA-239 if the state machine failed to kill the blocker. In most cases, crashing the instance will prevent cluster-wide hang. This is the default setting.

There is a workaround to resolve the issue:
Disable kill blocker code via setting _kill_controlfile_enqueue_blocker=false.

alter system set "_kill_controlfile_enqueue_blocker"=false;

_kill_controlfile_enqueue_blocker is NOT available in 10.2.0.1, it's available from 10.2.0.2.

[oracle@hadoop-21 trace]$ dmesg|grep oracle
dracut: inactive '/dev/lhdw_oracle/lvm_sdb1' [9.91 TiB] inherit
EXT4-fs (dm-3): Unaligned AIO/DIO on inode 49283886 by oracle; performance will be poor.
INFO: task oracle:4583 blocked for more than 120 seconds.
oracle D 000000000000001d 0 4583 1 0x00000080
INFO: task oracle:4585 blocked for more than 120 seconds.
oracle D 0000000000000001 0 4585 1 0x00000080---------------->>>>ckpt 进程挂起 120秒，无法对控制文件进行io操作，超时持有 cf enqueue
INFO: task oracle:4587 blocked for more than 120 seconds.
oracle D 000000000000000a 0 4587 1 0x00000080
INFO: task oracle:4589 blocked for more than 120 seconds.
oracle D 0000000000000010 0 4589 1 0x00000080
INFO: task oracle:4591 blocked for more than 120 seconds.
oracle D 0000000000000000 0 4591 1 0x00000080
INFO: task oracle:56454 blocked for more than 120 seconds.
oracle D 0000000000000000 0 56454 1 0x00000080
INFO: task oracle:2323 blocked for more than 120 seconds.
oracle D 0000000000000000 0 2323 1 0x00000080
INFO: task oracle:4835 blocked for more than 120 seconds.
oracle D 0000000000000008 0 4835 1 0x00000080
EXT4-fs (dm-3): Unaligned AIO/DIO on inode 49283837 by oracle; performance will be poor.
EXT4-fs (dm-3): Unaligned AIO/DIO on inode 49283406 by oracle; performance will be poor.
EXT4-fs (dm-3): Unaligned AIO/DIO on inode 49283408 by oracle; performance will be poor.
EXT4-fs (dm-3): Unaligned AIO/DIO on inode 49283859 by oracle; performance will be poor.

[oracle@hadoop-21 trace]$ find / -inum 49283837 2 > /dev/null

##_controlfile_enqueue_timeout
This parameter is used to set the timeout limit for CF enqueue. By default, it is 900 seconds.
This timeout can be reduced using the above parameter so that the process dont have to wait for 15 minutes to get the enqueue.

##_kill_controlfile_enqueue_blocker

_kill_controlfile_enqueue_blocker FALSE
enable killing controlfile enqueue blocker on timeout

This parameter enables killing controlfile enqueue blocker on timeout.

TRUE. Default value. Enables this mechanism and kills blocker process in CF enqueue.
FALSE. Disables this mechanism and no blocker process in CF enqueue will be killed.

This kill blocker interface / ORA-494 was introduced in 10.2.0.4. This new mechanism will kill *any* kind of blocking process, non-background or background.

The difference will be that if the enqueue holder is a non-background process, even if it is killed, the instance can function without it.
In case the holder is a background process, for example the LGWR, the kill of the holder leads to instance crash.

If you want to avoid the kill of the blocker (background or non-background process), you can set _kill_controlfile_enqueue_blocker=false.
This means that no type of blocker will be killed anymore although the resolution to this problem should focus on why the process is holding the enqueue for so long.
Also, you may prefer to only avoid killing background processes, since they are vital to the instance, and you may want to allow the killing of non-background blockers.

In order to prevent a background blocker from being killed, the following init.ora parameter can be set to 1 (default is 2).

_kill_enqueue_blocker=1

#for 11g：
_kill_enqueue_blocker 2
if greater than 0 enables killing enqueue blocker

_kill_enqueue_blocker = { 0 | 1 | 2 | 3 }
0. Disables this mechanism and no foreground or background blocker process in enqueue will be killed.
1. Enables this mechanism and only kills foreground blocker process in enqueue while background process is not affected.
2. Enables this mechanism and only kills background blocker process in enqueue.
3. Default value. Enables this mechanism and kills blocker processes in enqueue.

alter system set "_kill_enqueue_blocker"=1 scope=spfile;

请勿在rac 环境修改这个参数：因为：The reason why the blocker interface with ORA-494 is kept is because, in most cases, customers would prefer crashing the instance than having a cluster-wide hang.

With this parameter, if the enqueue holder is a background process, then it will not be killed, therefore the instance will not crash. If the enqueue holder is not a background process, the new 10.2.0.4 mechanism will still try to kill it.

可以杀掉任何oracle的后台进程

总结：

os 调整，调整 sysctl.conf 文件

sysctl -w vm.dirty_ratio=10
sysctl -w vm.dirty_background_ratio=5

db 调整

alter system set "_kill_controlfile_enqueue_blocker"=false;

alter system set "_kill_enqueue_blocker"=1 scope=spfile;