β

Performance Tunning: cursor: mutex X & cursor: mut

ANBOB 43 阅读

朋友那一套12C R2多组户架构3节点RAC数据库反应数据库响应缓慢,找到我,因远程不方便,只拿到事后一份AWR,数据库出现较高的cursor: mutex X 和cursor: mutex S, 12.2目前的资料较少,这个问题个人感觉BUG的可能性大,但MOS中未发现匹配的NOTE.这里仅记录一下。

DB Name	DB Id	Unique Name	Role	Edition	Release	RAC	CDB
ANBOB	3056822707	ANBOB	PRIMARY	EE	12.2.0.1.0	YES	YES


			Snap Id	Snap Time	Sessions	Cursors/Session	Instances	Pluggable Databases Open
Begin Snap:	3264	03-Aug-17 11:00:16	230	1.9				3			4
End Snap:	3265	03-Aug-17 12:00:04	1034	2.1			3			4
Elapsed:	 	59.79 (mins)	 	 	 	 
DB Time:	 	57,548.34 (mins)	 


	Per Second	Per Transaction	Per Exec	Per Call
DB Time(s):	962.4	248.5	9.51	9.73
DB CPU(s):	5.1	1.3	0.05	0.05
Background CPU(s):	0.6	0.1	0.01	0.00
Redo size (bytes):	190,081.8	49,081.9	 	 
Logical read (blocks):	369,483.3	95,406.1	 	 
Block changes:	1,458.0	376.5	 	 
Physical read (blocks):	3,537.2	913.4	 	 
Physical write (blocks):	1,765.4	455.8	 	 
Read IO requests:	126.6	32.7	 	 
Write IO requests:	89.7	23.2	 	 
Read IO (MB):	27.6	7.1	 	 
Write IO (MB):	13.8	3.6	 	 
IM scan rows:	0.0	0.0	 	 
Session Logical Read IM:	0.0	0.0	 	 
Global Cache blocks received:	3,082.4	795.9	 	 
Global Cache blocks served:	2,789.8	720.4	 	 
User calls:	98.9	25.5	 	 
Parses (SQL):	53.2	13.7	 	 
Hard parses (SQL):	10.3	2.7	 	 
SQL Work Area (MB):	11.0	2.9	 	 
Logons:	1.7	0.4	 	 
Executes (SQL):	101.2	26.1	 	 
Rollbacks:	0.1	0.0	 	 
Transactions:	3.9	 	 


Event Waits Total Wait Time (sec) Avg Wait % DB time Wait Class
cursor: mutex X 20,558,713 3.3M 159.05ms 94.7 Concurrency
cursor: mutex S 14,188,123 85.5K 6.03ms 2.5 Concurrency
enq: TX – row lock contention 218 29.9K 137.16 s 0.9 Application
DB CPU 18.4K 0.5
gc buffer busy acquire 4,853,464 3703.3 763.02us 0.1 Cluster
gc cr block busy 2,036,172 2445.8 1.20ms 0.1 Cluster
gc cr block 2-way 4,170,540 1718.3 412.01us 0 Cluster
inactive session 1,379 1379.8 1000.60ms 0 Other
gc cr block 3-way 2,071,664 1102.1 531.98us 0 Cluster
gc cr multi block mixed 551,888 554.5 1.00ms 0 Cluster

time model

Statistic Name Time (s) % of DB Time % of Total CPU Time
parse time elapsed 3,401,559.92 98.51
failed parse elapsed time 2,578,581.09 74.68
sql execute elapsed time 49,749.93 1.44
DB CPU 18,373.74 0.53 90.26
connection management call elapsed time 393.19 0.01
hard parse elapsed time 233.07 0.01
hard parse (sharing criteria) elapsed time 75.11 0
PL/SQL compilation elapsed time 11.99 0
sequence load elapsed time 4.25 0

SQL ordered by Version Count

Version Count Executions SQL Id SQL Module PDB Name SQL Text
7,873 20 f0h5rpzmhju11 QZDB3PDB select SYS_CONTEXT(‘USERENV’, …
3,225 awxuv71j1wzbh XXHCPDB select distinct s.sowner, s.vn…
2,666 f0h5rpzmhju11 XXHCPDB select SYS_CONTEXT(‘USERENV’, …
2,187 gtuz08rcbk69y XXHCPDB select distinct s.sowner, s.vn…
1,466 7n9pz49qw4f39 XXHCPDB select decode(upper(failover_m…
1,038 9zg9qd9bm4spu XXHCPDB update user$ set spare6=DECODE…
686 6nas5twtqzkk0 CDB$ROOT select /*+ no_monitor no_state…
207 25 121ffmrc95v7g CDB$ROOT select i.obj#, i.ts#, i.file#,…
93 6 a1xgxtssv5rrp sqlplus@node1 (TNS V1-V3) QZDB3PDB select sum(used_blocks), ts.ts…
93 6 a1xgxtssv5rrp sqlplus@node1 (TNS V1-V3) XXHCPDB select sum(used_blocks), ts.ts…
93 6 a1xgxtssv5rrp sqlplus@node1 (TNS V1-V3) AJZPDB select sum(used_blocks), ts.ts…

Mutex Sleep Summary
ordered by number of sleeps desc

Mutex Type Location Sleeps Wait Time (ms)
Cursor Parent kkscsAddChildNode [KKSPRTLOC28] 334,812,274 2,336,374,941
Cursor Parent kkscsPruneChild [KKSPRTLOC27] 81,186,150 30,246,997
Library Cache kgllkc1 57 722,195 110,128
Library Cache kgllkdl1 85 211,650 31,718
Cursor Pin kkslce [KKSCHLPIN2] 75,333 63,530
Cursor Parent kksheqd [KKSPRTLOC12] 1,966 12,282
Cursor Parent kksfbc [KKSPRTLOC3] 1,214 5,765
Library Cache kglReleaseHandleReference 124 1,114 148
f0h5rpzmhju11
select SYS_CONTEXT('USERENV', 'SERVER_HOST'), SYS_CONTEXT('USERENV', 'DB_UNIQUE_NAME'), SYS_CONTEXT('USERENV', 'INSTANCE_NAME'), SYS_CONTEXT('USERENV', 'INSTANCE'), STARTUP_TIME, SYS_CONTEXT('USERENV', 'DB_DOMAIN'), SYS_CONTEXT('USERENV', 'SERVICE_NAME') from v$instance
awxuv71j1wzbh
select distinct s.sowner, s.vname from sys.snap$ s, sys.snap_reftime$ r where r.masobj# in (73811) and s.sowner = r.sowner and s.vname = r.vname and bitand(s.flag3, 67108864) = 67108864

gtuz08rcbk69y
select distinct s.sowner, s.vname from sys.snap$ s, sys.snap_reftime$ r where r.masobj# in (73811) and s.sowner = r.sowner and s.vname = r.vname and s.mowner = r.mowner and s.master = r.master and bitand(s.flag3, 67108864) = 67108864

note:
从AWR可以看出数据库的解析出现了严重的问题,cursor: mutex X占了94%的DB TIME同时cursor: mutex S伴随出现。同时mutex 函数 kkscsAddChildNode [KKSPRTLOC28]和kkscsPruneChild [KKSPRTLOC27] 增加和裁剪子游标时产生了问题,数据库存在high version SQL, version高于200的都是数据库的 递归SQL ,最大的达到了7000多。同时ASH 中显示也是该上面的3个SQL.

关于MUTEX

mutex很像latch,mutex 是一种低级的粒度串行机制控制访问SGA中的共享数据结构, 但是相比mutex更小更轻量并且更快, get时少几倍的指令调用, 在10g r2 版本引入,属于KGX (Kernel Generic Mutex Module). 每个结构对象都有自己的mutex保护,父游标和子游标都有各自的mutex; 每个结构又被多个的mutex保护,在11G R2中有超过60种原因会导致不能重用已存在的child cursor,hard parse产生新的child cursor;soft parse选择已存在的child cursor,查找已存在的child cursor时,持有的是cursor: mutex S,  而增加新的Child Cursor持有的是cursor: mutex X.

分析此类问题需要收集的信息:

• AWR Report
• ASH Report
• Hanganalyze trace
• Systemstate dump

What are Wait Events (on Mutexes)?
•Somebody requests a Mutex
•Can not get it by spinning
•And thus, sleeps
•Sleeping is recorded as wait time
•Spinning is not recorded as wait-, but as CPU time

Cursor: Mutex X情景:

Wants: Exclusive mode on Mutex of Parent / Child
• Building a new cursor under a parent
• Although this operation is cheaper, building many cursors under a parent cursor is not recommended.
• Capture SQL bind data(PEEK)
• Build or Update statistics blocks,Modify cursor-related statistics
• Mutex is in the parent cursor.

Cursor: mutex S情景:

Wants: Shared mode on Mutex of Parent / Child

• Change the reference count (“in flux”)= “new guy is interested / spinning”
• Parent examination
• When finding a cursor to execute, the parent must be examined. The examination of the parent is performed using the mutex, cursor: mutex S.
• When the parent cursor has many child cursors involved, this waits will come as the server process has to traverse the entire list of child cursors under the parent to find a match.
• Mutex is in the parent cursor.

Complex – cursor: mutex S/X情景:
•Root-caused by invalidated child cursor(s) => Too many cursor objects in Library Cache
Diagnosis:
•Oracle Wait Interface
•10046 Level 12 session trace (=sql_trace event)  根据bind oacdty检查绑定变量的类型是否存在bind mismatch
•v$sql_shared_cursor plus cursor trace [296377.1]

SQL不能共享的原因 :

 select * from
  (select sql_id, nonshared_reason, count(*) from v$sql_shared_cursor
  unpivot
  (nonshared_value for nonshared_reason in (
  UNBOUND_CURSOR as 'UNBOUND_CURSOR',
  SQL_TYPE_MISMATCH as 'SQL_TYPE_MISMATCH',
  OPTIMIZER_MISMATCH as 'OPTIMIZER_MISMATCH',
  OUTLINE_MISMATCH as 'OUTLINE_MISMATCH',
  STATS_ROW_MISMATCH as 'STATS_ROW_MISMATCH',
  LITERAL_MISMATCH as 'LITERAL_MISMATCH',
  FORCE_HARD_PARSE as 'FORCE_HARD_PARSE',
  EXPLAIN_PLAN_CURSOR as 'EXPLAIN_PLAN_CURSOR',
  BUFFERED_DML_MISMATCH as 'BUFFERED_DML_MISMATCH',
  PDML_ENV_MISMATCH as 'PDML_ENV_MISMATCH',
  INST_DRTLD_MISMATCH as 'INST_DRTLD_MISMATCH',
  SLAVE_QC_MISMATCH as 'SLAVE_QC_MISMATCH',
  TYPECHECK_MISMATCH as 'TYPECHECK_MISMATCH',
  AUTH_CHECK_MISMATCH as 'AUTH_CHECK_MISMATCH',
  BIND_MISMATCH as 'BIND_MISMATCH',
  DESCRIBE_MISMATCH as 'DESCRIBE_MISMATCH',
  LANGUAGE_MISMATCH as 'LANGUAGE_MISMATCH',
  TRANSLATION_MISMATCH as 'TRANSLATION_MISMATCH',
  BIND_EQUIV_FAILURE as 'BIND_EQUIV_FAILURE',
  INSUFF_PRIVS as 'INSUFF_PRIVS',
  INSUFF_PRIVS_REM as 'INSUFF_PRIVS_REM',
  REMOTE_TRANS_MISMATCH as 'REMOTE_TRANS_MISMATCH',
  LOGMINER_SESSION_MISMATCH as 'LOGMINER_SESSION_MISMATCH',
  INCOMP_LTRL_MISMATCH as 'INCOMP_LTRL_MISMATCH',
  OVERLAP_TIME_MISMATCH as 'OVERLAP_TIME_MISMATCH',
  EDITION_MISMATCH as 'EDITION_MISMATCH',
  MV_QUERY_GEN_MISMATCH as 'MV_QUERY_GEN_MISMATCH',
  USER_BIND_PEEK_MISMATCH as 'USER_BIND_PEEK_MISMATCH',
  TYPCHK_DEP_MISMATCH as 'TYPCHK_DEP_MISMATCH',
  NO_TRIGGER_MISMATCH as 'NO_TRIGGER_MISMATCH',
  FLASHBACK_CURSOR as 'FLASHBACK_CURSOR',
  ANYDATA_TRANSFORMATION as 'ANYDATA_TRANSFORMATION',
  PDDL_ENV_MISMATCH as 'PDDL_ENV_MISMATCH',
  TOP_LEVEL_RPI_CURSOR as 'TOP_LEVEL_RPI_CURSOR',
  DIFFERENT_LONG_LENGTH as 'DIFFERENT_LONG_LENGTH',
  LOGICAL_STANDBY_APPLY as 'LOGICAL_STANDBY_APPLY',
  DIFF_CALL_DURN as 'DIFF_CALL_DURN',
  BIND_UACS_DIFF as 'BIND_UACS_DIFF',
  PLSQL_CMP_SWITCHS_DIFF as 'PLSQL_CMP_SWITCHS_DIFF',
  CURSOR_PARTS_MISMATCH as 'CURSOR_PARTS_MISMATCH',
  STB_OBJECT_MISMATCH as 'STB_OBJECT_MISMATCH',
  CROSSEDITION_TRIGGER_MISMATCH as 'CROSSEDITION_TRIGGER_MISMATCH',
  PQ_SLAVE_MISMATCH as 'PQ_SLAVE_MISMATCH',
  TOP_LEVEL_DDL_MISMATCH as 'TOP_LEVEL_DDL_MISMATCH',
  MULTI_PX_MISMATCH as 'MULTI_PX_MISMATCH',
  BIND_PEEKED_PQ_MISMATCH as 'BIND_PEEKED_PQ_MISMATCH',
  MV_REWRITE_MISMATCH as 'MV_REWRITE_MISMATCH',
  ROLL_INVALID_MISMATCH as 'ROLL_INVALID_MISMATCH',
  OPTIMIZER_MODE_MISMATCH as 'OPTIMIZER_MODE_MISMATCH',
  PX_MISMATCH as 'PX_MISMATCH',
  MV_STALEOBJ_MISMATCH as 'MV_STALEOBJ_MISMATCH',
  FLASHBACK_TABLE_MISMATCH as 'FLASHBACK_TABLE_MISMATCH',
  LITREP_COMP_MISMATCH as 'LITREP_COMP_MISMATCH',
  PLSQL_DEBUG as 'PLSQL_DEBUG',
  LOAD_OPTIMIZER_STATS as 'LOAD_OPTIMIZER_STATS',
  ACL_MISMATCH as 'ACL_MISMATCH',
  FLASHBACK_ARCHIVE_MISMATCH as 'FLASHBACK_ARCHIVE_MISMATCH',
  LOCK_USER_SCHEMA_FAILED as 'LOCK_USER_SCHEMA_FAILED',
  REMOTE_MAPPING_MISMATCH as 'REMOTE_MAPPING_MISMATCH',
  LOAD_RUNTIME_HEAP_FAILED as 'LOAD_RUNTIME_HEAP_FAILED',
  HASH_MATCH_FAILED as 'HASH_MATCH_FAILED',
  PURGED_CURSOR as 'PURGED_CURSOR',
  BIND_LENGTH_UPGRADEABLE as 'BIND_LENGTH_UPGRADEABLE',
  USE_FEEDBACK_STATS as 'USE_FEEDBACK_STATS'
  ))
  where nonshared_value = 'Y'
  group by sql_id, nonshared_reason
  )
where sql_id = '&sqlid' ;

模拟一种 cursor: mutex S/X wait 情况:
•Generating 64 child cursors for one SQL_ID
•Accessing them 20x parallel
➔ Delay to create new children (X)
➔ Delay to select good child (S)

Similar Problem with CHAR binds
•Bind buffers are size 32, 128, 2000 or 4000 bytes
•Changing CHAR bind length invalidates
•Reason BIND_LENGTH_UPGRADEABLE= 4^n cursor versions

情况较复杂应该根据自己的实际情况分析,可能的一些解决方案如下:

• One Quick Fix for cursor: mutex S/X System is loaded with heavy mutex waits due to high number of cursors (=version count)
=> frequently flush this cursor with dbms_shared_pool.purge (look out for new parsing issues = CPU)
•  One solution for cursor: mutex S/X Application uses jdbc setter methods now properly on INTEGER column (=2)
#•setNUMBER(2) => Bind Var. is NUMBER
#setNULL(2, java.sql.Types.INTEGER)
=> Bind Var. is NUMBER
•Check how the application does handle bind variables Avoid BIND MISMATCH at (nearly) any cost
•Reduce the number of cursor versions below 100 More will lead to overhead
•Look for matching Oracle bugs in your RDBMS release
•Upgrade to 11.2.0.3 or higher 11.2.0.2 is worst version for cursor issues IMHO
• Increase the SGA if the shared pool is too small and heavy hard parse occurs due to reloads.
• For 11g, make sure cursor_sharing is not similar, as it has been deprecated. This may also cause cursor: mutex S waits:Recommended to use FORCE.
• Increase SESSION_CACHED_CURSORS to higher value if there are too many parses and waits on cursor: pin S.
• Use CURSOR_SHARING to FORCE to avoid waits on cursor: mutex S & cursor: mutex X due to huge version counts.
• If possible run with _LIBRARY_CACHE_ADVICE = FALSE and see if that helps
• On HP platforms ensure HPUX_SCHED_NOAGE parameter is set to 178.
• Obsoleting parent cursors if VERSION_COUNT exceeds a threshold (cursor: mutex S).
• When the child cursors grow beyond certain count be it 20 or 100, it obsoletes the parent cursors. In order to activate this enhancement bug set following:

For 11.2.0.2, apply the 11.2.0.2.2 PSU 12431716 atleast and do the following.
• _CURSOR_FEATURES_ENABLED to 1026
• event 106001 with value 100 (as the parameter _CURSOR_OBSOLETE_THRESHOLD is not present)
• Set , event= 106001 trace name context forever, level 100 in init.ora
For 11.2.0.3 and above, do the following.
• _CURSOR_OBSOLETE_THRESHOLD to 100 (default. this is the number of child cursor after which we obsolete it)

在有些情况下是存在真的hot对象,可以上面的方法都无法解决,需要手动打散:

To identify if the issue is a hot mutex use the following SQL :

SQL> select KGLNAOBJ, KGLNAOWN, KGLHDNSP,
KGLOBTYP, KGLNAHSV, KGLOBT23, KGLOBT24 
from X$KGLOB
where KGLOBT23 > 100000 or KGLOBT24 > 100000 
order by KGLOBT24;

Note:
相隔几分钟运行上面的SQL,如果KGLOBT23 或KGLOBT24 短时间内暴涨就说明该对象是HOT OBJ.或者从ASH或v$session数据根据当时的p1值得到。

Mutex wait Parameter Description
P1 Hash value of cursor
P2 Mutex value (top 2 bytes contain SID holding mutex in exclusive mode, and bottom two bytes usually hold the value 0)
P3 Mutex where (an internal code locator) OR'd with Mutex Sleeps

-- find hot obj
select KGLNAOBJ, KGLNAOWN, KGLHDNSP, KGLOBTYP,
KGLNAHSV 
from X$KGLOB
 where KGLNAHSH= {value of P1};

-- find blocker
SELECT count (*),
to_number(substr(to_char(rawtohex(p2raw)), 1, 8),'XXXXXXXX') blocking_sid FROM v$session
WHERE event = 'library cache: mutex X‘
group by to_number(substr(to_char(rawtohex(p2raw)),1, 8), 'XXXXXXXX');

-- hot copy
 You can apply the hot copy fix from 11.2.0.2 onwards w/o anyinterim patches.
• Set the following parameters in init.ora:-
_KGL_HOT_OBJECT_COPIES={a}


_KGL_DEBUG="name='{b}' schema='{c}' namespace={d} debug=33554432"
or (where the object is a cursor)
_KGL_DEBUG="hash='{e}' debug=33554432"

NOTE: Only with Oracle Support Guidance

Where:
{a} - half the CPU count
{b} - KGLNAOBJ from SQL
{c} - KGLNAOWN from SQL
{d} - KGLHDNSP from SQL
{e} - KGLNAHSV from SQL.

The same can be performed using dbms_shared_pool APIs in 11.2.0.2+.
SQL> EXEC DBMS_SHARED_POOL.MARKHOT('username','object_name',KGLHDNSP);
<or, for a cursor>
SQL> EXEC DBMS_SHARED_POOL.MARKHOT('KGLNAHSV', 0);

NOTE: dbms_shared_pool.markhot只是实例周期的改变,重启后失效,需要每次重启后重复手动配置.

12.2 这个案例的临时解决方案

上面介绍了那么多,再回到这个12.2的案例, 这么高的mutex等待与跟high version脱离不了关系,简单咨询客户后确认了没有对12.2默认参数做调整,sql verison能达到7000多,我查询了12.2的隐藏参数_CURSOR_OBSOLETE_THRESHOLD原来现在默认已加大到了8192, 该参数在11G中也只是100,12C R1版本是也在1024. 并且high version的SQL都是递归内部SQL,无法调应用解决,于是让客户调整_CURSOR_OBSOLETE_THRESHOLD参数调到200,并重启实例生效。问题消失了,并且上面的那个high version的SQL从TOP中消失。

当然如果再深入查,至少要在当时做SSD或ash 中找到进程信息,或尝试禁用_optim_peek_user_binds, sql ACS\ECS这些特性,或者是个未知的BUG, 我相信未来肯定会有客户同样遇到,希望提供一个分析的方向。

作者:ANBOB
A No Bad Oracle Blog
原文地址:Performance Tunning: cursor: mutex X & cursor: mut, 感谢原作者分享。

发表评论