Saturday, August 3, 2013

Oracle 11g RAC Performance

For quite some time, Oracle's kernel has been well-instrumented to capture detailed wait event information.  Google “Oracle wait events” and you’ll come up with a wide range of great papers and publications about Oracle wait events, and also see http://docs.oracle.com/cd/E11882_01/server.112/e16638/toc.htm.
 
Several cluster-specific wait events are populated in RAC environments.  
 
Wait class “Cluster” indicates cluster-related waits related to Cache Fusion:
 
SQL> desc v$system_wait_class
 Name   Null?    Type
 ----------------------------------------- -------- ----------------------------
 WAIT_CLASS_ID    NUMBER
 WAIT_CLASS#    NUMBER
 WAIT_CLASS    VARCHAR2(64)
 TOTAL_WAITS    NUMBER
 TIME_WAITED    NUMBER
 TOTAL_WAITS_FG    NUMBER
 TIME_WAITED_FG    NUMBER
 
SQL> select wait_class from v$system_wait_class
  2  /
 
WAIT_CLASS
----------------------------------------------------------------
Other
Application
Configuration
Concurrency
Commit
Idle
Network
User I/O
System I/O
Cluster
 
The views V$WAITCLASSMETRIC and V$WAITCLASSMETRIC_HISTORY show wait information for all types of waits; if you want to show Cluster-related wait metrics, join the views on WAIT_CLASS_ID.   Additionally, V$SYSTEM_EVENT and V$EVENT_NAME also have a WAIT_CLASS_ID column.  In 11g, there are 50 Cluster-related RAC waits.
In addition to Cache Fusion waits, Global Enqueue service waits are in the “Other” wait class and typically start with “ges%”.
 
 
RAC Wait Event Naming Conventions
 
- Events that start with “GCS%” and “gc%” are Cache Fusion-related waits.  In other word, they have to do with waits experienced in shipping current or consistent-read versions of blocks across instances in a RAC cluster
- Events that start with “ges%’ are related to Global Enqueue Services
 
Block-Oriented RAC Waits
 
- “gc current block 2-way” -  Indicates that a current block was sent from either the resource master requiring 1 message and 1 transfer
- “gc current block 3-way” – Indicates that a current block was forwarded to a 3rd node from which it was sent,  requiring 2 messages and one block transfer.
- “gc cr block 2-way” -  Indicates that a cr block was sent from either the resource master requiring 1 message and 1 transfer
- “gc cr block 3-way”  -Indicates that a current block was forwarded to a 3rd node from which it was sent,  requiring 2 messages and one block transfer
 
 
The block-oriented wait event statistics indicate that a block was received as either the result of a 2-way or a 3-way message, that is, the block was sent from either the resource master requiring 1 message and 1 transfer, or was forwarded to a third node from which it was sent, requiring 2 messages and 1 block transfer.
 
Message-Oriented RAC Waits
 
- “gc current grant 2-way” – Indicates that no current block was received because it was not cached in any instance. Instead a global grant was given, enabling the requesting instance to read the block from disk or modify it.
- “gc cr grant 2-way” - – Indicates that no current block was received because it was not cached in any instance. Instead a global grant was given, enabling the requesting instance to read the block from disk or modify it.
 
The message-oriented wait event statistics indicate that no block was received because it was not cached in any instance. Instead a global grant was given, enabling the requesting instance to read the block from disk or modify it.
 
Contention-Oriented RAC Waits
 
- gc current block busy” – Indicates a local instance made a request for a current version of a block and did not immediately receive the block.  This indicates that the block shipping was delayed on the remote instance
- “gc cr block busy” – Indicates a local instance made a request for a CR version of a block and did not immediately receive the block
- “gc buffer busy acquire” – Indicates that the local instance cannot grant access to data in the local buffer cache because a global operation on the buffer is pending and not yet completed
- “gc buffer busy release” – Indicates that a remote instance is waiting to complete access on a block to prep for shipping to a remote instance.  
 
The “gc buffer busy” events mean that there is block contention that is resulting from multiple local requests for the same block, and Oracle must queue these requests.
 
The contention-oriented wait event statistics indicate that a block was received which was pinned by a session on another node, was deferred because a change had not yet been flushed to disk or because of high concurrency, and therefore could not be shipped immediately. A buffer may also be busy locally when a session has already initiated a cache fusion operation and is waiting for its completion when another session on the same node is trying to read or modify the same data. High service times for blocks exchanged in the global cache may exacerbate the contention, which can be caused by frequent concurrent read and write accesses to the same data.
 
Load-Oriented Waits
 
- “gc current block congested”
- “gc cr block congested”
 
The load-oriented wait events indicate that a delay in processing has occurred in the GCS, which is usually caused by high load, CPU saturation and would have to be solved by additional CPUs, load-balancing, off loading processing to different times or a new cluster node.  For the events mentioned, the wait time encompasses the entire round trip from the time a session starts to wait after initiating a block request until the block arrives.
 
Measuring RAC Waits from V$SESSION_WAIT
 
set linesize 200
set pages 55
col sid format 99999
col name format a36
col p1 format 999999999 Head 'P1'
col program format a25
col p2 format 999999999 Head 'P2'
col p3 format 999999999 Head 'P3'
col pgm format a15 head 'What'
col state format a15
col wt format 9999999 head 'Wait|Time'
col WaitEvent format a38 head 'Wait Event'
 
col lc format 99999999999.99 head 'last call'
 
select  A.sid,
decode(A.event,'null event','CPU Exec',A.event) WaitEvent,
A.p1,A.p2,A.p3,
decode(A.state,'WAITING','WTG',
'WAITED UNKNOWN TIME','UNK',
'WAITED SHORT TIME','WST',
'WAITED KNOWN TIME','WKT') wait_type,
decode(A.state,'WAITING',A.seconds_in_wait,
'WAITED UNKNOWN TIME',-999,
'WAITED SHORT TIME',A.wait_time,
'WAITED KNOWN TIME',A.WAIT_TIME) wt,
round((last_call_et/60),2) lc,
substr(nvl(b.module,b.program),1,15) pgm
from    v$session_wait A,
        v$session B
where 1=1
and (A.event  like 'gc%'
or A.event like 'GC%'
or A.event like 'ge%')
and A.event not like '%remote message'
and A.event not like '%sleep'
and A.sid=B.sid
and B.status='ACTIVE'
order by 1
/
 
Sample output:
 
930
 
 
 
 
 
Measuring RAC Waits from  V$ACTIVE_SESSION_HISTORY
 
set lines 200
col samplestart format a30 head 'Begin Time'
col sampleend format a30 head 'End Time'
col aas format 999.99 head 'AAS'
col aas_event format 999.99 head 'AAS per |Event'
col dbt format 999999 head 'DB Time'
col event format a30
col time_waited format 999999 head 'Time (ms)'
col wait_pct format 999.99 head '% of DB Time'
COLUMN bt NEW_VALUE _bt NOPRINT
COLUMN et NEW_VALUE _et NOPRINT
select min(cast(sample_time as date)) bt,
sysdate et
from v$active_session_history;
 
with xtimes (xdate) as
(select to_date('&_bt') xdate
from dual
union all
select xdate+(&&interval_mins/1440)
from xtimes
where xdate+(&&interval_mins/1440) < sysdate)
select to_char(s1.xdate,'DD-MON-RR HH24:MI:SS') samplestart,
to_char(s1.xdate+(&&interval_mins/1440),'DD-MON-RR HH24:MI:SS') sampleend,
s2.event,
.001*((sum(s2.time_waited))) time_waited,
(count(s2.sample_id)/(60*&&interval_mins)) aas_event,
(count(s3.sample_id)) dbt,
round(100*(sum(s2.time_waited)/1000000/count(s3.sample_id)),2) as wait_pct
from xtimes s1,
     v$active_session_history s2,
     v$active_session_history s3
where s2.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s3.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s2.sample_id=s3.sample_id
and (s2.event  like 'gc%'
or s2.event like 'GC%'
or s2.event like 'ge%')
and s2.event not like '%remote message'
and s2.event not like '%sleep'
group by s1.xdate,s2.event
order by s1.xdate
/
 
undefine interval_mins
 
Sample output:
 
931
 

Measuring RAC Waits from DBA_HIST_ACTIVE_SESS_HISTORY
 
set lines 200
col samplestart format a30 head 'Begin Time'
col sampleend format a30 head 'End Time'
col aas format 999.99 head 'AAS'
col aas_event format 999.99 head 'AAS per |Event'
col event format a30
col time_waited format 999999 head 'Time (ms)'
col wait_pct format 999.99 head 'Wait %'
COLUMN bt NEW_VALUE _bt NOPRINT
COLUMN et NEW_VALUE _et NOPRINT
select min(cast(sample_time as date)) bt,
sysdate et
from v$active_session_history;
 
with xtimes (xdate) as
(select to_date('&_bt') xdate
from dual
union all
select xdate+(&&interval_mins/1440)
from xtimes
where xdate+(&&interval_mins/1440) < sysdate)
select to_char(s1.xdate,'DD-MON-RR HH24:MI:SS') samplestart,
to_char(s1.xdate+(&&interval_mins/1440),'DD-MON-RR HH24:MI:SS') sampleend,
s2.event,
.001*((sum(s2.time_waited))) time_waited,
10*(count(s2.sample_id)/(60*&&interval_mins)) aas_event,
10*(count(s3.sample_id)) dbt,
round(100*(sum(s2.time_waited)/1000000/(10*(count(s3.sample_id)))),2) as wait_pct
from xtimes s1,
     dba_hist_active_sess_history s2,
     dba_hist_active_sess_history s3
where s2.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s3.sample_time between s1.xdate and s1.xdate+(&&interval_mins/1440)
and s2.sample_id=s3.sample_id
and (s2.event  like 'gc%'
or s2.event like 'GC%'
or s2.event like 'ge%')
and s2.event not like '%remote message'
and s2.event not like '%sleep'
group by s1.xdate,s2.event
order by s1.xdate
/
 
undefine interval_mins
 
Sample output:
 
932
Measuring RAC Waits from Enterprise Manager
 
From the main performance page in Enterprise Manager, you can see whether you’ve a RAC-related cluster load as indicated below:
 
933
 
Drilling down on the details shows some additional information:
 
934
 
Measuring RAC Waits with ADDM
 
Like anything performance-related, ADDM provides great information about RAC-related performance issues, if they exist.  For some reason DBAs don’t use ADDM as much as they should – many think (including me) it should be one of your first tools for diagnosing historical performance problems:
 
935
 
936
 
937
 
938
 
Measuring RAC Waits with AWR
 
AWR reports show detailed cluster and RAC wait related information as well:
 
939
 
 

 940
 
 
941

No comments: