Monday, May 3, 2010

Diagnosing a hung Oracle database

I'm not sure where I got this - maybe Metalink, maybe a forum post or possibly another blog.

Spool out relevant V$ VIEWS

WHILE the database is hanging, run the following queries:
SPOOL v_views.log;

SELECT *
FROM v$parameter;

SELECT class, value, name
FROM v$sysstat;

SELECT sid, id1, id2, type, lmode, request
FROM v$lock;

SELECT l.latch#, n.name, h.pid, l.gets, l.misses,
l.immediate_gets, l.immediate_misses, l.sleeps
FROM v$latchname n, v$latchholder h, v$latch l
WHERE l.latch# = n.latch#
AND l.addr = h.laddr(+);

SELECT *
FROM v$session_wait
ORDER BY sid;

/* repeat last query 3 times - we want to see who's repeatedly waiting*/

SPOOL OFF;

If a specific session hangs do these queries:

SPOOL session.log;

- Get the sid from the pid or vice versa:
SELECT s.sid, p.spid
FROM v$session s, v$process p
WHERE s.paddr = p.addr
AND ... < p.spid = or perhaps
s.sid = >

- What SQL statement is being executed?
SELECT s.sid, s.status, q.sql_text
FROM v$session s, v$sqltext q
WHERE s.sql_hash_value = q.hash_value
AND s.sql_address = q.address
AND s.sid =
order by q.piece;

-Get output from V$SESSION_WAIT
column sid format 990
column seq# format 99990
column wait_time heading 'WTime' format 99990
column event format a30
column p1 format 9999999990
column p2 format 9999999990
column p3 format 9990
select sid,event,seq#,p1,p2,p3,wait_time from V$session_wait
where sid=
order by sid;
*** It is needed to run this statement a few times to see if the wait events actually change

- What is LOCKWAIT column set to in V$SESSION?
SELECT lockwait
FROM v$session
WHERE sid = ;

- If not null, who's holding what lock on which object?
col Username format A15
col Sid format 9990 heading SID
col Type format A4
col Lmode format 990 heading 'HELD'
col Request format 990 heading 'REQ'
col Id1 format 9999990
col Id2 format 9999990

select SN.Username, M.Sid, M.Type,
DECODE(M.Lmode, 0, 'None', 1, 'Null', 2, 'Row Share', 3, 'Row
Excl.', 4, 'Share', 5, 'S/Row Excl.', 6, 'Exclusive',
LTRIM(TO_CHAR(Lmode,'990'))) Lmode,
DECODE(M.Request, 0, 'None', 1, 'Null', 2, 'Row Share', 3, 'Row
Excl.', 4, 'Share', 5, 'S/Row Excl.', 6, 'Exclusive',
LTRIM(TO_CHAR(M.Request, '990'))) Request,
M.Id1, M.Id2 from V$SESSION SN, V$LOCK M
WHERE (SN.Sid = M.Sid and M.Request ! = 0)
or (SN.Sid = M.Sid and M.Request = 0 and Lmode != 4 and (id1, id2)
in (select S.Id1, S.Id2 from V$LOCK S where Request != 0 and S.Id1
= M.Id1 and S.Id2 = M.Id2) ) order by Id1, Id2, M.Request;

- What is LATCHWAIT column set to in V$PROCESS?
SELECT latchwait
FROM v$process
WHERE spid = ;SELECT latchwait
FROM v$process
WHERE spid = ;

- If not null, who's holding what latch?
column name format a32 heading 'LATCH NAME'
column pid heading 'HOLDER PID'
select c.name,a.addr,a.gets,a.misses,a.sleeps,
a.immediate_gets,a.immediate_misses,b.pid
from v$latch a, v$latchholder b, v$latchname c
where a.addr = b.laddr(+) and a.latch# = c.latch#
and c.name like '&latch_name%' order by a.latch#;
- Did the lock holder or latch holder shut off his/her terminal without logging off? This can leave a zombie shadow
process holding the resource.
Kill it with "alter system kill session ''".
If on sqlnet 2.1 or above, try setting sqlnet.expire_time in the sqlnet.ora on the server side. (For more info, see page A-16 of "Understanding SQL*NET" v2.1).

- If session is not frozen but slow, get session-specific statistics:
SELECT s.sid, s.value, t.name
FROM v$sesstat s, v$statname t
WHERE s.statistic# = t.statistic#
AND s.sid = ;

- If session is extremely slow or truly frozen, get session-specific wait info:
SELECT *
FROM v$session_wait
where sid = ;


- repeat above query 3 times.

SPOOL OFF;

- If this is distributed transaction run the following statements from EVERY NODE involved in the transaction:
SPOOL 2pc.log;
SELECT * FROM dba_2pc_pending;
SELECT * FROM pending_sessions$;
SELECT * FROM pending_sub_sessions$;
SELECT * FROM dba_2pc_neighbors;
SPOOL OFF;

The following spool will help determine the cause in MTS.
SPOOL mts.log
- How busy the dispatchers are:
select name,network,status,
(busy /(busy + idle)) * 100 "% of time busy" from v$dispatchers
- query the shared servers view V$SHARED_SERVERS:
select name,status,requests, (busy /(busy + idle)) * 100 "% of time busy"
from v$shared_servers
Spool off
4. Gather OS-Level Info

***RDA output gives the most comprehensive data on OS and database configuration

For information about how to download, install and run the RDA scripts refer to Note 314422.1 "Remote Diagnostic Agent (RDA) 4.0 - Overview"

a. Briefly describe your architecture. Include number of CPUs / disks. Are you using raw devices, NFS mounted
files, shared disks, RAM disks...Are you mirroring? etc.

b. Measure unusual OS level activity: excessive CPU or i/o, paging, swapping, etc. There are other monitor tools such
as TOP. Sun has utilities on their webpage that can be downloaded for monitoring the system.
If you're not comfortable with the utilities listed below, be sure to do step 3 and, if you have isolated a troubled process, the v$sesstat output in step.
- On unix: use SAR, VMSTAT and NETSTAT (man sar)
- You can use truss, trace (man truss) or oradbx (call support) for tracing specific processes.
- You can also unix debuggers such as dbx, xdb, adb or truss to dump process activity. Example:
dbx -a /*attach to hanging process*/
where /*dump Oracle stack */
detach /*detach from process*/

truss -o /tmp/mytruss -fae -p
- On vms: use MONITOR (help monitor)
- you can use analyze/process (help analyze) or orambx (call support) for tracing specific processes.

- On novl: use LOAD MONITOR (gui - no online documentation)

- On Windows NT: use Performance Monitor, Event Monitor, Dr. Watson (see Resource kit for documentation).
- use pview for monitoring specific processes (resource kit).

c. Check the system log for additional information. On many platforms it is at /var/adm the files are the message files.
5. Get SYSTEMSTATE and HANGANALYZE Dumps
This creates a large trace file in the user_dump_dest (30M or more is not unusual).
Note: the init.ora parameter MAX_DUMP_FILE_SIZE controls the maximum trace file size. Using Oradebug and setting unlimit will allow a complete dump which we will need.

Do this step for sure if the entire database is frozen or nearly frozen and if this condition came on suddenly and there are no archive errors in the alert log. Please note: As systemstate dumps are instance specific, they tend to be inconclusive with hanging problems involving Oracle Parallel Server (OPS) unless you get them from each node. You will need 3 system state dumps from each node for OPS.

NOTE: Do systemstate dump 3 times in a row, so we can determine whether
the processes are hung or active.

For Oracle 8.0.5.x to 8.1.7.x
$ svrmgrl
svrmgr> connect internal

svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME HANGANALYZE LEVEL 3';
wait 90 seconds
svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME HANGANALYZE LEVEL 3';

EXIT ... then reconnect

svrmgr>ALTER SESSION SET MAX_DUMP_FILE_SIZE=UNLIMITED;
svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 10';
wait 90 seconds
svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 10';
wait 90 seconds
svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 10';