Thursday, May 27, 2010

Testing a DR with DataGuard

I recently performed a DR test using a DataGuard database, but because it was a test the users wanted to leave the Primary (PROD) database active. I explained that this wasn't a DR test in that case, and that the process and commands we would use for the test would not be relevant for a real DR. They accepted this, and I decided that I'd set up my own DataGuard environment so I could test a 'real' DR scenario. In the meantime, these are the steps I performed to allow the standby database to be opened for the users to be able to connect while leaving the primary production database available for normal business.

This DataGuard environment is using a DataGuard Broker.

For brevity, I'll just include the steps, and not all of the commands.

Set the Oracle environmment and connect to the DataGuard Broker on the primary server:
# dgmgrl
DGMGRL> connect /
Connected

DGMGRL> show configuration

The last line of this command should show this:

Current status for "DGPRI_DATAGUARD":
SUCCESS

Disable the configuration (disable configuration;).

Once this is done, alert log messages will appear regarding loss of connectivity to the standby, this is expected and nothing to worry about:

ARC1: Attempting destination LOG_ARCHIVE_DEST_1 network reconnect (3113)
ARC1: Destination LOG_ARCHIVE_DEST_1 network reconnect abandoned
PING[ARC1]: Error 3113 when pinging standby (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.10.10.10)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=DGSBY_XPT.SYSDOMAIN.LOCAL)(INSTANCE_NAME=DGSBY)(SERVER=dedicated))).

I then checked that the Primary database was OK and that users were still logged in.

On the standby database, you need to issue a command to stop the log apply:

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE;

To allow the database to open in READ WRITE mode, issue this command:

SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;

You then need to stop and restart the standby database.

I checked the primary database to make sure that this command hadn't done something mad like stop and restart the database in standby mode (always best to check), it was fine.

That was basically it - disable the configuration, issue a couple of commands on the standby and then stop and restart the standby. This allowed the testers to pretend the standby was now the Production database for DR purposes, but also allowed the real production database to remain active.

After the test was complete I rebuilt the standby (basically a hot backup and restore of the primary, then running the relevant commands and enabling the broker configuration).

My next test will be to stop my primary test database untidily and start my standby, pretending it is a real loss of Production. I'll update data, then switch it back. Hopefully it all hangs together and this is what would happen in a 'real' DR.

Wednesday, May 19, 2010

I Hate ORA-12514

Whenever a user calls me and tells me they're getting an ORA-12514 when trying to connect I get a little depressed. I find this one of the hardest errors to solve because there are so many places for it to go wrong. Here's a recent example:

I checked the database and it was set to archive log mode, but automatic archiving was not enabled. This usually causes a database to hang, but can also prevent connections (but with a different error). I stopped and restarted the database in archive log mode, enabled automatic archiving, tied reconnecting and voila! It didn’t work.

After a recent migration we had issues with some of the ORACLE_HOMEs on some databases. Back in the dim and distant past the databases used /u01/app/oracle/product/8.1.7 as the ORACLE_HOME, then they were upgraded and moved to /u01/app/oracle/product/8.1.7.4. However, I found that the listener.ora had this database hard-coded in the SID list, but with the old ORACLE_HOME.
I changed the entry for the TST2 database to reflect the new home, reloaded the listener, restarted the database, tried reconnecting and voila! It didn’t work.
I then checked the listener and tnsnames.ora files and saw that the service name had “sysdomain.local’ in it, so I removed that, reloaded the listener, tried to connect again and voila! It didn’t work.

I then checked the parameters in the database, and saw that there was a reference to TST1 in the SERVICE_NAME parameter, so I changed that to TST2, stopped and restarted the database and voila! It worked.

It looks as if this database was created by copying TST1 and recreating the controlfile, but someone left the reference to TST1 in the init.ora file. So even though the listener and tnsnames ora files looked OK, this parameter was over-riding them. If I hadn't have spotted it I may have been still trying to solve this one.

Monday, May 17, 2010

Checking Netbackup logs

A number of sites we look after use Symantec Veritas Netbackup as the Media Manager for RMAN backups. Many times I'll kick off a restore and have it sitting there without knowing what's going on.

Netbackup has logs which detail the process so you can see exactly what the restore is doing. The logs are stored here:

/usr/openv/netbackup/logs/user_ops/dbext/logs

The file name format is a bit cryptic, but you generally want the most recent:

2355 May 17 19:32 9613.0.1274086925
2355 May 17 21:30 9613.0.1274094025
2355 May 17 22:23 9613.0.1274097219
2355 May 18 07:40 28094.0.1274130593

The latest one is a log from a restore I was running that was hanging on a controlfile restore. Our backups are put to disk, then written to tape after 2 weeks, so I couldn't understand why this was taking so long (the file was from a week before). This is the log:

Restore started Tue May 18 07:09:53 2010


07:09:54 (4538920.xxx) Restore job id 4538920 will require 1 image.
07:09:54 (4538920.xxx) Media id BD0078 is needed for the restore.

07:09:58 (4538920.001) Restoring from image created Mon May 10 11:42:13 2010
07:10:00 (4538920.001) INF - Data socket = nbserver.sysdomain.local.IPC:/tmp/vnet-28452274130599746506000000000-c9aGK3;d86e0e6cce7b3e
71ed1e7b93ed529939;4;3600
07:10:00 (4538920.001) INF - Name socket = nbserver.sysdomain.local.IPC:/tmp/vnet-28453274130599795954000000000-Z9aOK3;7dd95d7b3700c6
50e74388642ac3522f;4;3600
07:10:00 (4538920.001) INF - If Media id BD0078 is not in a robotic library administrative interaction may be required to satisfy this
mount request.
07:10:00 (4538920.001) INF - Job id = 4538920
07:10:00 (4538920.001) INF - Backup id = dbserver_1273455733
07:10:00 (4538920.001) INF - Backup time = 1273455733
07:10:00 (4538920.001) INF - Policy name = CIS_oracle_policy
07:10:01 (4538920.001) INF - Snapshot = 0
07:10:01 (4538920.001) INF - Frozen image = 0
07:10:01 (4538920.001) INF - Backup copy = 0
07:10:01 (4538920.001) INF - No drives available or requesting resources
07:10:01 (4538920.001) INF - Master server = master
07:10:01 (4538920.001) INF - Media server = nbserver
07:10:01 (4538920.001) INF - New data socket = nbserver.sysdomain.local.IPC:/tmp/vnet-28451274130599694192000000000-o8ayK3;550ef0e07a
989118940acf8578bb2e6f;4;3600
07:10:01 (4538920.001) INF - Encrypt = 0
07:10:01 (4538920.001) INF - Use shared memory = 0
07:10:02 (4538920.001) INF - Restore id = 4538920.001
07:10:02 (4538920.001) INF - Encrypt = 0
07:10:02 (4538920.001) INF - Client read timeout = 3600
07:10:02 (4538920.001) INF - Media mount timeout = 1800
07:10:02 (4538920.001) INF - client = dbserver
07:10:02 (4538920.001) INF - requesting_client = dbserver
07:10:02 (4538920.001) INF - browse_client = dbserver
07:39:59 (4538920.001) ERR - Timed out waiting for the media to be mounted and positioned.
07:40:05 (4538920.001) The following files/folders were not restored:
07:40:05 (4538920.001) UTF - /s7ldas09_1_1
07:40:06 (4538920.001) Status of restore from image created Mon May 10 11:42:13 2010 = timed out waiting for media manager to mount vo
lume

07:40:07 INF - Server status = 5
07:40:07 (4538920.xxx) INF - Status = the restore failed to recover the requested files.

It's fairly obvious what happened - the tape isn't in the library so the restore timed out. This is a job for the team that looks after Netbackup - I have to ask why the file went to tape after just 1 week, and why the tape was not in the library.

Monday, May 10, 2010

Using SCP without prompting for a password

I often have to scp several files from one server to another, but if it's not set up for it, every file prompts for a password. This is fine if you're only doing 1 or 2, but not if you're copying dozens.

You need to generate a public key and copy it to the destination server.

For example, I have to copy several files from server a (call it source) to server b (call it target).

On source, go to the ~/.ssh directory. If it doesn't exist, create it. If it does exist and there are files in there, it's probably a good idea to take a copy first in case you break something.

Generate a public key like this:

ssh-keygen -t rsa

You should see this:

Generating public/private rsa key pair
Enter file in which to save the key ...

Press Enter, and you should see this:

Enter passphrase (empty for no passphrase):

Press Enter twice.

You should see this:

Your identification has been saved in id_rsa.pub
Your public key has been saved in id_rsa.pub

Copy this file (you can scp it) to the target server, and put it in the /tmp folder there.

Log on to the target server, and cd to the ~/.ssh directory. If there is already an id_rsa.pub file you should rename the one you copied to the /tmp directory, and then move it to the ./ssh directory. If one doesn't already exist it's OK to just move it.

If there is an "authorized_keys" file, take a backup and edit it. Copy the contents of file you copied to the end of this. If there is no "authorized_keys" file in the ~/.ssh directory, just rename your file to "authorized_keys".

Log back onto the source server and try to scp a file - it should copy without prompting for a password.

Thursday, May 6, 2010

Some Oracle Statspack tips

Although AWWR is now available for V10, and is (probably) better than statspack, it is an option that requires a license. I tend to install statspack onto a database so that I can still get performance information. There are a couple of things I've picked up over the years about statspack that are worth looking at.

The first is the statspack level - by default it's level 5, but I prefer level 7, mainly because it lists the top 5 objects.

To see the statspack levels available:

SELECT * FROM stats$level_description ORDER BY snap_level;

Note that V8i has levels 0,5 and 10; V9 has levels 0,5,6,7 and 10.

To change the level you run this command:

exec statspack.snap(i_snap_level => 7, i_modify_parameter => 'true');



If there are a large number of snapshots you may see this when you run an spreport.sql:

Instance DB Name Id Snap Started Level Comment
------------ ------------ ----- ----------------- ----- ----------------------
PRD PRD ##### 02 May 2010 22:00 10


In order to see the snap_ids you will need to change the column format in the spreport.sql (or sprepins.sql) . I tend to copy the $ORACLE_HOME/rdbms/admin/spreport.sql (or sprepins.sql) to a backup copy and make a couple of changes.

Change this line in spreport.sql

column snap_id heading 'Snap|Id' format 9990;

to

column snap_id heading 'Snap|Id' format 9999999990;

Running the spreport.sql again shows this:

Instance DB Name Id Snap Started Level
------------ ------------ ----------- ----------------- -----
Comment
----------------------

PRD PRD 33739 06 May 2010 10:00 10


It's wrapped the columns, but at least the snap_id is now visible.



Another change I like to make is the number of lines in the SQL Statements. By default this is only set to 5. This means that the top SQL sections only show the first 5 lines of the statement, which doesn't really show enough.

Change this line:

define top_n_sql = 5;

to

define top_n_sql = 65;

or how ever many lines you think you want. This will now display more, or all, of the SQL statement which you can copy to run an explain on.

Wednesday, May 5, 2010

How to clone an Oracle Home (10g)

If you need to install a V10 (or above) environment and already have the version and patches somewhere else (but the same OS and level), instead of going through an install you can simply clone an existing ORACLE_HOME.

Rather than re-installing Oracle 10g and it's associated patches each time, it is possible to perform a clone of an exisiting Oracle Home.


1. Take a "tar" of the existing Oracle Home. Using tar will ensure that file permissions and symbolic links (internal to the Oracle Home) are maintained.
tar -cvf oracle_home.tar $ORACLE_HOME
2. Extract oracle_home.tar to the required location (same or different server)
3. Run the clone utility
cd $ORACLE_HOME/clone/bin
perl clone.pl ORACLE_HOME="" ORACLE_HOME_NAME=""

This also saves having to copy the install media onto a server if it's at a remote location.

Note - I haven't tested this, these are instructions from a colleague.

Oracle DataGuard Trouble Shooting

I have implemented a couple of DataGuard environments, and most of the time they tick along just fine. However they do sometimes break, and this is a brief guide to some of the errors I've encountered and how to fix them.

Data Guard Trouble shooting.

Dataguard Broker errors (I’ve removed the text to just show the errors):

DGMGRL> show configuration;
.
.
Warning: ORA-16607: one or more databases have failed

This is a general error showing there is a problem with either the primary or secondary database.

To determine which is causing the issue:

DGMGRL> show database verbose PROD_PRIM;
.
.
Error: ORA-16778: redo transport error for one or more databases

This indicates the logs are not getting to the secondary database.



DGMGRL> show database PROD_SBY;
.
.
Error: ORA-12541: TNS:no listener

This shows the listener on the secondary server is not started.



DGMGRL> show database PROD_SBY;
.
.
Error: ORA-16766: Redo Apply unexpectedly offline

This shows that the redo apply has not been enabled. You need to connect to the PROD_SBY database and issue the
“ALTER DATABASE RECOVER MANAGED STANDBY DATABASE NODELAY DISCONNECT FROM SESSION;” command.




DGMGRL> show database PROD_SBY;
.
.
Warning: ORA-16826: apply service state is inconsistent with the DelayMins property

This shows that the redo apply command was issued without the “NODELAY” parameter. You need to connect to the PROD_SBY database, stop the redo apply and then re-enable it with the “NODELAY” parameter:

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

Database altered.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE NODELAY DISCONNECT FROM SESSION;

Database altered.


Messages may appear in the alert log on the primary:

PING[ARC1]: Error 16146 when pinging standby EAM_SBY
ARC3: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16040)
ARC3: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Errors in file /u01/app/oracle/admin/PRD/bdump/prd_arc3_10159.trc:
ORA-16040: standby destination archive log file is locked.

ORA-16011: Archivelog Remote File Server process in Error state
FAL[server, ARC3]: Error 16011 creating remote archivelog file 'EAM_SBY'
FAL[server, ARC3]: FAL archive failed, see trace file.
Wed Feb 18 09:10:33 2009
ORA-16055: FAL request rejected
ARCH: FAL archive failed. Archiver continuing
Wed Feb 18 09:10:33 2009
ORACLE Instance PROD_PRIM - Archival Error. Archiver continuing.

And on the standby:

RFS[240]: Possible network disconnect with primary database
Aborting archivelog file creation:
/s01/oraarch/PROD_SBY/PROD_PRIM19019645361152.arc
If this a network disconnect, then this archivelog will be fetched again
by GAP resolution mechanism.

Not sure why this happens – maybe network activity causing time-outs, or some kind of timing conflict. It seems to resolve itself. Test it by “alter system switch logfile;” on the primary and making sure the new log gets applied on the standby. If it does, all is OK. If not, you may need to re-build the standby or do more investigation.

Using Oracle APEX in an 11g database with the embedded gateway feature

I've used APEX (Oracle Application Express) for a while, since when it was called HTML-DB, and I've written a couple of applications but nothing too involved. In db versions before 11g you had to set up OAS to act as the portal / web service (actually I think 10g had an internal method but it wasn't easy to get going). 11g now has an embedded PL/Sql gateway which doesn't need as much implementation. However, there are some 'gotchas'. Here's some information I've found that I put together on how to get the 11g Embedded Gateway working with APEX:

Application Express 3.2 in 11g Using the embedded gateway


Download the apex32.zip file from Oracle

Extract to a temporary location.

Backup and remove the existing $ORACLE_HOME/apex directory

Copy the new extracted files to the 11g $ORACLE_HOME/apex directory

There are some html help pages in the extracted "install.32/e13366" directory

Run the @apexins script from the $ORACLE_HOME/apex directory:

@apexins tablespace_name tablesspace_name TEMP /i/

tablespace_name = where the objects will be stored


Run the @apxconf script from the $ORACLE_HOME/apex directory as sys

It will prompt for you to create a password for the admin account - note you will
need to change this when you first login from the web page so you might want to make
it a temporary one.

Unlock the anonymous account: ALTER USER ANONYMOUS ACCOUNT UNLOCK;


Check the port : SELECT DBMS_XDB.GETHTTPPORT FROM DUAL;

If 0, no port is set. It needs to be changed using
   EXEC DBMS_XDB.SETHTTPPORT(nnnn);
   COMMIT;


Go to a web page and enter "http://servername:port/apex/apex_admin

It should show a page with a username/login field.

I had a problem with the login button not doing anything when clicked, this fixed it

@apex_epg_config /u01/app/oracle/product/11.1.0/db_1


Other things to try if things don't work, any or all of these may fix things:


--------
DECLARE
  ACL_PATH  VARCHAR2(4000);
  ACL_ID    RAW(16);
BEGIN
  -- Look for the ACL currently assigned to '*' and give APEX_030200
  -- the "connect" privilege if APEX_030200 does not have the privilege yet.

  SELECT ACL INTO ACL_PATH FROM DBA_NETWORK_ACLS
   WHERE HOST = '*' AND LOWER_PORT IS NULL AND UPPER_PORT IS NULL;

  -- Before checking the privilege, ensure that the ACL is valid
  -- (for example, does not contain stale references to dropped users).
  -- If it does, the following exception will be raised:
  --
  -- ORA-44416: Invalid ACL: Unresolved principal 'APEX_030200'
  -- ORA-06512: at "XDB.DBMS_XDBZ", line ...
  --
  SELECT SYS_OP_R2O(extractValue(P.RES, '/Resource/XMLRef')) INTO ACL_ID
    FROM XDB.XDB$ACL A, PATH_VIEW P
   WHERE extractValue(P.RES, '/Resource/XMLRef') = REF(A) AND
         EQUALS_PATH(P.RES, ACL_PATH) = 1;

  DBMS_XDBZ.ValidateACL(ACL_ID);
   IF DBMS_NETWORK_ACL_ADMIN.CHECK_PRIVILEGE(ACL_PATH, 'APEX_030200',
     'connect') IS NULL THEN
      DBMS_NETWORK_ACL_ADMIN.ADD_PRIVILEGE(ACL_PATH,
     'APEX_030200', TRUE, 'connect');
  END IF;

EXCEPTION
  -- When no ACL has been assigned to '*'.
  WHEN NO_DATA_FOUND THEN
  DBMS_NETWORK_ACL_ADMIN.CREATE_ACL('power_users.xml',
    'ACL that lets power users to connect to everywhere',
    'APEX_030200', TRUE, 'connect');
  DBMS_NETWORK_ACL_ADMIN.ASSIGN_ACL('power_users.xml','*');
END;
/
COMMIT;

---------

ALTER SYSTEM SET SHARED_SERVERS = 5 SCOPE=BOTH;

---------

Changing the admin password:

SQL> PASSWORD APEX_PUBLIC_USER

or

SQL> @apxchpwd.sql

---------

UPDATE resource_view r
SET r.res=UPDATEXML(res, '/a:Resource/a:Contents/b:acl','

xmlns="http://xmlns.oracle.com/xdb/acl.xsd"
xmlns:dav="DAV:"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xmlns.oracle.com/xdb/acl.xsd http://xmlns.oracle.com/xdb/
acl.xsd">

true
ANONYMOUS


',
'xmlns:a="http://xmlns.oracle.com/xdb/XDBResource.xsd" xmlns:b="http://xmlns.oracle.com/xdb/acl.xsd"') WHERE r.any_path = '/sys/acls/ro_an
onymous_acl.xml';

---------

lsnrctl reload

---------

alter user xdb account unlock;
alter user xdb identified by xdb;


Tuesday, May 4, 2010

Oracle Enterprise Manager Desktop Widgets

Oracle has recently made available some desktop widgets that connect to OEM and display the status of monitored instances: http://www.oracle.com/technology/products/oem/widgets/index.html

They are quite neat, but the drill-down only functions if you have the relevant Management Packs installed.

You need to install Adobe Air, but it's really a one-click installation, there's not a lot involved. I generally find desktop widgets are only really useful if you have dual monitors, or a widescreen one, otherwise they get in the way.




Monday, May 3, 2010

Recovery using RMAN through a RESETLOGS

I had to recover a database after a RESETLOGS a while ago and recently found the doco. This is the syntax. I may edit it to include comments later, but I think the commands should be self-explanatory.

rman catalog rman/xxxxx@rmancat

connect target

RMAN> run {
2> allocate channel ch1 type 'SBT_TAPE';
3> set until time "to_date('13-JUL-2009 21:00:00','dd-MON-YYYY hh24:mi:ss')";
4> restore database;
5> sql 'alter database mount';
6> recover database;
7> alter database open resetlogs;
8> }

allocated channel: ch1
channel ch1: sid=1095 devtype=SBT_TAPE
channel ch1: Veritas NetBackup for Oracle - Release 6.5 (2007072323)

executing command: SET until clause
released channel: ch1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of set command at 07/24/2009 11:08:38
RMAN-06004: ORACLE error from recovery catalog database: RMAN-20207: UNTIL TIME or RECOVERY WINDOW is before RESETLOGS time

RMAN> list incarnation of database;


List of Database Incarnations
DB Key  Inc Key DB Name  DB ID            STATUS  Reset SCN  Reset Time
------- ------- -------- ---------------- --- ---------- ----------
3794137 3794138 DEVDB 1020948257       PARENT  1          19-JUN-09
3794137 3869825 DEVDB 1020948257       CURRENT 8615505    23-JUL-09

RMAN> set dbid 1020948257;

executing command: SET DBID
database name is "DEVDB" and DBID is 1020948257

RMAN> reset database to incarnation 3794138;

database reset to incarnation 3794138

RMAN> run {
2> allocate channel ch1 type 'SBT_TAPE';
3> set until time "to_date('13-JUL-2009 21:00:00','dd-MON-YYYY hh24:mi:ss')";
4> restore database;
5> sql 'alter database mount';
6> recover database;
7> alter database open resetlogs;
8> }

allocated channel: ch1
channel ch1: sid=1095 devtype=SBT_TAPE
channel ch1: Veritas NetBackup for Oracle - Release 6.5 (2007072323)

executing command: SET until clause

Starting restore at 24-JUL-09

channel ch1: starting datafile backupset restore
channel ch1: specifying datafile(s) to restore from backup set
restoring datafile 00001 to /u08/oradata/DEVDB/system.dbf
restoring datafile 00002 to /u08/oradata/DEVDB/undotbs_1.dbf
restoring datafile 00003 to /u08/oradata/DEVDB/sysaux_01.dbf
restoring datafile 00004 to /u09/oradata/DEVDB/data_l_01.dbf
restoring datafile 00005 to /u09/oradata/DEVDB/data_m_01.dbf
restoring datafile 00006 to /u09/oradata/DEVDB/data_s_01.dbf
restoring datafile 00007 to /u10/oradata/DEVDB/indx_l_01.dbf
restoring datafile 00008 to /u10/oradata/DEVDB/indx_m_01.dbf
restoring datafile 00009 to /u10/oradata/DEVDB/indx_s_01.dbf
channel ch1: reading from backup piece DEVDB_db_13_07_2009_18_51_t692131889_s4_p1
channel ch1: restored backup piece 1
piece handle=DEVDB_db_13_07_2009_18_51_t692131889_s4_p1 tag=TAG20090713T185129
channel ch1: restore complete, elapsed time: 00:04:06
Finished restore at 24-JUL-09

sql statement: alter database mount
released channel: ch1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of sql command on default channel at 07/24/2009 11:16:05
RMAN-11003: failure during parse/execution of SQL statement: alter database mount
ORA-00205: error in identifying control file, check alert log for more info

RMAN> run {
2> allocate channel ch1 type 'SBT_TAPE';
3> set until time "to_date('13-JUL-2009 21:00:00','dd-MON-YYYY hh24:mi:ss')";
4> restore controlfile;
5> restore database;
6> sql 'alter database mount';
7> recover database;
8> alter database open resetlogs;
9> }

allocated channel: ch1
channel ch1: sid=1095 devtype=SBT_TAPE
channel ch1: Veritas NetBackup for Oracle - Release 6.5 (2007072323)

executing command: SET until clause

Starting restore at 24-JUL-09

channel ch1: starting datafile backupset restore
channel ch1: restoring control file
channel ch1: reading from backup piece DEVDB_db_13_07_2009_18_51_t692132525_s5_p1
channel ch1: restored backup piece 1
piece handle=DEVDB_db_13_07_2009_18_51_t692132525_s5_p1 tag=TAG20090713T185129
channel ch1: restore complete, elapsed time: 00:00:37
output filename=/u08/oradata/DEVDB/control01.ctl
output filename=/u09/oradata/DEVDB/control02.ctl
output filename=/u10/oradata/DEVDB/control03.ctl
Finished restore at 24-JUL-09

Starting restore at 24-JUL-09

skipping datafile 1; already restored to file /u08/oradata/DEVDB/system.dbf
skipping datafile 2; already restored to file /u08/oradata/DEVDB/undotbs_1.dbf
skipping datafile 3; already restored to file /u08/oradata/DEVDB/sysaux_01.dbf
skipping datafile 4; already restored to file /u09/oradata/DEVDB/data_l_01.dbf
skipping datafile 5; already restored to file /u09/oradata/DEVDB/data_m_01.dbf
skipping datafile 6; already restored to file /u09/oradata/DEVDB/data_s_01.dbf
skipping datafile 7; already restored to file /u10/oradata/DEVDB/indx_l_01.dbf
skipping datafile 8; already restored to file /u10/oradata/DEVDB/indx_m_01.dbf
skipping datafile 9; already restored to file /u10/oradata/DEVDB/indx_s_01.dbf
Finished restore at 24-JUL-09

sql statement: alter database mount

Starting recover at 24-JUL-09

starting media recovery

channel ch1: starting archive log restore to default destination
channel ch1: restoring archive log
archive log thread=1 sequence=712
channel ch1: restoring archive log
archive log thread=1 sequence=713
channel ch1: reading from backup piece DEVDB_arch__t692132627_s6_p1
channel ch1: restored backup piece 1
piece handle=DEVDB_arch__t692132627_s6_p1 tag=TAG20090713T190347
channel ch1: restore complete, elapsed time: 00:00:35
archive log filename=/u01/app/oracle/product/10.2.0/db_1/dbs/archDEVDB_arch712_1_689964513.arc thread=1 sequence=712
archive log filename=/u01/app/oracle/product/10.2.0/db_1/dbs/archDEVDB_arch713_1_689964513.arc thread=1 sequence=713
channel ch1: starting archive log restore to default destination
channel ch1: restoring archive log
archive log thread=1 sequence=714
channel ch1: reading from backup piece DEVDB_arch__t692139952_s9_p1
channel ch1: restored backup piece 1
piece handle=DEVDB_arch__t692139952_s9_p1 tag=TAG20090713T210552
channel ch1: restore complete, elapsed time: 00:00:36
archive log filename=/u01/app/oracle/product/10.2.0/db_1/dbs/archDEVDB_arch714_1_689964513.arc thread=1 sequence=714
media recovery complete, elapsed time: 00:00:00
Finished recover at 24-JUL-09

database opened
new incarnation of database registered in recovery catalog
starting full resync of recovery catalog
full resync complete
released channel: ch1

Diagnosing a hung Oracle database

I'm not sure where I got this - maybe Metalink, maybe a forum post or possibly another blog.

Spool out relevant V$ VIEWS

WHILE the database is hanging, run the following queries:
SPOOL v_views.log;

SELECT *
FROM v$parameter;

SELECT class, value, name
FROM v$sysstat;

SELECT sid, id1, id2, type, lmode, request
FROM v$lock;

SELECT l.latch#, n.name, h.pid, l.gets, l.misses,
l.immediate_gets, l.immediate_misses, l.sleeps
FROM v$latchname n, v$latchholder h, v$latch l
WHERE l.latch# = n.latch#
AND l.addr = h.laddr(+);

SELECT *
FROM v$session_wait
ORDER BY sid;

/* repeat last query 3 times - we want to see who's repeatedly waiting*/

SPOOL OFF;

If a specific session hangs do these queries:

SPOOL session.log;

- Get the sid from the pid or vice versa:
SELECT s.sid, p.spid
FROM v$session s, v$process p
WHERE s.paddr = p.addr
AND ... < p.spid = or perhaps
s.sid = >

- What SQL statement is being executed?
SELECT s.sid, s.status, q.sql_text
FROM v$session s, v$sqltext q
WHERE s.sql_hash_value = q.hash_value
AND s.sql_address = q.address
AND s.sid =
order by q.piece;

-Get output from V$SESSION_WAIT
column sid format 990
column seq# format 99990
column wait_time heading 'WTime' format 99990
column event format a30
column p1 format 9999999990
column p2 format 9999999990
column p3 format 9990
select sid,event,seq#,p1,p2,p3,wait_time from V$session_wait
where sid=
order by sid;
*** It is needed to run this statement a few times to see if the wait events actually change

- What is LOCKWAIT column set to in V$SESSION?
SELECT lockwait
FROM v$session
WHERE sid = ;

- If not null, who's holding what lock on which object?
col Username format A15
col Sid format 9990 heading SID
col Type format A4
col Lmode format 990 heading 'HELD'
col Request format 990 heading 'REQ'
col Id1 format 9999990
col Id2 format 9999990

select SN.Username, M.Sid, M.Type,
DECODE(M.Lmode, 0, 'None', 1, 'Null', 2, 'Row Share', 3, 'Row
Excl.', 4, 'Share', 5, 'S/Row Excl.', 6, 'Exclusive',
LTRIM(TO_CHAR(Lmode,'990'))) Lmode,
DECODE(M.Request, 0, 'None', 1, 'Null', 2, 'Row Share', 3, 'Row
Excl.', 4, 'Share', 5, 'S/Row Excl.', 6, 'Exclusive',
LTRIM(TO_CHAR(M.Request, '990'))) Request,
M.Id1, M.Id2 from V$SESSION SN, V$LOCK M
WHERE (SN.Sid = M.Sid and M.Request ! = 0)
or (SN.Sid = M.Sid and M.Request = 0 and Lmode != 4 and (id1, id2)
in (select S.Id1, S.Id2 from V$LOCK S where Request != 0 and S.Id1
= M.Id1 and S.Id2 = M.Id2) ) order by Id1, Id2, M.Request;

- What is LATCHWAIT column set to in V$PROCESS?
SELECT latchwait
FROM v$process
WHERE spid = ;SELECT latchwait
FROM v$process
WHERE spid = ;

- If not null, who's holding what latch?
column name format a32 heading 'LATCH NAME'
column pid heading 'HOLDER PID'
select c.name,a.addr,a.gets,a.misses,a.sleeps,
a.immediate_gets,a.immediate_misses,b.pid
from v$latch a, v$latchholder b, v$latchname c
where a.addr = b.laddr(+) and a.latch# = c.latch#
and c.name like '&latch_name%' order by a.latch#;
- Did the lock holder or latch holder shut off his/her terminal without logging off? This can leave a zombie shadow
process holding the resource.
Kill it with "alter system kill session ''".
If on sqlnet 2.1 or above, try setting sqlnet.expire_time in the sqlnet.ora on the server side. (For more info, see page A-16 of "Understanding SQL*NET" v2.1).

- If session is not frozen but slow, get session-specific statistics:
SELECT s.sid, s.value, t.name
FROM v$sesstat s, v$statname t
WHERE s.statistic# = t.statistic#
AND s.sid = ;

- If session is extremely slow or truly frozen, get session-specific wait info:
SELECT *
FROM v$session_wait
where sid = ;


- repeat above query 3 times.

SPOOL OFF;

- If this is distributed transaction run the following statements from EVERY NODE involved in the transaction:
SPOOL 2pc.log;
SELECT * FROM dba_2pc_pending;
SELECT * FROM pending_sessions$;
SELECT * FROM pending_sub_sessions$;
SELECT * FROM dba_2pc_neighbors;
SPOOL OFF;

The following spool will help determine the cause in MTS.
SPOOL mts.log
- How busy the dispatchers are:
select name,network,status,
(busy /(busy + idle)) * 100 "% of time busy" from v$dispatchers
- query the shared servers view V$SHARED_SERVERS:
select name,status,requests, (busy /(busy + idle)) * 100 "% of time busy"
from v$shared_servers
Spool off
4. Gather OS-Level Info

***RDA output gives the most comprehensive data on OS and database configuration

For information about how to download, install and run the RDA scripts refer to Note 314422.1 "Remote Diagnostic Agent (RDA) 4.0 - Overview"

a. Briefly describe your architecture. Include number of CPUs / disks. Are you using raw devices, NFS mounted
files, shared disks, RAM disks...Are you mirroring? etc.

b. Measure unusual OS level activity: excessive CPU or i/o, paging, swapping, etc. There are other monitor tools such
as TOP. Sun has utilities on their webpage that can be downloaded for monitoring the system.
If you're not comfortable with the utilities listed below, be sure to do step 3 and, if you have isolated a troubled process, the v$sesstat output in step.
- On unix: use SAR, VMSTAT and NETSTAT (man sar)
- You can use truss, trace (man truss) or oradbx (call support) for tracing specific processes.
- You can also unix debuggers such as dbx, xdb, adb or truss to dump process activity. Example:
dbx -a /*attach to hanging process*/
where /*dump Oracle stack */
detach /*detach from process*/

truss -o /tmp/mytruss -fae -p
- On vms: use MONITOR (help monitor)
- you can use analyze/process (help analyze) or orambx (call support) for tracing specific processes.

- On novl: use LOAD MONITOR (gui - no online documentation)

- On Windows NT: use Performance Monitor, Event Monitor, Dr. Watson (see Resource kit for documentation).
- use pview for monitoring specific processes (resource kit).

c. Check the system log for additional information. On many platforms it is at /var/adm the files are the message files.
5. Get SYSTEMSTATE and HANGANALYZE Dumps
This creates a large trace file in the user_dump_dest (30M or more is not unusual).
Note: the init.ora parameter MAX_DUMP_FILE_SIZE controls the maximum trace file size. Using Oradebug and setting unlimit will allow a complete dump which we will need.

Do this step for sure if the entire database is frozen or nearly frozen and if this condition came on suddenly and there are no archive errors in the alert log. Please note: As systemstate dumps are instance specific, they tend to be inconclusive with hanging problems involving Oracle Parallel Server (OPS) unless you get them from each node. You will need 3 system state dumps from each node for OPS.

NOTE: Do systemstate dump 3 times in a row, so we can determine whether
the processes are hung or active.

For Oracle 8.0.5.x to 8.1.7.x
$ svrmgrl
svrmgr> connect internal

svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME HANGANALYZE LEVEL 3';
wait 90 seconds
svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME HANGANALYZE LEVEL 3';

EXIT ... then reconnect

svrmgr>ALTER SESSION SET MAX_DUMP_FILE_SIZE=UNLIMITED;
svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 10';
wait 90 seconds
svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 10';
wait 90 seconds
svrmgr>ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 10';