Thursday, April 29, 2010

How much faster is DataPump compared to Export?

I've done a test to find out. I took a full datapump export of a 9.5GB database followed by an old-style export.

The results are pretty conclusive, but obviously for a better test you should do more databases, different sizes and different options:

DataPump - started at 15:12:00, finished at 15:41:34, so elapsed time was 29 mins and 34 secs
Export - started at 09:32:56 and finished at 10:27:26, so elapsed time was 54 mins and 30 secs

Later on I'll do an Import and post the results.

Monday, April 19, 2010

Tracing listener connections

We've hit an issue were clients are being disconnected from the database, but there are no error messages apart from ORA-03113 and ORA-03114 (which basically means "you have lost your connection to the database" - duh).
There are several odd things about these disconnects:

Idle connections aren't affected, so it's not a time-out issue.
Multiple PCs get disconnected at the same time, so it's not as if a session hits a profile limit and that gets killed.
Only one application is affected, no other uses have reported issues.
It's not limited to one geographical location.
It happens at (seemingly) random times.
It doesn't matter what process or screen the user is accessing.


I've set tracing on at the listener level to see if anything gets written there - there is nothing in the listener.log.
I originally set the level to ADMIN, but that generated 20Mb of trace in 20 minutes, so I dropped it to USER.

The line goes into the listener.ora file (find it by doing lsnrctl stat):

trace_level_listener=USER

The listener then needs to be reloaded to put it into effect:

lsnrctl reload

lsnrctl stat

will show the trace file location:

Listener Trace File       /u01/app/oracle/product/9.2.0/network/trace/listener.trc

Since this is going to get large, I have put in a cron job to cycle it every hour. The drop outs tend to occur overnight.

This is the script that compresses and cycles the log:

#!/bin/ksh
export tstamp=`date +%d_%m_%Y_%H%M`
     cd /u01/app/oracle/product/9.2.0/network/trace
     if [[ -f listener.trc ]] ; then
         # copy current file to time-stamped one
         cp listener.trc listener_${tstamp}.trc
         # Zero the trace file
         cat /dev/null > listener.trc
         # compress any  dated logs
         find . -name "listener_*.trc" -exec compress -f {} \;
     fi

Sunday, April 11, 2010

My favourite tablespace space script

I got this from the asktom.com site, and it's the best one I've found for showing tablespace space usage:

column dummy noprint
column pct_used format 999.9 heading "%|Used"
column name format a16 heading "Tablespace Name"
column Kbytes format 999,999,999 heading "KBytes"
column used format 999,999,999 heading "Used"
column free format 999,999,999 heading "Free"
column largest format 999,999,999 heading "Largest"
break on report
compute sum of kbytes on report
compute sum of free on report
compute sum of used on report
select nvl(b.tablespace_name,
nvl(a.tablespace_name,'UNKOWN')) name,
kbytes_alloc kbytes,
kbytes_alloc-nvl(kbytes_free,0) used,
nvl(kbytes_free,0) free,
((kbytes_alloc-nvl(kbytes_free,0))/
kbytes_alloc)*100 pct_used,
nvl(largest,0) largest
from ( select sum(bytes)/1024 Kbytes_free,
max(bytes)/1024 largest,
tablespace_name
from sys.dba_free_space
group by tablespace_name ) a,
( select sum(bytes)/1024 Kbytes_alloc,
tablespace_name
from sys.dba_data_files
group by tablespace_name )b
where a.tablespace_name (+) = b.tablespace_name;


The only problem I've found is that it's very slow in V10, but fine in all other versions.

How to run an OS command from within Oracle

This is something that comes up occasionally, and I always end up searching Google to find the answer, so I documented it the last time so I had the full instructions. Here they are.


The working directory is important:

cd /u01/app/oracle/dba/work/andy

Create the program in this directory, vi a file and call it shell.c :
#include
#include
#include
void sh(char *command) {
int num;
num = system(command);
}


Compile the program:

/usr/local/bin/gcc -G -c shell.c
ld -r -o shell.so shell.o
chmod 775 shell.so


Create the directory and the library in the database:

SQL> create directory shell_lib as '/u01/app/oracle/dba/work/andy';
SQL> create library shell_lib as '/u01/app/oracle/dba/work/andy/shell.so';
SQL> /


Create the procedure:

SQL> CREATE OR REPLACE PROCEDURE shell(command IN char)
> AS EXTERNAL
> NAME "sh"
> LIBRARY shell_lib
> LANGUAGE C
> PARAMETERS (command string);
> /


Set up the listener.ora and tnsnames.ora:

listener.ora:
TSTLSNR =
(ADDRESS_LIST =
(ADDRESS= (PROTOCOL= IPC)(KEY=EXTPROC))
(ADDRESS= (PROTOCOL= TCP)(HOST=10.208.10.1)(PORT=1525))
)
SID_LIST_TSTLSNR =
(SID_LIST =
(SID_DESC =
(SID_NAME = andy)
(ORACLE_HOME = /u01/app/oracle/product/8.1.7)
)
(SID_DESC =
(SID_NAME = PLSExtProc)
(ORACLE_HOME = /u01/app/oracle/product/8.1.7)
(PROGRAM = extproc)
)
)

tnsnames.ora:
EXTPROC_CONNECTION_DATA =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))
(CONNECT_DATA = (SID = PLSExtProc)(SERVER=DEDICATED))
)
andy =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(Host = 10.208.10.1)(Port = 1525))
(CONNECT_DATA = (SID = andy))
)


Run the procedure:

SQL> exec shell('ls');

andyalterTablespace.sh cr_db.sh logs new_temp.sql shell.c
andyrun.sh cr_user.sql mga_performance.dmp performance shell.o
andyrun1.sh data mga_performance.log replace_temp.log shell.so
andyrun2.sh listener.andy1 new scripts
PL/SQL procedure successfully completed.

So this should work, just make sure all the elements are present. One thing to double-check is the names - a typo or the wrong directory or script name will cause this not to work and you could spend ages trying to figure out where it's gone wrong.

Quick Tip - log on to sqlplus as sysdba in Windows

Most of the sites I work with (99% of them) are Unix (mainly Solaris, some AIX, HP-UX, Red Hat Linux, Oracle Enterprise Linux), but there are a couple that are running Oracle on Windows (ugh!).

I always forget how to log on to sqlplus as sysdba, so this is a reminder for me:

c:\> sqlplus / nolog

sql> connect sys/password@SID as sysdba




Flash Recovery Area full - ORA-19815

I got alerted over the weekend from a client site with the message "ORA-19815: WARNING: db_recovery_file_dest_size.."

It turns out that there was a new application release going in over the weekend so they had decided to suspend the database backups (without telling me).

This is a fairly easy fix, but it took a while due to the number of archived logs.

In case you're not familiar with the Flash Recovery Area (FRA), the location and size is defined in the init.ora file. Oracle will keep archived logs and disk backups there for as long as it can while maintaining the space. For example, if you give it 200Gb, Oracle will keep old archived logs in the FRA (in case a recovery is needed - doesn't have to go to tape) unless new ones need the space, in which case it will delete them if they have already been backed up. You will see messages in the alert log to the affect "Deleted Oracle managed file....".

What happened this weekend is that because the backups hadn't run, the FRA filled up and there were no old logs that hadn't been backed up for Oracle to delete, so the messages were issued, and the database hangs.

To allow the database to continue, the first thing to do is to issue the command "ALTER SYSTEM SET DB_RECOVERY_FILE_DEST_SIZE=400G scope=both;" - making sure the 400G is larger than the current setting (show parameter recovery).

This will start the database writing new archived logs again, but the disk will fill up and you'll be back to where you started unless you backup the existing archived logs.

Connect to rman, and run an archived log backup. If the site uses a media manager like Netbackup you'll need to find the parameters, they should be documented (yeah, right!), or find an existing backup script which should have them. If you can't find them you'll need to ask the Netbackup admin for them, but it's something you should know if you support a site.

rman catalog rman/xxxxx@rcat10.world

RMAN> connect target

connected to target database: PROD (DBID=99999999)


RMAN> run {
2> sql 'alter system archive log current';
3> ALLOCATE CHANNEL ch00 TYPE 'SBT_TAPE' ;
4> SEND 'NB_ORA_POLICY=Oracle_PROD, NB_ORA_SERV=nbu_server01';
5> BACKUP
6> filesperset 200
7> FORMAT 'al_%s_%p_%t'
8> ARCHIVELOG ALL;
9> RELEASE CHANNEL ch00;
10> }

sql statement: alter system archive log current

allocated channel: ch00
channel ch00: sid=352 devtype=SBT_TAPE
channel ch00: VERITAS NetBackup for Oracle - Release 6.0 (2008012304)

sent command to channel: ch00

Starting backup at 2010-04-11 17:48:12
current log archived
channel ch00: starting archive log backupset
channel ch00: specifying archive log(s) in backup set
input archive log thread=1 sequence=130197 recid=212695 stamp=715864266
input archive log thread=1 sequence=130198 recid=212696 stamp=715864428
input archive log thread=1 sequence=130199 recid=212697 stamp=715864599

etc.

Once the backup has been completed, these logs are flagged as having been backed up by RMAN so Oracle will consider them for deletion. However, this won't happen until you reset the DB_RECOVERY_FILE_DEST_SIZE back to what it was, otherwise Oracle thinks there's still plenty of space.

SQL> ALTER SYSTEM SET DB_RECOVERY_FILE_DEST_SIZE=300G scope=both;

Once this is done, the delete messages will show up in the alert log:

Deleted Oracle managed file /FRA/PROD/archivelog/2010_04_09/o1_mf_1_130013_5vwfhln1_.arc
Deleted Oracle managed file /FRA/PROD/archivelog/2010_04_09/o1_mf_1_130014_5vwflfy1_.arc
Deleted Oracle managed file /FRA/PROD/archivelog/2010_04_09/o1_mf_1_130015_5vwfo4fy_.arc
Deleted Oracle managed file /FRA/PROD/archivelog/2010_04_09/o1_mf_1_130016_5vwfr3v9_.arc

It's a good idea to tell the site that it's better practice to reschedule rather than suspend backups, or at least let the archived log backups run because they won't impact the database files.

Thursday, April 8, 2010

Performing a compressed pipe export and import

I always forget how to do a compressed pipe export and import, so here are the commands, taken from the orafaq website.

Make a pipe process

mknod exp.pipe p
Start the compress and pipe it in the background

gzip < exp.pipe > scott.exp.gz &

Wait a second or so before kicking off the export

Run the export command

nohup exp userid=scott/tiger file=exp.pipe ... &


That does the export. To import the file, do this:

Make a new pipe process, run the uncompress in the background and then run the import command


mknod imp.pipe p

zcat expdat.dmp.Z > imp.pipe &

imp system/password file=imp.pipe full=Y log=imp.log resumable=y resumable_name=impjob resumable_timeout=7200 buffer=5000000


This should be all you need.

Find the SQL associated with a Unix process

If the server is slow, and you do a "top" or "prstat" and there is a process chewing up CPU, and it's an Oracle process, this is how to see what SQL it's running.

Note that I've truncated some output. Also, it looks a mess, but I can't see any better formatting options.

prstat

PID USERNAME TIME CPU PROCESS/NLWP
25475 oracle 5:52:23 24% oracle/1

ps -ef | grep 25475

oracle 25475 14487 23 Apr 06 ? 352:27 oracleDEV (LOCAL=NO)

sqlplus "/ as sysdba"

select s.username,
s.SID,
s.serial#,
q.SQL_TEXT
from v$session s,
v$process p,
v$sql q
where s.paddr=p.addr
and p.spid=25475
and s.sql_hash_value=q.hash_value;


USERNAME SID SERIAL# SQL_TEXT
MAXIMO 59 22149 select count(*) from trans where message is not null

Since you have the SID and SERIAL#, this session can be killed if required:

alter system kill session '59,22149';

Handy Unix command line script

I don't use this very often, but it's one I keep in my head that I find really useful.

One thing you can use it for is to quickly make backup copies of files in a directory:

for i in `ls *.dbf`
do
cp $i $i.bkp
done

note the backticks, not single quotes in the first line.

You can also use it to change some text in all the files (probably worth taking a backup first):

for i in `ls *.sql`
do
sed 's/PROD/DEV/g' $i > $i,
mv $i, $i
done

Note the file is output to $i, then the $i, moved back over the original.

It can also be used if you copy files from Windows to Unix and they have "^M" at the end of each line:

for i in `ls *`
do
dos2unix $i > $i,
mv $i, $i
done

Ignore the "cannot get terminal type" messages.

RMAN duplicate - RMAN-06025 errors

We do RMAN duplicates of some PROD databases everyday, and for the most part they are pretty reliable.

Recently we started to get these errors:

RMAN-06025: no backup of log thread 1 seq 51996 lowscn 5982564789686 found to restore

Looking it up, we found that the what the Duplicate process does is restore the database files from the backup, connect to the live PROD database to find the latest archived log, then restore the differential logs and apply them. The problem is that the backup finishes around 02:00 but the Duplicates start around 05:00. I don't know why it's starting to happen, maybe there is more activity on the database in the early hours than there used to be so it's generating logs where it didn't before, but obviously the logs aren't backed up, so the error message is issued.

I looked into fixing it up and had a few thoughts. I could put a "SET UNTIL TIME" clause into the Duplicate script, but since the backups could finish at any time I couldn't guarantee that I wouldn't hit the same problem. I could also use a "SET UNTIL SCN" clause, but I would need to amend the script to connect to the PROD database, run a query to determine the SCN of the last backed up archive log, feed that to a variable and pass it to the Duplicate script.

I was about to sit down and write it when I decided there was a better, and a lazier way: I changed the time of the Duplicate job to start soon after the backup completed. Although not an exact science, it's good enough that it hasn't thrown up the RMAN-06025 errors since I changed it 3 weeks ago.

The users don't need up-to-the-minute copies of the databases, so a few hours older makes no difference.

We were also hitting another error which we couldn't figure out. We do Duplicates of the one PROD database to a few different databases on different servers. In one of the logs we were seeing error messages with the names of different database files. For example, on server1 we duplicate PROD to dup1, and on server2 we duplicate PROD to dup2. On server2 we were seeing errors in the log to the effect

"'/u03/oradata/dup1/dup1_data01.dbf' directory doesn't exist"

Which meant that even though the Duplicate script on server2 specifically set AUXNAME to

"/u04/oradata/dup2/dup2_data01.dbf" etc

the job was trying to restore the dup1 file.

We could only surmise that there were 2 Duplicate jobs running around the same time, and RMAN was getting confused as to which files were supposed to be restored where. We rescheduled one of them an hour apart and it fixed it up.

RMAN obviously doesn't keep the script entirely in memory, and if there are 2 Duplicates going at the same time against the same source database it gets confused.

Problem with Database recovery - ORA-00314

We had a hardware failure on the archived log disk (on AIX), which resulted in massive alert logs containing repeating error messages:
"ARC0: Error 19504 creating archivelog file '/d09/oracle/arch/PROD/arch_PROD_93739.arc'
ARC0: Archiving not possible: error count exceeded
"
and

"ARC0: Archiving not possible: No primary destinations".

There were 3 databases writing logs here and all crashed. All databases are V8.1.7.4.

The Unix team advised they would need to rebuild the file system and we would probably lose all the files. See, this is why you put archived logs and data files on separate disks!

Luckily we have cron jobs that scp the archive logs to a separate server every 30 minutes, so these were available.

We tried restarting the databases (after changing the init.ora file to redirect the archive logs to a new disk) - 2 performed crash recovery and restarted straight away.

One of them prompted for a media recovery, so we checked the backup logs and grabbed the previous night's backup and restored it.

Note that since this is an old server the backups are not managed through RMAN, cold backups are done each night.

Tried a restart and recovery and that's where we hit problems - we got a

"ORA-00314: log 1 of thread 1, expected sequence# 77569 doesn't match 77573"

error message.

We looked this up and none of the explanations seemed to match what we knew. The redo logs and control files were from the same backup and roughly the same time, as were all the datafiles.
One suggestion was that the database was up when the files were backed up.

We re-checked the backup logs - the database was shut, the files copied by an OS command, the database restarted.

We restored the files again, tried recovery, same thing.

It was only then that the Unix guy told us he'd looked through his logs and the timings showed the database was open during the backup. I showed him our log and he said "What's that?". It appeared that this server has 2 sets of backup processes (I did say it was old). The one he restored from backed up the files while the database was up, rendering them useless.

We managed to find the correct backup from 3 days before, re-ran the restore and recovery (after shipping the archived logs back to the new archive location from the DR server), and this time it worked.

Lessons learnt:

Talk to each other - we assumed the backup scripts and logs we always check are the 'proper' ones - not an unreasonable assumption. However, if other groups are involved in backups then it pays to double-check their processes and information.

Split data and log files onto different areas - should be standard practice, but you'd be surprised how many sites I see that don't do this. It's a no-brainer.

Protect your archived logs - because we scp these to a separate server every 30 mins we have an off-site copy. They also get backed up from the primary server, but it's always good to have multiple copies. Yes, this can be achieved with duplicate destinations, but this works too.

Don't panic - at one stage I thought all of our backups would be invalid, and we wouldn't be able to recover. This was a Production database so it would have been a bad situation to be in. However, our logs showed that the database was shutdown before the files were backed up and restarted when complete, and the Unix team were able to restore it - the first one they tried was from a different (and useless) process. We check the logs every day and had confidence in the procedure, which we've now proved.

Keep the users informed - I sent email status updates all the time, including the stuff that went wrong. I find that users are very forgiving if they have all the information and regular updates re-assure them that I am working the problem. All they see is their application / database is unavailable, and if they don't hear from the DBA every so often as far as they know he/she's gone off to do something else. Sometimes users can be a pain and want constant updates, but if you get in first and send out a couple they soon feel confident that you're on top of it and will let them know when something changes. If you don't update them they will hassle you.

You can never have too many backups - as well as scp'ing the archived logs to the DR server, they are also backed up to tape. I also take a full export of the databases each night and scp that to the DR server as well (they are small databases). Although an export is not a backup, it's better to tell the users "I can have your database back as of 03:00 yesterday" rather than "Our backups haven't been working, we've lost your database".

There's nothing worse than typing "recover database until cancel;" on a PROD database and seeing an error come back.

But there's nothing better than telling the users "Your database is back up and running again."

Objective of this Blog

I've decided to create this blog to record my DBA activities so that I can keep a record of things I come across, various Oracle DBA issues and solutions, tips and tricks etc. Most of my information is spread across documents, emails, text files etc and I decided I wanted a place to centralise everything.

Whenever I encounter an issue and find a solution, I'll try and put it up here. This is mainly for my benefit, but if it helps other DBAs, all good.