We had a hardware failure on the archived log disk (on AIX), which resulted in massive alert logs containing repeating error messages:
"ARC0: Error 19504 creating archivelog file '/d09/oracle/arch/PROD/arch_PROD_93739.arc'
ARC0: Archiving not possible: error count exceeded
"
and
"ARC0: Archiving not possible: No primary destinations".
There were 3 databases writing logs here and all crashed. All databases are V8.1.7.4.
The Unix team advised they would need to rebuild the file system and we would probably lose all the files. See, this is why you put archived logs and data files on separate disks!
Luckily we have cron jobs that scp the archive logs to a separate server every 30 minutes, so these were available.
We tried restarting the databases (after changing the init.ora file to redirect the archive logs to a new disk) - 2 performed crash recovery and restarted straight away.
One of them prompted for a media recovery, so we checked the backup logs and grabbed the previous night's backup and restored it.
Note that since this is an old server the backups are not managed through RMAN, cold backups are done each night.
Tried a restart and recovery and that's where we hit problems - we got a
"ORA-00314: log 1 of thread 1, expected sequence# 77569 doesn't match 77573"
error message.
We looked this up and none of the explanations seemed to match what we knew. The redo logs and control files were from the same backup and roughly the same time, as were all the datafiles.
One suggestion was that the database was up when the files were backed up.
We re-checked the backup logs - the database was shut, the files copied by an OS command, the database restarted.
We restored the files again, tried recovery, same thing.
It was only then that the Unix guy told us he'd looked through his logs and the timings showed the database was open during the backup. I showed him our log and he said "What's that?". It appeared that this server has 2 sets of backup processes (I did say it was old). The one he restored from backed up the files while the database was up, rendering them useless.
We managed to find the correct backup from 3 days before, re-ran the restore and recovery (after shipping the archived logs back to the new archive location from the DR server), and this time it worked.
Lessons learnt:
Talk to each other - we assumed the backup scripts and logs we always check are the 'proper' ones - not an unreasonable assumption. However, if other groups are involved in backups then it pays to double-check their processes and information.
Split data and log files onto different areas - should be standard practice, but you'd be surprised how many sites I see that don't do this. It's a no-brainer.
Protect your archived logs - because we scp these to a separate server every 30 mins we have an off-site copy. They also get backed up from the primary server, but it's always good to have multiple copies. Yes, this can be achieved with duplicate destinations, but this works too.
Don't panic - at one stage I thought all of our backups would be invalid, and we wouldn't be able to recover. This was a Production database so it would have been a bad situation to be in. However, our logs showed that the database was shutdown before the files were backed up and restarted when complete, and the Unix team were able to restore it - the first one they tried was from a different (and useless) process. We check the logs every day and had confidence in the procedure, which we've now proved.
Keep the users informed - I sent email status updates all the time, including the stuff that went wrong. I find that users are very forgiving if they have all the information and regular updates re-assure them that I am working the problem. All they see is their application / database is unavailable, and if they don't hear from the DBA every so often as far as they know he/she's gone off to do something else. Sometimes users can be a pain and want constant updates, but if you get in first and send out a couple they soon feel confident that you're on top of it and will let them know when something changes. If you don't update them they will hassle you.
You can never have too many backups - as well as scp'ing the archived logs to the DR server, they are also backed up to tape. I also take a full export of the databases each night and scp that to the DR server as well (they are small databases). Although an export is not a backup, it's better to tell the users "I can have your database back as of 03:00 yesterday" rather than "Our backups haven't been working, we've lost your database".
There's nothing worse than typing "recover database until cancel;" on a PROD database and seeing an error come back.
But there's nothing better than telling the users "Your database is back up and running again."