2013-12-02 11:38:15 WARNING OGG-00947 Oracle GoldenGate Manager for Oracle, mgr.prm: Lag for EXTRACT PRODP is 06:45:46 (checkpoint updated 00:00:04 ago).
2013-12-02 11:38:15 WARNING OGG-00947 Oracle GoldenGate Manager for Oracle, mgr.prm: Lag for EXTRACT PRODP is 06:45:46 (checkpoint updated 00:00:04 ago).
I haven't used GG much, so had to poke around in the documentation.
First off, finding where the files are. At this site, they are here :
/apps/oracle/product/gg_11.2/ora10g_PRODP_PRODGP
Your mileage may vary. I don't know if there is a standard location, but going to the $ORACLE_HOME and navigating up to 'product' may help.
An 'ls -alrt' will show the latest log files to have a look at. The most obvious file is the error log:
ggserr.log
This should show what the problem is. In my case it had some interesting entries:
2013-12-02 13:41:29 WARNING OGG-01834 Oracle GoldenGate Delivery for Oracle, aprodp_2.prm: Failed setting IPv6 socket to dual stack mode (error: 99, Option not supported by protocol).
2013-12-02 13:41:29 WARNING OGG-01834 Oracle GoldenGate Delivery for Oracle, aprodp_2.prm: Failed setting IPv6 socket to dual stack mode (error: 99, Option not supported by protocol).
2013-12-02 13:41:29 WARNING OGG-01834 Oracle GoldenGate Delivery for Oracle, aprodp_2.prm: Failed setting IPv6 socket to dual stack mode (error: 99, Option not supported by protocol).
2013-12-02 13:41:29 INFO OGG-03035 Oracle GoldenGate Delivery for Oracle, aprodp_2.prm: Operating system character set identified as ISO-8859-1. Locale: en_AU, LC_ALL:.
2013-12-02 13:41:30 WARNING OGG-00254 Oracle GoldenGate Delivery for Oracle, aprodp_2.prm: CACHEBUFFERSIZE 64KB is a deprecated parameter.
2013-12-02 13:41:30 INFO OGG-01815 Oracle GoldenGate Delivery for Oracle, aprodp_2.prm: Virtual Memory Facilities for: COM anon alloc: mmap(MAP_ANON) anon free: munmap
file alloc: mmap(MAP_SHARED) file free: munmap
target directories:
/oracle/003/admin/PRODGP/PRODP_2_swap.
2013-12-02 13:41:30 INFO OGG-00996 Oracle GoldenGate Delivery for Oracle, aprodp_2.prm: REPLICAT aprodp_2 started.
2013-12-02 13:41:30 INFO OGG-01020 Oracle GoldenGate Delivery for Oracle, aprodp_2.prm: Processed extract process RESTART_ABEND record at seq 47832, rba 1051 (aborted 0 records).
At first glance, you might think "Aha! It's a network error with IPV6". But it's not. That's only a warning and (in this case at least) can be ignored.
The "RESTART_ABEND" also looks dodgy, but again, this is not the case. This line is telling you (and me) that GoldenGate was restarted after an abend.
I had to go back further into the log, and found this:
2013-12-01 08:06:49 ERROR OGG-00664 Oracle GoldenGate Capture for Oracle, PRODP_2.prm: OCI Error during OCIServerAttach (status = 1
2547-ORA-12547: TNS:lost contact).
2013-12-01 08:06:49 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, PRODP_2.prm: PROCESS ABENDING.
This happened a few times, so while it is a network issue of some kind it tends to fix itself.
However, we were still getting the "Lag for EXTRACT PRODP is 06:45:46" messages, and the lag time was increasing, so it looked as if it was still stuck.
I did some more digging around.
Looking at the "ls -alrt" results, there is a directory that has also been updated recently - dirrpt . Navigating to that showed some recent files:
-rw-rw-rw- 1 oracle dba 12497 Dec 3 09:05 PRODP.rpt
-rw-rw-rw- 1 oracle dba 14812 Dec 3 09:05 aprodp.rpt
-rw-rw-rw- 1 oracle dba 28256 Dec 3 09:08 PRODP_2.rpt
-rw-rw-rw- 1 oracle dba 42520 Dec 3 09:08 aprodp_2.rpt
These report files are really useful. They show exactly what the GG parameters are, the tables that are being replicated, the directories involved and
other messages.
The PRODP.rpt showed this:
2013-12-02 13:41:00 INFO OGG-01639 BOUNDED RECOVERY: ACTIVE: for object pool 1: p13152_extr.
2013-12-02 13:41:00 INFO OGG-01640 BOUNDED RECOVERY: recovery start XID: 23.3.2889152.
2013-12-02 13:41:00 INFO OGG-01641 BOUNDED RECOVERY: recovery start position: SeqNo: 172754, RBA: 1182883856, SCN: 1773.3471757875 (7618448773683), Timestamp: 2013-12-02 02:42:55.000000.
2013-12-02 13:41:00 INFO OGG-01642 BOUNDED RECOVERY: recovery end position: SeqNo: 172780, RBA: 237458548, SCN: 1773.3480433166 (7618457448974), Timestamp: 2013-12-02 03:39:00.000000, Thread: 1.
2013-12-02 13:41:00 INFO OGG-01579 BOUNDED RECOVERY: VALID BCP: CP.PRODP.000002188.
2013-12-02 13:41:00 INFO OGG-01629 BOUNDED RECOVERY: PERSISTED OBJECTS RECOVERED: 1.
Hmmm. The contents of this report file usually look like this:
27562 records processed as of 2013-12-03 02:00:29 (rate 0,delta 0)
27577 records processed as of 2013-12-03 02:30:33 (rate 0,delta 0)
27585 records processed as of 2013-12-03 03:02:06 (rate 0,delta 0)
27595 records processed as of 2013-12-03 04:19:41 (rate 0,delta 0)
27603 records processed as of 2013-12-03 05:15:29 (rate 0,delta 0)
So it still looked like GG wasn't doing anything.
There is also a log file in this directory called MGR.rpt which is worth having a look at, it shows the activity of the GG manager:
2013-12-02 13:35:53 INFO OGG-00983 Manager started (port 7809).
2013-12-02 13:35:53 INFO OGG-00967 Manager performing AUTOSTART of ER processes.
2013-12-02 13:35:53 INFO OGG-00975 EXTRACT PRODP starting.
2013-12-02 13:35:54 INFO OGG-00975 EXTRACT PRODP_2 starting.
2013-12-02 13:35:54 INFO OGG-00975 REPLICAT aprodp starting.
2013-12-02 13:35:54 INFO OGG-00975 REPLICAT aprodp_2 starting.
2013-12-02 13:35:54 INFO OGG-00979 REPLICAT aprodp is down (gracefully).
2013-12-02 13:35:54 INFO OGG-00979 REPLICAT aprodp_2 is down (gracefully).
OK, so the log and report files are informative but still haven't explained what's happening (unless you're clever and have spotted what's going on).
Time to run some commands.
Make sure you're in the directory where 'ggsci' lives and run it:
./ggsci
Oracle GoldenGate Command Interpreter for Oracle
Version 11.2.1.0.1 OGGCORE_11.2.1.0.1_PLATFORMS_120423.0230_FBO
Solaris, sparc, 64bit (optimized), Oracle 10g on Apr 24 2012 09:06:57
Copyright (C) 1995, 2012, Oracle and/or its affiliates. All rights reserved.
GGSCI (prodhost) 1>
These are some of the commands that you can use - I won't show the output because it's quite lengthy in some cases:
info *
info *,detail
info *,showch
The most useful ones -
send PRODP,status
Which showed this:
Sending STATUS request to EXTRACT PRODP ...
EXTRACT PRODP (PID 16605)
Current status: In recovery[1]: Reading from data source
Current read position:
Redo thread #: 1
Sequence #: 172790
RBA: 775155132
Timestamp: 2013-12-02 04:02:25.000000
SCN: 1773.3484235828
Current write position:
Sequence #: 164
RBA: 1110
Timestamp: 2013-12-02 15:00:12.415826
Extract Trail: /apps/oracle/product/gg_11.2/ora10g_PRODP_PRODGP/dirdat/pt
Note the "In Recovery[1]"
and
send PRODP, showtrans
Sending SHOWTRANS request to EXTRACT PRODP ...
Extract is currently in recovery mode (reading transactions from trail file). Please try again in a few minutes.
OK, now we're getting somewhere. Time to read the manual (I know, I should have done this in the beginning).
Looking at this document - http://docs.oracle.com/cd/E15881_01/doc.104/gg_troubleshooting_v104.pdf
I found this:
Extract freezes during recovery
If Extract abends due to an error or system failure, it might appear to be stalled when it is restarted. The cause of this is probably because Extract had to search a long way back into the transaction logs to find the beginning of a long-running transaction that was open at the time of the failure.
To find out whether a prolonged recovery is the cause of the delay, use the SEND EXTRACT command with the STATUS option. If one of the following messages is displayed on the Current status line, this indicates that recovery is in progress (or being completed) and that Extract is operating normally:
❍ In recovery[1] – Extract is recovering to its input checkpoint.
❍ In recovery[2] – Extract is recovering to its output checkpoint.
❍ Recovery complete – The recovery is finished, and normal processing will resume.
Ah. So even though there is a lag, and even though the error messages are getting generated, GG is actually performing as expected.
If the extract was for a long time, it will take time recovering. It needs to read from (data from info
Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
Thread #: 1
Sequence #: 172699
to
Current Checkpoint (position of last record read in the data source):
Thread #: 1
Sequence #: 172817
RBA: 725406320
Before it starts processing anything.
The send status shows
GGSCI (prodhost) 4> send extract PRODP,status
Sending STATUS request to EXTRACT PRODP ...
EXTRACT PRODP (PID 16605)
Current status: In recovery[1]: Reading from data source
Current read position:
Redo thread #: 1
Sequence #: 172803
RBA: 1024377380
So 172803 is slowly moving towards 172817, when it gets there it will complete recovery and start processing.
Later, this was the result of the
send PRODP, showtrans
Sending SHOWTRANS request to EXTRACT PRODP ...
Oldest redo log file necessary to restart Extract is:
Redo Log Sequence Number 172935, RBA 654089232
------------------------------------------------------------
XID: 150.1.40097
Items: 0
Extract: PRODP
Redo Thread: 1
Start Time: 2013-12-02:17:14:56
SCN: 1773.3576958226 (7618553974034)
Redo Seq: 172935
Redo RBA: 654089232
Status: Running
And this was the status:
GGSCI (prodhost) 3> send PRODP,status
Sending STATUS request to EXTRACT PRODP ...
EXTRACT PRODP (PID 16605)
Current status: Recovery complete: At EOF
Current read position:
Redo thread #: 1
Sequence #: 173092
RBA: 1033393152
Timestamp: 2013-12-03 09:57:01.000000
SCN: 1773.3710127915
Current write position:
Sequence #: 164
RBA: 17461664
Timestamp: 2013-12-03 09:56:56.021460
Extract Trail: /apps/oracle/product/gg_11.2/ora10g_PRODP_PRODGP/dirdat/pt
So, although it took over 10 hours to recover, it was doing what was expected.