Monday, December 2, 2013

GoldenGate Lag Issue

Someone recently implemented GoldenGate here, and it was happily chugging along when one day GG started issuing some nasty messages:

    2013-12-02 11:38:15  WARNING OGG-00947  Oracle GoldenGate Manager     for Oracle, mgr.prm:  Lag for EXTRACT PRODP is 06:45:46     (checkpoint updated 00:00:04 ago).
    2013-12-02 11:38:15  WARNING OGG-00947  Oracle GoldenGate Manager     for Oracle, mgr.prm:  Lag for EXTRACT PRODP is 06:45:46         (checkpoint updated 00:00:04 ago).


I haven't used GG much, so had to poke around in the documentation.

First off, finding where the files are. At this site, they are here :

    /apps/oracle/product/gg_11.2/ora10g_PRODP_PRODGP

Your mileage may vary. I don't know if there is a standard location, but going to the $ORACLE_HOME and navigating up to 'product' may help.

An 'ls -alrt' will show the latest log files to have a look at. The most obvious file is the error log:

    ggserr.log

This should show what the problem is. In my case it had some interesting entries:

    2013-12-02 13:41:29  WARNING OGG-01834  Oracle GoldenGate     Delivery for Oracle, aprodp_2.prm:  Failed setting IPv6 socket to     dual stack mode (error: 99, Option not supported by protocol).
    2013-12-02 13:41:29  WARNING OGG-01834  Oracle GoldenGate     Delivery for Oracle, aprodp_2.prm:  Failed setting IPv6 socket to     dual stack mode (error: 99, Option not supported by protocol).
    2013-12-02 13:41:29  WARNING OGG-01834  Oracle GoldenGate     Delivery for Oracle, aprodp_2.prm:  Failed setting IPv6 socket to     dual stack mode (error: 99, Option not supported by protocol).
    2013-12-02 13:41:29  INFO    OGG-03035  Oracle GoldenGate     Delivery for Oracle, aprodp_2.prm:  Operating system character     set identified as ISO-8859-1. Locale: en_AU, LC_ALL:.
    2013-12-02 13:41:30  WARNING OGG-00254  Oracle GoldenGate     Delivery for Oracle, aprodp_2.prm:  CACHEBUFFERSIZE 64KB is a     deprecated parameter.
    2013-12-02 13:41:30  INFO    OGG-01815  Oracle GoldenGate     Delivery for Oracle, aprodp_2.prm:  Virtual Memory Facilities     for: COM     anon alloc: mmap(MAP_ANON)  anon free: munmap
        file alloc: mmap(MAP_SHARED)  file free: munmap
        target directories:
        /oracle/003/admin/PRODGP/PRODP_2_swap.
    2013-12-02 13:41:30  INFO    OGG-00996  Oracle GoldenGate     Delivery for Oracle, aprodp_2.prm:  REPLICAT aprodp_2 started.
    2013-12-02 13:41:30  INFO    OGG-01020  Oracle GoldenGate     Delivery for Oracle, aprodp_2.prm:  Processed extract process     RESTART_ABEND record at seq 47832, rba 1051 (aborted 0 records).


At first glance, you might think "Aha! It's a network error with IPV6". But it's not. That's only a warning and (in this case at least) can be ignored.

The "RESTART_ABEND" also looks dodgy, but again, this is not the case. This line is telling you (and me) that GoldenGate was restarted after an abend.

I had to go back further into the log, and found this:

    2013-12-01 08:06:49  ERROR   OGG-00664  Oracle GoldenGate Capture for Oracle, PRODP_2.prm:  OCI Error during OCIServerAttach (status = 1
    2547-ORA-12547: TNS:lost contact).
    2013-12-01 08:06:49  ERROR   OGG-01668  Oracle GoldenGate Capture for Oracle, PRODP_2.prm:  PROCESS ABENDING.


This happened a few times, so while it is a network issue of some kind it tends to fix itself.

However, we were still getting the "Lag for EXTRACT PRODP is 06:45:46" messages, and the lag time was increasing, so it looked as if it was still stuck.

I did some more digging around.

Looking at the "ls -alrt" results, there is a directory that has also been updated recently - dirrpt . Navigating to that showed some recent files:

    -rw-rw-rw-   1 oracle   dba        12497 Dec  3 09:05 PRODP.rpt
    -rw-rw-rw-   1 oracle   dba        14812 Dec  3 09:05 aprodp.rpt
    -rw-rw-rw-   1 oracle   dba        28256 Dec  3 09:08 PRODP_2.rpt
    -rw-rw-rw-   1 oracle   dba        42520 Dec  3 09:08 aprodp_2.rpt



These report files are really useful. They show exactly what the GG parameters are, the tables that are being replicated, the directories involved and
other messages.

The PRODP.rpt showed this:

    2013-12-02 13:41:00  INFO    OGG-01639  BOUNDED RECOVERY:  ACTIVE: for object pool 1: p13152_extr.

    2013-12-02 13:41:00  INFO    OGG-01640  BOUNDED RECOVERY: recovery start XID: 23.3.2889152.

    2013-12-02 13:41:00  INFO    OGG-01641  BOUNDED RECOVERY: recovery start position: SeqNo: 172754, RBA: 1182883856, SCN: 1773.3471757875     (7618448773683), Timestamp: 2013-12-02 02:42:55.000000.

    2013-12-02 13:41:00  INFO    OGG-01642  BOUNDED RECOVERY: recovery end position: SeqNo: 172780, RBA: 237458548, SCN: 1773.3480433166 (7618457448974),     Timestamp: 2013-12-02 03:39:00.000000, Thread: 1.

    2013-12-02 13:41:00  INFO    OGG-01579  BOUNDED RECOVERY: VALID BCP: CP.PRODP.000002188.

    2013-12-02 13:41:00  INFO    OGG-01629  BOUNDED RECOVERY: PERSISTED OBJECTS RECOVERED: 1.


Hmmm. The contents of this report file usually look like this:

               27562 records processed as of 2013-12-03 02:00:29 (rate 0,delta 0)
               27577 records processed as of 2013-12-03 02:30:33 (rate 0,delta 0)
               27585 records processed as of 2013-12-03 03:02:06 (rate 0,delta 0)
               27595 records processed as of 2013-12-03 04:19:41 (rate 0,delta 0)
               27603 records processed as of 2013-12-03 05:15:29 (rate 0,delta 0)


So it still looked like GG wasn't doing anything.

There is also a log file in this directory called MGR.rpt which is worth having a look at, it shows the activity of the GG manager:

    2013-12-02 13:35:53  INFO    OGG-00983  Manager started (port 7809).

    2013-12-02 13:35:53  INFO    OGG-00967  Manager performing AUTOSTART of ER processes.

    2013-12-02 13:35:53  INFO    OGG-00975  EXTRACT PRODP starting.

    2013-12-02 13:35:54  INFO    OGG-00975  EXTRACT PRODP_2 starting.

    2013-12-02 13:35:54  INFO    OGG-00975  REPLICAT aprodp starting.

    2013-12-02 13:35:54  INFO    OGG-00975  REPLICAT aprodp_2 starting.

    2013-12-02 13:35:54  INFO    OGG-00979  REPLICAT aprodp is down (gracefully).

    2013-12-02 13:35:54  INFO    OGG-00979  REPLICAT aprodp_2 is down (gracefully).



OK, so the log and report files are informative but still haven't explained what's happening (unless you're clever and have spotted what's going on).

Time to run some commands.

Make sure you're in the directory where 'ggsci' lives and run it:

    ./ggsci

    Oracle GoldenGate Command Interpreter for Oracle
    Version 11.2.1.0.1 OGGCORE_11.2.1.0.1_PLATFORMS_120423.0230_FBO
    Solaris, sparc, 64bit (optimized), Oracle 10g on Apr 24 2012 09:06:57

    Copyright (C) 1995, 2012, Oracle and/or its affiliates. All rights reserved.

    GGSCI (prodhost) 1>


These are some of the commands that you can use - I won't show the output because it's quite lengthy in some cases:

   info *

  info *,detail

  info *,showch


The most useful ones -

    send PRODP,status

Which showed this:

    Sending STATUS request to EXTRACT PRODP ...

    EXTRACT PRODP (PID 16605)
      Current status: In recovery[1]: Reading from data source

      Current read position:
      Redo thread #: 1
      Sequence #: 172790
      RBA: 775155132
      Timestamp: 2013-12-02 04:02:25.000000
      SCN: 1773.3484235828
      Current write position:
      Sequence #: 164
      RBA: 1110
      Timestamp: 2013-12-02 15:00:12.415826
      Extract Trail: /apps/oracle/product/gg_11.2/ora10g_PRODP_PRODGP/dirdat/pt


Note the "In Recovery[1]"

and

    send PRODP, showtrans

    Sending SHOWTRANS request to EXTRACT PRODP ...

    Extract is currently in recovery mode (reading transactions from trail file). Please try again in a few minutes.


OK, now we're getting somewhere. Time to read the manual (I know, I should have done this in the beginning).

Looking at this document - http://docs.oracle.com/cd/E15881_01/doc.104/gg_troubleshooting_v104.pdf

I found this:

    Extract freezes during recovery
    If Extract abends due to an error or system failure, it might appear to be stalled when it is restarted. The cause of this is probably because Extract had to search a long way back into the transaction logs to find the beginning of a long-running transaction that was open at the time of the failure.
    To find out whether a prolonged recovery is the cause of the delay, use the SEND EXTRACT command with the STATUS option. If one of the following messages is displayed on the Current status line, this indicates that recovery is in progress (or being completed) and that Extract is operating normally:

    ❍ In recovery[1] – Extract is recovering to its input checkpoint.
    ❍ In recovery[2] – Extract is recovering to its output checkpoint.
    ❍ Recovery complete – The recovery is finished, and normal processing will resume.


Ah. So even though there is a lag, and even though the error messages are getting generated, GG is actually performing as expected.

If the extract was for a long time, it will take time recovering. It needs to read from (data from info , showch):
Recovery Checkpoint (position of oldest unprocessed transaction in the data source):
Thread #: 1
Sequence #: 172699
to
Current Checkpoint (position of last record read in the data source):
Thread #: 1
Sequence #: 172817
RBA: 725406320

Before it starts processing anything.

The send status shows

    GGSCI (prodhost) 4> send extract PRODP,status
    Sending STATUS request to EXTRACT PRODP ...
    EXTRACT PRODP (PID 16605)
    Current status: In recovery[1]: Reading from data source
    Current read position:
    Redo thread #: 1
    Sequence #: 172803
    RBA: 1024377380


So 172803 is slowly moving towards 172817, when it gets there it will complete recovery and start processing.

Later, this was the result of the

    send PRODP, showtrans

    Sending SHOWTRANS request to EXTRACT PRODP ...

    Oldest redo log file necessary to restart Extract is:

    Redo Log Sequence Number 172935, RBA 654089232

    ------------------------------------------------------------
    XID:                  150.1.40097
    Items:                0
    Extract:              PRODP
    Redo Thread:          1
    Start Time:           2013-12-02:17:14:56
    SCN:                  1773.3576958226 (7618553974034)
    Redo Seq:             172935
    Redo RBA:             654089232
    Status:               Running



And this was the status:

    GGSCI (prodhost) 3> send PRODP,status

    Sending STATUS request to EXTRACT PRODP ...

    EXTRACT PRODP (PID 16605)
      Current status: Recovery complete: At EOF

      Current read position:
      Redo thread #: 1
      Sequence #: 173092
      RBA: 1033393152
      Timestamp: 2013-12-03 09:57:01.000000
      SCN: 1773.3710127915
      Current write position:
      Sequence #: 164
      RBA: 17461664
      Timestamp: 2013-12-03 09:56:56.021460
      Extract Trail: /apps/oracle/product/gg_11.2/ora10g_PRODP_PRODGP/dirdat/pt


So, although it took over 10 hours to recover, it was doing what was expected.