Correcting OGG-01733 in Goldengate – Trail file header file size mismatch

Warning – Oracle/Goldengate support will probably get mad at you if you try this. It worked great for me, but they recommended we reload the data from scratch, so that’s probably what they’d recommend for you too. Just know that this is 100% unofficial and unsupported 🙂

Quick summary:

We have Goldengate replication from Oracle 11.2 on Lunux to MSSQL 2012 on Windows, and we ran into an OGG-01733 error “Trail file header file size value {X} for trail file {Y} differs from actual size of the file ({Z})”, which caused an ABEND where we were stuck. We opened a ticket with Oracle support and after a week with very little response, the concluded that I should just perform a new initial load on the destination – since the trail files had already been pumped to the destination server and removed from the extract server, they were unable to troubleshoot further.

It turns out the work-around was to open the trail file in a hex editor and manually update the trail file header to make it think it was supposed to be the size it actually was. After saving the file again and resuming replication, it continued on its merry way and applied the transactions without another complaint.

Steps to resolve this error message:

    1. Make a backup of your trail file – you know, since you’re editing it and might want a second shot.
    2. Open the report file and make a note of the size the file is currently (“Z”) and the size it’s supposed to be (“X”). I’ll refer to those as X and Z further down.
    3. Use a decimal-to-hex converter like this one to convert both of these values to their hex equivalent (now I’ll call them “HX” and “HZ”)
    4. Load up the trail file in your favorite hex editing tool – I like using Notepad++ in combination with the HEX-editor plug-in (once the file is loaded, select “HEX-Editor” from the plug-ins menu, and then select “View in Hex”)
    5. Perform a search (if you’re using Notepad++, ensure the data type is set to “Hexadecimal”) for your “HX” value – the size the file thinks it should be. However, you need to search for an even number of digits – if your hex value is an odd number of digits, either drop the leftmost (largest) one or add a zero to the left (I dropped a digit):
    • Side note: You can see that my trail file size isn’t too far into the file – under 300 bytes from the beginning. However, since it’s stored in hex, it’s not something that’s easily viewable in the file (where you will see some file path and server version information if you look to the right where the ASCII is displayed. Also, in my image, the file size is preceded by a quite a few zeroes – my trail files are set to 100MB, but it appears Goldengate supports up to 4GB trail files using the 32 bytes in the header file. Back to fixing this…
  1. CAREFULLY edit the HX value you’ve found to be the new HZ value – the actual size of the file. In particular, don’t move any of the bytes around or add/remove anything, just fix the values you need to change so that the file size is stored in the same location.
  2. Save the file and close it.
  3. Resume replication right where you left off (assuming you made a backup and the edited the original trail file) – it should check the new file size, see the transaction that was previously beyond the file size limit, and then apply it and move on!

Conclusion?

What causes this behavior? I can’t find any clear documentation or explanation at all – when searching for this error, the only meaningful links I can find at all are either in an Oriental language and have basic details as well as a dire warning to call Oracle support immediately or a case where somebody receives it on an initial load and the forum’s advice is “your table is too small to mess with this – just export it to CSV and reload it that way”.

When we looked at the list of trail files, we noticed something particularly odd – the trail files near the offending file all had ascending “last modified” timestamps, as you’d expect, but this file was actually out of order:

05/01/2015  03:45 AM        99,999,462 SV002351
05/01/2015  04:38 AM        99,999,802 SV002352
05/01/2015  08:13 AM        99,999,367 SV002353
05/01/2015  10:09 AM        99,999,936 SV002354
05/01/2015  11:05 AM        99,999,630 SV002355
                                                 <-- File should be right here
05/01/2015  11:41 AM               891 SV002357
05/01/2015  11:47 AM        99,999,462 SV002358
05/01/2015  11:50 AM        99,999,280 SV002359
05/01/2015  11:58 AM        99,999,314 SV002360
05/01/2015  12:09 PM        99,999,910 SV002361
05/01/2015  12:40 PM        99,998,043 SV002362
05/01/2015  01:16 PM        99,999,754 SV002363
05/01/2015  01:34 PM        72,017,446 SV002356  <-- But it's down here
05/01/2015  02:05 PM        99,999,516 SV002364
05/01/2015  02:40 PM        99,999,966 SV002365

The file contained two additional transactions beyond the stated header size and the actual end of the file, and they were both time-stamped correctly to have been located in that file (they were both stamped 10:34AM, along with the transactions that were earlier in the file, and since the server is an hour off because of time zone, they were in the right file).

The fact that it’s smaller than the others, and that it’s followed by a file containing no transactions (just a header) led me to believe the file was cut short by a network interruption of some kind. We’re using a local extract and a separate pump, as we’re advised to do, but the connection still drops from time to time. In this case, I can only imagine it was in the middle of committing something, was interrupted, and then somehow these transactions were suspended for some reason and then added to the file later. I can’t imagine why, but when they’re added, the file header isn’t updated.

Hopefully this explanation and work-around have helped somebody else – we pulled our hair out for a week going back and forth with Oracle support and scouring the internet (unsuccessfully) for any relevant information – in the end, going rogue and editing the file was the only way (short of a complete reload) to get things moving again!

Oracle Goldengate REPLICAT frozen on “Starting”

We use Oracle Goldengate (expensive and probably overkill for Oracle->MSSQL, but good at what it does) to replicate data from an Oracle database into a SQL Server. However, I got an alert the other day that replication had stopped, and when I checked the status of replication, all the REPs we had set up were in status “Starting…”, but none we actually doing anything.

Attempting to stop them got the following error:

GGSCI (GGSERVER) 68> stop rep MYREP
Sending STOP request to REPLICAT MYREP ...
ERROR: opening port for REPLICAT MYREP (TCP/IP error: Connection refused).

 

Stopping/Starting the manager service or rebooting the PC didn’t help either – they still said “Starting” and were unresponsive. Even stranger, deleting and recreating the REP gave the same result – before I even attempted to start the REP for the first time, it said “Starting”, and an attempt to start it gave me “Process is starting up – try again later”.

The cause was the REP process status file, located in the DIRPCS folder under the Goldengate root – there should be a file for each REP that’s currently running giving details about the status. When a REP stops, this file is deleted. Since all of the current REPs weren’t doing anything (they were all sitting at the end of the previous trail file), they should have been stopped. I deleted the PCR files for the affected REP streams, and then manager reporting “STOPPED” – at that point, I was able to start up each REP without issue.

I’m not sure how they got that way, but once started again, they all worked without issue. I hope this saves you the troubleshooting time of hunting down these files!

Changing SQL Server data file locations in clustered instance using service SIDs

Recently I added a second LUN to a clustered instance of SQL Server to isolate the logs files (and another for the TempDB) and I mounted them to an empty NTFS folder as I’ve done before, but when I started SQL Server, I recieved the following error:

CREATE FILE encountered operating system error 5 (Access is denied.)
while attempting to open or create the physical file
'X:\MSSQL10_50.INSTANCENAME\TempDB\tempdb.mdf

The error means that SQL Server doesn’t have NTFS rights to the location of the TempDB, but when I tried to add those rights, the permissions weren’t granted to the domain proxy account as I’d expected, but were instead granted to the service SID account, MSSQL$InstanceName. I attempted to grant the permissions to this account at the new location, but couldn’t get it to resolve to an actual account. What finally worked was:

  1. In the permissions change dialog box, change the “Location” from the domain to the local machine (even though it’s a cluster and your using a domain account to run the service)
  2. In the text box, type “NT Service\MSSQL$INSTANCENAME” and click “Check Names”

Even though that appears to be a local account, it will resolve properly on all the cluster nodes involved. This step, as opposed to using the domain proxy account the service is running as, was necessary because (during the initial SQL Server setup process) I’d selected to use the proxy account SID to host permissions rather than a domain group. The better choice permissions-wise, but the source of some confusion!