Warning – Oracle/Goldengate support will probably get mad at you if you try this. It worked great for me, but they recommended we reload the data from scratch, so that’s probably what they’d recommend for you too. Just know that this is 100% unofficial and unsupported 🙂
Quick summary:
We have Goldengate replication from Oracle 11.2 on Lunux to MSSQL 2012 on Windows, and we ran into an OGG-01733 error “Trail file header file size value {X} for trail file {Y} differs from actual size of the file ({Z})”, which caused an ABEND where we were stuck. We opened a ticket with Oracle support and after a week with very little response, the concluded that I should just perform a new initial load on the destination – since the trail files had already been pumped to the destination server and removed from the extract server, they were unable to troubleshoot further.
It turns out the work-around was to open the trail file in a hex editor and manually update the trail file header to make it think it was supposed to be the size it actually was. After saving the file again and resuming replication, it continued on its merry way and applied the transactions without another complaint.
Steps to resolve this error message:
- Make a backup of your trail file – you know, since you’re editing it and might want a second shot.
- Open the report file and make a note of the size the file is currently (“Z”) and the size it’s supposed to be (“X”). I’ll refer to those as X and Z further down.
- Use a decimal-to-hex converter like this one to convert both of these values to their hex equivalent (now I’ll call them “HX” and “HZ”)
- Load up the trail file in your favorite hex editing tool – I like using Notepad++ in combination with the HEX-editor plug-in (once the file is loaded, select “HEX-Editor” from the plug-ins menu, and then select “View in Hex”)
- Perform a search (if you’re using Notepad++, ensure the data type is set to “Hexadecimal”) for your “HX” value – the size the file thinks it should be. However, you need to search for an even number of digits – if your hex value is an odd number of digits, either drop the leftmost (largest) one or add a zero to the left (I dropped a digit):
- Side note: You can see that my trail file size isn’t too far into the file – under 300 bytes from the beginning. However, since it’s stored in hex, it’s not something that’s easily viewable in the file (where you will see some file path and server version information if you look to the right where the ASCII is displayed. Also, in my image, the file size is preceded by a quite a few zeroes – my trail files are set to 100MB, but it appears Goldengate supports up to 4GB trail files using the 32 bytes in the header file. Back to fixing this…
- CAREFULLY edit the HX value you’ve found to be the new HZ value – the actual size of the file. In particular, don’t move any of the bytes around or add/remove anything, just fix the values you need to change so that the file size is stored in the same location.
- Save the file and close it.
- Resume replication right where you left off (assuming you made a backup and the edited the original trail file) – it should check the new file size, see the transaction that was previously beyond the file size limit, and then apply it and move on!
Conclusion?
What causes this behavior? I can’t find any clear documentation or explanation at all – when searching for this error, the only meaningful links I can find at all are either in an Oriental language and have basic details as well as a dire warning to call Oracle support immediately or a case where somebody receives it on an initial load and the forum’s advice is “your table is too small to mess with this – just export it to CSV and reload it that way”.
When we looked at the list of trail files, we noticed something particularly odd – the trail files near the offending file all had ascending “last modified” timestamps, as you’d expect, but this file was actually out of order:
05/01/2015 03:45 AM 99,999,462 SV002351 05/01/2015 04:38 AM 99,999,802 SV002352 05/01/2015 08:13 AM 99,999,367 SV002353 05/01/2015 10:09 AM 99,999,936 SV002354 05/01/2015 11:05 AM 99,999,630 SV002355 <-- File should be right here 05/01/2015 11:41 AM 891 SV002357 05/01/2015 11:47 AM 99,999,462 SV002358 05/01/2015 11:50 AM 99,999,280 SV002359 05/01/2015 11:58 AM 99,999,314 SV002360 05/01/2015 12:09 PM 99,999,910 SV002361 05/01/2015 12:40 PM 99,998,043 SV002362 05/01/2015 01:16 PM 99,999,754 SV002363 05/01/2015 01:34 PM 72,017,446 SV002356 <-- But it's down here 05/01/2015 02:05 PM 99,999,516 SV002364 05/01/2015 02:40 PM 99,999,966 SV002365The file contained two additional transactions beyond the stated header size and the actual end of the file, and they were both time-stamped correctly to have been located in that file (they were both stamped 10:34AM, along with the transactions that were earlier in the file, and since the server is an hour off because of time zone, they were in the right file).
The fact that it’s smaller than the others, and that it’s followed by a file containing no transactions (just a header) led me to believe the file was cut short by a network interruption of some kind. We’re using a local extract and a separate pump, as we’re advised to do, but the connection still drops from time to time. In this case, I can only imagine it was in the middle of committing something, was interrupted, and then somehow these transactions were suspended for some reason and then added to the file later. I can’t imagine why, but when they’re added, the file header isn’t updated.
Hopefully this explanation and work-around have helped somebody else – we pulled our hair out for a week going back and forth with Oracle support and scouring the internet (unsuccessfully) for any relevant information – in the end, going rogue and editing the file was the only way (short of a complete reload) to get things moving again!