Correcting OGG-01733 in Goldengate – Trail file header file size mismatch

Warning – Oracle/Goldengate support will probably get mad at you if you try this. It worked great for me, but they recommended we reload the data from scratch, so that’s probably what they’d recommend for you too. Just know that this is 100% unofficial and unsupported 🙂

Quick summary:

We have Goldengate replication from Oracle 11.2 on Lunux to MSSQL 2012 on Windows, and we ran into an OGG-01733 error “Trail file header file size value {X} for trail file {Y} differs from actual size of the file ({Z})”, which caused an ABEND where we were stuck. We opened a ticket with Oracle support and after a week with very little response, the concluded that I should just perform a new initial load on the destination – since the trail files had already been pumped to the destination server and removed from the extract server, they were unable to troubleshoot further.

It turns out the work-around was to open the trail file in a hex editor and manually update the trail file header to make it think it was supposed to be the size it actually was. After saving the file again and resuming replication, it continued on its merry way and applied the transactions without another complaint.

Steps to resolve this error message:

    1. Make a backup of your trail file – you know, since you’re editing it and might want a second shot.
    2. Open the report file and make a note of the size the file is currently (“Z”) and the size it’s supposed to be (“X”). I’ll refer to those as X and Z further down.
    3. Use a decimal-to-hex converter like this one to convert both of these values to their hex equivalent (now I’ll call them “HX” and “HZ”)
    4. Load up the trail file in your favorite hex editing tool – I like using Notepad++ in combination with the HEX-editor plug-in (once the file is loaded, select “HEX-Editor” from the plug-ins menu, and then select “View in Hex”)
    5. Perform a search (if you’re using Notepad++, ensure the data type is set to “Hexadecimal”) for your “HX” value – the size the file thinks it should be. However, you need to search for an even number of digits – if your hex value is an odd number of digits, either drop the leftmost (largest) one or add a zero to the left (I dropped a digit):
    • Side note: You can see that my trail file size isn’t too far into the file – under 300 bytes from the beginning. However, since it’s stored in hex, it’s not something that’s easily viewable in the file (where you will see some file path and server version information if you look to the right where the ASCII is displayed. Also, in my image, the file size is preceded by a quite a few zeroes – my trail files are set to 100MB, but it appears Goldengate supports up to 4GB trail files using the 32 bytes in the header file. Back to fixing this…
  1. CAREFULLY edit the HX value you’ve found to be the new HZ value – the actual size of the file. In particular, don’t move any of the bytes around or add/remove anything, just fix the values you need to change so that the file size is stored in the same location.
  2. Save the file and close it.
  3. Resume replication right where you left off (assuming you made a backup and the edited the original trail file) – it should check the new file size, see the transaction that was previously beyond the file size limit, and then apply it and move on!

Conclusion?

What causes this behavior? I can’t find any clear documentation or explanation at all – when searching for this error, the only meaningful links I can find at all are either in an Oriental language and have basic details as well as a dire warning to call Oracle support immediately or a case where somebody receives it on an initial load and the forum’s advice is “your table is too small to mess with this – just export it to CSV and reload it that way”.

When we looked at the list of trail files, we noticed something particularly odd – the trail files near the offending file all had ascending “last modified” timestamps, as you’d expect, but this file was actually out of order:

05/01/2015  03:45 AM        99,999,462 SV002351
05/01/2015  04:38 AM        99,999,802 SV002352
05/01/2015  08:13 AM        99,999,367 SV002353
05/01/2015  10:09 AM        99,999,936 SV002354
05/01/2015  11:05 AM        99,999,630 SV002355
                                                 <-- File should be right here
05/01/2015  11:41 AM               891 SV002357
05/01/2015  11:47 AM        99,999,462 SV002358
05/01/2015  11:50 AM        99,999,280 SV002359
05/01/2015  11:58 AM        99,999,314 SV002360
05/01/2015  12:09 PM        99,999,910 SV002361
05/01/2015  12:40 PM        99,998,043 SV002362
05/01/2015  01:16 PM        99,999,754 SV002363
05/01/2015  01:34 PM        72,017,446 SV002356  <-- But it's down here
05/01/2015  02:05 PM        99,999,516 SV002364
05/01/2015  02:40 PM        99,999,966 SV002365

The file contained two additional transactions beyond the stated header size and the actual end of the file, and they were both time-stamped correctly to have been located in that file (they were both stamped 10:34AM, along with the transactions that were earlier in the file, and since the server is an hour off because of time zone, they were in the right file).

The fact that it’s smaller than the others, and that it’s followed by a file containing no transactions (just a header) led me to believe the file was cut short by a network interruption of some kind. We’re using a local extract and a separate pump, as we’re advised to do, but the connection still drops from time to time. In this case, I can only imagine it was in the middle of committing something, was interrupted, and then somehow these transactions were suspended for some reason and then added to the file later. I can’t imagine why, but when they’re added, the file header isn’t updated.

Hopefully this explanation and work-around have helped somebody else – we pulled our hair out for a week going back and forth with Oracle support and scouring the internet (unsuccessfully) for any relevant information – in the end, going rogue and editing the file was the only way (short of a complete reload) to get things moving again!

Oracle Goldengate REPLICAT frozen on “Starting”

We use Oracle Goldengate (expensive and probably overkill for Oracle->MSSQL, but good at what it does) to replicate data from an Oracle database into a SQL Server. However, I got an alert the other day that replication had stopped, and when I checked the status of replication, all the REPs we had set up were in status “Starting…”, but none we actually doing anything.

Attempting to stop them got the following error:

GGSCI (GGSERVER) 68> stop rep MYREP
Sending STOP request to REPLICAT MYREP ...
ERROR: opening port for REPLICAT MYREP (TCP/IP error: Connection refused).

 

Stopping/Starting the manager service or rebooting the PC didn’t help either – they still said “Starting” and were unresponsive. Even stranger, deleting and recreating the REP gave the same result – before I even attempted to start the REP for the first time, it said “Starting”, and an attempt to start it gave me “Process is starting up – try again later”.

The cause was the REP process status file, located in the DIRPCS folder under the Goldengate root – there should be a file for each REP that’s currently running giving details about the status. When a REP stops, this file is deleted. Since all of the current REPs weren’t doing anything (they were all sitting at the end of the previous trail file), they should have been stopped. I deleted the PCR files for the affected REP streams, and then manager reporting “STOPPED” – at that point, I was able to start up each REP without issue.

I’m not sure how they got that way, but once started again, they all worked without issue. I hope this saves you the troubleshooting time of hunting down these files!

“Initializing Reconciler has failed” when setting up SQL Compact replication

When initializing replication to a .NET Compact Framework client on a mobile device, I was receiving an error message when I attempted to start the synchronize:

Initializing the SQL Server Reconciler has failed. Try again.

I had confirmed that SQL Compact web replication was set up correctly, and checking the URL came back as expected. Searching for the error online comes back with a dozen recommendations, but when I traced the replication sync attempt, I saw the following statement executed:

exec sp_helpdistpublisher N’SQLSERVERNAME’

Followed immediately by the error message:

The remote server “SQLSERVERNAME” does not exist, or has not been designated as a valid Publisher, or you may not have permission to see available Publishers.

Sure enough, executing that command in SSMS, logged in as my replication user, gave me the same error message. At some point, I’d changed the user I was using to set up the subscription, and that user didn’t have rights to view the publication list on my SQL Server. The fix was pretty easy:

  1. In SQL Management Studio, right-click the publication
  2. Select “Properties” and then open the “Publication Access List” tab
  3. Add the user you’re connecting your subscriber with to this list

Here’s a shot of the screen where I had to make this change, in case there’s any confusion:

Publication Security Settings

Calculating working hours between two dates

As a follow-up to an earlier post (Return a list of all dates between a start and end date), I need to find the number of working hours between two timestamps – in this case, it was to see how long a support ticket had been open before it was initially assigned, but the user didn’t want non-work hours to count against them.

To do this, I used the previous script to generate a list of dates and hours, and then marked the rows as work time or not (based on day of week and hour of day, evaluated together). The result was a table that would effectively let me do a SUM to find the value I was looking for. Once I had that table, I could join to it for rows between the two datetimes in question and SUM up rows that had “WorkTime” marked:

SELECT tt.TicketNumber,
       tt.TicketCreateTime,
       tt.TicketAssignTime,
       SUM(  CONVERT(INT, wh.WorkTime)) as WorkHoursBeforeAssigned
       COUNT(CONVERT(INT, wh.WorkTime)) as TotalHoursBeforeAssigned
  FROM TroubleTickets tt
  JOIN #WorkingHours wh
    ON wh.EvaluateTime BETWEEN tt.TicketCreateTime
                           AND tt.TicketAssignTime
GROUP BY tt.TicketNumber,
         tt.TicketCreateTime,
         tt.TicketAssignTime

In this case, tickets that were created and picked up after hours, without passing any worktime, would show as zero hours old (as they should, since they were interested in working time) – however, I’ve also included COUNT here to show total hours as well as work hours.

Also, this script only counts for raw day-of-week and hour-of-day working time – it ignores holidays and other special circumstances. I have a script that tracks holidays (American ones, at least), and I’ll put that up shortly as well – if you want to take holidays into account, you could incorporate that into your evaluation.

Here’s the script that builds the working time table (you can also download it here):

-- Set things up before we get started
--------------------------------------
DECLARE @WorkTimeStart		TINYINT,
		@WorkTimeEnd		TINYINT,
		@WorkDayOfWeekStart	TINYINT,
		@WorkDayOfWeekEnd	TINYINT

DECLARE @StartDate			DATETIME,
		@EndDate			DATETIME

CREATE TABLE #WorkingHours (
		EvaluateTime	DATETIME,
		IsWorktime		BIT DEFAULT(0)
)

--------------------------------------

	SET @WorkTimeStart = 7  --7AM
	SET @WorkTimeEnd   = 16 --4PM hour (4-5PM count as working)
	SET @WorkDayOfWeekStart = 2 --Monday
	SET @WorkDayOfWeekEnd   = 6 --Friday

	SET @StartDate	= '2000-01-01 00:00:00'
	SET @EndDate	= '2020-12-31 23:59:59'

--------------------------------------


-- Built the list of timestamps we're working with
;WITH numberlist(number)
   AS (SELECT RANK() over(order by c1.object_id,
                                   c1.column_id,
                                   c2.object_id,
                                   c2.column_id)
		 from sys.columns c1
        cross 
         join sys.columns c2)
INSERT INTO #WorkingHours (EvaluateTime)
SELECT DATEADD(hh, number-1, @StartDate)
  FROM numberlist
 WHERE DATEADD(hh, number-1, @StartDate) <= @EndDate


-- Set the times to worktime if they match criteria
UPDATE #WorkingHours
   SET IsWorktime = CASE WHEN (DATEPART(dw, EvaluateTime)
								BETWEEN @WorkDayOfWeekStart
								AND @WorkDayOfWeekEnd)
							  AND
							  (DATEPART(hh, EvaluateTime)
							   BETWEEN @WorkTimeStart
							   AND @WorkTimeEnd) THEN 1
						 ELSE 0
					END


-- Retun the results
 SELECT * FROM #WorkingHours
 ORDER BY EvaluateTime

 DROP TABLE #WorkingHours

Moving a SQL Server database to another server on a schedule – without using replication

Recently, I had the need to copy a set of databases from a dozen remote servers to a central server, restore them, and have it happen automatically, with no intervention from me at all. Replication wouldn’t work for the following reasons:

  1. Many tables didn’t have primary keys, so merge replication was out (even though this was only one-way replication)
  2. The size of the databases (28GB in one instance) and the quality/speed of the WAN removed the log shipping option
  3. There’s too much activity to consider any kind of live replication

Given our restrictions, we decided to go the following route. On the remote server, we set up a batch file that did the following:

  1. Use OSQL to back up the databases in question to a folder
  2. Run 7Zip from the command line to compress the backups into separate archives. For each auto-attaching later, each archive had the name we wanted it attached to the remote server with (for example, Site1ProdDB was backed up to Site1ProdDB.BAK, then compressed to Site1ProdDB.7z)
  3. Delete the BAK files
  4. Archives were renamed from *.7z to *.7zz (this is important – I’ll explain why in the server part)
  5. Scripted FTP using Windows command line FTP tool to a folder on our central collection server
  6. Once the FTP was complete, rename the archives on the remote server back from *.7zz to *.7z
  7. Delete the local *.7zz files

That’s it for the client – the BAT file was scheduled as a SQL Agent job so that we could kick it off remotely from any site we wanted, or so we could set them up on a schedule. Then, we put a BAT file on the server that did the following:

  1. Check folder for files that match *.7z
  2. For each one found, do the following:
    1. Extract it to a “Staging” folder
    2. Delete the 7z file for that archive
    3. Use OSQL to restore the file from the command line
    4. Use OSQL to run a script that changes the DB owner, adds some user permissions, and generally does some housework on the database
    5. Use an SMTP tool to send a email notice that the backup has been restored
  3. Repeat step 2 for every .7z file in the folder
  4. As a second step in the SQL Agent job, run “MoveLog.bat” (included below) to finish rotating the logs – it ensures that only logs with meaningful information are kept

The server BAT process can run as often as desired – in our case, we run it every 30 minutes, so the backup will be picked up and restored as soon as it’s available. That’s where the rename from the client side comes into play: If the files were named Database.7z, then the server process would attempt to pick them up while they’re being uploaded via FTP, and shenanigans would ensue. By renaming them when they’re done uploading, they become immediately available for restoring on the server side.

As I said before, I scheduled both the client (source) and the server (restore/destination) process as SQL Agent jobs – the Windows scheduler is too cumbersome to work with remotely, and kicking them off on demand was a pain. With the SQL Agent, they can be started on demand, and then I get an email notification as soon as they’ve been successfully restored.

I’ve attached the files below, and I welcome any feedback that you have or any improvements that can be made – I’m happy to give you credit and post a new version here. Specifically, I’m interested in any feedback about how to make this process more dynamic – I know BAT scripting supports FOR EACH and wildcards, but I was unable to make them work properly with OSQL, so I’d appreciate any input there. Enjoy!

Download the ZIP archive containing the files for this post