Correcting OGG-01733 in Goldengate – Trail file header file size mismatch

Warning – Oracle/Goldengate support will probably get mad at you if you try this. It worked great for me, but they recommended we reload the data from scratch, so that’s probably what they’d recommend for you too. Just know that this is 100% unofficial and unsupported 🙂

Quick summary:

We have Goldengate replication from Oracle 11.2 on Lunux to MSSQL 2012 on Windows, and we ran into an OGG-01733 error “Trail file header file size value {X} for trail file {Y} differs from actual size of the file ({Z})”, which caused an ABEND where we were stuck. We opened a ticket with Oracle support and after a week with very little response, the concluded that I should just perform a new initial load on the destination – since the trail files had already been pumped to the destination server and removed from the extract server, they were unable to troubleshoot further.

It turns out the work-around was to open the trail file in a hex editor and manually update the trail file header to make it think it was supposed to be the size it actually was. After saving the file again and resuming replication, it continued on its merry way and applied the transactions without another complaint.

Steps to resolve this error message:

    1. Make a backup of your trail file – you know, since you’re editing it and might want a second shot.
    2. Open the report file and make a note of the size the file is currently (“Z”) and the size it’s supposed to be (“X”). I’ll refer to those as X and Z further down.
    3. Use a decimal-to-hex converter like this one to convert both of these values to their hex equivalent (now I’ll call them “HX” and “HZ”)
    4. Load up the trail file in your favorite hex editing tool – I like using Notepad++ in combination with the HEX-editor plug-in (once the file is loaded, select “HEX-Editor” from the plug-ins menu, and then select “View in Hex”)
    5. Perform a search (if you’re using Notepad++, ensure the data type is set to “Hexadecimal”) for your “HX” value – the size the file thinks it should be. However, you need to search for an even number of digits – if your hex value is an odd number of digits, either drop the leftmost (largest) one or add a zero to the left (I dropped a digit):
    • Side note: You can see that my trail file size isn’t too far into the file – under 300 bytes from the beginning. However, since it’s stored in hex, it’s not something that’s easily viewable in the file (where you will see some file path and server version information if you look to the right where the ASCII is displayed. Also, in my image, the file size is preceded by a quite a few zeroes – my trail files are set to 100MB, but it appears Goldengate supports up to 4GB trail files using the 32 bytes in the header file. Back to fixing this…
  1. CAREFULLY edit the HX value you’ve found to be the new HZ value – the actual size of the file. In particular, don’t move any of the bytes around or add/remove anything, just fix the values you need to change so that the file size is stored in the same location.
  2. Save the file and close it.
  3. Resume replication right where you left off (assuming you made a backup and the edited the original trail file) – it should check the new file size, see the transaction that was previously beyond the file size limit, and then apply it and move on!

Conclusion?

What causes this behavior? I can’t find any clear documentation or explanation at all – when searching for this error, the only meaningful links I can find at all are either in an Oriental language and have basic details as well as a dire warning to call Oracle support immediately or a case where somebody receives it on an initial load and the forum’s advice is “your table is too small to mess with this – just export it to CSV and reload it that way”.

When we looked at the list of trail files, we noticed something particularly odd – the trail files near the offending file all had ascending “last modified” timestamps, as you’d expect, but this file was actually out of order:

05/01/2015  03:45 AM        99,999,462 SV002351
05/01/2015  04:38 AM        99,999,802 SV002352
05/01/2015  08:13 AM        99,999,367 SV002353
05/01/2015  10:09 AM        99,999,936 SV002354
05/01/2015  11:05 AM        99,999,630 SV002355
                                                 <-- File should be right here
05/01/2015  11:41 AM               891 SV002357
05/01/2015  11:47 AM        99,999,462 SV002358
05/01/2015  11:50 AM        99,999,280 SV002359
05/01/2015  11:58 AM        99,999,314 SV002360
05/01/2015  12:09 PM        99,999,910 SV002361
05/01/2015  12:40 PM        99,998,043 SV002362
05/01/2015  01:16 PM        99,999,754 SV002363
05/01/2015  01:34 PM        72,017,446 SV002356  <-- But it's down here
05/01/2015  02:05 PM        99,999,516 SV002364
05/01/2015  02:40 PM        99,999,966 SV002365

The file contained two additional transactions beyond the stated header size and the actual end of the file, and they were both time-stamped correctly to have been located in that file (they were both stamped 10:34AM, along with the transactions that were earlier in the file, and since the server is an hour off because of time zone, they were in the right file).

The fact that it’s smaller than the others, and that it’s followed by a file containing no transactions (just a header) led me to believe the file was cut short by a network interruption of some kind. We’re using a local extract and a separate pump, as we’re advised to do, but the connection still drops from time to time. In this case, I can only imagine it was in the middle of committing something, was interrupted, and then somehow these transactions were suspended for some reason and then added to the file later. I can’t imagine why, but when they’re added, the file header isn’t updated.

Hopefully this explanation and work-around have helped somebody else – we pulled our hair out for a week going back and forth with Oracle support and scouring the internet (unsuccessfully) for any relevant information – in the end, going rogue and editing the file was the only way (short of a complete reload) to get things moving again!

View SQL Server table updates per second

When trying to guage the level of database activity, you can use SQL Profiler to view the type and volume of transactions in motion at any given time and to view the overall level of database IO, but you can’t use it to directly tell which database tables are being updated.

However, there’s a handy dynamic management view called sys.dm_db_index_usage_stats that tells you the number of rows that have been updated in each database index since the instance was last restarted (or since the table/index was created, if that happened more recently):

SELECT *
FROM sys.dm_db_index_usage_stats[plain]

The view also has some additional information on index usage, including the number of scans, seeks, and lookups performed on each index – super helpful information if you’re looking for unused indexes or which objects are heaviest-hit. If you look at indexes 0 and 1 (zero is the table heap, 1 is the clustered index), you’ll see activity on the underlying table data itself.

I needed to see the row updates per second for every table in the database, so rather than run that select over and over (and compare the results), I wrote a quick script to do the comparison repeatedly for me:

SET NOCOUNT ON

-- Remove the working table if it already exists
-- so it doesn't get in the way
IF OBJECT_ID('tempdb..#TableActivity_After') IS NOT NULL
DROP TABLE #TableActivity_After

-- Collect our working data
SELECT object_name(us.object_id) as TableName,
		user_updates as UpdatedRows,
		last_user_update as LastUpdateTime
INTO #TableActivity_After
from sys.dm_db_index_usage_stats us
join sys.indexes si
	on us.object_id = si.object_id
	and us.index_id = si.index_id
where database_id = db_id()
and user_seeks + user_scans + user_lookups + user_updates > 0
and si.index_id in (0,1)
order by object_name(us.object_id)

-- Figure out if we're running it the first time or again
-- Put the data into the correct tables
IF OBJECT_ID('tempdb..#TableActivity_Before') IS NULL
BEGIN
	-- First time it's being run - stage the existing data
	PRINT 'Initial table usage collected - execute again for changes'

END
ELSE
BEGIN
	-- Running script a subsequent time
	-- Compare this set of data to our last set

	-- See how long it's been since we ran this script last
	-- Or at least since last change in any table in the database
   DECLARE @SecondsSince DECIMAL(10,2)
	SELECT @SecondsSince = CONVERT(FLOAT, DATEDIFF(ms, MAX(LastUpdateTime ), GETDATE()))/1000
	  FROM #TableActivity_BEFORE

	SELECT @SecondsSince as 'Seconds since last execution'

	-- Do actual table comparison and give results
	SELECT a.TableName,
		   a.updatedrows - isnull(b.UpdatedRows,0) as RowsUpdated,
		  CONVERT(INT, (a.updatedrows - isnull(b.UpdatedRows,0)) / @SecondsSince) as RowsPerSecond
	 FROM #TableActivity_After a
	 LEFT
	 JOIN #TableActivity_Before b
	   ON b.TableName = a.TableName
    WHERE a.updatedrows - isnull(b.UpdatedRows,0) > 0
	ORDER BY RowsUpdated DESC

END

-- Swap the tables so the AFTER table becomes the new BEFORE
-- Then clean up AFTER table since we'll get a new one next time
IF OBJECT_ID('tempdb..#TableActivity_Before') IS NOT NULL
DROP TABLE #TableActivity_Before

SELECT *
  INTO #TableActivity_Before
  FROM #TableActivity_After

DROP TABLE #TableActivity_After

Running that script the first time will grab an snapshot of table activity. Running it again will tell you what has changed since you ran it the first time, and running it again will continue to tell you (updating the “before” image each time so you’re getting an update on only the most recent database activity).

If you wanted to see activity on all database indexes, you could update the query at the top to show index name and remove the “WHERE si.index_id in (0,1)” and you’d see all the index details.

I hope this is helpful – if you have any feedback or would like to see something added, please feel free to leave a comment below!

Download the full script here

Querying Active Directory from SQL Server

SQL Server provides some pretty flexible integration with Active Directory through the ADSI Linked Server provider, something that’s present by default when you install SQL Server. If you’ve never used it before, it allows you to connect to a domain controller and query AD the same way you’d query any other linked server. For example, it gives you the option to:

  • Identify when logins to SQL Servers or databases that support financial applications exist, but have no matching AD account (either direct integrated logins, or if SQL logins or rows in a “User” table have been set up to match the AD login)
  • Kick off alerts to provision the user in various systems based on their AD group membership
  • Automatically trigger an action when a new account appears in active directory (for example, we auto-provision security badges and send an email alert to our head of security to assign the appropriate rights)

While much of this could also be done from Powershell as well, we use the SQL Server Agent to manage many of our scheduled job (because it’s so handy to have the agent remotely accessible), as well as sometimes just needing data from AD in a query. To support a number of processes we have in place, we run a synchronization job every so often throughout the day that pulls about two dozen fields for all users and synchronizes them into a table if anything has changed.

Setting up the linked server itself is pretty straightforward (courtesy of http://community.spiceworks.com/how_to/show/27494-create-a-sql-linked-server-to-adsi):

  1. Create the linked server itself
  2. Set the security context (if you want to query AD as something other than the SQL Server Service account – by default, all domain users can do this and it’s only required if the domain is remote or if, for some reason, your SQL Service account’s AD rights have been restricted, like if you’re running as “LOCAL SERVICE”)
  3. Enable OPENQUERY (Ad Hoc Distributed Queries)

You’ll notice that setting up the linked server itself doesn’t actually specify where Active Directory is located or what domain/forest you’ll be querying – that’s actually done in the query itself. In each query, you’ll need to specify the FQDN (Fully-qualified domain name) of the domain (or OU) of the domain you’re querying. For example, we’d get all users from a domain by issuing the following query (in this example, “ADLinkedServerName” is the linked server we just created, and our domain is “corp.mycompany.local”):

SELECT EmployeeNumber, Name AS FullName, givenName as FirstName, sn as LastName,
L AS Location, samAccountName as ADAccount
FROM OPENQUERY(ADLinkedServerName,'SELECT Name, L, givenName, sn,
EmployeeNumber, EmployeeID,samAccountName,createtimestamp
FROM ''LDAP://OU=Users,DC=corp,DC=mycompany,DC=local''
WHERE objectClass =''user''') ad

This query will search that OU (“Users”, in this case) and everything below it, so changing the FROM to “LDAP://DC=corp,DC=mycompany,DC=local” would fetch the entire directory (for all the “user” objects), regardless of what folder they appeared it – if your directory puts users in another OU (like “Associates”, for example), you should adjust the query accordingly.

For column names, you can pull any AD properties at all that you’re looking for – even custom ones that aren’t part of a standard AD configuration. To get an easy list of AD properties to choose from, I like using ADSIEDIT (part of Microsoft’s Remote Server Administration Tools – download RSAT for Windows 7 or RSAT for Windows 8.1) – just drill down all the way down to an object, like a user, right click on them and select “Properties”, and you can see a list of all the properties on that account. If you’ve got Domain Admin rights, this tool can be used to modify these values too, but for querying, you only need to be a domain user or somebody who has rights to browse AD. Make a note of the names of particular properties that you’re interested in – also note that AD queries are case-sensitive, so you’ll need to note the casing of these properties as well.

One potential gotcha that I’ve run into is that maximum result size that AD will return in a single query can be set as part of domain policy – by default it’s 1000 records at once, and can be configured by setting or adjusting the “PageSize” property on your domain controllers (see https://support.microsoft.com/kb/315071/en-us). Also, there’s a “MaxResultSetSize” property as well that’s set to 256KB by default, but I’ve never hit it – unless you’re pulling every single property back, you’d likely hit the PageSize row limit before you hit the ResultSize byte limit, but remember that both are there. If you do hit the AD result count limit, it will return the rows up to the limit, but then execution stops with a kind of cryptic error:

Msg 7330, Level 16, State 2, Line 2
Cannot fetch a row from OLE DB provider "ADsDSOObject" for linked server "YOURDOMAIN".

If your domain is larger than the PageSize limit, you’ll need to cut your query into multiple return sets of data so you don’t exceed the limit on any single query. Since our domain contains about 2400 users, we were able to do it in two queries, broken up like this:

SELECT samAccountName
  FROM OPENQUERY(ADLinkedServerName,'SELECT samAccountName
                                       FROM ''LDAP://OU=Users,DC=corp,DC=mycompany,DC=local''
                                      WHERE objectClass =''user''
                                        AND givenName<''L''') as c
UNION ALL
SELECT samAccountName
  FROM OPENQUERY(ADLinkedServerName,'SELECT samAccountName
                                       FROM ''LDAP://OU=Users,DC=corp,DC=mycompany,DC=local''
                                      WHERE objectClass =''user''
                                        AND givenName>=''L''') as c

By dividing the names on L, this cut the directory roughly in half – if yours was larger, you could divide it by querying each OU separately, or by looping through letters of the alphabet, or whatever makes sense in your setting. You could even do something dynamic like pull as many records as you can, then grab the value from the last record you pulled and use it as the baseline to pull the next set as far as you can, and then repeat until you run out of records. Linked servers don’t allow you to dynamically assemble your query at run-time – it has to be hard-coded in the query – but there are some ways around that (like building your OPENQUERY as a string and then executing it via sp_executesql, for example).

Now that you have your AD records stored in a temp table, you can identify new/changed records and merge them into a SQL table you already have ready using an INSERT/UPDATE/DELETE or MERGE statement, or possibly trigger notifications or some other business process.

I hope this is helpful – if you’d like some more detail, please leave a comment and I’m happy to elaborate where it’s necessary!

Removing expired/unused SSRS subscriptions

SQL Reporting Services doesn’t do a very good job keeping the SQL Agent clean by removing expired or otherwise unusable subscriptions from the job list. To deal with this, we created a script that pulls some details about these old subscriptions, including the report responsible, the last run time and status, and the user who originally scheduled it. If you notice your SQL Agent job list getting excessively long, you can use this query to identify the culprit reports and owners, and then either notify them or remove the old subscriptions manually yourself (run this on the server with your SSRS databases):

  select c.Name as ReportName,
         s.EventType,
         s.Description as SubscriptionDescription,
         s.LastStatus as LastSubscriptionStatus,
         s.LastRunTime SubscriptionLastRunTime,
         case
            when recurrencetype = 1 then 'One Time'
            when recurrencetype = 2 then 'Hourly'
            when recurrencetype = 4 then 'Daily'
            when recurrencetype = 5 then 'Monthly'
            when recurrencetype = 6 then 'Month Week'
            else 'Other'
         end as RecurranceType,
         s.DeliveryExtension,
         u.UserName as SubscriptionSetUpBy,
         s.ModifiedDate as SubscriptionLastModifiedDate
    from [ReportServer].[dbo].[Subscriptions] s
    join [ReportServer].[dbo].[Catalog] c
      on c.ItemID = s.Report_OID
    join [ReportServer].[dbo].[Users] u
      on u.UserID = s.OwnerID
    join [ReportServer].[dbo].[reportschedule] rs
      on c.itemid = rs.reportid
     and s.subscriptionid = rs.subscriptionid
    join [ReportServer].[dbo].[schedule] sch
      on rs.scheduleid = sch.scheduleid
   where s.EventType <> 'RefreshCache'
     and s.LastRunTime < dateadd(m, -3, getdate())
order by c.name

There are a number of similar scripts out there that pull much of this information together, but there wasn’t one that collected all the details we were looking for in one place. From here, you can deal with the subscriptions as you see fit.

Note that you can just remove the old subscriptions by brute force if you’d prefer, and SSRS will clean up the orphaned SQL jobs, but I’ve preferred to review the list and notify users as we’ve never had too much volume to deal with. If you want to just delete them straight away, you can do so here:

DELETE ReportServer.dbo.Subscriptions
WHERE InactiveFlags != 0
	OR LastRunTime < dateadd(m, -3, getdate())

Exporting from SQL Server to CSV with column names

SQL Server can easily export to CSV file, but it exports just the data, without the column names included. In order to export the column names, you need to actually perform two exports – one with the column names, and one with the data – and then combine the two files into a single file. It populates

You could do this using any query you want – native SQL, a linked server, a stored procedure, or anything else – and the results will export the same way once they’re in the temp table. Since it builds the list of column name dynamically as well, you only need to change out the query being executed and set the export location – no other configuration is necessary.

-- Declare the variables
DECLARE @CMD VARCHAR(4000),
        @DelCMD VARCHAR(4000),
        @HEADERCMD VARCHAR(4000),
        @Combine VARCHAR(4000),
        @Path VARCHAR(4000),
        @COLUMNS VARCHAR(4000)

-- Set values as appropriate
    SET @COLUMNS = ''
    SET @Path = '\\servername\share\outputpath'

-- Set up the external commands and queries we'll use through xp_cmdshell
-- Note that they won't execute until we populate the temp tables they refer to
    SET @CMD = 'bcp "select * from ##OutputTable" queryout "' + @Path + '\Temp_RawData.csv" -S ' + @@SERVERNAME + ' -T -t , -c'
    SET @HEADERCMD = 'bcp "SELECT * from ##cols" queryout "' + @Path + '\Temp_Headers.csv" -S ' + @@SERVERNAME + ' -T -t , -c'
    SET @Combine = 'copy "' + @Path + '\Temp_Headers.csv" + "' + @Path + '\Temp_RawData.csv" "' + @Path + '\MyCombinedFile.csv"'
    SET @DelCMD = 'del "' + @Path + '\Temp_*.csv"'

-- Create and populate our temp table with the query results
SELECT *
  INTO ##OutputTable
  FROM YourSourceTable

-- Generate a list of columns	
 SELECT @COLUMNS = @COLUMNS + c.name + ','
   from tempdb..syscolumns c
   join tempdb..sysobjects t
     on c.id = t.id
  where t.name like '##OutputTable%'
  order by colid
  
  SELECT @COLUMNS as Cols INTO ##Cols
		
-- Run the two export queries - first for the header, then for the data
exec xp_cmdshell @HEADERCMD
exec xp_cmdshell @CMD

-- Combine the two files into a single file
exec xp_cmdshell @Combine

-- Clean up the two temp files we created
exec xp_cmdshell @DelCMD

-- Clean up our temp tables
drop table ##cols
drop table ##OutputTable

If you have any suggestions or run into any issues, please let me know!