Today the development team complained that a pair of QA data servers was having issues with its common availability groups (there are four HADR Always On availability groups shared between just these two machines), making the databases unusable. I said I'd have a look and immediately noticed the LUN for SQL Server binaries had 0 bytes free. This is a 100 GB capacity drive, and the only two folders are Program Files, which Windows told me was only using 1.08 GB, and Program Files (x86), which was only using 535 MB. A look at protected/hidden files didn't reveal anything new, but fortunately a developer told me that the SSISDB log file was maxing out yesterday at 84 GB, and because I couldn't get SQL Server Management Studio (SSMS) to start, I got to use some canned T-SQL scripts in an osql session (C:\>osql -E):
SELECT GPS.name, DBS.database_name FROM sys.availability_databases_cluster AS DBS
INNER JOIN sys.availability_groups_cluster AS GPS ON
DBS.group_id = GPS.group_id
INNER JOIN sys.dm_hadr_database_replica_states AS STS ON
GPS.group_id = STS.group_id AND
DBS.group_database_id = STS.group_database_id
WHERE STS.is_primary_replica = 1 AND
DBS.database_name = 'SSISDB'
ORDER BY GPS.name ASC
GO
This tells me the name of the availability group that the SSISDB has been placed in. To shrink the log file, I need the logical file name, which I can obtain with this:
SELECT DB_NAME(database_id) AS DatabaseName, name AS LogicalFileName,
physical_name AS PhysicalFileName FROM sys.master_files WHERE DB_NAME(database_id) LIKE 'SSISDB%'
GO
I then remove SSISDB from the corresponding availability group:
ALTER AVAILABILITY GROUP [AG_xyzpdq] REMOVE DATABASE [SSISDB]
GO
And I reset the recovery model to simple so that I can shrink the log:
ALTER DATABASE [SSISDB] SET RECOVERY SIMPLE WITH NO_WAIT
GO
USE [SSISDB]
GO
DBCC SHRINKFILE (N'SSISDB_log',0,TRUNCATEONLY)
GO
I can immediately see in Windows Explorer that the LUN's free space is now at 80%. All this is on the data server that has the primary-replica status for the availability group that was holding SSISDB. On the secondary data server, I restore SSISDB and then drop it:
RESTORE DATABASE SSISDB WITH RECOVERY
GO
USE master
GO
DROP DATABASE SSISDB
GO
Then back on the primary, I reset the recovery model back to FULL for SSISDB:
ALTER DATABASE SSISDB SET RECOVERY FULL WITH NO_WAIT
GO
And then I can back up the SSISDB database:
BACKUP DATABASE [SSISDB] TO DISK = N'Path:\Changed\ToHide\SensitiveStuff\SSISDB.bak'
WITH NOFORMAT, NOINIT, NAME = N'SSISDB-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10
GO
At this point, I can start SSMS, and I use that to get SSISDB back into the availability group. But then the SSISDB log file starts to bloat up again. Why? Well, this is a QA machine, and the test engineer was testing an SSIS package and enabled verbose logging. So I had to drop SSISDB out of the AG again, set its recovery model back to SIMPLE, and then find the job with verbose logging enabled. Rather than click through 60+ jobs, I wrote a handy little script to find exactly which jobs are currently enabled with verbose logging on an SSIS job:
USE msdb
SELECT STEPS.job_id, JOBS.name, STEPS.command FROM [dbo].[sysjobsteps] AS STEPS
INNER JOIN [dbo].[sysjobs] AS JOBS ON
STEPS.job_id = JOBS.job_id
WHERE STEPS.subsystem='SSIS'
AND STEPS.command LIKE '%LOGGING_LEVEL(Int16)\"";3%'
AND JOBS.enabled = 1
From Job Activity Monitor, I clicked through to the job step, selected Configuration on the General tab, and then clicked the Advanced tab under the configuration section. I reset logging to Basic and then sent the above code to the QA engineer so he can quickly find which job steps he might have left with verbose logging.
Note: Adding SSISDB back into an availability group requires a password. There's a bug in SSMS 2016 that requires you to hit the refresh after entering the password in the wizard and after selecting the SSISDB database to be added into the availability group. Also, the wizard seems to forget what the shared location is for the Always On replicas (i.e., it will list the server name but not the shared directory). So it's good to store this information in a safe place.
Update: Something bothered me about this whole scenario. Why would verbose logging in SSIS cause the SSISDB transaction log to bloat instead of the master database file? I suspect the real culprit was a runaway/uncommitted transaction. The verbose logging was still a problem, but I don't think it was the cause of the massive log file. How SQL Server dealt with any uncommitted transactions is still inconclusive, but both nodes were rebooted a few times during this process. Next time I encounter this issue, I will likely start by looking for uncommitted transactions (with DBCC opentran) and go from there.
Update: On second thought, verbose logging would indeed affect the transaction log, especially since we have our SSISDBs in AlwaysOn availability groups, which requires each member database to be in FULL recovery model. FULL recovery model keeps even committed transactions in the transaction log until the transaction log is backed up. So it's likely that the verbose logging was causing this problem, although the QA engineer seems to have remembered to disable verbose logging when he was done testing. It's concerning to me that we have SSISDB.ldf writing to the same LUN as the SQL Server binaries (which might explain why Management Studio wasn't working properly, since it's installed at LUN:\Program Files\Microsoft SQL Server\130\Tools\Binn\ManagementStudio), so I'm going to ask about moving these log files to the same LUN as the user database log files.
Comments