SQL and Beyond: December 2012

Sunday, December 23, 2012

Deadlock in highly concurrent situation

I recently experienced a deadlock case which happened in highly concurrent situation. So many concurrent access were requested at the same time against a single stored procedure and that caused a lot of deadlock situations. I think real case was more complex than what I describe here, but I just want to summarize some interesting deadlock case here.

The deadlock case occurred around a stored procedure called GetNextId. This SP returns a next job id from Job and JobQueue tables based on certain condition.

CASE 1: Here is the SP that caused deadlock. JobJobQueueView is a simple view that joins Job an JobQueue table based on JobId. So the SP selects a next job id from the view with certain where condition. It works fine in normal sitation but if highly concurrent calls are made, deadlock situation occurred.

WITH CTE1 (NextJobId) AS 
( 
   SELECT TOP 1 JobId 
   FROM JobJobQueueView WITH (HOLDLOCK) 
   WHERE (JobType = @JobType) 
    AND (SentToMachine IS NULL) 
    ORDER BY Priority DESC, CreateTime ASC
)
UPDATE JobQueue 
SET @JobId = JobId, SentToMachine = @MachineName,StartTime = GETDATE()
FROM CTE1 
WHERE JobId = COALESCE(CTE1.NextJobId, -1);
-- @JobID is OUTPUT

The problem is : one SPID (55) holds Shared (S) lock on JobQueue table (and Job) when executing SELECT statement. At the same time another SPID (56) hold Shared lock on JobQueue while running SELECT. Since it has HOLDLOCK lock hint, the Shared locks are not released. Now SPID 55 tried to acquire exclusive X lock on JobQueue table to update data but cannot get it because SPID 56 holds Shared lock. At the same time, SPID 56 tried to get X lock on JobQueue but failed because SPID 55 still holds the S lock. Hence the deadlock. HOLDLOCK has the same effect of setting transaction isolation level to SERIALIZABLE.

CASE 2: To eliminate the side effect of HOLDLOCK (or SERIALIZABLE level), the HOLDLOCK lock hint was removed. Now no deadlock occurred but there is another side effect which only occurs in highly concurrent situation - the race condition. Some GetNextId requests got the same Id.

WITH CTE1 (NextJobId) AS 
(
    SELECT TOP 1 JQ.JobId
    FROM Job J, JobQueue JQ
    WHERE J.JobId = JQ.JobId
      AND (SentToMachine IS NULL)
    ORDER BY Priority DESC, CreateTime ASC
)
UPDATE JobQueue
SET @JobId = JobId, SentToMachine = @MachineName, StartTime = GETDATE()
FROM CTE1
WHERE JobId = COALESCE(CTE1.NextJobId, -1);
-- @JobId is OUTPUT

The problem is : one SPID (55) holds Shared (S) lock on JobQueue and Job tables while executing SELECT, and at the same time another SPID (56) holds Shared (S) lock on JobQueue during SELECT, so they have the same next job id. But each Shared lock will be released as soon as the SELECT ends. One of them, say SPID 55, goes into UPDATE and hold exclusive lock on JobQueue table. SPID 56 will wait for X lock release. As soon as X lock is released from 55, 56 will update with the same id. There are other cases similar to this race condition. For example, SPID 55 release S lock after SELECT and if SPID 56 manages to read and write the same next job id before SPID 55 gets X lock on JobQueue, it is also possible for two SPIDs to have the same id.

CASE 3: To eliminate the side effect of case 2, I put UPDLOCK lock hint on JobQueue table. UPDLOCK allows other SPIDs to have shared lock but disallows others to have exclusive or update lock. This prevents the same id to be allowed at the same time.

WITH CTE1 (NextJobId) AS 
(
    SELECT TOP 1 JQ.JobId
    FROM Job J, JobQueue JQ WITH (UPDLOCK)
    WHERE J.JobId = JQ.JobId
      AND (JobType = @JobType)
      AND (SentToMachine IS NULL)
    ORDER BY Priority DESC, CreateTime ASC
)
UPDATE JobQueue
SET @JobId = JobId, SentToMachine = @MachineName, StartTime = GETDATE()
FROM CTE1
WHERE JobId = COALESCE(CTE1.NextJobId, -1);
-- @JobId is OUTPUT

By holding Update lock on JobQueue table during SELECT, another SPID cannot acquire UPDLOCK at the same time and should wait until the U lock is released. When one SPID (55) holds UPDLOCK in SELECT, it allows others to get S lock but disallows U or X lock. Another SPID (56) enters SELECT statement but cannot acquire U lock since it is already held by 55. SPID 56 will simply wait until 55 is done with update. Once 55 completes UPDATE statement, SPID 56 can proceed with SELECT statement where it will get next job id. When 55 executes UPDATE JobQueue statement, 55 will release U lock and acquire X lock (can be expressed as converting U to X lock) and so no other SPIDs or threads cannot access this resource.

Note: I wrote a C# test application to simulate deadlock. Click here for details.

Detect SQL script changes by using .SQL file checksum

I recently spent some time to find out the way of detecting the modification of SQL script file. Why I needed it is not that important, but quick summary is 'there is a application running a SQL script every 15 mins; if there is any change in script it runs the script; otherwsie it exits; to specify the script changes, the script has version number variable (in SQLCMD variable) and compare it with a SQL table (version table containing latest version data).' The downside of using version number in SQL script is that everytime someone changes the script, s/he also should manually increment version number. Sometime people forgot to do it, hence the bothersome problem.

So the solution I bring up is to detect SQL script file change by using the hash checksum of the file.
First step is to read the SQL script file in BLOB format by using OPENROWSET().

:setvar script1 C:\Script1.sql
DECLARE @val BIGINT
SELECT @val = master.sys.fn_repl_hash_binary(BulkColumn) 
FROM OPENROWSET(BULK '$(script1 )', SINGLE_BLOB) AS A

OPENROWSET(BULK ...) uses BULK rowset provider that accepts full file path and read mode such as BLOB, CLOB. The sentance above returns binary data of the file in BulkColumn.

Next step is to hash the binary data. SQL Server has a built-in function called fn_repl_hash_binary() that accepts binary data and returns a hash value.

So by comparing this @val with the version column value of version table, I was able to decide whether the SQL script is modified or not.

One of the problem I ran into though is OPENROWSET(BULK) only accepts FULL file path. I only can specify relative path in my case. So it was a headache... Fortunately the SQL script used SQLCMD mode,so I can get around this problem by using this approach.

(1) First at the begining of main script, I added the following statements. It calls VersionCheck.cmd batch file (see (2)) that will create _ChkSum.sql. Running _ChkSum.sql actually calculates the checksum of SQL script file(s) and save the result to #ChkSum temp table. Once _ChkSum.sql is run, I read the #ChkSum temp table and fetch the chksum value. Then I simply compare the variable with stored SQL table value.

-- Run version check. If there is no script change, exit.
:!! VersionCheck.cmd
:r _ChkSum_.sql
GO 

DECLARE @chk NUMERIC(38)
SELECT TOP 1 @chk = chk FROM #ChkSum

IF EXISTS (SELECT * FROM Master.Sys.Databases WHERE name = 'MyDB') 
   IF EXISTS (SELECT * FROM $(dbName).Sys.Tables WHERE name = 'VersionControl')
      IF EXISTS (SELECT * FROM $(dbName).dbo.VersionControl 
                 WHERE Id=1 AND ChkSum=@chk) 
      BEGIN   
         RAISERROR('No change. Exit now.', 2, 1);   
      END
GO

(2) Here is VersionCheck.cmd. This file contains OPENROWSET() statement that calculates SQL Script file checksum. The example below calculate the checksum of two SQL files and add them up.

@echo off
@echo :setvar script1 script1.sql > _ChkSum_.sql
@echo :setvar script2 script2.sql >> _ChkSum_.sql

@echo SET NOCOUNT ON >> _ChkSum_.sql
@echo DECLARE @val BIGINT >> _ChkSum_.sql
@echo DECLARE @chk NUMERIC(38) >> _ChkSum_.sql

@echo SELECT @val = master.sys.fn_repl_hash_binary(BulkColumn)  >> _ChkSum_.sql
@echo FROM OPENROWSET(BULK '%~dp0$(script1)', SINGLE_BLOB) AS A >> _ChkSum_.sql
@echo SET @chk = CAST(@val as NUMERIC(38)) >> _ChkSum_.sql

@echo SELECT @val = master.sys.fn_repl_hash_binary(BulkColumn)  >> _ChkSum_.sql
@echo FROM OPENROWSET(BULK '%~dp0$(script2)', SINGLE_BLOB) AS A >> _ChkSum_.sql
@echo SET @chk = @chk + CAST(@val as NUMERIC(38)) >> _ChkSum_.sql

@echo CREATE TABLE #ChkSum(chk NUMERIC(38)) >> _ChkSum_.sql
@echo INSERT #ChkSum VALUES (@chk) >> _ChkSum_.sql
@echo GO >> _ChkSum_.sql

To specify absolute full file path, I used %~dp0 in OPENROWSET(). This will be replaced by current working directory.
Please note that I used @chk variable as NUMERIC(38) because arithmetic overflow can occur if @chk is defined as BIGINT.
After sum of each checksum is added up, I saved the value to temp table (#ChkSum). Temp table can last in session so it can read in main script.

As a downside, using this approach might be slow if file is huge since it requires file reading.
But with relatively big file, this approach worked fine to me.

Saturday, December 22, 2012

How to capture DeadLock for a database

This article briefly explains how to capture DEADLOCK error that occurs at one database. In many cases, we want to know where the deadlock occurs for a specific target database.

(1) Start SQL Profiler
(2) Connect to SQL server
(3) In [Trace Properties] dialog, select [TSQL-Locks] template in [Use this template].

(4) In [Trace Properties] dialog, click [Events Selection] tab. Uncheck all unwanted events. I tend to uncheck all events except [Deadlock graph] when profiling against production server to reduce noise.
(5) In [Events Selection] grid, select [Database Name] column header.
(6) In [Edit Filter] dialog, expand [Like] and type your target database name. Click OK.

(7) Click [Run] to start profiling.
(8) Once captured all deadlock information you need, click [Stop] toolbar button.
(9) To save trace, save .trc file using File->Save.

(10) Now to review deadlock graph, click [Deadlock graph] row in the grid and the graph will be shown at the bottom. By rightclicking in the graph pane, you can select [Zoom to fit] to adjust size. A circle with blue X is the deadlock victim. If you put mouse over the circle, you will see TSQL statement in the process was executing. Owner mode means the process (spid) owned the specific lock and Request mode means the process tried to get the lock but failed to get it.

(11) If deadlock graph is pretty complex, you might want to save the trace as XML file since it will have a lot more details. XML trace file contains all TSQL statements, call frames and resource lock relations. To save trace as XML, goto File->Save As -> Trace XML file. Reading XML trace file needs more detective work but sometimes deadlock graph UI does not provide clear view due to complex lock relationship and in that case reading XML trace will help a lot.