Distributed Deadlock Detection

By JoAnne L. Holliday and Amr El Abbadi

University of California at Santa Barbara


Deadlock can occur whenever two or more processes are competing for limited resources and the processes are allowed to acquire and hold a resource (obtain a lock) thus preventing others from using the resource while the process waits for other resources.


Two common places where deadlocks may occur are with processes in an operating system (distributed or centralized) and with transactions in a database. The concepts discussed here are applicable to any system that allocates resources to processes.


Locking protocols such as the popular Two Phase Locking (see concurrency control) give rise to deadlock as follows: process A gets a lock on data item X while process B gets a lock on data item Y. Process A then tries to get a lock on Y. As Y is already locked, process A enters a blocked state. Process B now decides to get a lock on X, but is blocked. Both processes are now blocked, and, by the rules of Two Phase Locking, neither will relinquish their locks.


This is a deadlock: process A is waiting for a resource held by process B and process B is waiting for a resource held by process A. No progress will take place without outside intervention. Several processes can be involved in a deadlock when there exists a cycle of processes waiting for each other. Process A waits for B which waits for C which waits for A.


Four conditions must hold for deadlock to occur:

1.    Exclusive use when a process accesses a resource, it is granted exclusive use of that resource.

2.    Hold and wait a process is allowed to hold onto some resources while it is waiting for other resources.

3.    No preemption a process cannot preempt or take away the resources held by another process.

4.    Cyclical wait there is a circular chain of waiting processes, each waiting for a resource held by the next process in the chain.


The structure of the system may allow additional complexities in the deadlock problem. The simplest model, single-resource, requires that a process have no more than one unfulfilled request. Thus a blocked process is waiting for only one other process and can be involved in at most one deadlock cycle.


In the AND model (also called the multiple-resource model), a process is allowed to make several resource requests, and it is blocked until all of the requests are granted. Processes in this model can be involved in several deadlock cycles at once.


In the OR model (also called the communication model), a process makes several requests and is blocked until any one of them is granted. The AND-OR model allows a combination of request types, such as a request for resource X and either Y or Z. Unless specifically stated, this article assumes the AND model.


The problem of deadlocks can be handled in several ways: Prevention, Avoidance, and Detection. In prevention, some requirement of the system makes deadlocks impossible so that no runtime support is required. Avoidance schemes require decisions by the system while it is running to insure that deadlocks will not occur. Detection requires the most sophisticated runtime support: the system must find deadlocks and break them by choosing a suitable victim that is terminated or aborted and restarted if appropriate.


Deadlock Prevention


Prevention is the name given to schemes that guarantee that deadlocks can never happen because of the way the system is structured. One of the four conditions for deadlock is prevented, thus preventing deadlocks. One way to do this is to make processes declare all of the resources they might eventually need, when the process is first started. Only if all the resources are available is the process allowed to continue. All of the resources are acquired together, and the process proceeds, releasing all the resources when it is finished. Thus, hold and wait cannot occur.


The major disadvantage of this scheme is that resources must be acquired because they might be used, not because they will be used. Also, the pre-allocation requirement reduces potential concurrency.


Another prevention scheme is to impose an order on the resources and require processes to request resources in increasing order. This prevents cyclic wait and thus makes deadlocks impossible.


One advantage of prevention is that process aborts are never required due to deadlocks. While most systems can deal with rollbacks, some systems may not be designed to handle them and thus must use deadlock prevention.

Deadlock Avoidance


In deadlock avoidance the system considers resource requests while the processes are running and takes action to insure that those requests do not lead to deadlock. Avoidance based on the banker's algorithm, sometimes used in centralized systems, is considered not practical for a distributed system. Two popular avoidance algorithms based on timestamps or priorities are wound-wait and wait-die. They depend on the assignment of unique global timestamps or priority to each process when it starts. Some authors refer to these as prevention.


In wound-wait, if process A requests a resource currently held by process B, their timestamps are compared. B is wounded and must restart if it has a larger timestamp (is younger) than A. Process A is allowed to wait if B has the smaller timestamp. Deadlock cycles cannot occur since processes only wait for older processes. In wait-die, if a request from process A conflicts with process B, A will wait if B has the larger timestamp (is younger). If B is the older process, A is not allowed to wait, so it dies and restarts.


In timeout based avoidance, a process is blocked when it requests a resource that is not currently available. If it has been blocked longer than a timeout period, it is aborted and restarted. Given the uncertainty of message delays in distributed systems, it is difficult to determine good timeout values.


These avoidance strategies have the disadvantage that the aborted process may not have been actually involved in a deadlock.


Deadlock Detection


Deadlock detection attempts to find and resolve actual deadlocks. These strategies rely on a Wait-For-Graph (WFG) that in some schemes is explicitly built and analyzed for cycles. In the WFG, the nodes represent processes and the edges represent the blockages or dependencies. Thus, if process A is waiting for a resource held by process B, there is an edge in the WFG from the node for process A to the node for process B.


In the AND model (resource model), a cycle in the graph indicates a deadlock. In the OR model, a cycle may not mean a deadlock since any of a set of requested resources may unblock the process. A knot in the WFG is needed to declare a deadlock. A knot exists when all nodes that can be reached from some node in a directed graph can also reach that node.


In a centralized system, a WFG can be constructed fairly easily. The WFG can be checked for cycles periodically or every time a process is blocked, thus potentially adding a new edge to the WFG. When a cycle is found, a victim is selected and aborted.


Distributed Deadlock


A distributed system consists of a number of sites connected by a network. Each site maintains some of the resources of the system. Processes with a globally unique identifier run on the distributed system. They make resource requests to a controller. There is one controller per site. If the resource is local, the process makes a request of the local controller. If the desired resource is at a remote site, the process sends a message. After a process makes a request, but before it is granted, it is blocked and said to be dependent on the process that holds the desired resource.


The controller at each site could maintain a WFG on the process requests that it knows about. This is the local WFG. However, each site's WFG could be cycle free and yet the distributed system could be deadlocked. This is called global deadlock. This would occur in the following situation:


1. Process A at site 1 holds a lock on resource X.

2. Process A has requested, but has not been granted, resource Y at site 2.

3. Process B at site 2 holds a lock on resource Y.

4. Process B has requested, but has not been granted, resource X at site 1.


Both processes are blocked by the other one. There is a global deadlock. However, the deadlock will not be discovered at either site unless they exchange information via some detection scheme.


Distributed Detection Schemes


A detection scheme is evaluated by two criteria: 1) If there exists an actual deadlock, it must be detected in a finite amount of time, and; 2) The scheme must not find a deadlock that is not actually there.


One way to detect deadlocks in distributed systems is for each site to construct a local WFG on the part it knows about. A site that suspects deadlock initiates a global snapshot protocol that constructs a consistent global state of the distributed system (see Distributed Snapshots). The global state corresponds to a deadlock if the union of the local WFGs has a cycle. This algorithm is correct because deadlock is a stable property.


The Probe Scheme


The scheme proposed by Chandy, Misra and Haas [1] uses local WFGs to detect local deadlocks and probes to determine the existence of global deadlocks. If the controller at a site suspects that process A, for which it is responsible, is involved in a deadlock it will send a probe message to each process that A is dependent on. When A requested the remote resource, a process or agent was created at the remote site to act on behalf of process A to obtain the resource. This agent receives the probe message and determines the dependencies at the current site.


The probe message identifies process A and the path it has been sent along. Thus, the message probe(i,j,k) says that it was initiated on behalf of process i and was sent from the controller for agent j (which i is dependent on) to the controller for agent k.

When an agent whose process is not blocked receives a probe, it discards the probe. It is not blocked so there is no deadlock. An agent that is blocked sends a probe to each agent that it is blocked by. If process i ever receives probe(i,j,i), it knows it is deadlocked.


This popular approach, often called "edge-chasing", has been shown correct in that it detects all deadlocks and does not find deadlocks when none exist.


In the OR model, where only one out of several requested resources must be acquired, all of the blocking processes are sent a query message and all of the messages must return to the originator in order for a deadlock to be declared. The query message is similar to a probe, but has an additional identifier so the originator can determine whether or not all the paths were covered.

There are several variations to these algorithms that seek to optimize different properties such as: number of messages, length of messages, and frequency of detection. One method proposed by Mitchell and Merritt for the single-resource model, sends information in the opposite direction from the WFG edges [4]. These and other algorithms are discussed in more detail in [2] and [3].




[1] K.M. Chandy, J. Misra, and L.M. Haas, Distributed Deadlock Detection, ACM Transactions on Computer Systems, 1(2), 143-156, May 1983.


[2] A.K. Elmagarmid, A Survey of Distributed Deadlock Detection Algorithms, ACM SIGMOD Record, 15(3), September 1986.


[3] E. Knapp, Deadlock Detection in Distributed Databases, ACM Computing Surveys, 19(4), 303-328, December 1987.


[4] D.P. Mitchell and M.J. Merritt, A Distributed Algorithm for Deadlock Detection and Resolution, Proceedings, ACM Symposium on Principles of Distributed Computing, ACM, 282-284, 1984.


-----------------Cross reference

WaitForGraph see Distributed Deadlock Detection

Locking see Distributed Deadlock Detection

Deadlocks see Distributed Deadlock Detection

Concurrency Control see Distributed Deadlock Detection

Global Snapshot see Distributed Deadlock Detection


-----------------Dictionary Terms