Wednesday, December 4, 2019

DS: Termination Detection using Snapshots

Disclaimer: This article provides review and summary of  the given book & journal reference.



1. Introduction
One of fundamental problem in DS is to determine if a distributed computation has terminated. In DS, a problem is typically solved in a distributed manner with the cooperation of a number of processes. It is necessary to determine when the execution of a particular subproblem has ended so that the execution of the next subproblem may begin, or if the whole problem has been solved, and no longer additional computation is needed.

The detection of the termination in DS is non-trivial since no process has complete knowledge of the global state, and global time does not exist. A termination detection (TD) algorithm was developed to solved this issue. TD must ensure the following:

  1. Execution of a TD algorithm cannot indefinitely delay the underlying computation.
  2. TD algorithm must not require addition of new communication channels between processes.
This article provides a glance introduction of TD based on literature review.


2. System model of a distributed computation
Distributed computation consists of a fixed set of processes that communicate solely by message passing; usually this is done asynchronously. This means messages sent over the same communication channel may not obey the FIFO ordering.
A distributed computation has the following characteristics:

  1. During execution, a process can be in only one of the two states: active-busy or idle-passive.
  2. An active process can become idle at any time, once it finishes.
  3. An idle process can become active only on the receipt of a message from another process.
  4. Only active processes can send message.
  5. A message can be received by a process in idle or active.
  6. The sending of a message and the receipt of a message occur as atomic actions.
TD is obviously not necessary if the DS contains some processed that never been idle. The termination of distributed computation is model as:

pi(t) = the state (active or idle) of process pi at instant t
ci,j(t) = the number of messages in transit in the channel at instant t from process pi to process pj.



3. Termination detection using distributed snapshots
The algorithm was originally introduced by Huang. The algorithm uses the fact that a consistent snapshot of a distributed system captures stable properties. Termination of a distributed computation is a stable property. Thus, if a consistent snapshot of a distributed computation is taken after the distributed computation has terminated, the snapshot will capture the termination of the computation. The main idea behind the algorithm is as follows: when a computation terminates, there must exist a unique process which became idle last:

  1. When a process goes from active to idle, it issues a request to all other processes to take a local snapshot, and also requests itself to take a local snapshot.
  2. When a process receives the request, if it agrees that the requester became idle before itself, it grants the request by taking a local snapshot for the request.
  3. A request is said to be successful if all processes have taken a local snapshot for it.
  4. The requester or any external agent may collect all the local snapshots of a request.
  5. If a request is successful, a global snapshot of the request can thus be obtained and the recorded state will indicate termination of the computation: in the recorded snapshot, all the processes are idle and there is no message in transit to any of the processes.
The algorithm can be described in the following 7 steps.

  1. Each process i maintains a logical clock denoted by x, initialized to zero at the start of the computation.
  2. A process increments its x by one each time it becomes idle.
  3. A basic message sent by a process at its logical time x is of the form B(x).
  4. A control message that requests processes to take local snapshot issued by process i at its logical time x is of the form R(x, i).
  5. Each process synchronizes its logical clock x loosely with the logical clocks x’s on other processes in such a way that it is the maximum of clock values ever received or sent in messages.
  6. A process also maintains a variable k such that when the process is idle, (x,k) is the maximum of the values (x, k) on all messages R(x, k) ever received or sent by the process.
  7. Logical time is compared as follows: (x, k) > (x’, k’) if (x > x’) or ((x=x’) and (k>k’)), i.e., a tie between x and x’ is broken by the process identification numbers k and k’

The algorithm is defined by the following four rules:

The last process to terminate will have the largest clock value. Therefore, every process will take a snapshot for it, however, it will not take a snapshot for any other process.


4. Other TD Models
More TD models and algorithms were introduced to handle more DS implementations, such as:

  1. Weight throwing TD Algorithm
  2. A spanning-tree-based TD Algorithm
  3. Message-optimal termination detection
Discussion on these models are not covered in this article. The readers are encouraged to read about these models in the reference materials.


5. Conclusion
The TD is a basic activity of any DS model. This article provides introduction of the simplest form of TD algorithm known as Distributed Snapshot approach. A number of algorithms have been developed to detect the termination of a distributed computation. However, a comparative study between these algorithms was not seemed to be done yet. This article also encourages other researchers to continue the study in hope that the comparative result can be obtained in the future.


6. References
Distributed Computing: Principles, Algorithms, and Systems. 2008. Cambridge University Press.
Author: Ajay D.Kshemkalyani, Mukesh Singhai
Termination Detection by Using Distributed Snapshots. 1989. Information Processing Letters Vol. 32 No. 3. Elsevier Science Publishers.
Author: Shing-Tsaan HUANG