Reliability of Distributed Systems
Erick Redwine and JoAnne L. Holliday
Santa Clara University
Making a distributed system reliable is very important. The failure of a distributed system can result in anything from easily repairable errors to catastrophic meltdowns. A reliable distributed system is designed to be as fault tolerant as possible. Fault tolerance deals with making the system function in the presence of faults (see Fault-Tolerant Systems). Faults can occur in any of the components of a distributed system. This article gives a brief overview of the different types of faults in a system and some of their solutions.
There are three types of component faults: transient faults, intermittent faults and permanent faults. A transient fault occurs once and then disappears. If the operation is repeated then the system will behave normally. An intermittent fault arises, then goes away, rises again and so on. A common cause of an intermittent fault is a loose contact on a connector. These faults are very annoying and hard to fix because of their sporadic nature. Lastly there are permanent faults caused by faulty components. The system will not work until the component is replaced. Burnt out chips, software bugs and processor failure (explored in Processor Faults) are all examples of permanent faults.
A special type of component is the processor, and it can fail in three ways, fail-silent, Byzantine and slowdown. All lead to a different kind of failure. A fail-silent, also known as fail-stop, fault occurs when a processor stops functioning. It no longer accepts input and outputs nothing, except perhaps to say it is no longer functioning. Byzantine faults occur when a faulty processor continues to run, giving wrong answers and maybe working with other faulty processors to give the impression that they are working correctly. Compared with fail-silent faults, Byzantine faults are hard to diagnose and more difficult to deal with. A slowdown fault occurs when a certain processor executes very slowly. That processor might be labeled as failed by the other processors. However the “failed” processor may return to its normal speed and begin to issue orders, causing problems within the distributed systems.
Network failures keep processors from communicating with each other. We will look at the failures resulting in total loss of communication along parts of the network. Two problems arise from this: one-way links and network partitions. One-way links cause problems similar to processor slowdown (see Processor Faults). For example, processor A can send messages to processor B but cannot receive messages from B. Processor C can talk to both A and B. So each processor has a different idea of which processors have failed. Processor A might think that B has failed since it does not receive any messages from it. Processor C thinks both A and B are working properly since it can send and receive messages from both.
Network partitions occur when a line connecting two sections of a network fail. So processors A and B can communicate with each other but cannot communicate with processors C and D and visa versa. Let us say processor A updates a file and processor C updates the same file but in a different way. When the partition is fixed, the file will not be consistent among all processors. It is not clear to the processors how to make the data consistent again.
To be reliable, a distributed system must be able to reach agreement (see Consensus and Byzantine Agreement), even if the system is faulty. This means all processors in a distributed system must agree on some value. Two cases must be explored to understand the problem of agreement. The first involves perfect processors but faulty communication lines, and the second problem looks at Byzantine processors. For very similar analogies, refer to Tanenbaum .
Two processors cannot be in agreement if the communication lines between them are faulty. The perfect example of this is the well-known, two-army problem illustrating the trouble of getting two perfect processors with faulty communication between them to agree on one bit of information. Two allied armies, led by General Bonaparte and General Alexander, are encamped on opposite sides of the enemy army. If both armies attack together then they will emerge victorious, but if they attack at different times they will be defeated. Their only mode of communication is a messenger who is subject to capture by the enemy army. Bonaparte decides it is time to attack so he sends a message reading, “Let’s attack tomorrow at dawn.” Alexander receives the message and responds, “Sounds good. See you at dawn.” The messenger manages to make it through the enemy line and Bonaparte receives the message. Both armies ready themselves for attack.
Later that day, Bonaparte realizes that Alexander does not know if the messenger made it back safely. So Bonaparte sends the messenger back across enemy lines to deliver a message of confirmation to Alexander. Upon receiving this message of confirmation, Alexander sends a message back to tell Bonaparte that he received the message and that the attack is still on. No matter how many messages are sent, neither General will attack because both Generals are unsure if the messages were received.
The next problem is more complex. It mimics the case of perfect communication lines and faulty processors and is commonly called the Byzantine generals problem. In this problem, n allied generals surround the enemy, but m of the generals are traitors (faulty processors). The traitors try to prevent the loyal generals from reaching agreement by giving them contradictory and false information. Unlike the two-army problem, the loyal generals can reach agreement. Agreement in this situation is defined as the generals exchanging the troop strengths of their armies. Each army sends a message to all other armies, telling them of their troop strengths. Then the generals send their finding on to all the other generals. A Byzantine general will send faulty information to all generals but those faulty signals will be picked up and the loyal generals will know the traitor is Byzantine. For more information on this, refer to the article on Byzantine Agreement. For algorithms to handle Byzantine faults refer to chapter 11 of Chow .
Before we explore some of the common solutions to system failures, we must learn the difference between synchronous and asynchronous systems. In a synchronous system the amount of time required for a message to be sent from one system to another has a known upper bound. Therefore, processor A sends a message to processor B and waits a given time for a response. If A does not receive a response within that time, it knows an error has occurred and it will send the message again. After a set number of resends, B is labeled as failed. In an asynchronous system, none of this is true. A processor will wait an infinite time for a response from the other processor. Many solutions for fault tolerance cannot be implemented in an asynchronous system. A processor experiencing slowdown is impossible to differentiate from a dead processor (see Impossibility of Consensus).
The most common approach to handling faults is redundancy. There are three types of redundancy: information redundancy, time redundancy, and physical redundancy. Information redundancy involves adding extra bits to allow recovery from distorted bits. An example is adding a Hamming code to data in order to recover from noise in the transmission line. With time redundancy an action is performed and if need be it is performed again. An aborted transaction can be redone without harm to the system. For more information refer to Chapter 3 of Tanenbaum . Time redundancy is the most frequently used solution for intermittent and transient faults. Physical redundancy involves adding more components to a system in case a component fails.
There are two types of physical redundancy: active replication and primary backup. The advantages and disadvantages of each must be weighed when determining which type of physical redundancy will be implemented.
First we will try to understand active replication by looking at Triple Modular Redundancy (TMR) in a circuit. Consider a circuit with devices A, B, and C linked in sequence. If all devices are working properly then the final result will be correct. But if one of the devices is faulty then the final result will probably be incorrect.
Now we will look at a circuit utilizing TMR. First devices A, B and C are replicated three times and then three voters are added after each stage of the circuit. Why are three voters required in this system?
The answer is that the voters are components too and might fail. Each voter has three inputs but only one output. The majority of inputs become the output of the voter. If two or all inputs are the same then that becomes the output. If all three inputs are different then the output is undefined.
Now let us see if the system will be fault tolerant when different components fail. First, consider a simple case where A1 fails. The voters will pass on the value of A2 and A3 since that value is in the majority. The voters pass on this value to B1, B2 and B3, which receive the same value as they would have if no fault had occurred. The system is fault tolerant. Now we will see what happens if one voter, V1, fails. This would mean that B1 would get the wrong input but B2 and B3 would have the correct input. So in the next series of voters, the error will be ruled out and the circuit will act like there was no failure. One should note that an error in V1 behaves exactly the same as an error in B1.
Now let A1 and A2 fail. Assuming they fail with the same result, this result will be passed through the voter so B1, B2 and B3 contain the wrong value. If A1 and A2 fail with different results then there will be three different inputs to the voter. The output of V1, V2 and V3 will be undefined. So the TMR system is not perfect but what is?
When considering active replication, it is important to consider how much replication is needed. The answer is related to the amount of fault tolerance wanted. A system is said to be k fault tolerant if faults in k processors can produce the same outputs as a fully functioning system. With fail silent faults, k + 1 processors are needed to achieve k fault tolerance. But with Byzantine failures, 2k + 1 processors are required. Refer to Tanenbaum  for an application of active replication in a distributed system.
The other type of physical redundancy is called primary backup. This type of fault tolerance involves one server, which is the primary server, and an unused backup server. If the primary fails, then the backup server becomes the primary. The client operating system but not the application programs will notice the switch of control. When compared to active replication, primary backup has one main advantage: simplicity. Messages are sent to the primary server only, as opposed to a whole group of servers. Second, this type of physical redundancy only requires two machines, a primary and a backup. Of course when a backup server becomes a primary server, a new backup is needed instantly. A large disadvantage to primary backup fault tolerance is that it handles Byzantine failures poorly. There is no check routine to make sure the primary is functioning correctly. Another disadvantage is that a primary backup system must always be in agreement so that the backup can take over the functions of primary. Also recovery from a primary failure is time consuming and complex.
An important decision involving the primary backup approach is when and how to switch to the backup server. One solution is for the backup server to send messages asking if the primary server is still functioning. If the server does not respond in a certain time then the backup will become the primary. This is not an ideal solution for asynchronous systems because the server could be running slowly resulting in repeated actions. Another solution is a hardware mechanism that the backup can use to reboot the primary. An alternate solution is to use a dual-ported disk shared by the primary and backup. The primary server will write the request to the disk then do the work and then write the results to the disk. This way if the primary fails at any time, the backup can read the disk and find out where the primary crashed. It can finish the job for the primary and there will not be any repeated messages. For further information on redundancy see Replicated Objects.
R. Chow and T. Johnson,
Distributed Operating Systems & Algorithms,
Addison Wesley Longman, 1997.
Distributed Operating Systems,
Prentice-Hall Inc, 1995.
Byzantine Agreement see Reliability of Distributed Systems
Fault-Tolerant Systems see Reliability of Distributed Systems
Failures see Reliability of Distributed Systems
Replication see Reliability of Distributed Systems
Redundancy see Reliability of Distributed Systems