Reliability of Distributed Systems
Erick
Redwine and JoAnne L. Holliday
Santa
Clara University
Making
a distributed system reliable is very important. The failure of a distributed system can result in anything from
easily repairable errors to catastrophic meltdowns. A reliable distributed system is designed to be as fault tolerant
as possible. Fault tolerance deals with
making the system function in the presence of faults (see Fault-Tolerant
Systems). Faults can occur in any of
the components of a distributed system.
This article gives a brief overview of the different types of faults in
a system and some of their solutions.
There
are three types of component faults: transient faults, intermittent faults and
permanent faults. A transient fault
occurs once and then disappears. If the
operation is repeated then the system will behave normally. An intermittent fault arises, then goes
away, rises again and so on. A common
cause of an intermittent fault is a loose contact on a connector. These faults are very annoying and hard to
fix because of their sporadic nature.
Lastly there are permanent faults caused by faulty components. The system will not work until the component
is replaced. Burnt out chips, software
bugs and processor failure (explored in Processor Faults) are all examples of
permanent faults.
A
special type of component is the processor, and it can fail in three ways,
fail-silent, Byzantine and slowdown.
All lead to a different kind of failure. A fail-silent, also known as fail-stop, fault occurs when a
processor stops functioning. It no
longer accepts input and outputs nothing, except perhaps to say it is no longer
functioning. Byzantine faults occur
when a faulty processor continues to run, giving wrong answers and maybe
working with other faulty processors to give the impression that they are
working correctly. Compared with
fail-silent faults, Byzantine faults are hard to diagnose and more difficult to
deal with. A slowdown fault occurs when
a certain processor executes very slowly.
That processor might be labeled as failed by the other processors. However the “failed” processor may return to
its normal speed and begin to issue orders, causing problems within the
distributed systems.
Network
failures keep processors from communicating with each other. We will look at the failures resulting in
total loss of communication along parts of the network. Two problems arise from this: one-way links
and network partitions. One-way links
cause problems similar to processor slowdown (see Processor Faults). For example, processor A can send messages
to processor B but cannot receive messages from B. Processor C can talk to both A and B. So each processor has a different idea of which processors have
failed. Processor A might think that B
has failed since it does not receive any messages from it. Processor C thinks both A and B are working
properly since it can send and receive messages from both.
Network
partitions occur when a line connecting two sections of a network fail. So processors A and B can communicate with
each other but cannot communicate with processors C and D and visa versa. Let us say processor A updates a file and
processor C updates the same file but in a different way. When the partition is fixed, the file will
not be consistent among all processors.
It is not clear to the processors how to make the data consistent again.
Agreement
To
be reliable, a distributed system must be able to reach agreement (see
Consensus and Byzantine Agreement), even if the system is faulty. This means all processors in a distributed
system must agree on some value. Two
cases must be explored to understand the problem of agreement. The first involves perfect processors but
faulty communication lines, and the second problem looks at Byzantine
processors. For very similar analogies,
refer to Tanenbaum [2].
Two
processors cannot be in agreement if the communication lines between them are
faulty. The perfect example of this is
the well-known, two-army problem illustrating the trouble of getting two
perfect processors with faulty communication between them to agree on one bit
of information. Two allied armies, led
by General Bonaparte and General Alexander, are encamped on opposite sides of
the enemy army. If both armies attack
together then they will emerge victorious, but if they attack at different
times they will be defeated. Their only
mode of communication is a messenger who is subject to capture by the enemy
army. Bonaparte decides it is time to
attack so he sends a message reading, “Let’s attack tomorrow at dawn.” Alexander receives the message and responds,
“Sounds good. See you at dawn.” The messenger manages to make it through the
enemy line and Bonaparte receives the message.
Both armies ready themselves for attack.
Later
that day, Bonaparte realizes that Alexander does not know if the messenger made
it back safely. So Bonaparte sends the
messenger back across enemy lines to deliver a message of confirmation to
Alexander. Upon receiving this message
of confirmation, Alexander sends a message back to tell Bonaparte that he
received the message and that the attack is still on. No matter how many messages are sent, neither General will attack
because both Generals are unsure if the messages were received.
The
next problem is more complex. It mimics
the case of perfect communication lines and faulty processors and is commonly
called the Byzantine generals problem.
In this problem, n allied generals surround the enemy, but m
of the generals are traitors (faulty processors). The traitors try to prevent the loyal generals from reaching
agreement by giving them contradictory and false information. Unlike the two-army problem, the loyal
generals can reach agreement. Agreement
in this situation is defined as the generals exchanging the troop strengths of
their armies. Each army sends a message
to all other armies, telling them of their troop strengths. Then the generals send their finding on to
all the other generals. A Byzantine
general will send faulty information to all generals but those faulty signals
will be picked up and the loyal generals will know the traitor is
Byzantine. For more information on
this, refer to the article on Byzantine Agreement. For algorithms to handle Byzantine faults refer to chapter 11 of
Chow [1].
Before
we explore some of the common solutions to system failures, we must learn the
difference between synchronous and asynchronous systems. In a synchronous system the amount of time
required for a message to be sent from one system to another has a known upper
bound. Therefore, processor A
sends a message to processor B and waits a given time for a
response. If A does not receive
a response within that time, it knows an error has occurred and it will send
the message again. After a set number
of resends, B is labeled as failed.
In an asynchronous system, none of this is true. A processor will wait an infinite time for a
response from the other processor. Many
solutions for fault tolerance cannot be implemented in an asynchronous
system. A processor experiencing
slowdown is impossible to differentiate from a dead processor (see
Impossibility of Consensus).
The
most common approach to handling faults is redundancy. There are three types of redundancy:
information redundancy, time redundancy, and physical redundancy. Information redundancy involves adding extra
bits to allow recovery from distorted bits.
An example is adding a Hamming code to data in order to recover from
noise in the transmission line. With
time redundancy an action is performed and if need be it is performed
again. An aborted transaction can be
redone without harm to the system. For
more information refer to Chapter 3 of Tanenbaum [2]. Time redundancy is the most frequently used solution for
intermittent and transient faults.
Physical redundancy involves adding more components to a system in case
a component fails.
There
are two types of physical redundancy: active replication and primary
backup. The advantages and
disadvantages of each must be weighed when determining which type of physical
redundancy will be implemented.
First
we will try to understand active replication by looking at Triple Modular
Redundancy (TMR) in a circuit. Consider
a circuit with devices A, B, and C linked in
sequence. If all devices are working
properly then the final result will be correct. But if one of the devices is faulty then the final result will
probably be incorrect.
![]()
Now
we will look at a circuit utilizing TMR.
First devices A, B and C are replicated three times
and then three voters are added after each stage of the circuit. Why are three voters required in this
system?

The
answer is that the voters are components too and might fail. Each voter has three inputs but only one
output. The majority of inputs become
the output of the voter. If two or all
inputs are the same then that becomes the output. If all three inputs are different then the output is
undefined.
Now
let us see if the system will be fault tolerant when different components
fail. First, consider a simple case
where A1 fails. The voters will
pass on the value of A2 and A3 since that value is in the
majority. The voters pass on this value
to B1, B2 and B3, which receive the same value as they
would have if no fault had occurred.
The system is fault tolerant.
Now we will see what happens if one voter, V1, fails. This would mean that B1 would get the
wrong input but B2 and B3 would have the correct input. So in the next series of voters, the error
will be ruled out and the circuit will act like there was no failure. One should note that an error in V1
behaves exactly the same as an error in B1.
Now
let A1 and A2 fail. Assuming
they fail with the same result, this result will be passed through the voter so
B1, B2 and B3 contain the wrong value. If A1 and A2 fail with
different results then there will be three different inputs to the voter. The output of V1, V2 and V3
will be undefined. So the TMR system is
not perfect but what is?
When
considering active replication, it is important to consider how much
replication is needed. The answer is
related to the amount of fault tolerance wanted. A system is said to be k fault tolerant if faults in k processors
can produce the same outputs as a fully functioning system. With fail silent faults, k + 1 processors
are needed to achieve k fault tolerance.
But with Byzantine failures, 2k + 1 processors are required. Refer to Tanenbaum [2] for an application of
active replication in a distributed system.
The
other type of physical redundancy is called primary backup. This type of fault tolerance involves one
server, which is the primary server, and an unused backup server. If the primary fails, then the backup server
becomes the primary. The client
operating system but not the application programs will notice the switch of
control. When compared to active
replication, primary backup has one main advantage: simplicity. Messages are sent to the primary server
only, as opposed to a whole group of servers.
Second, this type of physical redundancy only requires two machines, a
primary and a backup. Of course when a
backup server becomes a primary server, a new backup is needed instantly. A large disadvantage to primary backup fault
tolerance is that it handles Byzantine failures poorly. There is no check routine to make sure the
primary is functioning correctly.
Another disadvantage is that a primary backup system must always be in
agreement so that the backup can take over the functions of primary. Also
recovery from a primary failure is time consuming and complex.
An
important decision involving the primary backup approach is when and how to
switch to the backup server. One
solution is for the backup server to send messages asking if the primary server
is still functioning. If the server
does not respond in a certain time then the backup will become the primary. This is not an ideal solution for
asynchronous systems because the server could be running slowly resulting in
repeated actions. Another solution is a
hardware mechanism that the backup can use to reboot the primary. An alternate solution is to use a
dual-ported disk shared by the primary and backup. The primary server will write the request to the disk then do the
work and then write the results to the disk.
This way if the primary fails at any time, the backup can read the disk
and find out where the primary crashed.
It can finish the job for the primary and there will not be any repeated
messages. For further information on
redundancy see Replicated Objects.
[1]R.
Chow and T. Johnson,
Distributed
Operating Systems & Algorithms,
Addison
Wesley Longman, 1997.
[2]A.
Tanenbaum,
Distributed
Operating Systems,
Prentice-Hall
Inc, 1995.
Byzantine
Agreement see Reliability of Distributed Systems
Fault-Tolerant
Systems see Reliability of Distributed Systems
Failures
see Reliability of Distributed Systems
Replication
see Reliability of Distributed Systems
Redundancy
see Reliability of Distributed Systems