COEN 317 Distributed Computing
Syllabus Lecture Notes Instructor Project Examinations

Project

The project can be done in teams with up to four members. It consists in an implementation of LH*, a scalable distributed data structure based on linear hashing and a predecessor to cloud data structures like BigTable.

Linear Hashing

Linear Hashing (invented by Witold Litwin) is one of a family of extensible hashing algorithm. It is well adapted. An LH file stores records (consisting of Record ID - RID - and client data - typically a byte string - in buckets. You can imagine that each bucket consists of a linear linked links. Buckets are usually small, with maybe three records per bucket a good compromise between storage overhead and access performance.

LH allows the key (RID) based operations of insert, delete, update, and read. It also has a scan operation that goes through all records and scans the client data for a certain user-provided pattern.

LH*

You can find a concise description of the LH* algorithms in the first pages of LH*RS.

Distributed computing with Python

Pyhon 3.5 has introduced a new module, socketserver, that in conjunction with asynio makes writing this type of code much easier.

from socketserver import BaseRequestHandler, TCPServer
class EchoHandler(BaseRequestHandler):
     def handle(self):
           print('Got connection from', self.client_address)
           while True:
                msg = self.request.recv(8192)
                if not msg:
                     break
               self.request.send(msg)

if __name-- == '__main__':
     serv = TCPServer(('',20000), EchoHandler)
     serv.serve_forever()

In this code, we define a special handler class that implements a handle() method for serving client connections. The request attribute is the underlying client socket and client_address has the client address. To test it, we need to run a separate Python process connecting to it:

from socket import socket, AF_INET, SOCK_STREAM
s = socket(AF_INT, SOCK_STREAM)
s.connect(('localhost', 20000)
s.send(b'Hello')
s.recv(8192)

Here is a simple example of a server, a bucket and of a client that allow the client to access a dictionary structure on a bucket. The coordinator on creation prints out the IP address and the port number on which it listens. The buckets are created with the port address of the coordinator and receive a bucket number. Clients are equally created with this port address. The client talks directly to a bucket using a simple language consisting of commands of a verb (INSERT / QUERY), a key (an integer) and possibly a value.

In order to send Image Adjustment Messages to clients, clients need to also implement a listener as buckets and coordinators do.

Project Implementation

The role of the coordinator

The coordinator is usually collocated with bucket 0. Any new client needs to know its address. We can equip the coordinator with a second dictionary that contains the bucket number as key and their address as value. When a new client accesses the coordinator, it can download the contents of the dictionary. Since the coordinator creates new buckets, it will have the IP addresses of all buckets in the system.

The role of the bucket

Buckets have a dictionary to allow the key based operations of insert, update, read, and delete. They also have an auxiliary dictionary with the addresses of all buckets that have been derived from them. This allows forwarding of requests to the bucket that actually holds the key.

Client addressing

Each client needs to address a bucket by the bucket number, without the user having to enter the IP address and port number. In order to allow this, we need an address look-up dictionary at the client that associates the number of all known buckets with their IP address. The interface in the example client is not acceptable, therefore.

Project Scope

You need to present a complete implementation of the basic LH* algorithm.

First, you will start the coordinator. You will start a number of initially empty buckets who will send their IP-addresses to the coordinator. (If you want, you can generate them automatically, but they need to run in their own terminal, which requires OS-dependent code.)

You will then start a client with only the IP address of the coordinator. The coordinator will also be Bucket 0. The client will perform a number of insert operations. For the purpose of this project, we assume that buckets will overflow whenever there are more than 3 records. (This is of course a ridiculously low number). When a bucket overflows, it sends a message to the coordinator. The coordinator then splits a bucket according to the current split pointer.

A bucket that receives a split command from the coordinator, also receives the address of a new bucket. It then rehashes its records, leading to a number of insert operations at the new bucket and an equal number of deletions from the local dictionary.

An incomplete implementation of file state

This is the file state proto-implementation done in class.
2016 Thomas Schwarz, S.J., COEN, SCU SCU COEN COEN317 T. Schwarz
These documents are not intended for dissemination beyond SCU.        CAVEAT LECTOR