Monday, 24 May 2010

This post is best displayed as a .pdf file which can be found HERE.



Distributed Systems


A complete revision summary by James Bedford.


This summary now contains everything.


Last updated:


11:38pm - 25/05/10.




Notice


This document can look tremendously dull, or even daunting. This isn't really a summary at all. It's intended to be used as a reference, and once complete should contain absolutely everything required. Use of Command-F (CTRL-F) is heavily recommended. ;)



Please feel free to discuss any of the content of this document with me or suggest any improvements. You can find me on the Manchester Computer Science IRC, on Skype as Starscent, and my email is bedforj8@cs.man.ac.uk.


Many thanks.



Progress Notice


I'd appreciate any feedback, suggestions, corrections etc. to improve this document. You can get in touch with me via the contact details provided above. I'm going to keep reading over this document and making alterations. Again, many thanks.



I believe that all the required content is now here.



Improvement notes (may or may not get round to):




  • Watch the Google video and add notes to this document.

  • Read 'Web Services are not Distributed Objects" and add notes to this document.

  • Have a look at the example code provided in Java for RMI.

  • Find answers to questions and add to notes.


Questions:




  • Does Java allow maybe and at-least-once invocation semantics?



Legal Notice


The notes in this document are heavily based on a lecture series by Chris Kirkham and Rizos Sakellariou at The University of Manchester. This document is intended to act as a revision aid to students of the course, and not as a copyright infringement. See also the 'References' section at the bottom of this document.









Introduction





  • System - "a complex whole; a set of connected parts; an organised assembly of resources and procedures united and regulated by interaction or interdependence to accomplish a set of specific functions."

  • A distributed system is a series of independent computers, connected together to appear to users as a single, coherent system.

    • A system in which hardware and software components of networked computers communicate and coordinate by passing messages.



  • "You know you have a distributed system when the crash of a computer you've never heard of stops you from getting any work done." - Leslie Lamport.

  • Distributed systems:

    • Operate concurrently.

      1. Non-determinism.

      2. Race conditions.

      3. Deadlocks.



    • Are physically distributed.

    • Are lined via a network.

      1. Which allow messages to be sent an received.



    • Independent clocks.

      1. There is no global clock!

      2. There is no global state!

      3. Hard to synchronise.

      4. Both systems need to reach the same point at the same time. The different clocks need to be taken into account.



    • Units may fail independently.

      1. Network faults.

      2. System failures.



    • The shortcomings of distributed systems have to be worked around in the applications that are written.




  • Why do we have distributed systems?




  • Data is distributed across various systems.

    • Systems are distributed (because the users are distributed) - there are many reasons why systems would want to work together, such as communication, file transfer, gaming, peer-to-peer networks, oh and of course… the web.



  • Increase performance.

    • Increase computing power using parallel, grid and cloud computing.





Evolution of distributed systems





  • Early distributed systems were involved with airline reservation systems and banking.

  • Major developments came with the improvements in network technology and the WWW in the early 90s.



The 8 fallacies of distributing computing





  • Developed by Peter Deutsch at SUN (1994).

  • The fallacies (the inverse is true):

    • The network is reliable.

      1. The hardware may fail.

      2. The best way to deal with errors is to use redundancy, an alternative method. The more alternative methods, the smaller the chance of a serious problem.



    • Latency is zero.

      1. At the speed of light, it would take 30 milliseconds to send a ping to the USA and back.

      2. Allow for delays.

      3. Allow for messages received out of order.



    • Bandwidth is infinite.

      1. Compress messages.

      2. Reduced the amount of sent messages.



    • The network is secure.

      1. Build in security measures.



    • Topology doesn't change.

      1. Don't rely on specific endpoints for communications.



    • There is one administrator.

      1. More difficult to locate problems, changes, with multiple administrators.

      2. Humans are error prone.



    • Transport cost is zero.

      1. Time costs to go down the network layers.

      2. Line rental costs, ISP costs.



    • The network is homogeneous.

      1. Technology standards are required and should be adopted by distributed applications.

      2. A fallacy added by James Gosling in 1994.







Challenges and Goals in Distributed Systems





  • Heterogeneity

    • Definition: the property of consisting of different parts.

      • Such as hardware, operating systems, programming languages.



    • Middleware

      • Definition: a software layer that provides a programming abstraction to mask the differences of the underlying software.





  • Openness

    • Open standards and open interfaces.

    • For example, web services are specified by the W3C standards.



  • Security

    • Confidentiality

    • Integration (protection against alteration or corruption of information).

    • Availability - protection of access to resources.

    • Denial of service attack - when a service is bombarded by a large number of requests in order to lower the performance of a system.



  • Scalability

    • The ability to expand the distributed system.

    • As the system grows in size, the performance should also increase.

      • Increased complexity could lead to a reduction in performance. This has to be worked around.





  • Failure Handling

    • We need to:

      • Detect failures.

        • Checksums.



      • Mask failures.

        • Resubmitting messages.



      • Tolerate failures.

        • e.g. Don't continuously try to connect to a server.



      • Recover from failures.

        • Roll back systems.

        • Data and state recovery.



      • Build redundancy.

        • If the chance of failure in a single system is p, then the chance of failure in two systems is p^2, three systems is p^3… and so on.







  • Concurrency

    • Access to the same source can be handled in a consistent manner.



  • Transparency

    • Hide the underlying platform.

    • Access transparency.

      • Resources are accessed in a single, uniform way.



    • Location transparency.

      • Users need not be aware of where a resource is physically located.



    • Concurrency transparency.

      • See the above concurrency section.



    • Replication transparency.

      • Even if a resource is replicated, it should appear to the user as a single resource.



    • Failure transparency.

      • The user shouldn't be are of faults happening within the system.







Architectures of Distributed Systems





  • Design Requirements:

    • Performance

      • Responsiveness

      • Throughput

      • Load balancing



    • Quality of Service

    • Caching and Replication

    • Dependability

      • Correctness

      • Security

      • Fault-tolerance





  • Tightly coupled

    • Highly integrated machines that are programmed and networked together to give the appearance of a single computer.

    • Key example - parallel computers.

    • Distributed shared memory (DSM) provides the illusion of one single, shared memory unit.

      • There can be delays on the data access using shared memory, as getting the data from the physical memory can be slow.

        • Data could be replicated in order to speed up the run times.

        • However, this means that the duplicated data has to be updated across all locations and more space will be required.







  • Loosely coupled

    • The applications and components don't share anything.

    • Either:

      • Client-Server

        • Asymmetrical architecture.

          • The client (slave) makes a request and then waits for the server (master) to respond.



        • Can be either stateless or stageful.

        • e.g. web browser (client) and web server.

        • A web server can send messages to another server, in which case, it can act as a client.

          • e.g. Google as a search engine.



        • A server can only handle a certain number of users/processes at once.

        • Applet code can be retrieved from the server and then run locally on the machine, rather than on the server.



      • Peer-to-Peer.

        • e.g. iPlayer, Napster, Torrent.

        • No central server.

        • All computers have the same rights, so they are symmetrical.

        • Information is distributed across many machines - there's no central server.





    • Based on the logical organisation of the components of distributed systems.

    • Architectural structures:

      • Layered Architecture

        • Components have a specified way of communicating.

        • Very common in networking (OSI 7-Layer Model).

        • Applications, services.

        • Middleware.

          • An abstraction to make the distributed system appear homogeneous.

          • Takes into account different programming languages, different hardware, different operating systems.

          • Examples of middleware include COBRA, Java RMI.

          • The end to end argument states that communication control protocols should be defined at the end points of the communication.

            • The implication of this, is that from the point of view of the application, it's not so easy to abstract everything.





        • Operating system.

        • Computer and network hardware.



      • Object-Based Architecture

        • Objects communicate via methods.

        • Objects can be remote and can call remote methods.



      • Event-Based Architectural Style

        • An event bus connects every component together.

        • The event, data bus is shared across every component.

        • A component publishes information to the event bus.

        • This style is independent to processes or objects.

        • Known as a publish-subscribe system.



      • Shared-Data Space Architecture

        • A shared, persistent data space (for example a database) stores all the data that is required by a component.

        • Components publish and retrieve data to the shared data space.

        • Components can access the data space whenever they like - they don't have to trigger events, as in the previous architectural style. Therefore, shared-data space architecture is decouples in space and time.










Remote Procedure Call (RPC)





  • Distributed systems are built on send and receive messages, which are distributed systems equivalent of low-level constructs.

  • RPC is where the client calls a process on the server to execute the code or procedure that provides the service.

  • Arguments are sent in the message to the server.

    • But there's no point sending pointers or references that point to local variables.

    • An alternative is to send a copy of the variable, and then update the local variable when the result comes back.



  • Results can be received in the reply from the server to the client.

  • First appeared in 1984 (Birrell & Nelson).

    • Looking back, RPC seems like an obvious development that took a while to develop - you have to try and put yourself in context.



  • Provides access transparency

    • Using an RPC should appear to be the same as using a local method.



  • RPC adopts the client-server architecture.

    • A client makes the remote procedure request, the procedure is run on the server, and the client waits for the result message with the return from the call.



  • RPC is classed as middleware, sitting on top of a request reply protocol and an external data representation, which sits on top of the operating system. RPC sits below the application.

  • Client stub (also known as a skeleton) is used as an abstraction to provide the service.

    • Stubs put the arguments in a message.

    • Send the message to the server.

    • Wait for a reply message.

    • Unpack the result.

    • Return to the application call.



  • Server stub (also known as a skeleton) is also used at the server end to provide an abstraction of the RPC service at the server-side.

    • Receives messages.

    • Unpacks the message.

    • Runs the procedure.

    • Packs the result.

    • Sends the response message.



  • The steps of a remote procedure call:

    • Client call to procedure.

    • Stub builds the message.

    • Message is sent across the network (see Networks course summary).

    • Server OS hands message to server stub.

    • Stub unpacks the message.

    • Stub makes local call to it's procedure.

    • The procedure runs and returns a result to the server stub.

    • The server stub composes the reply message containing this answer.

    • The reply message is sent across the network to the client.

    • The client OS passes the reply to the client stub.

    • The client stub unpacks and returns the result.



  • Packing parameters into a message is known as parameter marshalling.

  • The inverse operation is unmarshalling.

  • These operations need to take into account the different problems associated with the heteronomously of the network.

    • Different ways of representing data types.

      • Little endian vs. big endian.

      • Allows character sets.

      • Size issues (32bit vs. 64bit).





  • The stubs can be generated once the specification of the procedure is known, but this generation needs to take into account the fact that different systems are heteronomous.

  • Stubs are generated with an Interface Definition Language (IDL)

    • COBRA (Common Object Request Broker Architecture)

    • Sun (NFS)

    • DCE (Distributed Computing Environment)



  • An interface definition file provides the services that the server is going to provide, and the client is going to use.

  • The IDL Compiler takes into account the interface definition file and compiles a program for that system. Therefore, different systems require different versions of the IDL compiler.

    • The interface definition file holds the syntax of the required interface, with extra information about the reference parameters, such as 'readonly'.



  • A Uuidgen generates a unique identifier, which is used to protect against errors caused by updating one and not the other.

  • Java has its own version of RPC, known as Remote Method Invocation (RMI).

  • Marshalling is simpler in Java, as Java will always be communicating with another Java application. The Java Virtual Machine provides the necessary level of abstraction. Because objects are serialisable (you can produce a string representation of the object), you can send them across the networks.

  • The key difference between RMI and RPC is that you can use references to remote objects.

    • In order to have remote method invocation, you need remote objects!



  • What are remote object references?

    • Remote object references must be available across the distributed system.

    • They consist of:

      • 32 bit internet address.

        • If the objects must stay where they're created, otherwise they won't be accessible.



      • 32 bit port number.

        • To identity the instance of the application being run.



      • 32 bit time stamp.

        • So as a reference is not reused.



      • 32 bit object number.

        • The reference mustn't be reused for representing some other object.



      • The interface of the object.

        • i.e. Class information.

        • Classes can be dynamically loaded, if it hasn't been loaded yet (Java dynamically loads classes).







  • As with RPC, the client has a stub (known as a proxy) for each remote class instance.

    • There is a proxy for every remote object a process can reference.

    • The class of the proxy contains a method for each method in the remote interface which marshals arguments and unmarshalls results for that method.

    • The stub communicates with the dispatcher for the class, which forwards to the skeleton, which implements unmarshalling.



  • A servant is an instance of a class which is referenced remotely.

    • It exists within the server, which created it and the remote reference to it. It does the execution of the method.



  • In RMI, there's a communication module at the server and the client.

    • For communications between the two ends.



  • In RMI, there's a remote reference module at the server and the client.

    • Responsible for translating between remote references and local ones.

    • Holds an up-to-date object table:

      • An entry for each remote object local to that process.

      • An entry for each local proxy (the stub), in order to redirect method invocations.





  • The server has a dispatcher and a skeleton for each class of remote object.

    • The dispatcher receives the incoming message, parses it, and uses its method info to pass it to the right method in the skeleton.

    • The skeleton implements the methods of the remote interface to unmarshall and marshall the method calls and results.

      • A result may be an exception, that's returned to the caller - the proxy as to determine whether the response contains a result or an exception and deal with it accordingly.



    • The dispatcher and the stub (skeleton) both work from the remote interface to get the same method information.



  • Garbage collection of remote objects.

    • Because references to objects can be remote, garbage collection is now much more difficult.

      • An object might not be referred to locally, but remotely.

      • An object may cease at a given period of time to be referred to locally, but may still be referred to remotely.



    • Local garbage collection must must check references in the remote reference module.

      • The server must have a complete up-to-date set of clients that have non-local references to a remote object.

      • The client must notify the server when it removes a reference (a proxy).

      • This scheme has an impact on the performance.





  • Generating classes or proxies, dispatchers and skeletons is done (since JDK1.5) using reflection.

    • Reflection allows a Java program to inspect and manipulate itself.

      • Java can "ask questions" about it's own classes.



    • It can get an object representing a class with a given name, for example "String".

    • It can then find out about this class's constructors, properties, methods, and their arguments.

    • It can create objects of that class.

    • Java has a class known as "Class".

      • This class is used to find out about other classes.

      • "Class" has a class method known as forname, which allows you to specify a class for "Class" to look at.

      • "Class" has an instance method known as getDeclaredConstructors() which returns an array of type "Constructors", which represent the constructor methods of the class specified in the instance of Class.

      • You can use this 'Class' mechanism to find out about proxies.





  • Rmiregisrty

    • How the server makes a remote object available to clients.

      • "How you get the whole thing off the ground."



    • The rmiregistry stores information about the remote objects that can be accessed from the server.

    • The client still has to know the server machine name and the port the registry is on.



  • Invocation semantics

    • This applies to both RPC and RMI

    • In a distributed system, we have the possibility of errors, and these must always be allowed for.

    • Failures:

      • Requests can be lost.

      • Replies can be lost.

      • Server could fail.

      • Client could fail.



    • Request behaviour options to account for these problems…

      • Don't acknowledge at all?

        • Fail after a timeout expires waiting for an acknowledgement?

        • Try resending the request?

          • After several request attempts - fail?







    • Reply Behaviour

      • If the server doesn't know it's a duplicate, it just processes it.

      • It could filter duplicates if it remembers the requests, and not bother replying twice or more times.

        • Or maybe the reply never arrived? So send it again.





    • Three types of invocation semantics:

      • "Maybe"

        • Retransmit request message = no.

        • Duplicate filtering = not applicable.

        • Re-execute procedure or retransmit reply? = not applicable.

        • Less effort, but not always satisfactory for the application.



      • "At-least-once"

        • Retransmit request message = yes.

        • Duplicate filtering = no

        • Re-execute procedure or retransmit reply? = re-execute procedure.

        • This can occur more than once (and have no different effects or different effects).

        • Sun RPC does this.



      • "At-most-once"

        • Retransmit request message = yes.

        • Duplicate filtering = yes (duplicate filtering is the discarding of multiple copies of the same request, and distinguishes this type of invocation semantics).

        • Re-execute procedure or retransmit reply? = retransmit reply.

        • Requires extra effort.

        • The operation won't be executed more than once.

        • Useful for banking situations.

        • Java RMI provides this.







  • You can build distributed systems using primitive messages.

  • Distributed event-based systems show another possibility.

    • Similar to GUIs in the sense that if a user clicks a button then an event occurs. This event is then dealt with by an event handler.

    • Events are notified to those objects which subscribe to (i.e. register interest in) an object.

      • The system makes a note of all the objects interested in an event, and notifies all the objects that are interested when the event occurs.



    • Events are published.

    • Notifications are sent asynchronously (there's no standard pattern to the notification) to all subscribers.

    • At the client end, the message is interpreted as an event.






Name and Directory Servers



How do you bind a RPC client to a server?





  • One way is to hardwire the machine name and the port number used by the sever into a client. This is fine for something like a file server, but we want to allow for more freedom.

  • Instead, a directory service is used to locate the machine with an unknown name.

    • This works in a similar way to the yellow pages! If you want a plumber, you look up a plumber and get the telephone number.

    • On the web, if you want a particular service, you go and look up the server. However, you still have to know where the directory service is!



  • We have to decide which port to use. Different ports are connected to different processes.

    • A 'daemon' is used to tell you what port to use.

    • The Distributed Computing Environment (DCE) set up this solution.

      • The server machine communicates with its local daemon when it starts up to tell it what port number it's going to use, as it may get a different port each time it starts up.

      • The server also registers itself with the directory server.

      • The client has to ask the directory server for information about the service, which returns the information.

      • The client then ask's the server's daemon for the port information.

      • The client can then do PRC with the server.



    • This is quite a long process. This is how client-sever interactions are designed to work. However, once you know where the server is, you don't have to do this process again. However, the server address and port number may change and have to be re-looked-up again.



  • "A pure name contains no information about the item they are associated with." - Richard Needham

  • Other names may tell you information about the object they refer to. There's a spectrum of how much information can be given within a name.

  • If the name doesn't tell you where it is, then you need to resolve it (find out the information about the object with a given name, such as attributes).

  • Names have a namespace, a context within which a name has a meaning. If you look up a name in the wrong context, then you will get totally meaningless information.

  • Names can be composed to make larger names. For example, a URL is a composite name. A URL can have a protocol part, an IP number, a port number, and a path name. A piece of the name can be looked up.

  • Uniform Resource Identifier (URI)

    • Used for a resource on the Web. (Which always start with a scheme, for example http or ftp).



  • Uniform Resource Locators (URL)

    • A subset of URIs which give a location for a resource.



  • Uniform Resource Names

    • URIs which are not URLs, for example urn:ISBN:0-4829-47283-8 identifies a book.



  • Namepsaces can be:

    • Flat (e.g. number or string with no structure)

    • Structured hierarchically

      • e.g. a UNIX filename.

      • Each part of a name in a hierarchy is resolved in a different context.

      • The context is changed for each section of the name when being resolved.





  • Domain Name System (DNS)

    • The mechanism used for naming computers across the internet.

    • Replication and caching (keeping hold of information to save having to obtain it again) are required to implement this system.

    • Cache consistency is not very important, so long as it's updated within a reasonable amount of time.

    • Large amounts of data are partitioned (broken up) in DNS.



  • A hierarchy of name-servers are used:

    • At the global layer "com, co.uk, edu, gov, net".

      • Worldwide scope.

      • Few nodes.

      • Seconds for a lookup response.

      • Lazy update propagation.

      • Many replicas.

      • Client-side caching enabled.



    • At the administrational layer ".sun, .ieee".

      • Organisational scope.

      • Many nodes.

      • Milliseconds for a lookup response.

      • Immediate update propagation.

      • No (or few) replicas.

      • Client-side caching enabled.



    • At the managerial layer.

      • Departmental scope.

      • Vast numbers of nodes.

      • Immediate lookup response.

      • Immediate update propagation.

      • No replicas.

      • Sometimes has client-side caching enabled.





  • Each client has a local name resolver. In order to achieve this, it has to communicate with name servers. A client can look up a name either iteratively or recursively.

    • #<xx> means the address of the name server for handling names in the node <xx>.

    • Iterative name lookup:

      • The client does most of the work.

        • The client's name resolver decomposes the name into its component parts and passes the list of names to the root name server.

        • The root name server passes back the address of the first part of the name in the list.

        • The client's name resolver can now look up the next piece of the name from the server addressed by the first part of the name, which was returned from step 2.

        • The next name server returns the address of the server for the next part of the name.

        • The client's name resolver looked up the next part of the name from the next name server (address provided in step 4).

        • This occurs until the address for the name required is returned to the client's name resolver.





    • Recursive name lookup:

      • The client does much less work.

        • The client's name resolver sends the name we want resolving.

        • The root server resolves the first part of the name and sends a message to the next server (addressed by the first part of the name), requesting address of the rest of the name.

        • This then occurs recursively, with each name server resolving a section of the name and asking the next server for the address, until the full name has been resolved at the last name server.

        • The address is then passed back through each of the name servers to the root server.

        • The root server then returns the result address to the client.





    • Iterative vs. Recursive Name Resolution

      • Recursive resolution puts more burden on the name server. For this reason, global layers support only iterative resolution

      • However, recursive resolution makes caching more effective (as each server can find out things with each request), and communication costs may be lower (as the name servers physical distances could be closer, so sending a message back and forth from the client may not be as affective).



    • DNS data can be divided into zones.

      • Each zone contains attribute data for a domain, but not for a subdomain.

      • Each zone has to authoritative name servers for a zone, so each machine has two servers that it can ask.

      • Zones have management data, for example the lifetime of cached items (how long you should remember an item for). If the zone is high up in the DNS hierarchy, the timeout should be high, if the zone is lower down, then the timeout should be low.

      • Tables hold the DNS details.



    • Name Server vs. Directory Server

      • A name server takes a name, and returns one or more attributes of the named object.

        • If it were a telephone directory, you give a name, and you get a telephone number.



      • A directory server takes attribute values, and returns sets of attributes of objects with these attribute values.

        • If it were a yellow pages book, you would look up a plumber, but you wouldn't care which plumber you got.



      • X.500 invented the directory service.

        • You have a directory information base (DIB).

        • Clients are known as directory user agents (DSAs), and a directory information tree (DIT) has to be built up.

        • X.500 is no longer widely used.

        • Lightweight Directory Access Protocol (LDAP) was invented for use with X.500.

          • LDAP allows for a simpler directory lookup than X.500.

          • LDAP is lightweight (as the name suggests!), more agile, and easier to use. It is now more common, and is an interface for X.500.

          • Microsoft's Active Directory Service provides an LDAP interface.

          • It desired for LDAP to allow the use of wildcards, which allow you to make SQL-like queries "give me all the name servers that match a certain value."












Time & Clocks



We assume that there are benefits from having different systems in a network and being able to agree the answers to time-related questions, such as, "did this event happen before that event?"



Synchronisation



Two different parts:





  1. Synchronising all the clocks to the same time.

    • UTC (Coordinated Universal Time)

      • International Atomic Time is derived from clocks with "atomic oscillators" with a drift rate of about 1 part in 10^13.

      • Astronomical time is derived from the stars, sun etc, but the slowing down of the Earth leads to diversion.

      • UTC is based on atomic time, but with the occasional insertion of leap seconds to keep it in step with astronomical time.

      • UTC is the universal time standard that is transmitted via radio and GPS satellites.

        • GPS receivers are thus accurate to about 1 microsecond.

        • Receivers from terrestrial stations are accurate to a few milliseconds.

        • However, in most computers, a cheap crystal clock is used for keeping track of the time. There's a drift in these clocks of typically 1 part in 10^6.



      • Cristian's clock synchronisation

        • Note the spelling of 'Cristian'.

        • It's obvious that know that a message is received after it has been sent, but it's unknown how long after it has been sent that it is received.

        • Cristian's algorithm states that with a 'time server' clients can set their own clocks by measuring the round-trip time (RTT) to process a request to the time server, and then adding half that time to the time given in the time server's reply.

        • The assumption is that the transmission in both directions (sending and receiving) is always the same

          1. This is not necessarily the case.

          2. It is more likely to be the case if the round-trip time is short (a long round trip time would suggest a delay somewhere along the transmission, which could be on the sending side or the receiving side).





      • The Berkeley Algorithm

        • There is an allocated master computer system within the distributed system, and the rest of the computer systems are slaves.

        • The process:

          1. The master system (which does the work making the times correct) polls the slave systems.

          2. The slaves reply with the current time they believe it to be.

          3. The master measures the round trip times and averages the times reported by the slaves.

            • The master also takes into account its own time.

            • The master can reject outliers.

            • The master can reject times with an excessive round trip time.



          4. The master tells all the machines what time to set the clocks to.

            • Rather than just sending a time, a change is sent (e.g. "advance your clock by one second" - because it doesn't matter how long it takes that message to get back.





        • This is a good algorithm to use if there is no GPS machine.

        • The master is just any other machine, but it performs the task of sorting the time. If the master fails then you can have a election (using a distributed election algorithm) to assign a new master.



      • Cristian's and Berkeley's algorithm are both designed to work for intranets.

      • Network Time Protocol (NTP)

        • A result of considering the internet as a whole.

        • The issue is to try and synchronise time on a much larger scale.

        • Primary clocks with a UTC clock are identified, and secondary clocks are synchronised with the primary clocks.

        • A primary machine can rename itself as a secondary machine if it becomes unreliable.

        • Three ways of performing synchronisation:

          1. Multicast mode.

            • Simple.

            • Time is broadcast to all the machines.

            • On a LAN, the round trip time is so small that it doesn't even need to be taken into account.



          2. Procedure call mode.

            • Effectively Cristian's algorithm.

            • The server accepts requests and replies with the time.

            • It is used when multicast is not supported, or a higher degree of accuracy is required.



          3. Symmetric mode.

            • Used to get the highest degree of accuracy.

            • Time is attached to messages (both the time the message is created, and the time the last message was received is attached).

            • If not many messages are being sent, then empty messages with the time can be sent to maintain synchronisation anyway.

            • Both sides of the communication then know about these times.





        • Time of receiving a message = the time the message was sent + the offset of the clocks.

          1. The offset is between two bounds. If you know O and D, then you can work out an estimation of the offset, and an estimate of how accurately you know what the offset is.



        • NTP servers filter successive (o,d) values to identify the best (lowest d value), as well as measuring the reliability of the other servers. Each server will interact with several others, so that they notice which servers are the most reliable.

          1. Experiments around the world have proven that clocks can be synchronised to 10s of milliseconds over internet paths.









  2. Clock drift

    • Maintaining the synchronisation once synchronisation is achieved.

    • Logical Time

      • Invented by Lesley Lamport

      • Lamport pointed out that the best use of time is in ordering events.

      • Rather than have a very high precision number, why not just count? As only the order of events are the most important thing.

      • In a single processor, you can order events using the local clock, or even a simple counter.

      • If two events happen in the same process, they occur in the order given by that process.

        • If a message is sent from 1 process to another, the event of sending a message happens before the event of receiving a message.

        • The previous bullet point effectively defines a partial ordering of events.

          1. This partial ordering is known as the happens-before relationship.

          2. Lamport's logical clock is about reflecting this relationship.





      • A logical clock is a monotonically increasing software counter.

        • Each process keeps its own logical clock, and uses it to timestamp events.

        • Each message sent contains the current time stamp.

        • When a message is received, the logical clock of the receiver is set to the maximum of the local logical clock and the timestamp received, and then add one to it!

        • If event e1 happens before event e2, then the logical clock on e1 will be less than the logical clock of e2.

          1. This integer will reflect the happens-before relationship.

          2. However, the inverse is not true.

          3. You can't deduce ordering from the timestamps - they may just coincidently relate to each other, meaning that we only have a partial ordering.

          4. There is a method of creating a total ordering of logical clocks.

            • Logical clocks that are the same can be resolved using process identifiers.

            • This can be used to control entry to critical sections, where only one process should be allowed to enter the section at a given time.









    • Vector Clocks

      • A development of Lamport's system.

      • A vector clock is an array of n integers.

      • Each process maintains its own vector.

      • The clocks are advanced when a message is received.

      • Messages between processes contain the vector clock of the sender as a timestamp.

        • Each clock starts with all the integers being zero.

        • Events in process i increment the i'th element in its vector clock.

        • The other elements of the vector are ignored until a message relating to that process is received.



      • When a process i receives a timestamp t in a message, it resets each element in its clock.

        • V [j] = max (V[j], t[j]) for j = 1... n.

        • This operation is known as a merge.

        • Note: each process that we're tracking (1..n) is updated.



      • Comparing vector clocks:

        • V1 = V2 iff V1[j] = V2[j] for all j.

        • V1 <= V2 iff V1[j] <= V2[j] for all j.

        • V1 < V2 iff V1 <= V2 & V1 != V2.

          1. (Less than or equal but not equal).





      • Now if event e1 happened-before event e2, V(e1) < V(e2)

        • This is something that couldn't be said for certain with Lamport's logical clock - you could pretend that you knew, but you couldn't know for certain.



      • Advantages

        • It is no longer possible end up with an arbitrary order of events when an ordering of events is not required.

          1. When c and e are completely unrelated events, with times in separate indexes in the vector...

          2. Neither V(c) < V(e) nor V(e) < V(c).

          3. Therefore it avoids drawing conclusions that are arbitrary and wrong.





      • Disadvantages

        • There is a cost in the extra amount of data required to timestamp is more information to be transmitted.

        • It is also required to know how many processes there are. The more processes, the longer the timestamp message.










Ordered Multicasting





  • Ordered multicasting is an application of logical clocks.

  • One of the protocols used to synchronise time is to just broadcast the time.

  • Multicast

    • One process/machine sends a message to a whole group of other processes/machine.

    • It is required to implement this as a primitive in order to avoid looping through single message sends.

    • A broadcast mechanism is required.

      • This mechanism should be built on top of this primitive system in order to create an order multicast.

      • Ordered multicast

        • Adding ordering information to multicast messages.

        • For example, a banking system.

          • The databases can be duplicated in order to have local versions of the databases for quick access.

          • However, when a database needs to be updated, all the duplicates need to be updated.

          • This leads to an ordering constraint.

            • If two users want to make an update, and one user is closer to one database, and the other user is closer to the other database, then when the two users multicast to both databases, the messages will be received at each of the databases in different orders.

            • If update one adds £100 to the balance, and update two adds a 1% interest, then different answers will occur depending on the order in which the messages will be received by the different databases.



          • There are many other examples of where ordering is important (debit and credit could lead to an overdrawn charge or not).

          • The problem is that the two databases could hold different information.



        • The messages could be ordered.

          • This can work for most of the time, but you can't solve the problem this way because you can't predict the arrival of messages that aren't yet known about! How can you order a message when it could come after a message that isn't yet known?

          • The message with the earliest timestamp, and thus positioned at the front of the messages queue may be sent after a message that hasn't arrived yet.

          • Just because the message has the earliest time known so far, it doesn't mean that it is in fact the earliest.



        • Totally ordered multicast

          • Used in order to get the same answers across all components of the distributed system.

            • i.e. All the messages are delivered to the servers in the same order.

            • This can be achieved with Lamport's logical clocks.

              1. Groups of processes multicast to each other.

              2. Each message is timestamped with the logical time of the sender.

              3. It is assumed that the multicast also goes to the sender (a copy is taken for the local system to keep when it performs a multicast operation).

              4. It is assumed that messages from the same sender are received in the same order that they were sent.

              5. It is also assumed that no messages are lost.





          • When a message is received, it is put into a local queue and ordered by its timestamp.

          • Acknowledgements (ACKs) are sent to other processes, to acknowledge the fact that this message has been received.

          • A process can only "act on" a queued message when it is at the head of the queue, and has been ACK'ed by every other process - only then can the local system know that there are no earlier messages on the way.

            • The contents of the queue and all the processes end up being the same across all systems within the distributed system.

            • Using the total ordering, we have all the messages stuck in queues at each receiver, being ordered, and no message is being acted on until each receiver has ACK'ed them.

            • This mechanism means that it stops local computer systems from acting on a message when there is an earlier message on its way that it's not yet aware of, because it won't have received a proper set of acknowledgements yet.



          • This method greatly reduces any advantage of replication because if one of the servers crashes, then all the acknowledgement messages are lost, and everything will stop.

            • One of the advantages of replicating the servers in the first place was that you had a backup if a server does go down!

            • However, this system is only required for updates - not requests.

            • Keeping replicas consistent by executing the same operations in the same order is a general technique known as state machine replication.





        • Causally-ordered multicasting

          • Instead of totally ordering messages, so that we we have identical effects on the two servers, we allow the processing of causally unrelated messages to be processed in different orders on different machines.

          • This is useful on a bulletin board.

            • Messages should be delivered in the correct order if they are related (in the same message thread, but it doesn't matter about the ordering of messages between thread topics).



          • Vector clocks can be used.

            • Sends are only recognised as events.

            • The messages sent will have time stamps.

            • A merge operation will occur when a message is received and accepted.

            • There's a layer between the communication layer and the application layer (middleware) that holds up these messages.

            • With a totally ordered multicast, the queue that held up the messages is not visible to the application, as the middleware is responsible for this.

            • In casually-ordered multicasting, the middleware keeps track of the vector clocks, and doesn't pass the information to the application until the properties are right.



          • Suppose Pj receives message m from Po with timestamp(m)...

            • Two conditions:

              1. timestamp(sender)[i] = VC[j][i] + 1

                • This means that this is the next message Pj was expecting from Pi.



              2. timestamp(sender)[k] <= VC[j][k] for all k != i

                • This means that Pj has seen all the messages seen by Pi when it sent message m.

                • All the messages that could have effected the contents of this message have also been seen by the receiver.

                • The middleware re-orders messages so that you don't see anything that you shouldn't see because you haven't seen things that cause it.









        • There are other possible orderings.

          • Some systems (such as ISIS) have built totally ordered and causally ordered into their system as properties that you can just switch on and start using.

          • There is some debate whether this is a good thing to do or not.

          • Criticism:

            • The end-to-end argument:

              1. The application is the thing that knows what it wants.

              2. If there are intermediaries (the middleware, or intermediate machines) imposing constraints, then the application may not get what it wants.



            • On the other hand, as a programmer, you don't want to have to implement this in the application yourself!





        • Two main causality problems:

          • Not all causality is real

            • For example, in the vector clock systems, if two messages are sent, it's hard to specify if they're not related.

            • The system will believe that the order of these two messages matters.



          • The computer may not know about causality.












Coordination and Agreement





  • Machines should notice when a master machine dies and to replace it.

  • Mutual exclusion is implemented using semaphores.

    • Semaphore consist of two operations P() and V().

    • P() is used at the entry to a critical section, and V() is used at the end of a critical section.

    • Semaphores are used in operating systems, as well as in distributed systems.



  • How do you implement a semaphore?

    • The following implementations would work for a single machine:

      1. Inhibit interrupts.

        • This method is no good for multiprocessors, as the other processes can still access the critical sections.



      2. Special, clever instructions.

        • Invented for multiprocessor setups.





    • Distributed mutual exclusion.

      1. In a distributed system, the individual systems don't share memory, which is the key difference. Neither of the two implementations (shown above for a single system will work).

      2. The simplest solution is to have a central server.

      3. The mutual server holds tokens that can be requested by clients.

      4. Tokens are given out to a client if it's available.

      5. If it's not available, then the client will have to wait in a queue.

      6. An example:

        • Client A requests a token from the server, but the token is unavailable, so client A is put into a queue on the server.

        • Client B releases the token to the server.The server looks at the queue for the first waiting client, and grants the token to that client.



      7. However...

        • The server is now a single point of failure.

          • If the server dies then nothing will work.



        • The server is also possibly a bottleneck in the runtime performance.

        • If a client dies with the token, it'll never be returned and the system may lock up.

        • More clever algorithms have been developed, but generally they're worse for the number of messages that are being sent.

        • It is more desirable for the master to be relocatable, and decided dynamically.







  • Elections in distributed systems.

    • To solve the problems noted in the mutual exclusion implementation, and to dynamically assign the master system, an election is used.

    • Elections are a good introduction to what is involved in getting processes in a distributed system to cooperate.

    • The whole point of an election is to choose a process to become the master, and to get all the processes to agree on who is the elected master (each processes is needs to learn the result of the election).

    • Any process can initiate an election.

    • If a process believes that the process in charge is no longer there, then an election is started.

    • The electoral system needs to account for the fact that there may be more than one election going on at any time!

    • The scenario is that you have a fixed set of processes within the distributed system, and you want to elect the process with the largest identifier, where identifiers are totally ordered (ties must not occur). An identifier could be "speed of processor" and "unique identifier".

    • Two key algorithms:

      1. Ring-based election algorithm

        • The method:

          • Arrange the processes in a logical ring (not necessarily a physical ring). Each process knows which the next process is. This is like a linked list of systems within the distributed system, where the last process points to the first.

          • We will assume that the communications will always succeed and never time out.

          • Initially every process starts in a non-participating state.

          • The initiating process changes itself to be a participant, and sends its identifier in an election message to its neighbour. The identifier is a unique identifier for a component of the distributed system, and will be used as the key for determining the new master.

          • The receiver marks itself as a participant, looks at the identifier in the message, and compares its own identifier with that in the message.

          • There are now three alternatives for the next stage:

            • If the identifier in the message is larger than the identifier of the receiver, then that means that there's a better leader already found. It just forwards the message.

            • If the receiver is already a participant, it simply discards the message (this is to get around multiple elections occurring at once).

            • Otherwise, it forwards the message with its own identifier (because it is the best candidate so far).



          • Algorithm stopping condition:

            • If the identifier in a received message is a processes's own identifier, then that process now knows it's been elected - the message has gone round the ring and come back to this machine with its own identifier. This process now has to let the other processes know that it is the newly elected master process.

            • An elected message is sent around the ring.

              • When the elected message is received by a process, it sets itself to non-participating and passes the message on.

              • When the elected message is received by the new master process, it throws the message away and the algorithm has completed successfully.







        • The worst case number of messages sent is 3n-1 (the case where the process elected is just before the process that initiated the election).

        • Weaknesses:

          • This system is not very tolerant to failures.

          • It's also very hard to add new processes to the ring.





      2. The Bully Algorithm

        • Designed for a slightly different situation.

        • We still assume that message communication is reliable but we allow for the fact that processes may crash.

        • A process already knows the identifiers of all the other processes and communicate with them all (this is a different method of communication between processes that in the ring-based algorithm).

        • Timeouts are used to detect process failure - the timeout is used to determine whether or not a process is dead.

        • Three types of message:

          • Election message - sent to initiate an election.

          • Answer message - sent in response to an election message.

          • Coordinator/elected message - sent to announce the identity of the election winner.



        • Two types of timeouts:

          • If no reply is received after a certain period of time, then it is assumed that the initiator of the election has crashed.

          • If the initiator of an election doesn't learn who won after a certain period of time, then it should begin another election (it is assumed that a process must have failed and messed up the election).



        • The method:

          • A process sends an election message to all processes with higher identifiers than itself (it doesn't bother sending the election message to processes with a lower identifier).

            • The thing initiator doesn't know is whether or not each of the processes are still running.



          • The processes that have not crashed will respond within the timeout a) period with an answer message.

            • When the election initiator receives the reply, it will now know that that the process that responded is still alive.



          • On receiving an election message, a process also initiates its own election.

          • Eventually, a process will have an election and get no answers - this is now the elected process. This process sends a coordinator message to all the lower processes, letting them know that it is the new master process.



        • Weaknesses:

          • Timeouts are required, which may take up time.

          • If a process restarts and elects itself, there may end up being two coordinators (masters) if the algorithm isn't coded properly.

          • The number of messages depends on which process initiates. The best case is O(n), where the election winner initiates the election. The worst case is O(n^2) - where the lowest process initiates.












Fault Tolerance





  • "A characteristic feature of distributed systems that distinguishes them from single-machine (centralised) systems is the notion of partial failure". - Tanenbaum.

  • The goal is of course to tolerate faults that may occur within a distributed system.

  • A system is dependable when it is trustworthy and reliable to provide a particular service.

  • Dependability requirements:

    • Availability

      • The probability that the system operates correctly at any given moment.



    • Reliability

      • The length of time that a system can run continuously without failure.



    • Safety

      • If a failure was to occur, the consequences should not be catastrophic for the system. For example, in airline software!



    • Maintainability

      • If the system does fail, the cost to restore the system should not be high.





  • What kinds of failures can occur?

    • Crash

      • A server, or a client halts during its process. This can leave a distributed state in an incomplete or unknown state.

      • We need to ensure that vulnerable sections of the system should either completely complete or not at all. This makes the vulnerable section of code an atomic operation.

      • Similarly, concurrent operations can be managed in this way.



    • Isolated execution

      • Start one operation, and then the next operation.

      • The problem with this approach is that a bottleneck can occur. We need to maximise concurrency, but prevent data corruption.





  • Operations should be ACID.

    • Atomic.

    • Consistant.

    • Isolation.

    • Durable.

      • Updates are persistent once the application successfully completes, otherwise the update should not go ahead.

      • Transaction algorithms use this durable recovery algorithms.





  • A mechanism can be used that would work as a high-level semaphores.

    • Two-phase locking:

      • Acquire read or write locks.

      • Release locks.



    • Omission failures

      • Filure to respond to incoming requests.

      • Failure to receive incoming messages

      • Failure to send messages.



    • Response failures

      • A server's response is incorrect.



    • Timing failures

      • A server fails to respond within a certain time.





  • Arbitrary (byzantine) failures - a component may produce output it should never have produced, which may not be detected as incorrect. Arbitrary responses at arbitrary times. These are by far the most common, and can be a combination of the above.

  • Two general's problem/paradox.

    • This is not summarised here.

    • We have unreliable communications, and we need to ensure that a message gets through.



  • Redundancy is the key solution to all these failures.

    • Physical redundancy

      • e.g. Boeing 747s have four engines but it can fly on three.



    • Time redundancy

      • An action is performed again and again until it is successful.



    • Information redundancy

      • Extra pieces of information are sent to allow for information recovery. This means larger messages being sent.



    • Redundancy won't always provide the solution to these failures.

      • For example, if faulty software running on the distributed system can fail, then it doesn't matter how many different copies of the software is running, they'll all have the same bug or problem. One solution would be to use multiple software teams and use multiple pieces of software.



    • Redundancy creates several problems

      • Consistency of replicas

        • Multiple replica's data has to be updated (e.g. databases).

        • Expense (cost to setup and maintain).










Distributed Transactions and Further Fault Tolerance



We need to ensure that transactions are meaningful using:





  • Concurrency control (dealing with actions that may interfere).



  • Locks can be used.

    • Such as semaphores.

    • "Acquire locks".

    • "Release locks".

    • For example, in a database.

      • We want to retrieve something and make some analysis.

      • We lock before the transfer data, and we unlock once the transfer is complete.

      • This is known as a read lock.

      • While we are reading the data - no other user can modify the data.

      • But users should be able to read the data, as the data isn't being changed.



    • We use a write lock if we want to write to data.

      • This allows us to perform a read on something that doesn't get changed half way through!

      • This is a stronger form of lock.

      • A write lock blocks out both readers and writers (other users).





  • Trade off

    • Locks are great for managing concurrency - but they're less efficient.

    • We now need to prevent deadlocks.

    • A method of resolving any issues using a more intelligent commit transaction is required.

    • A timestamp could be used.

      • The server checks that the timestamp for each read operation.

      • If the timestamp is well before the other transaction, then clearly there is no interference.

      • On the other hand, if there's an overlap, then the server will have to check everything at the commit phase.







Recovery mechanisms (recovering the system where something goes wrong):





  • The system could roll back to the checkpoint that represents the state of the system before the start of the transaction.

  • Forward recovery is where we know what the end state should be, and we know all the possible end state combinations are for all the possible errors.

  • If two accounts are stored into different databases, then we have the following problems:





    • Only one database crashes/fails (or both could crash/fail).

    • There may be a problem at the client side.

    • Distributed transactions come to the rescue!

      • Based on the notion of a coordinator, which monitors the transaction.

      • The coordinator ensures that all the other servers and clients stay "on the right page".

      • One-phase commit.

        • The client tells the coordinator to commit or abort a transaction.

        • The coordinator then communicates this request to all the participants.



      • Two-phase commit.

        • Gets around the problem of the one-phase commit, in that one of the participants cannot commit a transaction.

        • The coordinator asks the participant if it is ready to commit using a "can commit message" (this is the first phase).

          • There is a specified timeout.

          • If the timeout expires, "no" is assumed.



        • If all the participants reply "yes", then the coordinator sends a "do commit" message (this is the second phase).

        • If at any point, not all participants have replied "yes", then the transaction is aborted.

        • This is an all or nothing model.

        • Weaknesses:

          • If the coordinator fails, coordination is lost.

            • There is a three-phrase commit protocol.

            • Multicast could be used.

              • If the participants broadcast the information to all other participants, then this problem can be overcome.



            • (These options are non-examinable.)



          • The coordinator may be a faulty component.

          • Transactions (in general) should be short in duration, meaning that if there is no quick response, the timeout should expire.

          • When we have distributed transactions, we may have distributed deadlocks.

            • Deadlocks can be resolved by aborting one of the processes.

              • The earliest (or most involved) process can have the highest priority.



            • However, it can be hard to get a picture of which systems are waiting on which resources/events.

              • A coordinator could hold this information.

              • However, within a distributed system, the coordinator could receive delayed operations and make incorrect assumptions.

                • The coordinator could detect phantom deadlocks (which are deadlocks that don't actually exist.



              • An alternative would be to use edge chasing.

                • This is where a list of the processes that are waiting on other processes are kept on the server as a linked list.

                • If this linked list then references the same process twice, a deadlock is detected.

                • This can be done without a coordinating server by simply adding this information into every message that is sent.



              • When a deadlock is detected, an edge has to be aborted.









      • The Byzantine Problem













        • Reliable communications are assumed, but one of the generals is a traitor. We don't know whether to trust him or not.

        • Dealing with this problem is known as byzantine fault tolerance.

        • This is a faulty component acting in an intelligent way.

        • It can be proved that there is a solution to this problem if there are 3 times the number of faulty components working.

          • For further details, check the 'Journal of ACM, April 1980'.



        • If there are four components in the distributed system, and one is faulty, then each component needs to know the messages that were sent to each component from each other component in order to deduce a faulty component.

        • The idea is that each component sends the instruction it received from the component that originated the instruction to all the other components. Each component can then see the messages that each component had sent to it.

        • If the coordinator is faulty, then each component can see that it has been giving out random messages.

        • If a single component is faulty then the other components will be also able to see that.

        • Majority voting can be used to decide between multiple components what to do.

        • Other issues:

          • Signed messages could be used.

          • A faulty communication channel could be in place as well!

          • Messages may not be delivered.

          • The generals may not be able to communicate with each other directly! Ooooh!









  • Redundancy

    • Redundancy is a key technique to increase the reliability of a distributed system. It can be used to account for faults, failures, and accessibility.

    • The maths behind redundancy:

      • If system has a probability of failure of x.

      • Then the probability of two systems failing is x ^ 2.

      • The probability of three systems failing is x ^ 3.

      • And the probability of n systems failing is x ^ n.



    • Failure

      • Failure can be masked using redundancy.

      • In an odd number of redundancy systems, a majority vote is taken.

        • The final, unified result for all systems is the conclusion that is made and set for all components.

        • The system can continue, regardless of whether or not a component fails.

        • Critical software is written using this functionality.

        • Each redundancy system can be written independently and by different software development teams.





    • Performance

      • The performance of a distributed system can be increased by placing remote data, such as databases, at different physical locations.

      • This can mean that clients can connect to the replica database that is closes too, saving communication times.

      • For example a web browser will connect to the most local version of the webpage to the client's location.

      • Different copies of the data on the different servers have to be maintained so as the data is correct.



    • Weaknesses:

      • The price of replication.

      • There is a trade off with the cost of building multiple systems.

      • In queue systems, there is also a trade off between the number of clients that can be served at any given time and (again) the cost of the servers.



    • What is the price of replication?

      • Money.

        • System maintenance.

        • If the client modifies a value, the system has to ensure that the new value propagates to all the replicas of the system.



      • The same value is required to exist in all the replicas.

        • The system therefore needs to have a system to communicate new values to the replicas.

        • If the same value is being accessed from another client, control of this value has to be enforced.

        • A lock could be used on everything.

















          • But this could be inefficient.



        • Global synchronisation takes a long time, and could loose the benefit of using replicas for performance.

        • A consistency model is a contract between client processes, and the data on the server.

          • We want to say that processes will obey particular rules in order to ensure concurrency across replicas.

          • The read operation should display the results of the last write operation.

          • There is no global clock, so it can be hard to determine which operation occurred first.

          • Strict (tight) consistency model.

            • Once something is updated, then all the other replicas have exactly this value.

            • The same value should always be returned.

            • This isn't a very realistic approach in a distributed system; we need instant communication so as there's no chance of getting an out-of-date read.



          • Sequential consistency.

            • Ensures that variables are read and written in the correct, sequential, order.



          • Causal consistency.

            • Writes that are causally related must be seen by all processes in the same order.





        • Replica Management

          • When new replica server is placed into a graph of replica servers, it should be placed so as the cost to access this server is minimised.

            • To take an extreme worst case, there's no point placing a new replica server well away from all the other servers!



          • Every client needs to be able to access these servers at a minimal cost.

          • All possible combinations could be tried, an exhaustive search. But can be very slow.

          • A heuristic can be used to narrow down the number of nodes to try.

          • One method

            • Write down all the lengths of the paths between nodes, total these paths and remove the node with the smallest total.

            • Iterate the previous bullet point.














Distributed File Systems



Basic file service architecture:





  • Takes a name and decodes it, in order to get a file identifier.

  • On UNIX this may be an inode number.

  • On another system it may be the address of the first block, such as in FAT systems.

  • The flat file service provides operations on these blocks that are identified.



Idempotent versions of Operations:





  • Idempotent means that an operation can performed many times, and it's the same as doing it once.

  • For example, in a bank account you require money transfers to be idempotent.



  • Operations done via remote procedure call (RPC).

    • A read operation takes parameters in order to identify how to read the file.

    • The system keeps track of where you are in the file.

      • This is not a safe way of working using RPC.

      • You may have to repeat a read request if the transfer didn't complete successfully.

      • If the file system is keeping track of the position in the file, then you'd miss a bit!

      • This is not an idempotent operation!



    • Operations can be made idempotent by making the position in the file part of the read and write calls. The client keeps track of where it is in the file, not the server.



  • Stateless Servers

    • Making each interaction between the client and the server independent of previous interactions. Stateless.

    • If the server or the client reboots, there's no state to have to restore.

    • Consequences:









      • On a simple UNIX system, access permissions can be checked.

      • The process cannot tamper with the file descriptor (where this information is stored).

      • With stateless interactions, the accesses to the server needs to be checked for every access. This means extra overhead for each read/write operation.



    • Caching issues:

      • The norm is to read and write sections of a file in blocks.

      • The assumption is that the process is likely to want the next series of blocks.

      • In a distributed system, server caching doesn't cause any problems. But for a client, this can lead to noticeable differences.



    • Sun Network System (NFS)

      • The system Manchester University uses… so it must be awesome! ;)

      • Developed for a network of UNIX machines in the 1980s.

      • It can now run on Windows machines and other machines.

      • The remote filestore is mounted in the client machine's hierarchy.

      • There is a virtual file system (VFS).

      • NFS Architecture:













        • Client and server.

        • Applications run on the client.

        • These applications make system calls to the UNIX kernel.

        • The UNIX kernel has a virtual file system, which encapsulates various physical locations of files; local and remote.

        • The remote filesystem is controlled by an NFS client, which communicates with an NFS server on a server filestore.

        • This NFS server lies underneath a virtual file system within the server's UNIX kernel. The NFS server uses the virtual file system to access the physical locations of the files.

        • The NFS protocol is a protocol that allows both sides of the NFS connection to communicate between each other.

        • NFS uses RPC.



      • To a client this is all abstracted, and it appears like the file structure is a single entity.

      • Directories can be remote mounts.

        • I.e. remote file directories are mounted as directories on the local machine.

        • This complicates the way you look up a filename.

        • Different parts of the file path have to be looked up at different sections.

        • The local system can look up the the point of finding the mount point, at which point the server has to lookup the next part.



      • Issue: when is the directory tree built?

        • When you boot the client. The client the talks to all the servers and gets a layout of the entire system.

          • This is a very slow and inefficient option.



        • Sun NFS uses an automounter.

          • The automounter talks to the server only at the point where the client wishes to find out more information about a mount point.

          • One of the side effects is that calling a list directory command can result in displaying a directory where there's nothing there, if the directory hasn't yet been mounted.

          • However, this side effect is worth living with.

          • The positive side effect is that for structures that don't change, they stay.





      • Issue: how does caching work over Sun NFS?

        • Server side caching works as normal.

        • The problem is at the client side.

          • If you have bits of a file in local memory, and a different process writes to the same file on the server, the first client now has an out of date version.

          • The server doesn't know about this, nor does it want to know.

          • A timestamp mechanism is used to validate cached blocks.

          • If you have a block from a file, you also have a time for which this was valid. This is the timestamp.

            • For files, a block is valid for 3-30 seconds.

            • For directories, it is valid for around 30-60 seconds (these change less regularly).



          • The client also holds a time since last modified at the server by the server's clock. When this block is validated, the modification time should be checked to see if it has changed. The server can inform the client whether or not this is out of date, rather than throwing away caches and ask for a new one.





      • NFS is pretending that the file structure is just a single structure.



    • Andrew File System (AFS)

      • Developed at Carnegie-Mellon University (CMU) in the late 1980s.

      • The aim of AFS is to maximise scalability by minimising client-server interactions.

      • AFS is very concerned about the scalability of the distributed system.

      • Coda (a more advanced file system which is more relevant today) incorporated AFS, and can be used in wireless setups to avoid disconnection problems between accesses to the file server.

      • The basic idea is that when a client wishes to access a file, AFS gives it a copy of the file.

        • Files up to 64kb are transmitted.

        • Rather than trying to cache individual blocks, the whole file is sent.

        • The version on the disk is used as opposed to the remote one.

        • Large local caches are used, this idea is known as 'whole file caching'.



      • The other design principe is not to have servers pretending to be clients.

        • On the clients you have an application running known as Venus, and on the server, you have a server application known as Vice.

        • It is these application-level pieces of software that handle the retrieval of remote files.

        • One of the side effects of NFS is that each client has a different picture of the directory structure.



      • AFS states that there are two filestores - shared and local.

        • Files on each system are characterised into these two filestores. The category (and directory) that files fit into is relative to the system that the file is on.

        • All the names in the shared filestore will be common to all workstations.

        • You can produce links between the local file store and the shared file store.

        • Issue: how does caching work over AFS?

          • Use a callback promise, which is a token.

          • This token is either valid or cancelled.

          • This provides a way for updating cached copies.

          • The act of closing a cached file sends the modified file back to the server. Timestamps are also used, in the case that the client goes down or the RPC goes missing.

          • It renews callbacks before an open if the time has elapsed (see above for times).

          • Maybe clarify this using t as on the slides.

          • AFS has to use a server state in order to implement this.

            • This state naturally has to be maintained.





        • AFS has changed the semantics of sharing.

          • If we have multiple writers to a file, the last close will overwrite the previous close - it will not use an append.

          • Therefore, AFS is simply not recommended for systems where we wish to have multiple client access to files.

          • At CMU, they spent time analysing how the files are used.

            • What the observed was that files were usually small (less than 10kb).

            • There were about 6 times as many reads from files as there are from writes to files.

            • Most files get read rather than written.

            • Sequential access is more common that random access.

            • Most files only have a single user. These facts are not surprising, but what CMU have done is look at these results, and create a whole file caching system (AFS), as it makes the most sense for it to work like that.









    • In our modern reality…

      • NFS-4 introduced callbacks (it doesn't have stateless servers any more).

      • NFS looked at AFS and decided that what it was doing was a good idea.

      • NFS-3 is what was referred to earlier.

      • AFS and NFS provide a useful, real service, supporting thousands of users.

      • There are other file management systems.








The Quest For Performance





  • In distributed system, performance is an important objective. We can build a simulation of the performance before we build the actual system, but how do we build a model?

  • Using an analytical solution

    • Derive formulas and use mathematics to analysis and predicted systems.

    • Sometimes advanced maths is required to do this.

    • Queuing theory can be used to analyse systems involving queues, such as incoming/outgoing message queues and process queues.

      • Queueing theory is based on probability theory.

      • A useful result of queuing theory is Little's Law:

        • average number of customers in the queue = arrival rate * service time





    • Amdahl's law

      • If we have a program running in time n, and x% of this program that can be parallelised, then we can reduce the runtime of the program by distributing the program across p machines.

      • parallelised runtime = overhead + single process run time * (1-x + x/p)



    • The minimal number of HTTP requests required.

      • The amount of time for a client to send a request to the server and get a request back.

      • latency (round trip time) + size of message + time to send the number of bytes that express the request (request size / bandwidth) + server process time + size of reply/bandwidth.

      • This isn't 100% accurate, but it is a good estimate.





  • Building a simulation

  • Two main simulations:

    • Monte-Carlo simulations (random numbers are used).

      • Results are calculated and based on repeated random sampling. A random input is used, and the result for this input is calculated.

      • This is useful for simulating a sample space of a large data set.

      • Example: simulating the solution to the Monty Hall problem.



    • Discrete-Event simulations (events over time are considered).

      • Events are randomly generated.

      • The simulation proceeds as a chronological sequence of events.

      • This simulation implies some notion of time, and is used in most computer games, and the final graphics exercise, and the third lab exercise of this course.














Security





  • Within a single machine, the operating system is responsible for security. This involves tasks such as checking access rights and user identities.

  • A network poses new problems to security.

  • Different machines have different user identities (the same identity on local machines may represent different users across multiple machines).

  • In addition, the network causes problems. Messages can be:





    • Read.









      • e.g. Intercepting across a network graph, or listening in to messages on different channels in wireless communications.



    • Altered.

    • Created containing forged information.

    • Copied and replayed.

    • Denial of service attacks can also occur

      • A network can be set up so as it has too much traffic.

      • This can be done very simply by making computers send requests as quickly as they can.





  • Cryptography





    • The solution.

    • The information is hidden in such a way that it cannot be understand. Therefore it cannot be read or manipulated.

    • Extra information is included, such as sequence numbers or time information, so if the information is intercepted it can be detected.

    • Keys are used:

      • Shared secret keys

        • The sender and the receiver both know the shared secret key.

        • The problem is how to share the key. This is the weak point.

        • Public/private keys pairs













          • Every local system has a pair of keys; a public and a private.

          • Everyone system knows all the public keys of everyone other system.

            • The public keys published and made public, but a system only knows its own private key.

            • It needs the other key in order to decrypt a message.



          • This method of cryptography requires around 100-1000 times as much processing power to use public key encryption. This can really slow things down for large documents.

          • It also relies on being able to derive the private key from the public key.

          • RSA (the company that set up this system) run a competition where they offer prizes to people that can decrypt the private keys from the public keys.

            • If a method was discovered, then private keys could be worked out from public keys, and the whole mechanism breaks down because public keys are published.

            • The encryption from public to private keys is done using factors.

            • These keys get longer and longer as computers get faster.

            • Encryption algorithms tend to be leaked (despite it not being a good idea for this information to be available), so the system relies upon computers not being fast enough to perform the computations necessary of decrypting the keys.



          • How it works:

            • A sends a message to B with B's public key.

              • Only B can decrypt this message.



            • More practically, A sends a message to B with a one-off session key, encrypted with B.

              • This session key is used to get around the problem that the public key encryption is very slow, and that you don't want to have to encrypt and decrypt the messages every time.

              • The session key is used for all messages with a burst of the communication.



            • Signatures (see later section) work with public keys.

              • A sends to B its identity, and the information encrypted with A's private key.

              • Once B knows who A is, it looks up A's public key and decrypts the message.

              • B now knows its got a message from A because it decrypts it using A's key.

              • To keep things confidential, the whole message can be encrypted with B's public key.





          • There needs to be a secure way to get the public keys.

            • This needs to avoid machines giving out fake keys.

            • So, in the wider world, we use certificates, which are keys used to ensure that systems are who they say they are.

            • There is a chain of trustworthiness, with a defined point where they public key is so well known that nobody could possibly claim to have that key.

              • This is the system of certificates.

              • Once this system has been built, its very difficult to revoke a certificate because there are copies of it all over the place.

                • Trustworthy information has an expiry dates to get around this, so after a certain timeout the information has to be renewed.









        • Needham-Schroeder Authentication Protocol

          • How to get two processes communicating securely with each other using an authentications server, which has secret keys for all the principles (the components of the distributed system).

            • When you create a username, the secret key is your password.

            • You don't want processes to have to learn this secret key - the authentication server should know this.

            • Kerberos uses this within Intranets.

            • How it works:

              1. A wants to communicate with B.

              2. A sends a request message to S to supply a communication key in order to communicate with B.

              3. S returns a message encrypted in A's secret key, containing a newly generated key K<AB>, and a 'ticket' encrypted in B's secret key. This 'ticket' is used to contain the identity of A. A nonce Na demonstrates that the message was sent in response to the preceding one. A believes that S sent the message that comes back to it, because only S knows A's secret key.

              4. A then sends the 'ticket' to B. A can't understand the ticket, because it's encrypted in B's secret key, but it does send it on to B. B will be able to understand it.

              5. B decrypts the 'ticket' and uses the new key K<AB> to encrypt another nonce to A (Nb). The nonce again signals to A that the message has indeed come from B. A can understand this encrypted message, as its using the key K<AB> that s provided.

              6. A demonstrates to B that it was the sender of the previous message by returning an agreed transformation of Ng.







        • Secure channels

          • An idealisation built on existing communication services such as TCP, IP etc.

          • There's a lot of effort involved, but this does ensure privacy and integrity of data.

          • Examples:

            • Virtual Private Networks (VPNs).

            • Secure Socket Layers (SSLs).



          • Properties:

            • Two processes know the identity that they're communicating with.

            • This has to be secure.





        • Signatures

          • Very early on, people realised they needed an electronic version of a signature.

          • The idea of a signature has been transferred into an electronic medium.

          • A real signature has to be part of the document (not cut and pasted), it's obvious to tell when this has happened!

          • With computers, its easy to copy and paste signatures, so we need a secure digest (a.k.a. secure hash function), which is derived from the message.

          • The hash function is a way of deriving a short value from a much larger ranged value (such as a message).

          • The idea is to write a message, sign it, and then provide a secure digest, which shows the receiving process that the message hasn't been tampered with, and that it is indeed a message with an authentic signature.

          • Uses of signatures:

            • For authentication; to show the recipient that the message came from a particular sender.

            • For non-repudiation; the sender cannot deny that they sent the message.

              • For example, if you send money using a banking system, then you can't retract this claiming it wasn't you that sent it because you've signed it.
















Integration Architecture



Architecture matters. Simples.





  • The service is the idea that links the requestor and the provider together.

  • What happens is the two parties become known to each other.

    • Via a service such a Bonjour on Mac OS X.



  • They agree on the semantics and WSD (find out what this is).

    • The semantics ensure that both sides of the communication are communicating in an allowable and acceptable manner in order to get the correct service.

    • Each side of the communication then uses inputs and the semantics agreed on to formulate the way they want to operate.

    • On the requester's side, a requester agent communicates with the provider's provider agent in order to perform the actual interactions, given these inputs and WSD.



  • States that distributed systems architectures that are built must meet the following requirements;

    • Reliability

    • Fault-tolerance

    • Responsiveness

    • Scalability

      • Allow room for improvement and enhancement.



    • Security.



  • They must also ensure that they will work across all platforms, achieving heterogeneity. The service is a section of code and data that stands alone, and the only way in and out of the service is via messages. The services are durable and survive crashes.

  • Service Oriented Architecture (SOA)

    • SOA is one architectural design.

    • The four tenets of service oriented architecture:

      1. Service boundaries are explicit.

        • This is so as there is no ambiguity as to where a piece of code and date resides.

        • It is explicit if a piece of code is inside or outside a service, for example within internet banking, the code and the data live inside the banking service.



      2. Services are autonomous.

        • Services are developed and managed independently.

        • A service can be re-written without impacting other services with some constraints.



      3. Services share schema and contracts.

        • Schemas define the messages a service can send an receive.

        • A contract defines permissible message sequences.

        • The services do not share implementations, and are independent of computing platforms or programming languages.



      4. Service compatibility is based on policy.

        • The policy defines the rules for using a service.

        • It introduces the notion of roles to distributed computing.

        • For example, a security policy specifies who can use the service.

        • Should services be organised into an overall virtual organisation, or accessed over an open network?







  • More complex services

    • Purchasing good or a resource.









      • Requires the following sequential operations:

        1. Get a quote.

        2. Make a reservation.

        3. Make changes.

        4. Make payment

        5. Cancel ticket.



      • The service schema defines the message schema for each stage of the process.





        • Services can share schemas.

        • Incoming message: the service transforms the message from the "shared schema" to its internal schema.

        • Outgoing message: the service transforms the message from the internal schema to the "shared schema".

        • There can be a different internal schema for each system.

          • The "shared schema" will have to be translated by every system.

          • This totals n x (n - 1) number of transformations, where n is the number of servers.

          • With 12 services, this is 132 transformations!



        • The alternative is to have a central standard schema

          • This means that there are now only 2n number of transformations.

          • Each message goes via this central standard schema.

          • For 12 services, this is only 24 transformations - this is a scalable solution.



        • Difficulties mapping a service's internal schema to another schema.

          • Sometimes this can even be impossible to achieve.







          • The following problems make this mapping difficult:

            • Names can conflict (the same name can be used for different meanings).

            • The structure of the message can be different.

            • The values can be represented differently.



          • If we can identify what the information means semantically, then we can find a different syntax for the representing the same thing.

            • Ontologies are tools used to capture the semantics of the data or services.

              • They are representations of a set of concepts within a domain, and they allow reasoning about the concepts they represent.





          • Each service wants to push forward its most convenient method.

            • Individual services want to reduce the costs of writing their own code to transform their data to and from the standard, which is sometimes extremely hard or impossible to transform the data.

            • Standards evolve slowly, and are driven by large organisations, mandated by the government, or by a community.





        • The service contract defines a causal relationship between the messages and whether particular messages in the process are mandatory.



      • There is a co-relation identifier to link the messages in a conversation.

        • Once the transactions in a service has been identified, it's important to identify the consequences for if particular parts of the service fail or crash.



      • Message-Oriented Middleware (MOM)

        • The infrastructure for shipping messages between the services.

        • The Service Oriented Architecture is about using messages to connect services.

          • Asynchronous and non-blocking message passing are used.









            • Asynchronous means the server applications sends a message and doesn't wait for a response.

            • The message will enter a queue and come back some time later.







            • The recipient may or may not be actively processing incoming messages when the message is sent.



          • Point-to-point messaging specifies a single recipient.

            • There is an enqueue and a dequeue for the messages.



          • Three types of delivery modes:

            1. At most once.

              • Best effort.

              • The message gets sent and forgotten.

              • Good for continuous updates.

              • The cheapest mode with no stable storage.

              • The most throughput.



            2. At least once.

              • Requires the system to be rebootable, and the message queues unchanged.

              • The application receiving the message can deal with duplicates.

              • The message is repeatably sent until there's an acknowledgement of the message being received.



            3. Exactly once.

              • Requires the system to be rebootable, and the message queues unchanged.

              • It's easy to write the application.

              • Requires a form of commit.

              • An order of delivery can be specified.

              • The least throughput.





          • Publish Subscribe.

            • The sender publishes a message on a coptic, rather than a destination.

            • The subscribers then subscribe to topics that they're interested to.

            • SImilar to a mailing list.

            • Topics are organised in a hierarchy.



          • Different companies have produced different tools for performing messaging.

            • IBM have Websphere MQ.

            • SQLServer.







            • It is necessary for these services to inter-operate.

              • Emerging standards.

                • Web Services Reliable Messaging (WS-Reliable Messaging).

                • Advanced Message Queuing Protocol (AMQP).









        • SOAP

          • Before SOAP was XM-RPC (both are protocols).

          • Simple Object Access Protocol

          • Service Oriented Access Protocol

          • Some people believe that SOAP no longer stands for anything because people have forgotten what it stands for.

          • "SOAP provides a simple lightweight mechanism for exchanging structured and typed information between peers in a decentralised, distributed environment using XML" - Microsoft.

          • The structure is as follows:

            • Envelope

              • Contains everything.



            • Header









              • Optional.



            • Body

              • Contains the message pay load (see next).



            • Message Pay Load

              • The part of the message that is intended for the application.





          • There are many different styles of XML.

            • You can do RPC using XML.



          • Documents are an alternative, which are the exchanging of messages with a service.



        • Transactions across queuing systems and databases have a two-phase commit between the two sections.

          • There is a short duration, and the transaction takes place within the same trust domain.







    • Service-Oriented architectures have a set of design principles, the use of messages to connect services. A service is a collection of code and data whereby the only way in about out of a service is through "messages".



  • The Web

    • The web has a lot of characteristics, but what is its architectural style?

    • Representation State Transfer (REST)

      • Defines a set of constraints for building internet-scale distributed applications in a simple and elegant manor.

      • The architecture style that is back-derived from the web. It is the style that makes the web work so well.

      • The term REST was coined by Roy Fielding in his PhD thesis to describe the style of the web.

      • The key abstraction in REST is a resource. A resource can be anything that's named.

      • Each representation provides information about the state of the resource.

      • Each client navigates through the resources (via links), and each link transfers the client to the representation (state) of another resource.

      • REST is independent of any middleware.

      • HTTP follows the REST principles.

        • It uses URLs for links.

        • Resources have multiple representations such as TML, XML, MOV etc.

        • They have the same interface (GET, POST, PUT methods that can be called on the resources).

        • REST is reliable, easy to manage server failures, is scalable and easy to add a new server.

        • The state of resources are storied in databases shared by the servers, because databases prove the best means of storing data, they're reliable, fast, fault-tolerant, etc.



      • Issue: how does caching work within REST?

        • Messages should signify in their headers whether or not its contents should be cacheable using a respective "public" or "private" flag.

        • This allows for data reuse and minimising the latency.

        • The price for this is that what's in the cache is out-of-date.

        • There is a cache on both the server and the client.

        • The key to interoperability is standardisation.

        • HTTP provides this.

          • You can use HTTP across many programming languages.












Parallel Computing and The Grid





  • Using more than one computer (multiple CPUs) in parallel, in order to break down a large computation.

  • Multi-core is an example of parallel computing.

    • Distributed systems (different computers) running in parallel are typically referred to as "distributed computing" - but the idea is the same.



  • Used in weather prediction, running long simulations, rendering.

  • History

    • In the late 80s long integer factorisation (remember pubic-key cryptography relies upon the difficult of finding the prime factors of long integers) was experimented using distributed computing.

      • The algorithm was split into independent problems, and then these problems were solved in parallel by different machines.

      • Communication is necessary between these machines.

        • At the time this was very simple; it was just email, but the importance of distributing the calculations made the slow communications worthwhile.



      • In the 1990s, the internet consists mainly of UNIX machines.





        • A C program was developed that would run on a machine when it was idle, and use email to communicate with a server, to email results and request data.



      • Computations over large sets of data can be hugely sped up using these techniques.

      • In the late 1960s, grid computing was first suggested as a futuristic idea!





        • The term 'Grid' first came about in 1996 to describe hardware and software needed to speed up computations.

        • This was just a vision at the time.

        • It drew an analogy with the electricity grid.

          • The idea was that you'd be able to plug in your program to a computing grid in order to perform huge calculations using the huge grid.



        • The term "gird" was catchy and soon many researchers were talking about grids for all sorts of things - data, knowledge, sensor etc!







  • High Performance Computing (HPC)

    • Refers to the use of parallel supercomputers and computer clusters linked together to solve a single large problem.



  • Requirements of grid computing

    • Coordination of resources that are not subject to centralised control.

    • An interface to make use of the machines found in the grid.









      • An open standard.



    • Quality of service.

      • Fast response time.

      • Throughput.

      • Availability.

      • Security.





  • Defining the grid

    • There are different definitions:

      • An implementation of distributed computing.

      • A common set of interfaces, tools, APIs, ...

      • The ability to coordinate resources across different administrative domains.

      • A means of providing an abstraction (virtualisation) of the resources, services etc.

      • Resource sharing and coordinated problem solving in dynamic virtual organisations.



    • The last definition is the key one.

    • Grid computing must provide:

      • Resource discovery and information collection publishing.

        • Apple's Bonjour service.



      • Data management on and between resources.

        • A protocol must be built that allows data transfer.



      • Process management on and between resources.

        • A method of sending processes to different machines.



      • Common security mechanisms.

        • Machines must only be used for legitimate users.



      • Process and session recording/accounting.

        • There must be a log mechanism.



      • Information about what is running etc.



    • Over the last fifteen years, most of the work has been about developing standards.

      • One of the key mistakes in the past was to start from scratch inventing protocols for everything, which was a waste of time.





  • Middleware available for use in the grid.

    • Globus

      • Widely used.

      • GridFTP was developed from Globus but has extra functionality to copy requirements of high performance computing, as well as additional security.

      • OGSA-DAI allows different data services to be exposed using a unified protocol over the grid. For example from different databases.

      • WS-Resource Framework uses web services to implement many of the required functionality of grid computing.



    • Condor and Condor-G

      • Focuses on the different jobs, which are to be submitted to be processed on different machines.

      • Condor provides load balancing and submissions facilities for these jobs.

      • It acts like a front-end queue.





  • A lot of investment for getting machines to communicate with each other, such as faster infrastructures and more space.

  • In parallel computing, we need to deal with communication and synchronisation between the various parallel systems.

    • The data communication time has to be balanced against the time it will take to run the "sub-process".



  • The sections of the program that can be run in parallel are the sections that don't matter which order they are run. These sections when parallelised will speed up the run time of the application.

  • Parallelising an application is not always the best way to increase the performance.

    • Maximum parallelisation speed = 1/(1 - proportion of the application that can be parallelised).

    • Amdahl's law.

      • parallelised program run time = parallelisation overhead + single machine program run time * (1 - proportion of the application that can be parallelised + proportion of the application that can be parallelised / number of CPUs).

      • See section on analytical analysis of distributed systems.



    • The time to run the application with parallelisation should be less than without in order for the parallelisation to be any use!



  • Parallel computing can be represented schematically as a graph.

    • Nodes are the processes.

      • Process times are written within the nodes.



    • Edges represent messages being sent by processes, and required by processes.

      • Message send times are written on the edges.



    • To calculate the running time of a non-parallelised application, add all the node processes times.

    • To calculate the parallelised process time, calculate the longest path to the end node.



  • Teragrid is one of the largest computing grids in existence in the US.

  • Datagrid is one of the largest computing grids in Europe.

  • Applications on the grid

    • The application needs to be able to split the computation down into many smaller tasks, which can then be given to different computers.

    • Tools can be used to automatically break down the computation.

    • This is known as a divide-and-conquer technique.

    • Most common applications are scientific workflows.

    • Case study: IBM's Deep Blue

      • 30 node parallel computing.

      • The machine can evaluate 200 million positions per second, enabling the machine to evaluate moves 40 turns in advance and beating the world champion in May 1997.





  • The key idea is to get into a position to create a virtualisation of different machines connected through the internet and make them available to applications.




Cloud Computing





  • The idea:

    • The client becomes very thin, and the heavy computation is run on a remote server.

    • Making computing power a service.

      • Users of the Cloud are consumers who don't have to own the resources they use - the resources are provided by the providers as a service.

      • Computing power is provided as a service.

      • Software as a service.

      • Platform as a service.

      • Infrastructure as a service.



    • Software is created and made available using standard interfaces so that other applications can use this software within it.

    • A key example is Amazon's EC2 service.

    • The cloud vs. the grid.

      • Similarities:

      • Scalability

      • Divide and conquer

      • Lots of data

      • SLAs



    • Differences:

      • The Grid has a tighter relation between users and administrators. The Grid is more restrictive.

      • The Cloud has privacy concerns, because it is more accessible.
















        • Trade-Off Summary











  • Distributed systems is all about trading off one thing for another. In a world where there are no time, space or money constraints, most of the problems in distributed systems would be overcome.

  • List of trade-offs within distributed systems development:

    • Cost.

    • Amount of redundancies vs. cost of redundancies.

      • Would you rather use an expensive system with redundancy, or a cheap system without. Well... in an airline flight system, the extra cost is probably well worth it.

      • In queue-based systems.

        • If the server cannot deal with a client request, then the request will have to wait in a queue.

        • If the queues become long, then large delays could occur.

        • If there is a cost for waiting in the queue, then this should be traded off with the cost of running more servers, given the number of expected requests.

        • Queuing theory is the mathematical study of queues to find optimal solutions to this problem.





    • Communication costs vs. increase in run time.

      • This trade off is found in parallelisation, and in grid computing.



    • Size of messages sent vs. readability.

      • For example, XML is easy to interpret, but may not necessarily be the most compact message to send.



    • Inaccurate system model vs. benefits of having the model.

      • We are prepared to pay the cost of having an inaccurate model in order to get an answer questions about the system without having to actually build the system.








References





  • Distributed Systems Second Year Course by Chris Kirkham and Rizos Sakellariou.

  • Distributed Systems - Concepts and Design by George Coulouris, Jean Dollimore and Tim Kindberg.

  • Distributed Systems - Principles and Paradigms by Andrew S. Tanenbaum and Maarten Van Steen

No comments:

Post a Comment