Thursday, 10 September 2009

Distributed Systems - Massive Distribution for Performance

Definition of a Distributed System


A distributed System is one in which independent, self-sufficient – often autonomous or heterogeneous – spatially-separated components must use a common interconnect to exchange information and coordinate actions, and allow the whole to appear to the user as one single coherent system.




Knowledge is POWER!


Think for a moment, what could you do with a computer? Yeah, your right - sod all really, specially if its Windows :P . Now, what about 100 computers? That’s right, now were getting somewhere! Now, let’s look at something better. What about an entire data centre comprising thousands of computers! Well, put simply – holy monkeys that’s a lot of computers, its... its like being in heaven :D !




Parallel vs. Distributed


Do you remember the parallelization schemes that a distributed system can be placed into? If it helps, its:

  • SIMD

  • MISD

  • MIMD


Where a distributed system is MIMD-parallel computing when the interconnect is on a network-scale.




A Parallelization Example


Say you were given a list ‘a’, which contained 60 items, along with some integer ‘b’.

Now, you and 2 other people were asked to produce another different list, ‘x’, which is identical to list ‘a’, however, every item has been incremented by 1, and the first occurrence of ‘b+3’ has been removed. How would you do it?

Let’s say that it is possible to partition the work, so all 3 people can work in parallel. This would finish the job much quicker. This idea is commonly known as the divide-and-conquer strategy. And we also can say that a parallelization pattern called master-and-workers applies here.




Master-and-Worker


Master-and-worker is when you split the work load evenly between a number of processes that will work in parallel to get the correct answer. Say for example we use the example above. Here we have 3 people and a list of 60 elements. In this case you would split up the work load so each person has a list containing 20 elements from ‘a’:

  • a1 = [0:20]

  • a2 = [21:40]

  • a3 = [41:60]


Here, we will have 3 new lists: x1, x2 and x3, where each element has been incremented by 1.

Now what we do it get all 3 new ‘x’ lists, and combine them together to get the one list: ‘x’, which is every element in ‘a’, but incremented by 1.

Now all we need to do it remove the first item in the list that is equal to ‘b+3’.

This process is called Split, Spawn and Merge. First we split the list, then we spawn methods to return a new list ‘x’, then we merge all the new lists together to get the one new list ‘x’.




Map-Reduce


The map-reduce model – would you believe it – comprises of the map and reduce second-order functions.



MAP


In its simplest form, a map takes a unary function ‘f’ and a collection [c1, ···, cn], and then returns a collection:

  • [f(c1), ….., f(cn)]


as you can see, the map will apply the required function onto each element of a given collection. Here’s an example:

  • Say we have a collection ‘A’
    >>> A = [1, 2, 3, 4, 5]


  • Then we have a function, say: incr(x), which will return the value ‘x+1’
    >>>def incr(x):
    return x+1

  • Now we declare a new collection: ‘B’ = map(incr, A)
    >>>B = map(incr, A)



  • Note how the first argument is the function incr, and the second is the collection ‘A’.



  • Once this is done, the map will apply each element of ‘A’ to the function. And when we print out the new collection ‘B’ we get:
    [2, 3, 4, 5, 6]




REDUCE


In its simplest form, reduce takes a binary, associative function, Θ (with an optional, initial value ‘i’) and a collection [c1, …, cn], and then returns a new value:

i Θ c1 Θ … Θ cn

So reduce will apply the function to the elements of the given collection. Lets have an example:

  • Lets use the collection ‘B’ from the map above:
    >>> B = map(incr, A)


  • Then we define a binary associative function, that takes to numbers for arguments:
    >>>def prodDup(x,y):
    return (x*2)*(y*2)


  • Now we apply the function to the collection ‘B’ using reduce:
    >>>reduce(prodDup, B)


  • When printed, we obtain the value 184,320.





Map-Reduce Programming Model


Users implement the following interfaces, which are 2 functions known as the mapper and the reducer.

  • Map (in_key, in_value) ->
    (out_key, intermediate_value) list

  • Reduce (out_key, intermediate_value list) ->
    out_value list


This infrastructure takes care of splitting, spawning and merging, as well as fault-tolerance and load balancing.

No comments:

Post a Comment