Disjoint-Set Operations

The following problem arises in some graph algorithms:

I think the easiest way to understand this definition is to look at an example and what it means. The following example is taken from Cormen et al, Introduction to Algorithms:

If you look at these, then you can see for example that the edges (b, d) and (a, b) share vertices, whereas (b, d) and (h, i) do not. Finding the disjoint sets means finding all those sets that are distinct for all members (aka disjoint). In our example, the result of processing the given input data would be:

    (a, b, c, d)  (e, f, g)  (h, i)  (j)

Note that j is a vertex that is not part of any edge, hence an edgeless graph: so by definition it is disjoint to any other subgraph.

Ask yourself: Do I understand the problem?

Setting up a test harness

As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:

Therefor, our test code looks like this:

#! -*- Encoding: Latin-1 -*-

def selftest():
    import time

    # see above: test data parameters
    SIZE_OF_DOMAIN = 10000
    SIZE_OF_EDGES = 2000

    edges = [(random.randrange(SIZE_OF_DOMAIN), random.randrange(SIZE_OF_DOMAIN)) for k in range(SIZE_OF_EDGES)]
    vertices = range(SIZE_OF_DOMAIN)

    functions = {
        'using builtin set datatype' : ds_using_builtin_set_type,
        'using builtin list datatype' : ds_using_lists,
    }

    expected_result = None
    for function_name in functions:
        function = functions[function_name]
        t0 = time.time()
        result = len(function(edges, vertices))
        t0 = time.time() - t0
        print "%32s: %12.2f for result %6d" % (function_name, t0, result, ) 

        if expected_result is None:
            expected_result = result
        else:
            assert result == expected_result

if __name__ == "__main__":
    selftest()

Implementation using Python Sets

Python has builtin set datatypes which makes implementing this ADT really simple, and the code really beautiful. Remember our abstract definition:

  1. MAKE-SET(x): create a set out of a single vertex x.
        def MAKE_SET(x):
            return frozenset([x])
    
  2. FIND-SET(x): find the set that has the member x.
        def FIND_SET(x):
            for subset in sets:
                if x in subset:
                    return subset
    
  3. UNION(u,v): given two sets, create a new set that is a union of both sets (i.e. includes all items of both sets).
        def UNION(set_u, set_v):
            sets.add(frozenset.union(set_u, set_v))
            sets.remove(set_u)
            sets.remove(set_v)
    

It doesn't get more simple than that. Subsequently, the whole Python code looks like this:

def ds_using_builtin_set_type(edges, vertices):

    # given an element, create a set with only this element as member
    def MAKE_SET(x):
        return frozenset([v])

    # create big set of sets
    sets = set([MAKE_SET(v) for v in vertices])

    # find a set containing element x in a list of all sets
    def FIND_SET(x):
        for subset in sets:
            if x in subset:
                return subset

    # create a combined set containing all elements of both sets, destroy original sets
    def UNION(set_u, set_v):
        sets.add(frozenset.union(set_u, set_v))
        sets.remove(set_u)
        sets.remove(set_v)

    # main algorithm: find all connected components
    for (u, v) in edges:
        set_u = FIND_SET(u)
        set_v = FIND_SET(v)

        if set_u != set_v:
            UNION(set_u, set_v)

    return sets

To give you a feel how this algorithm works, let's look at some sample data.

Edge Processed | Collection of disjoint sets
  Initial Sets | {0}       {1}    {2}            {3}    {4}    {5} {6} {7} {8}
        (5, 4) | {0}       {1}    {2}            {3}    {4, 5}     {6} {7} {8}
        (6, 3) | {0}       {1}    {2}            {3, 6} {4, 5}         {7} {8}
        (6, 2) | {0}       {1}    {2, 3, 6}             {4, 5}         {7} {8}
        (1, 8) | {0}       {1, 8} {2, 3, 6}             {4, 5}         {7}    
        (3, 4) | {0}       {1, 8} {2, 3, 6, 4, 5}                      {7}    
        (1, 0) | {0, 1, 8}        {2, 3, 6, 4, 5}                      {7}    

Implementation using Python lists

In much of the literature, the operations are implemented in more basic datastructures - linked lists, arrays and such. For example, we're going to use lists rather than sets to represent sets. UNION will be simple: it will extend one list with the items of the other. But FIND would be slow: we'd have to enumerate all items in all sets, giving us a O(N2) runtime in the worst case. So we'll use an additional hash table to speed up the process of finding elements in the lists. Let's see how this works:

  1. MAKE-SET(x): create a set out of a single vertex x.
        def MAKE_SET(x):
            return [x]
    
    This is actually even more simple than the version using set datatypes.
  2. FIND-SET(x): find the set that has the member x.
        def FIND_SET(x):
            return set_member_lookup[x]
    
    Given that set_member_lookup is our lookup dictionary, this implementation is bound to be fast. But it also kind of cheats: because we haven't specified how we keep items of the dictionary up to date.
  3. UNION(u,v): given two sets, create a new set that is a union of both sets (i.e. includes all items of both sets).
        def UNION(set_u, set_v):
            if sets[set_u] is not None:
                if sets[set_v] is not None:
                    sets[set_u].extend(sets[set_v])
    
                    for k in array_of_sets[set_v]:
                        set_member_lookup[k] = set_u
                    sets[set_v] = None
    
    OK, so this is where the cost of the complexity is hidden: when we join two sets, we need to update all relevant items in the lookup dictionary, and we need to blank out the input set (set it to None)

The complete implementation in Python looks like this:

def ds_using_lists(edges, vertices):

    # given an element, create a set with only this element as member
    def MAKE_SET(x):
        return [x]

    # create big set of sets
    sets = [MAKE_SET(v) for v in vertices]

    # create lookup in sets
    set_member_lookup = {}
    for index, v in enumerate(vertices):
        set_member_lookup[v] = index

    # find a set containing element x in a list of all sets
    def FIND_SET(x):
        return set_member_lookup[x]

    # create a combined set containing all elements of both sets, destroy original sets
    def UNION(set_u, set_v):
        if sets[set_u] is not None:
            if sets[set_v] is not None:
                sets[set_u].extend(sets[set_v])

                for k in sets[set_v]:
                    set_member_lookup[k] = set_u
                sets[set_v] = None

    # main algorithm: find all connected components
    for (u, v) in edges:        
        set_u = FIND_SET(u)
        set_v = FIND_SET(v)

        if set_u != set_v:
            UNION(set_u, set_v)

    return filter(None, sets)

You should be able to see that the main code structure is almost identical, except for a bit of overhead for keeping the dictionary on track.

But how do they actually perform?

For a universe of 5.000 vertices with 5.000 edges, we get these results:

Method Time
using builtin set datatype 1.36
using builtin list datatype 0.10

The results are much more dramatic if the universe of vertices is a lot bigger than the edges. For example, for 20.000 vertices and 5.000 edges, we get

Method Time
using builtin set datatype 20.40
using builtin list datatype 0.03

Essentially, this is so because the set operation x in set basically uses linear search, while our find operation using dictionaries is quite unimpressed.

Moral of the story: while the set datatype is nice, sometimes a tuned list-based operation can be a lot faster.