# Disjoint-Set Operations

The following problem arises in some graph algorithms:

- Given a list of edges and vertices (aka "a graph"),
find the
*disjoint*sets of edges and vertices.

I think the easiest way to understand this definition is to look at an example and what it means. The following example is taken from Cormen et al, Introduction to Algorithms:

- Vertices:
`a, b, c, d, e, f, g, h, i, j`

- Edges:
`(b, d), (e, g), (a, c), (h, i), (a, b), (e, f), (b, c)`

If you look at these, then you can see for example that the edges `(b, d)`

and `(a, b)`

share vertices, whereas `(b, d)`

and `(h, i)`

do not. Finding the disjoint
sets means finding all those sets that are distinct for all members (aka disjoint).
In our example, the result of processing the given input data would be:

```
(a, b, c, d) (e, f, g) (h, i) (j)
```

Note that `j`

is a vertex that is not part of any edge, hence
an edgeless graph: so by definition it is disjoint to any other subgraph.

## Ask yourself: Do I understand the problem?

- The problem looks like a task for an algorithm. But why is it mentioned as a data type?
- The reason for this is that the literature
*defines*an abstract data type (ADT) for solving this problem:*data and methods*, and then proceeds to look at various*implementations*of this problem - very much like a heap, a stack, or a linked list is really some abstract set of data + methods that can be implemented in various ways. - In the literature, the ADT is defined as a) a set datatype and b) three methods:
`MAKE-SET(x)`

: create a set out of a single vertex`x`

.`FIND-SET(x)`

: find the set that has the member`x`

.`UNION(u,v)`

: given two sets, create a new set that is a union of both sets (i.e. includes all items of both sets).

- Note that
`MAKE-SET(x)`

doesn't need to support operations such as "create a set out of two or more vertices"). - Note that this specification does not tell you
*how sets are represented*, it tells you the*operations*defined on sets, and you choose a datatype best suited to implement those operations, or methods. - An implicit requirement that is important if you think about relaxing the implementation: you must be able to
*compare two sets to find out if they are equal*. Most definitions in the literature brush over this requirement, but it is there.

- For us, input vertices will be integer numbers, and edges will be pairs of integer numbers.

## Setting up a test harness

As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:

- Our set members are simply going to be integers, in a range
`SIZE_OF_DOMAIN`

. - We are going to define
`SIZE_OF_EDGES`

random edges by combining two random elements out of the universe of vertices - We will implement a couple of functions for the same task. As usual, we are going to store them in a dictionary along with their name, so that we can call them, time their performance and print out results
- We will verify that all algorithms return the same output for the same input.

Therefor, our test code looks like this:

```
#! -*- Encoding: Latin-1 -*-
def selftest():
import time
# see above: test data parameters
SIZE_OF_DOMAIN = 10000
SIZE_OF_EDGES = 2000
edges = [(random.randrange(SIZE_OF_DOMAIN), random.randrange(SIZE_OF_DOMAIN)) for k in range(SIZE_OF_EDGES)]
vertices = range(SIZE_OF_DOMAIN)
functions = {
'using builtin set datatype' : ds_using_builtin_set_type,
'using builtin list datatype' : ds_using_lists,
}
expected_result = None
for function_name in functions:
function = functions[function_name]
t0 = time.time()
result = len(function(edges, vertices))
t0 = time.time() - t0
print "%32s: %12.2f for result %6d" % (function_name, t0, result, )
if expected_result is None:
expected_result = result
else:
assert result == expected_result
if __name__ == "__main__":
selftest()
```

## Implementation using Python Sets

Python has builtin set datatypes which makes implementing this ADT really simple, and the code really beautiful. Remember our abstract definition:

`MAKE-SET(x)`

: create a set out of a single vertex`x`

.def MAKE_SET(x): return frozenset([x])

`FIND-SET(x)`

: find the set that has the member`x`

.def FIND_SET(x): for subset in sets: if x in subset: return subset

`UNION(u,v)`

: given two sets, create a new set that is a union of both sets (i.e. includes all items of both sets).def UNION(set_u, set_v): sets.add(frozenset.union(set_u, set_v)) sets.remove(set_u) sets.remove(set_v)

It doesn't get more simple than that. Subsequently, the whole Python code looks like this:

```
def ds_using_builtin_set_type(edges, vertices):
# given an element, create a set with only this element as member
def MAKE_SET(x):
return frozenset([v])
# create big set of sets
sets = set([MAKE_SET(v) for v in vertices])
# find a set containing element x in a list of all sets
def FIND_SET(x):
for subset in sets:
if x in subset:
return subset
# create a combined set containing all elements of both sets, destroy original sets
def UNION(set_u, set_v):
sets.add(frozenset.union(set_u, set_v))
sets.remove(set_u)
sets.remove(set_v)
# main algorithm: find all connected components
for (u, v) in edges:
set_u = FIND_SET(u)
set_v = FIND_SET(v)
if set_u != set_v:
UNION(set_u, set_v)
return sets
```

To give you a feel how this algorithm works, let's look at some sample data.

```
Edge Processed | Collection of disjoint sets
Initial Sets | {0} {1} {2} {3} {4} {5} {6} {7} {8}
(5, 4) | {0} {1} {2} {3} {4, 5} {6} {7} {8}
(6, 3) | {0} {1} {2} {3, 6} {4, 5} {7} {8}
(6, 2) | {0} {1} {2, 3, 6} {4, 5} {7} {8}
(1, 8) | {0} {1, 8} {2, 3, 6} {4, 5} {7}
(3, 4) | {0} {1, 8} {2, 3, 6, 4, 5} {7}
(1, 0) | {0, 1, 8} {2, 3, 6, 4, 5} {7}
```

## Implementation using Python lists

In much of the literature, the operations are implemented in more basic datastructures - linked lists, arrays and such.
For example, we're going to use lists rather than sets to represent sets. UNION will be simple: it will extend one list with the items of the other.
But FIND would be slow: we'd have to enumerate all items in all sets, giving us a O(N^{2}) runtime in the worst case. So we'll use an additional hash table
to speed up the process of finding elements in the lists. Let's see how this works:

`MAKE-SET(x)`

: create a set out of a single vertex`x`

.This is actually even more simple than the version usingdef MAKE_SET(x): return [x]

`set`

datatypes.`FIND-SET(x)`

: find the set that has the member`x`

.Given thatdef FIND_SET(x): return set_member_lookup[x]

`set_member_lookup`

is our lookup dictionary, this implementation is bound to be fast. But it also kind of cheats: because we haven't specified how we keep items of the dictionary up to date.`UNION(u,v)`

: given two sets, create a new set that is a union of both sets (i.e. includes all items of both sets).OK, so this is where the cost of the complexity is hidden: when we join two sets, we need to update all relevant items in the lookup dictionary, and we need to blank out the input set (set it to None)def UNION(set_u, set_v): if sets[set_u] is not None: if sets[set_v] is not None: sets[set_u].extend(sets[set_v]) for k in array_of_sets[set_v]: set_member_lookup[k] = set_u sets[set_v] = None

The complete implementation in Python looks like this:

```
def ds_using_lists(edges, vertices):
# given an element, create a set with only this element as member
def MAKE_SET(x):
return [x]
# create big set of sets
sets = [MAKE_SET(v) for v in vertices]
# create lookup in sets
set_member_lookup = {}
for index, v in enumerate(vertices):
set_member_lookup[v] = index
# find a set containing element x in a list of all sets
def FIND_SET(x):
return set_member_lookup[x]
# create a combined set containing all elements of both sets, destroy original sets
def UNION(set_u, set_v):
if sets[set_u] is not None:
if sets[set_v] is not None:
sets[set_u].extend(sets[set_v])
for k in sets[set_v]:
set_member_lookup[k] = set_u
sets[set_v] = None
# main algorithm: find all connected components
for (u, v) in edges:
set_u = FIND_SET(u)
set_v = FIND_SET(v)
if set_u != set_v:
UNION(set_u, set_v)
return filter(None, sets)
```

You should be able to see that the main code structure is almost identical, except for a bit of overhead for keeping the dictionary on track.

## But how do they actually perform?

For a universe of 5.000 vertices with 5.000 edges, we get these results:

Method | Time |
---|---|

using builtin set datatype | 1.36 |

using builtin list datatype | 0.10 |

The results are much more dramatic if the universe of vertices is a lot bigger than the edges. For example, for 20.000 vertices and 5.000 edges, we get

Method | Time |
---|---|

using builtin set datatype | 20.40 |

using builtin list datatype | 0.03 |

Essentially, this is so because the set operation `x in set`

basically uses linear search, while our find operation using dictionaries is quite unimpressed.

Moral of the story: while the `set`

datatype is nice, sometimes a tuned list-based operation can be a lot faster.