Heaps

A heap is a common data structure that can be used to maintain a priority queue, implement a sorting mechanism, and generally keep track of extremal (minimal or maximal) values in an efficient manner. Note that the Python Standard Library contains a heap implementation in heapq, but we're going to write our own first and then take a peak inside the reference implementation.

The Wikipedia article on Heaps contains pointers to a couple of specialized heaps with additional features, but we're going to start off easy with a very simple one: a min heap.

A simple Min Heap implementation

We're going to start with a min heap, and then check how we can generalize it to a MaxHeap. We need a capacity for the heap, and since each level is going to have 2**n items maximum, it makes sense to start off with a capacity that is a power of 2.

class MinHeap(object):
    def __init__(self, capacity = 2 ** 20):
        # the capacity should be a power of two - since all levels in a heap
        # are densely populated, and why would you want to waste space?
        # the default is 1MB of arbitrary data
        self.data = [None] * capacity
        self.used = 0

Inserting an item is a two-step process:

class MinHeap(object):
    ...
    def Insert(self, value):
        assert self.used < len(self.data)

        self.data[self.used] = value
        self.used += 1
        self.PushUp(self.used-1)

That was the easy part. Now for the fun part: the item is at the end of the heap. we need to push it up to a position, where the heap property is satisfied.

class MinHeap(object):
    ...
    def PushUp(self, item_position):
        assert item_position >= 0
        assert item_position < self.used

        # the value to be pushed up the tree
        item_value = self.data[item_position]

        # when does the loop end? when the item has reached 
        # the top position - there is no place higher than the top!
        while item_position > 0:

            # get index and value of parent
            parent_position = GetIndexOfParent(item_position)
            parent_value = self.data[parent_position]

            # if the parent is less than this value, we need to push ourselves up
            if parent_value > item_value:

                self.data[parent_position] = item_value
                self.data[item_position] = parent_value
                item_position = parent_position

            else:
                # the parent is larger, so we are in a good position
                break

        return item_position

This code is missing an important function: GetIndexOfParent. Now, in the literature (e.g. Cormen et al) you'll find the following definitions:

This won't work with Python, because Python arrays are zero-based. So i is really (i+1), and from the result we need to subtract 1 to get to the actual position. Sounds complicated? Let's try to insert it:

Therefor we have:

def GetIndexOfParent(n):
    return (n-1)/2

def GetIndexOfLeftChild(n):
    return (2*n)+1

def GetIndexOfRightChild(n):
    return (2*n)+2

Getting the minimum is trivial:

class MinHeap(object):
    ...
    def GetMin(self):
        assert self.used > 0
        return self.data[0]

Extracting the minimum is another important operation: we special case for a heap that has only one element (so there is no need to invoke Heapify()).

class MinHeap(object):
    ...
    def ExtractMin(self):
        assert self.used > 0
        result = self.data[0]

        if self.used > 1:
        
            # move item from the end of the heap to the top, and heapify it (move it down
            # until the heap property is satisified again)
            self.data[0] = self.data[self.used-1]
            self.used -= 1
            self.Heapify()

        else:
            self.used = 0

        return result

    def Heapify(self):

        # start at the top
        index_of_this_node = 0
        value_of_this_node = self.data[index_of_this_node]

        # loop end criteria: we've reached the last most item in the heap
        while index_of_this_node < self.used:

            # get left child (or None)
            index_of_left_node = GetIndexOfLeftChild(index_of_this_node)
            value_of_left_node = value_of_this_node
            if index_of_left_node < self.used:
                value_of_left_node = self.data[index_of_left_node]

            # get right child (or None)
            index_of_right_node = GetIndexOfRightChild(index_of_this_node)
            value_of_right_node = value_of_this_node
            if index_of_right_node < self.used:
                value_of_right_node = self.data[index_of_right_node]

            # find index of the node that is bigger than ours
            index_of_max_node = index_of_left_node
            value_of_max_node = value_of_left_node
            if value_of_right_node < value_of_max_node:
                value_of_max_node = value_of_right_node
                index_of_max_node = index_of_right_node

            # if there is a match, swap the nodes
            if value_of_max_node < value_of_this_node:
                self.data[index_of_this_node] = value_of_max_node
                self.data[index_of_max_node] = value_of_this_node
                index_of_this_node = index_of_max_node

            else:
                break;

Let's add some more helper functions:

class MinHeap(object):
    ...
    def IsEmpty(self):
        return self.used == 0

    def GetCount(self):
        return self.used
        
    def GetDepth(self):
        if self.used < 2:
            return self.used

        level = 1
        pos = self.used -1

        while pos > 0:
            pos = GetIndexOfParent(pos)
            level += 1

        return level

Pretty-Printing a Heap

One of the more interesting functions to write for a heap class is a repr() method to print the heap as a string. We're going to go for a simple plaintext function. For example, a heap with 15 items could look like this:

           3            
     15          9      
  17    15    12    34  
44 22 40 32 46 23 47 

The code for this has several steps:

class MinHeap(object):
    ...
    def __repr__(self):
        # special case: empty heap
        if self.used == 0:
            return "<MinHeap: None>"

        max_cell_width = 1

        # We expect 2<sup>i</sup> items on level i
        expected_size_of_this_level = 1
        levels = []
        reprs_on_this_level = None

        # We enumerate all items
        for i in range(self.used):
            if reprs_on_this_level is None:
                reprs_on_this_level = []
                levels.append(reprs_on_this_level)

            # Convert the items on this level to strings and store them in a list
            k = repr(self.data[i])
            reprs_on_this_level.append(k)

            # As we go along, keep track of the maximum string size needed to represent a single element.
            if len(k) > max_cell_width:
                max_cell_width = len(k)

            if len(reprs_on_this_level) == expected_size_of_this_level:

                # We expect 2**i items on level i
                expected_size_of_this_level *= 2
                reprs_on_this_level = None

        max_cell_width += 1

        # Center a single word around the expected cell size for this level, adding spaces if necessary
        def center(word, max_cell_width):
            result = [' '] * max_cell_width

            word = list(word)
            insert_pos = (max_cell_width - len(word))/2
            for c in word:
                result[insert_pos] = c
                insert_pos += 1
            return "".join(result)

        # We enumerate the list of string lists in reverse order
        for i in range(len(levels)-1,-1,-1):

            level = levels[i]
            for k in range(len(level)):
                level[k] = center(level[k], max_cell_width)

            # Convert the list of cells to a single string
            levels[i] = "".join(levels[i])

            # Adjust the expected cell size: cell size grows by 2**n
            max_cell_width *= 2

        # Finally, enumerate all strings again from the top down: adding newlines to combine all the lines
        return "\n".join(levels)

Comparing our code with the reference implementation

As mentioned in the beginning, the python standard library contains a reference implementation of the heap data structure. You can read the source on the web at http://svn.python.org/projects/python/trunk/Lib/heapq.py. A couple of comments on that code: