Understanding Counting Sort

There exist a few specialized sorting algorithms that make use of extra knowledge about input data. This means: they exist, and you should know about them, but they work only in certain specialized circumstances. The input data for any sorting algorithm is normally defined like this:

Input: L_x is a list of n objects for which some sort of ordering relationship is defined

Using this you can sort, for example, strings, integers and stereotypes. Counting Sort has a more restricted definition:

Input: L_x is a list of n objects for which some sort of ordering relationship is defined, and
You know in advance the size m of the object universe U_O input items are taken from, and
There exists a relationship r(x) that maps an object to be sorted to an index, and
m is reasonably small.

Hm. Conditions 1-3 are fairly straightforward and can be defined for most input data. For example, when sorting strings:

Input: L_x is a list of n strings
m is the size of the object universe U_O of all strings. Assume your input uses only the 26 letters of the alphabet, and you have a string length of 10 would be 26¹⁰ = 141.167.095.653.376
There exists a relationship r(x) that maps an object to be sorted to an index: lexographically order U_O, and then use the index in that list

However, for many interpretations of reasonably, 141.167.095.653.376 is not a number that is reasonably small. Right? On the other hand, let's assume you have lots of integers in the range 0..100:

Input: L_x is a list of n small integers
m is the size of the object universe U_O of all integers, which is 100
There exists a relationship r(x) that maps an object to be sorted to an index: the integer value itself
m is indeed reasonably small.

Given that "reasonably small" is not an exact definition, you'll have to try and come up with numbers that work on your system, with your data.

The basic idea behind Counting Sort

Assume you have m integers in the range 0..n-1.
Create n integer counters, starting at 0
For each item in the input array: increase the counter associated with the integer value
At the end, recreate the sorted array from information sorted in the counters

For example, assume you have input numbers in the range 0..4, like this:

[3, 2, 0, 2, 2, 0, 2, 2, 1, 1]

If you count the numbers of 0s, 1s, 2s and 3s, you get this:

Number of 0s: 2
Number of 1s: 2
Number of 2s: 5
Number of 3s: 1

A sorted array can now be created using this information:

[0, 0, 1, 1, 2, 2, 2, 2, 2, 3]
 ^^^^
       ^^^^
             ^^^^^^^^^^^^^
                            ^

This should give you an important type of insight: you can use this to recreate sorted arrays, but only if the type of array can be recreated. For example, you couldn't really use this to sort objects in memory, unless you have integer based keys for these objects.

The algorithm

The above lends itself to a straight-forward implementation:

def counting_sort(A):

    # create counters initialized to zero
    frequencies = [0] * (max(A)+1)

    # count each item
    for value in A:
        frequencies[value] += 1

    # recreate sorted array
    writepos = 0
    for value, count in enumerate(frequencies):
        for i in range(count):
            A[writepos] = value
            writepos += 1

    return A

But how does it perform

Here are the results of comparing "sort 1 million randomized integers in the range 0..9999":

Method	Python	Javascript	C#	C/C++
Builtin array-sort	847 ms	443 ms	84 ms	110 ms
Counting sort	553 ms	353 ms	5 ms	0 ms

It is somewhat depressing to see how bad Python performance actually is... Also, I was surprised by C#, I had thought it to be much worse...