Finding a duplicate value in an array

The following is a classic programming problem: finding a duplicate value in an array. The problem can be explained very quickly:

You have a large, potentially huge array. You don't know much about the values except that one of them occurs twice. Find this duplicate value, and return it

Ask yourself: Do I understand the problem?

Before we proceed, let's stop and think about why this is a problem that comes up in many programming books?

One reason is that it seems deceptively simple: surely you can find something in an array? But the problem is that you can find it only by somehow - explicitly or implicitly - comparing it to data of all the other values. We are looking for a double value: so potentially we're comparing a lot of value pairs here. So it would definitely pay of if we were to find some way to drastically reduce the number of comparisons.

Setting up a test harness

As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:

So let's act upon our plan. Much of it is pretty straight-forward by now:

#! -*- Encoding: Latin-1 -*-

def selftest():
    # let's start small: 20 random unique values
    A = create_array_of_unique_values(20)

    # duplicate a single value
    duplicate_element = A[0]
    A.append( duplicate_element )
    
    # shuffle the array randomly again
    random.shuffle(A)

    print "INPUT: %r" % (A, )
    print " DUPE: %r" % (duplicate_element, )
    
    # we expect our function to find the duplicate exactly
    assert find_duplicate(A) == duplicate_element
    
if __name__ == "__main__":
    selftest()

There is a small function missing here: create_array_of_unique_values. We can use a hash-table to lookup values we've added already, thus ensuring that we never add a value twice. Example code:

def create_array_of_unique_values(N):
    temp = {}
    n = N
    while n:
        for i in range(n):
            temp[random.randrange(60*N)] = True
        m = len(temp.keys())
        assert m <= N
        n = N - m
    return list(temp.keys())    

I want to focus your attention on randrange(60*N). This is a small trick that ensures we have enough random values to fit into the array. Think of a counter example: If you have an array of 20 values, and each of the 20 values is < 10 - by definition you must have many duplicate values. So by expanding the random range like we do in our code, we ensure that the random number generator will spit out numbers that fit into the array.

Brute-Force approach

Let's try brute-force first: For each element in the array, compare it with each other element in the array. If the two elements are equal, we've found our duplicate.

Now, before you skip the next paragraph, keep in mind that there are several books that make - small, but real - bugs in as simple a code as the following. Take the time to at least run the code through in your head:

def find_duplicate_bf(A):
    n = len(A)
    for i in range(n):
        for j in range(i+1,n):
            if A[i] == A[j]:
                return A[i]

Using a hash table

We can use a hash-table that marks up items already visited, - similar to like we did when we created the test data, remember?. The basic logic is this: for each item in the list: if it is already in the hash-table, we've found a duplicate, otherwise we set the item in the hash-table to some value.

def find_duplicate_hash(A):
    d = {}
    for index, item in enumerate(A):
        if d.has_key(item):
            return item
        else:
            d[item] = True

The time complexity of this solution is O(N) (amortized, remember that hash tables can have bad worst-case complexity for some combinations of hash-functions and array sizes!), space complexity is O(N). If you look at the code, you see that essentially we need to duplicate the array in the worst case (for example, if the item is near the end of A).

Comparing this function with brute-force, it is obviously much faster, but it needs (potentially a lot) of additional memory.

Before we proceed, let's look at a minor variant of this approach, using a bit-vector rather than a hash table. The idea is to use the bit-vector as a hash-table: either a bit is set, then the number for this bit is a duplicate, or it is not set, then the number wasn't seen before and must be set in the bit-vector now. The idea sounds a bit strange, but it is pretty straightforward:

def find_duplicate_bitvec(A):
    d = 0
    for index, item in enumerate(A):
        bitmask = 1 << item
        if d & bitmask:
            return item
        else:
            d |= bitmask

This solution works, and in theory it has the same complexity as the hash-table approach. (In fact, since the bit-vector is used as a hash table, the Big-Oh complexity is identical). But in pratice it will most likely be a bit slower: one reason is that bit-vector lookup tends not to utilize good hash functions. And also, this approach will work - if at all - only for languages like Python where integers can have arbitrary size: for C/C++ or java you'd be limited to values in the range 0..31 or 0..63 which kind of limits the fun this function provides. And finally, this approach will work only for integers. Try bit-shifting for 3.5 bits!

Using a sort function

When thinking about possible algorithms, always ask yourself: but can I sort it? Sorting has simplified numerous problems, and indeed it can simplify ours: We enumerate the sorted array. The duplicate value will be a pair of numbers next to each other, which can be most easily checked by keeping a "previous" value handy

def find_duplicate_sorted(A):
    for index, item in enumerate(sorted(A)):
        if index == 0:
            prev = item
            
        elif prev == item:
            return item
            
        else:
            prev = item

The time complexity of this solution is O(N log N), space complexity is O(1) if we are allowed to rearrange the input array, or O(N) if we need to create a second, sorted array (like we do in our example code).

Comparing this function with the hash-table approach:

But how does it actually perform?

Let's write a small test suite to check all five suggested solutions, like this:

    functions = { 
        'brute-force' : find_duplicate_bf,
        'using sort' : find_duplicate_sorted,
        'using hash' : find_duplicate_hash,
        'using bit-vector' : find_duplicate_bitvec,
    }

    for name in functions.keys():
        t0 = time.time()
        result = functions[name](A)
        print "%20s took %.4f" % (name, time.time() - t0, )
        if result != duplicate_element:
            print "**** WARNING; %r GAVE WRONG RESULT" % (name, )

The results on my machine were this:

         brute-force took 1.6407
          using hash took 0.0012
    using bit-vector took 0.1066
          using sort took 0.0040

Abstraction time: from int to arbitrary objects

In the beginning of this section we restricted our search to integers. Let's consider the impact of using our algorithms for different types of data. I honestly think we ought to sit down calmly, take a stress pill, and think things over.

Let's revisit the brute force algorithm from above.

def find_duplicate_bf(A):
    n = len(A)
    for i in range(n):
        for j in range(i+1,n):
            if A[i] == A[j]:
                return A[i]

What features of A are we actually using here?

  1. The fact that items in A can be accessed sequentially
  2. The fact that pairs of items can be tested for equality
  3. The fact that an item can be a return value

Condition (3) seems a bit overkill, but think of a huge object in memory. Do we really need to shift it around? Would the index of the duplicate not suffice as a return value for our function, too? Well yes, it would, and all the algorithms above can be trivially modified to return the index of the duplicate rather than the duplicate itself. Doing so makes the algorithm more general, which is generally desirable.

How do our conditions fare with the other solutions shown above?