Finding a duplicate value in an array

The following is a classic programming problem: finding a duplicate value in an array. The problem can be explained very quickly:

You have a large, potentially huge array. You don't know much about the values except that one of them occurs twice. Find this duplicate value, and return it

Ask yourself: Do I understand the problem?

Input is an array of N values. We don't know N yet, but obviously the algorithm, whatever it is, shouldn't depend on a particular N.
Without loss of generality, let's assume all values are integers. We shall check at the end what happens if our values are different things - like for example files or words in a document.
Output is the value of the duplicate - not, for example, the index of the first occurence of a double value. It is trivial to modify the examples to return the index instead, and we'll return to this question again at the end.
This question does not ask about arrays with multiple duplicate values. That bit is important, because it complicates the question, and the solutions shown below. So again, we'll start with a single duplicate and defer consideration of multiple duplicates to some later stage.

Before we proceed, let's stop and think about why this is a problem that comes up in many programming books?

One reason is that it seems deceptively simple: surely you can find something in an array? But the problem is that you can find it only by somehow - explicitly or implicitly - comparing it to data of all the other values. We are looking for a double value: so potentially we're comparing a lot of value pairs here. So it would definitely pay of if we were to find some way to drastically reduce the number of comparisons.

Setting up a test harness

As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:

We need an array of random integer values. More than that - we need unique values: we said above that if there is more than one duplicate, our best algorithms might break down. So we must ensure that while each individual value is random, it occurs exactly once in the array
Given that, we duplicate a single value - for example, we can add the first element again, making it the duplicate.
To make things more interesting, we shuffle the array randomly again. Otherwise the first element will always be an a predefined location - at the end of the array - and that might skew our analysis of the options.

So let's act upon our plan. Much of it is pretty straight-forward by now:

#! -*- Encoding: Latin-1 -*-

def selftest():
    # let's start small: 20 random unique values
    A = create_array_of_unique_values(20)

    # duplicate a single value
    duplicate_element = A[0]
    A.append( duplicate_element )
    
    # shuffle the array randomly again
    random.shuffle(A)

    print "INPUT: %r" % (A, )
    print " DUPE: %r" % (duplicate_element, )
    
    # we expect our function to find the duplicate exactly
    assert find_duplicate(A) == duplicate_element
    
if __name__ == "__main__":
    selftest()

There is a small function missing here: create_array_of_unique_values. We can use a hash-table to lookup values we've added already, thus ensuring that we never add a value twice. Example code:

def create_array_of_unique_values(N):
    temp = {}
    n = N
    while n:
        for i in range(n):
            temp[random.randrange(60*N)] = True
        m = len(temp.keys())
        assert m <= N
        n = N - m
    return list(temp.keys())

I want to focus your attention on randrange(60*N). This is a small trick that ensures we have enough random values to fit into the array. Think of a counter example: If you have an array of 20 values, and each of the 20 values is < 10 - by definition you must have many duplicate values. So by expanding the random range like we do in our code, we ensure that the random number generator will spit out numbers that fit into the array.

Brute-Force approach

Let's try brute-force first: For each element in the array, compare it with each other element in the array. If the two elements are equal, we've found our duplicate.

Now, before you skip the next paragraph, keep in mind that there are several books that make - small, but real - bugs in as simple a code as the following. Take the time to at least run the code through in your head:

def find_duplicate_bf(A):
    n = len(A)
    for i in range(n):
        for j in range(i+1,n):
            if A[i] == A[j]:
                return A[i]

Using a hash table

We can use a hash-table that marks up items already visited, - similar to like we did when we created the test data, remember?. The basic logic is this: for each item in the list: if it is already in the hash-table, we've found a duplicate, otherwise we set the item in the hash-table to some value.

def find_duplicate_hash(A):
    d = {}
    for index, item in enumerate(A):
        if d.has_key(item):
            return item
        else:
            d[item] = True

The time complexity of this solution is O(N) (amortized, remember that hash tables can have bad worst-case complexity for some combinations of hash-functions and array sizes!), space complexity is O(N). If you look at the code, you see that essentially we need to duplicate the array in the worst case (for example, if the item is near the end of A).

Comparing this function with brute-force, it is obviously much faster, but it needs (potentially a lot) of additional memory.

Before we proceed, let's look at a minor variant of this approach, using a bit-vector rather than a hash table. The idea is to use the bit-vector as a hash-table: either a bit is set, then the number for this bit is a duplicate, or it is not set, then the number wasn't seen before and must be set in the bit-vector now. The idea sounds a bit strange, but it is pretty straightforward:

def find_duplicate_bitvec(A):
    d = 0
    for index, item in enumerate(A):
        bitmask = 1 << item
        if d & bitmask:
            return item
        else:
            d |= bitmask

This solution works, and in theory it has the same complexity as the hash-table approach. (In fact, since the bit-vector is used as a hash table, the Big-Oh complexity is identical). But in pratice it will most likely be a bit slower: one reason is that bit-vector lookup tends not to utilize good hash functions. And also, this approach will work - if at all - only for languages like Python where integers can have arbitrary size: for C/C++ or java you'd be limited to values in the range 0..31 or 0..63 which kind of limits the fun this function provides. And finally, this approach will work only for integers. Try bit-shifting for 3.5 bits!

Using a sort function

When thinking about possible algorithms, always ask yourself: but can I sort it? Sorting has simplified numerous problems, and indeed it can simplify ours: We enumerate the sorted array. The duplicate value will be a pair of numbers next to each other, which can be most easily checked by keeping a "previous" value handy

def find_duplicate_sorted(A):
    for index, item in enumerate(sorted(A)):
        if index == 0:
            prev = item
            
        elif prev == item:
            return item
            
        else:
            prev = item

The time complexity of this solution is O(N log N), space complexity is O(1) if we are allowed to rearrange the input array, or O(N) if we need to create a second, sorted array (like we do in our example code).

Comparing this function with the hash-table approach:

If we are allowed to sort the input array in-place, then this solution gives us O(1) space with O(N log N) time complexity, which is good enough for most reasonable input values of N, and certainly a lot better than our initial brute-force approach. But it destroys the original sequence order.
However, if we are not allowed to sort the array, then this solution isn't advisable: it is slower than the hash-table approach, and will use the same amount of memory.

But how does it actually perform?

Let's write a small test suite to check all five suggested solutions, like this:

    functions = { 
        'brute-force' : find_duplicate_bf,
        'using sort' : find_duplicate_sorted,
        'using hash' : find_duplicate_hash,
        'using bit-vector' : find_duplicate_bitvec,
    }

    for name in functions.keys():
        t0 = time.time()
        result = functions[name](A)
        print "%20s took %.4f" % (name, time.time() - t0, )
        if result != duplicate_element:
            print "**** WARNING; %r GAVE WRONG RESULT" % (name, )

The results on my machine were this:

         brute-force took 1.6407
          using hash took 0.0012
    using bit-vector took 0.1066
          using sort took 0.0040

Abstraction time: from int to arbitrary objects

In the beginning of this section we restricted our search to integers. Let's consider the impact of using our algorithms for different types of data. I honestly think we ought to sit down calmly, take a stress pill, and think things over.

Let's revisit the brute force algorithm from above.

def find_duplicate_bf(A):
    n = len(A)
    for i in range(n):
        for j in range(i+1,n):
            if A[i] == A[j]:
                return A[i]

What features of A are we actually using here?

The fact that items in A can be accessed sequentially
The fact that pairs of items can be tested for equality
The fact that an item can be a return value

Condition (3) seems a bit overkill, but think of a huge object in memory. Do we really need to shift it around? Would the index of the duplicate not suffice as a return value for our function, too? Well yes, it would, and all the algorithms above can be trivially modified to return the index of the duplicate rather than the duplicate itself. Doing so makes the algorithm more general, which is generally desirable.

How do our conditions fare with the other solutions shown above?

The hash-based solution actually has a slightly modified condition (2) - one could argue it doesn't have condition (2) but instead condition (2b): the hash value of the object must be comparable, and is in fact tested for equality. Note that computing a good hash value can be a function of the size of the input data (for example, if you are comparing huge video or audio files). Note also that this is the only algorithm using this requirement
The bit-vector implementation obviously works only for integers, so we need not bother with it any further here.

The sort-based method requires a condition (4): it must be possible to order the items, to sort them. Note that there are some objects that cannot be sorted easily: for example complex numbers:

        >>> a = complex(-1,1)
        >>> b = complex(-1,-1)
        >>> c = complex(1,-1)
        >>> d = complex(1,1)
        >>> l = [a,b,c,d]
        >>> l.sort()
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: no ordering relation is defined for complex numbers

Note also that the sort-based method is the only one requiring this condition.