Finding the longest common subsequence

You have two large, potentially huge array of objects, in a random order. You want to find the longest subsequence common to both arrays. The term subsequence means that the items of the sequence must retain the original order, but are not necessarily next to each other.

For example, take these arrays:

A = [14, 57, 32, 8, 17, 27, 20, 18, 1, 36]
B = [99, 24, 14, 5, 8, 22, 30, 60, 27, 17]

There are many valid subsequences; for example, [14, 8, 18, 1, 36] is a valid subsequence of A, and [99, 14] is one of B. But what is the longest common subsequence? It turns out to be [14, 8, 27].

Ask yourself: Do I understand the problem?

Setting up a test harness

As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:

The code for this function is:

#! -*- Encoding: Latin-1 -*-

def selftest():
    A = [random.randrange(100) for i in range(20)]
    B = [random.randrange(100) for i in range(20)]

    methods = { 'lookup dictionary' : lcs_lookup_dict, 
                'naive recursive backwards' : lcs_recursive_backwards,
                'native recursive forwards' : lcs_recursive_forwards,
                'recursive forwards indices' : lcs_recursive_forwards_indices,
                'iterative' : lcs_iterative,
                'iterative with sequence' : lcs_iterative_with_sequence,
                'iterative space optimal' : lcs_iterative_space_optim }

    expected_result = None
    for method_name in methods.keys():
        t0 = time.time()
        result = methods[method_name](A, B, 0, 0)
        print "%30s: %s in %.4f sec" % (method_name, result, time.time() - t0, )
        if expected_result is None:
            expected_result = result
        else:
            assert expected_result == result

if __name__ == "__main__":
    selftest()

Brute-Force approach

There is no brute-force approach. Really, there is one, but it is not feasible: there are n! possible subsequences for A and m! for B. If you want to compare all n! to all m! subsequences, you are in for a really long haul. We need something better.

Recursive approach using Optimal Substructure

What does the term Optimal Substructure mean? Informally, it means this: We can construct an optimal solution for n elements out of the optimal solutions for (n-1) elements. This is straight forward recursive: solution(x) = f(solution(x-1)).

Let's see how this relates to our problem.

  1. Take a look at the following two sequences:
    A = [14, 57, 32, 8, 17, 27, 20, 18, 1, 36]
    B = [99, 24, 14, 5, 8, 22, 30, 60, 27, 36]
    
    Both A and B end in 36. So any possible substructure in A[1..n-1] and B[1..n-1] can be extended by adding A[n]=36. In other words: if both A and B end in the same item: the result is 1 + the longest common subsequence of A[:-1] and B[:-1]. Now, what happen's if they DO NOT match?
  2. A = [14, 57, 32, 8, 17, 27, 20, 18, 1, 36]
    B = [99, 24, 14, 5, 8, 22, 30, 60, 27, 22]
    
    If they do not match, then either A[-1] or B[-1] must go. So the result will be the maximum of the longest common subsequence of A[-1], and the longest common subsequence of B[-1].

Once you understand this definition, it is trivial to write a recursive implementation:

def lcs_recursive_backwards(A, B):

    # if one of either A or B is empty, this means that there is no possible common subsequence left
    if not A or not B:
        return 0

    # case 1 of the recursion: A and B end in the same value
    if A[-1] == B[-1]:
        return 1 + lcs_recursive_backwards(A[:-1], B[:-1])

    # case 2 of the recursion: they don't match, so we need to test deleting either one
    return max(
        lcs_recursive_backwards(A, B[:-1]),
        lcs_recursive_backwards(A[:-1], B))

It is also straightforward to see that the direction of this recursion doesn't matter.

  1. Take a look at the following two sequences:
    A = [99, 57, 32, 8, 17, 27, 20, 18, 1, 36]
    B = [99, 24, 14, 5, 8, 22, 30, 60, 27, 84]
    
    Both A and B begin with 99. So any possible substructure in A[1..n] and B[1..n] can be extended by adding A[0]=99 at the beginning. In other words: if both A and B begin in the same item: the result is 1 + the longest common subsequence of A[1:] and B[1:]. And if they do not match:
  2. A = [14, 57, 32, 8, 17, 27, 20, 18, 1, 36]
    B = [99, 24, 14, 5, 8, 22, 30, 60, 27, 22]
    
    If they do not match, then either A[0] or B[0] must go. So the result will be the maximum of the longest common subsequence of A[1:], and the longest common subsequence of B[1:].

Which gives us this code:

def lcs_recursive_forwards(A, B):
    if not A or not B:
        return 0

    if A[0] == B[0]:
        return 1 + lcs_recursive_forwards(A[1:], B[1:])

    return max(
        lcs_recursive_forwards(A, B[1:]),
        lcs_recursive_forwards(A[1:], B))

This code still leaves a lot to be desired: all that harmless list slice statements like A[1:] and B[1:] all allocate new temporary arrays. We can do better by utilizing indices instead: rather than actually cutting the array, we just pass the original array and the new end position, like this:

def lcs_recursive_backwards_indices(A, B, i = 0, j = 0):
    if i >= len(A) or j >= len(B):
        return 0

    elif A[i] == B[j]:
        return 1 + lcs_recursive_backwards_indices(A, B, i+1, j+1)

    return max(
        lcs_recursive_backwards_indices(A, B, i, j+1),
        lcs_recursive_backwards_indices(A, B, i+1, j))

Recursive Approach with Backtracking

If you run the code, you'll see pretty soon that it is awfully slow. The main reason is that it calculates the same subsequence over and over and over again. For example, I did a test run with 10 (ten) items: one of the (i, j) combinations was called in total 48.620 times. And if you add a single item, so you now test with 11 (eleven) items, that number goes up to a mind-boggling 152.416 times. Clearly, the lcs_recursive definition isn't very useful in its naive implementation. What we need is memoization: keeping track of combinations already seen, and storing them in memory. In Python this is trivial (since dictionaries can have tuples as indices):

# lookup dictionary
k = {}

def lcs_lookup_dict(A, B, i, j):

    # try dictionary first
    try:
        return k[(i, j)]
    except KeyError:
        pass

    # quite similar to lcs_recursive_backwards_indices, with one change:
    # - the result is stored in a temp variable, so that it can be kept in memory
    if i >= len(A) or j >= len(B):
        result = 0

    elif A[i] == B[j]:
        result = 1 + lcs_lookup_dict(A, B, i+1, j+1)

    else:
        result = max(
            lcs_lookup_dict(A, B, i, j+1),
            lcs_lookup_dict(A, B, i+1, j))

    k[(i, j)] = result
    return result

This gives a much much better performance. The following is an example comparing two sequences with a whopping 12 (twelve) items each:

lcs_recursive_backwards_indices: 1 in 1.4740 sec
      naive recursive backwards: 1 in 1.4490 sec
      native recursive forwards: 1 in 1.3290 sec
              lookup dictionary: 1 in 0.0010 sec

So, some surprises here.

So, the algorithm with memoization is OK fast. Here are some results:

So we need something better still.

Iterative Solution with Memoization

It turns out that the recursive solution checks all combinations anyway. So we might as well do those explicitly, by enumerating all possible positions, storing temporary results in a cache. We only need to do this in a way that ensures we never read any cache value before we have written it. It turns out that this is pretty simple:

def lcs_iterative(A, B, i, j):

    def empty_list():
        return [0] * (len(B)+1)

    K = [empty_list() for temp in range(len(A)+1)]
    for i in range(len(A)-1,-1,-1):
        for j in range(len(B)-1,-1,-1):
            if A[i] == B[j]:
                K[i][j] = 1+K[i+1][j+1]

            else:
                K[i][j] = max(K[i][j+1], K[i+1][j])

    return K[0][0]

Compare the code logic to the one from the last recursive solution:

    if i >= len(A) or j >= len(B):
        result = 0

    elif A[i] == B[j]:
        result = 1 + lcs_lookup_dict(A, B, i+1, j+1)

    else:
        result = max(
            lcs_lookup_dict(A, B, i, j+1),
            lcs_lookup_dict(A, B, i+1, j))

You can see that they are basically identical:

It is pretty easy to extend the solution to actually give us the sequence members (not just the sequence length), if we do this:

The following code does this:

def lcs_iterative_with_sequence(A, B, i, j):

    def empty_list():
        return [0] * (len(B)+1)

    # Step 1: build lookup array to determine LCS
    K = [empty_list() for temp in range(len(A)+1)]
    for i in range(len(A)-1,-1,-1):
        for j in range(len(B)-1,-1,-1):
            if A[i] == B[j]:
                K[i][j] = 1+K[i+1][j+1]

            else:
                K[i][j] = max(K[i][j+1], K[i+1][j])

    # Step 2: restore sequence from information in lookup array
    S = []
    i = 0
    j = 0
    while (i < len(A)) and (j < len(B)):
        if A[i] == B[j]:
            S.append(A[i])
            i += 1
            j += 1
        elif K[i+1][j] >= K[i][j+1]:
            i += 1

        else:
            j += 1
    return S

One final optimization: in the code above, we have a two-dimensional array. But it turns out that if we don't need to reconstruct the array, then we don't need all that variables either. Two arrays with lookups for A and B suffice. The following code implements this idea:

def lcs_iterative_space_optim(A, B, i, j):

    X = [0] * (len(A)+1)
    Y = [0] * (len(B)+1)

    m = min(len(Y), len(X))
    for i in range(len(A)-1,-1,-1):
        for j in range(len(B)-1,-1,-1):
            if A[i] == B[j]:
                X[j] = 1+Y[j+1]

            else:
                X[j] = max(Y[j], X[j+1])

        for n in range(m):
            Y[n] = X[n]

    return X[0]

OK, given all that: how do these variants perform? On an input of 1.000 items (that we failed to analyse using the recursive solution), we get this:

       iterative with sequence: 176 in 0.4360 sec
                     iterative: 176 in 0.4360 sec
       iterative space optimal: 176 in 0.4220 sec

Much better.