So in an attempt to keep my coding skills sharp, I’ve been making my way through some of Facebook’s Engineering Puzzles. In particular, one puzzle that I found especially interesting was the ‘Gattaca’ problem, and so I thought I’d share my solution to the problem here.

The first thing to notice about the problem is that the talk about the DNA string and gene prediction stuff is pretty much a smokescreen. For this problem, the only thing that really concerns us is the gene predictions, which we’ll just call ‘intervals’. Each interval has three pieces: a start and end index to identify its boundaries, and a value. Our job then is to find the group of non-overlapping intervals that gives us the largest sum.

First, we start with a little boilerplate code. The Interval class is just a small, struct-like class to represent an interval. This is mostly for the sake of improving  the readability of the code; one could use a tuple or a dict instead.

# Small container class for intervals.
class Interval:
    def __init__(self, start, end, val):
        self.start = start
        self.end = end
        self.val = val

Next, we parse out the intervals from the input file (given as the first command-line argument to the script). Since we can ignore all the DNA string information, we scan through the file until we find an interval, which will be the only lines that have 3 whitespace-delimited fields.

# Get the list of the intervals from the input file.
def get_intervals():
    intervals = []
    f = open(sys.argv[1], 'r')
    for line in f:
        fields = line.strip().split()
        if len(fields) == 3:
            # Interval has 3 fields (start, end, val)
            intervals.append(Interval(*[int(x) for x in fields]))
    return intervals

Now that we have a list of intervals, the next task is to determine if two intervals overlap or not. An interval A is a range of integers denoted by a start index, a_i, and an end index, a_j, where a_i < a_j. Say we have a second interval, B denoted by (b_i, b_j). Then, if A and B do not overlap, one of two things must be true: either A must start after B ends, or A must end before B starts. In other words, either a_i > b_j or a_j < b_i . If neither of these are true, then the intervals overlap.

# Returns True if intervals a and b do not overlap, false otherwise.
def comp_intv(a, b):
    if a.end < b.start:
        return True
    elif b.end < a.start:
        return True
    else:
        return False

We now have all the pieces we need to do a search for the largest sum. For this, I use a technique similar to the one used to find paths through the graph in the Boggle Bash Trainer. We start with an arbitrary interval, call it I_1, and an ‘allowed’ list of intervals that do not overlap with any intervals already in the group. We take this ‘allowed’ list, and pare it down by removing any intervals that overlap with I_1. If this new ‘allowed’ list is empty, then the maximum total is just the value of I_1. If the list isn’t empty, then the maximum total is the value of this interval plus the maximum total out of all possible additions from the new ‘allowed’ list.

# Find the max total from a given starting interval and list of compatible
# intervals.
def find_max(start, allowed):
    # Find all compatible intervals.
    my_allowed = [x for x in allowed if comp_intv(x, start)]
    if not my_allowed:
        # No more possible additions, so max is just this interval's val
        return start.val
    else:
        # Otherwise, max is the addition of start's val plus the max val
        # out of start's possible additions.
        return start.val + max([find_max(x, my_allowed) for x in my_allowed])

Some things to note are that the comp_intv() function from above returns false if called with the same interval for a and b. This ensures that an interval is only used once in a group. Also, note that on subsequent calls to find_max(), each new 'allowed' list is the list of intervals that does not overlap with any of the intervals added to the group thus far. So if say the group has intervals I_i, I_2, then the 'allowed' list contains all intervals that do not overlap with I_1 or I_2. So when we try to add I_3 and pare down the allowed list to remove any intervals that overlap with I_3, we are left with a list of intervals that does not overlap with any of I_1, I_2, I_3.

From here, there is just one last step to find the largest sum. We simply call find_max() for every interval in the input file, then find the largest sum in that list. This gives us the largest possible sum out of all the intervals. Note that in this initial call, the 'allowed' list is simply the list of all intervals (since at this point, no intervals have been added to the group).

total = max([find_max(intv, intervals) for intv in intervals])

The full script can be downloaded here.

This entry was posted on Tuesday, August 11th, 2009 at 1:58 pm and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Responses

  1. Taylor says

    Such brute force solution is inefficient for large N=1000000

  2. Simmoril says

    @Taylor

    Yeah, I think the runtime is exponential in the size of the interval list, so that’s obviously not great. But at first glance this problem seems almost identical to the knapsack problem, so I’m guessing there’s no real easy optimizations (but as always, I could be wrong).

  3. Mircea says

    The problem can be solved in O(m + n) time complexity using dynamic programming. (n = number of intervals, m = length of DNA)

Leave a Reply