A Solution to Facebook’s ‘Gattaca’ Puzzle
So in an attempt to keep my coding skills sharp, I’ve been making my way through some of Facebook’s Engineering Puzzles. In particular, one puzzle that I found especially interesting was the ‘Gattaca’ problem, and so I thought I’d share my solution to the problem here.
The first thing to notice about the problem is that the talk about the DNA string and gene prediction stuff is pretty much a smokescreen. For this problem, the only thing that really concerns us is the gene predictions, which we’ll just call ‘intervals’. Each interval has three pieces: a start and end index to identify its boundaries, and a value. Our job then is to find the group of non-overlapping intervals that gives us the largest sum.
First, we start with a little boilerplate code. The Interval class is just a small, struct-like class to represent an interval. This is mostly for the sake of improving the readability of the code; one could use a tuple or a dict instead.
# Small container class for intervals. class Interval: def __init__(self, start, end, val): self.start = start self.end = end self.val = val
Next, we parse out the intervals from the input file (given as the first command-line argument to the script). Since we can ignore all the DNA string information, we scan through the file until we find an interval, which will be the only lines that have 3 whitespace-delimited fields.
# Get the list of the intervals from the input file. def get_intervals(): intervals = [] f = open(sys.argv[1], 'r') for line in f: fields = line.strip().split() if len(fields) == 3: # Interval has 3 fields (start, end, val) intervals.append(Interval(*[int(x) for x in fields])) return intervals
Now that we have a list of intervals, the next task is to determine if two intervals overlap or not. An interval is a range of integers denoted by a start index,
, and an end index,
, where
. Say we have a second interval,
denoted by
. Then, if
and
do not overlap, one of two things must be true: either
must start after
ends, or
must end before
starts. In other words, either
or
. If neither of these are true, then the intervals overlap.
# Returns True if intervals a and b do not overlap, false otherwise. def comp_intv(a, b): if a.end < b.start: return True elif b.end < a.start: return True else: return False
We now have all the pieces we need to do a search for the largest sum. For this, I use a technique similar to the one used to find paths through the graph in the Boggle Bash Trainer. We start with an arbitrary interval, call it , and an ‘allowed’ list of intervals that do not overlap with any intervals already in the group. We take this ‘allowed’ list, and pare it down by removing any intervals that overlap with
. If this new ‘allowed’ list is empty, then the maximum total is just the value of
. If the list isn’t empty, then the maximum total is the value of this interval plus the maximum total out of all possible additions from the new ‘allowed’ list.
# Find the max total from a given starting interval and list of compatible # intervals. def find_max(start, allowed): # Find all compatible intervals. my_allowed = [x for x in allowed if comp_intv(x, start)] if not my_allowed: # No more possible additions, so max is just this interval's val return start.val else: # Otherwise, max is the addition of start's val plus the max val # out of start's possible additions. return start.val + max([find_max(x, my_allowed) for x in my_allowed])
Some things to note are that the comp_intv() function from above returns false if called with the same interval for a and b. This ensures that an interval is only used once in a group. Also, note that on subsequent calls to find_max(), each new 'allowed' list is the list of intervals that does not overlap with any of the intervals added to the group thus far. So if say the group has intervals , then the 'allowed' list contains all intervals that do not overlap with
or
. So when we try to add
and pare down the allowed list to remove any intervals that overlap with
, we are left with a list of intervals that does not overlap with any of
.
From here, there is just one last step to find the largest sum. We simply call find_max() for every interval in the input file, then find the largest sum in that list. This gives us the largest possible sum out of all the intervals. Note that in this initial call, the 'allowed' list is simply the list of all intervals (since at this point, no intervals have been added to the group).
total = max([find_max(intv, intervals) for intv in intervals])
The full script can be downloaded here.



