A Solution to Facebook’s ‘Gattaca’ Puzzle

Author: Simmoril | Date: 11.8.2009 | Category: General

So in an attempt to keep my coding skills sharp, I’ve been making my way through some of Facebook’s Engineering Puzzles. In particular, one puzzle that I found especially interesting was the ‘Gattaca’ problem, and so I thought I’d share my solution to the problem here.

The first thing to notice about the problem is that the talk about the DNA string and gene prediction stuff is pretty much a smokescreen. For this problem, the only thing that really concerns us is the gene predictions, which we’ll just call ‘intervals’. Each interval has three pieces: a start and end index to identify its boundaries, and a value. Our job then is to find the group of non-overlapping intervals that gives us the largest sum.

First, we start with a little boilerplate code. The Interval class is just a small, struct-like class to represent an interval. This is mostly for the sake of improving  the readability of the code; one could use a tuple or a dict instead.

# Small container class for intervals.
class Interval:
    def __init__(self, start, end, val):
        self.start = start
        self.end = end
        self.val = val

Next, we parse out the intervals from the input file (given as the first command-line argument to the script). Since we can ignore all the DNA string information, we scan through the file until we find an interval, which will be the only lines that have 3 whitespace-delimited fields.

# Get the list of the intervals from the input file.
def get_intervals():
    intervals = []
    f = open(sys.argv[1], 'r')
    for line in f:
        fields = line.strip().split()
        if len(fields) == 3:
            # Interval has 3 fields (start, end, val)
            intervals.append(Interval(*[int(x) for x in fields]))
    return intervals

Now that we have a list of intervals, the next task is to determine if two intervals overlap or not. An interval A is a range of integers denoted by a start index, a_i, and an end index, a_j, where a_i < a_j. Say we have a second interval, B denoted by (b_i, b_j). Then, if A and B do not overlap, one of two things must be true: either A must start after B ends, or A must end before B starts. In other words, either a_i > b_j or a_j < b_i . If neither of these are true, then the intervals overlap.

# Returns True if intervals a and b do not overlap, false otherwise.
def comp_intv(a, b):
    if a.end < b.start:
        return True
    elif b.end < a.start:
        return True
    else:
        return False

We now have all the pieces we need to do a search for the largest sum. For this, I use a technique similar to the one used to find paths through the graph in the Boggle Bash Trainer. We start with an arbitrary interval, call it I_1, and an ‘allowed’ list of intervals that do not overlap with any intervals already in the group. We take this ‘allowed’ list, and pare it down by removing any intervals that overlap with I_1. If this new ‘allowed’ list is empty, then the maximum total is just the value of I_1. If the list isn’t empty, then the maximum total is the value of this interval plus the maximum total out of all possible additions from the new ‘allowed’ list.

# Find the max total from a given starting interval and list of compatible
# intervals.
def find_max(start, allowed):
    # Find all compatible intervals.
    my_allowed = [x for x in allowed if comp_intv(x, start)]
    if not my_allowed:
        # No more possible additions, so max is just this interval's val
        return start.val
    else:
        # Otherwise, max is the addition of start's val plus the max val
        # out of start's possible additions.
        return start.val + max([find_max(x, my_allowed) for x in my_allowed])

Some things to note are that the comp_intv() function from above returns false if called with the same interval for a and b. This ensures that an interval is only used once in a group. Also, note that on subsequent calls to find_max(), each new 'allowed' list is the list of intervals that does not overlap with any of the intervals added to the group thus far. So if say the group has intervals I_i, I_2, then the 'allowed' list contains all intervals that do not overlap with I_1 or I_2. So when we try to add I_3 and pare down the allowed list to remove any intervals that overlap with I_3, we are left with a list of intervals that does not overlap with any of I_1, I_2, I_3.

From here, there is just one last step to find the largest sum. We simply call find_max() for every interval in the input file, then find the largest sum in that list. This gives us the largest possible sum out of all the intervals. Note that in this initial call, the 'allowed' list is simply the list of all intervals (since at this point, no intervals have been added to the group).

total = max([find_max(intv, intervals) for intv in intervals])

The full script can be downloaded here.

A Perfect Example of What *Not* to Do

Author: Simmoril | Date: 8.8.2009 | Category: General

So… much… fail…

Boggle Bash Through a Programmer’s Eyes

Author: Simmoril | Date: 21.7.2009 | Category: General

Introduction

So, I recently became addicted to an online game called Boggle Bash, which is pretty much the original Boggle with a couple of twists thrown in. Although I’d been playing it for a couple weeks, at the very outset I had an overwhelming urge to build a program to ‘assist’ me in playing Boggle Bash. This is the story of me finally giving into that urge.

The Rules

For those of you who haven’t played Boggle, it’s a word game where 16 die are randomly ‘thrown’ into the slots of a 4×4 grid. The goal of the game is to form as many words as you can before the timer runs out. How do you form a word? Well, you start at any slot and move to any adjacent slot (up,down, left, right, and diagonally) that you haven’t already visited. Here’s a screenshot from the Boggle Bash website to illustrate:

Screenshot of Boggle Bash from Pogo.com

In this example, you can form the word TEN by starting with the T in row 2, column 4, and moving down to the E, then diagonally down to the N. But, you can’t form the word TENSE, since you can’t use the one E twice. There’s some other weird nuances, like no words less than 3 letters long, no proper nouns, no acronyms, etc., but you get the basic idea.

Bending the Rules

Some of you out there might already begin to see how a computer program could play Boggle. One method (and, coincidentally, the one I chose) is brute force, which is a two step process:

  1. Find every single possible legal combination of letters.
  2. Check each combination of letters. If it’s a word, keep it. If not, discard it.

Simple, yet highly effective. Let’s take a look at each step.

Walking a Boggle Board

One way to look at the Boggle board is not as a 4×4 grid, but as a graph with 16 nodes (sorry, my graphviz skills suck, tilt your head to see it oriented properly):

Graph representing a Boggle board

The problem of finding Boggle-legal words now becomes the problem of finding paths on the graph that have 3 or more vertices (without visiting any vertex more than once). Fair warning: the following explanation might not make total sense, and I take full responsibility.

Building the Graph

The first thing we need to do is build a representation of the above graph by creating an adjacency list, which is just a list that tells us which vertices are connected to which.

We start with a 4×4 matrix representing the Boggle board. This matrix is then used to build the adjacency list via the following algorithm: for every vertex in the matrix:

  1. Try to move in all 8 directions (up-left, up, up-right, left, right, down-left, down, down-right).
  2. For each direction, if you haven’t walked off the matrix, add the new vertex to this vertex’s adjacency list.

So, as a quick example, starting with node 0, we see we can walk right, down-right, and down, so the adjacency list for 0 is (1, 4, 5).

Finding all valid paths

Given the adjacency list, we now need to find a list of all valid paths through the graph. We start by trying to find all valid paths of an arbitrary length, call it n. To do this, for every vertex in the graph:

  1. Use the adjacency list to find all vertices adjacent to the current one that you haven’t already visited.
  2. For each adjacent vertex, find all paths of length n – 1 from that vertex. Tack on the current vertex to each path you find and voila! You have now found all paths of length n from the given vertex.

Sound confusing? Yeah, it was to me at first too. Here’s a quick example to illustrate: say you wanted to find all paths of length three. The method above takes us through the following steps:

  1. We start at vertex 0, and the adjacency list tells us we can go to vertices 1, 4, and 5. We now look for all paths of length 2 from these vertices.
  2. We start with vertex 1. The adjacency list tells us we can go to vertices 0, 2, 4, 5, and 6. We discard vertex 0, since we’ve already visited it, leaving 2, 4, 5, and 6. We now look for paths of length 1 from each of these vertices.
  3. We start with vertex 2. Since our path length is now 1, we stop here, and record our first found path of length 3, namely (0 -> 1 -> 2).
  4. We now start with vertex 4 from step 2. Again, the path length is now 1, so we stop here, and record the second found path of length 3, namely (0 -> 1 -> 4).

Building lists of paths for every length from 3 to 16 gives us a list of every valid path through the Boggle board (yes, there are a lot of them!).

Finding Real Words

So now we’ve got this big list of paths like (0 -> 1 -> 2), (0 -> 1 -> 4), (0 -> 1 -> 5), etc. The next task is to turn these paths into strings, and check if the strings are valid words. This step is actually quite a bit easier than building the list of paths. Just substitute letters for each of numbers, put all the letters together, and then use a spell-checker to see if the string you have it a valid word. For example, using the above graph and Boggle screenshot, we can translate the path (0 -> 1 -> 2) into the string JED. Our spell-checker (hopefully) tells us that’s not a word, and so we discard it.

From Theory to Practice

And now for the punchline you’ve all been waiting for: my implementation of the above method is written in python, and consists of two classes: BoggleBoard.py and BoggleTrainer.py. BoggleBoard uses the above method to generate a list of all possible paths. BoggleBoard is actually quite a bit more flexible than is needed, as it can generate valid path lists for Boggle boards of arbitrary dimensions.

BoggleTrainer then uses this path list and a given board layout to generate a list of all possible strings. It then checks those string using the Enchant spell-checking library. Valid words are kept, invalid words are discarded.

Reality is a Harsh Mistress

I field-tested the code for about a week or so, and the results were pretty positive. Running the code with python 2.6.2 on a 3.00 GHz Pentium 4 with 2 GB of RAM, the script was able to find all valid words up to 8 letters long in a reasonable amount of time. However, although the script works as advertised, it turns out there’s a few shortcomings:

  • The script doesn’t run fast enough to be able to check every single possible string in time (a single round of Boggle Bash is 3 minutes). The bottleneck is mostly with the Enchant library, although since the script attempts to cache all possible paths in RAM, that could be a bottleneck as well. However, at longer string lengths, the odds of finding a valid word decrease rapidly. I therefore decided to make a trade-off of only scanning strings up to 8 letters long.
  • The dictionary included with Enchant doesn’t exactly match Boggle Bash’s dictionary. Most notably, names and acronyms show up as false positives, and there are words that Enchant does not recognize as valid (how is ‘online’ not a word?).
  • The script itself doesn’t interface with the actual Boggle client at all. The board layout has to be entered manually as a string of 16 characters, and words from the word list have to be entered back into the Boggle client by hand. However, interfacing with the Boggle Bash client wasn’t really the main point of this exercise, so this is more of a side-note than an actual problem.

What Now?

I’m sure there’s like a kerjillion different optimizations/tweaks/enhancements that I completely missed so please feel free to leave them in the comments. One I thought of right off the top of my head was to spread out the spell checking over multiple cores by using python’s XML-RPC for message passing. Another might be trying to speed up the spell-checker, or doing something with tries. Actually interfacing with the Boggle Bash client might be cool too.

That’s it for my end, thanks to Pogo.com for creating such an enjoyable game, and I hope you’ve had as much fun reading this article as I’ve had writing it!

A Solution to Google Treasure Hunt 2008 Problem 4

Author: Simmoril | Date: 15.4.2009 | Category: General

From Google Treasure Hunt 2008:

Find the smallest number that can be expressed as
the sum of 3 consecutive prime numbers,
the sum of 5 consecutive prime numbers,
the sum of 517 consecutive prime numbers,
the sum of 551 consecutive prime numbers,
and is itself a prime number.

For example, 41 is the smallest prime number that can be expressed as
the sum of 3 consecutive primes (11 + 13 + 17 = 41) and
the sum of 6 consecutive primes (2 + 3 + 5 + 7 + 11 + 13 = 41).

I wrote my solution to this problem in ~60 lines of python. In a nutshell:

1. Get a precomputed list of (1,000,000) primes.

2. Find all possible summations of 551 consecutive prime numbers. If the sum is prime, store this in a list of ‘candidates’.

3. Find all possible summations of 517 consecutive prime numbers. If the sum is in the candidate list from step 2, store it in a new candidate list.

4. Repeat step 3 using this new candidate list for summations of 5 and 3 consecutive prime numbers.

5. The smallest number in the final candidates list is the answer.

The script ran pretty quickly on my machine (3.00 GHz P4), taking about 1-2 minutes to find the answer.

Book Review: The Productive Programmer

Author: Simmoril | Date: 9.9.2008 | Category: General

The Productive ProgrammerHonestly, I have to say I was surprised to see that (at the time of this writing) this book was averaging 4.5 stars on Amazon. I suppose my expectations of this book were heavily swayed by the title. I was expecting something a little more generic and high-level than what I got.

I think the book makes a decent attempt to provide the reader with ‘recipes’ that can be used to improve their efficiency and productivity. But the recipes themselves seem to be a little too free-flowing: by the time I got around to the 3rd or 4th recipe, I had already forgotten the overarching theme of the chapter. I feel like the material might’ve been better presented in the style of O’Reilly’s “Cookbook” book. I doubt a reader, after one pass through the book, will be able to implement every recipe they read about. I think it would be better to structure the book as a reference that they can flip through later.

Another issue I had with the book was that in the later half of the book, many of the examples given contain chunks (sometimes large chunks) of Java, Ruby, and Groovy code. While I’m sure many readers of this book are familiar with those languages, for those who aren’t, the chunks become a serious roadblock to understanding the point of the example. Many of the examples (the Groovy ones, especially) cannot simply be interpreted ‘by sight’. Perhaps a future edition of the book could have some type of fair warning at the beginning.

If you are a java developer, I think you can probably get a lot out of this book. If not, you’ll probably still be able to get a good amount of information out of the first few chapters, but I’d suggest just skimming or skipping the latter chapters altogether.

My rating: 3/5 stars.

Book Review: Programmer’s Guide to nCurses

Author: Simmoril | Date: 28.8.2008 | Category: General

Programmer\'s Guide to NCursesA couple weeks ago, I wanted to start playing around with the python curses module, but it’s been almost forever since I’ve done anything with curses. So, for a bit of a refresher course, I picked up Dan Gookin’s Programmer’s Guide to nCurses (note this book covers the C API, not the python API).

I was actually very surprised at how good of an introduction Dan’s book is. His laid back style of writing helps keep the reader entertained, and his short code snippets make for succinct examples that get right to the point without losing the reader in pages of boilerplate. The book itself is a fairly quick read too; although it officially weighs in at 556 pages, the last 300 pages or so is a reference manual for the nCurses API.

I honestly didn’t have too many problems with the book, other than the occasional typo (including an off-by-one error in the code on p. 177). However, I would warn the reader that I had some issues with a few of Dan’s examples. First, in many of the examples, the error handlers are missing a getch() call, which causes the program to immediately exit when an error occurs without pausing to let you see the errors. Also, in Chapter 8, some of the examples involving wrefresh() wouldn’t work for me unless a call was made to refresh() beforehand (but after that call was made once, all the wrefresh() calls worked after that). Not sure if this is an issue with nCurses or with my system, so YMMV.

I would definitely recommend this book to anyone who is looking to become familiar with the nCurses API. Although the book doesn’t cover more advanced topics, like best practices for designing and building full-fledged nCurses applications, it still makes for a very good starting point.

My Rating: 4/5 stars.

A Solution to the WordNumbers Puzzle

Author: Simmoril | Date: 27.7.2008 | Category: General

Well, it’s been a long time coming, but I finally finished my write-up of the solution to the WordNumbers puzzle.

The WordNumbers puzzle is one of a series of programming challenges posed by ITA Software as part of their hiring process. When I saw the WordNumbers puzzle, the description was so simple, yet when I first started to work on it, I honestly had no idea where to start. But, a couple days, and a couple of key insights later, and I was able to code up a program that yielded the correct answer. However, it turned out that writing a paper describing the solution took a lot longer than writing the actual solution itself!

I’ve omitted my implementation of the solution as well as the actual answer because I don’t want to spoil the puzzle for anyone else. But, for those of you who are dying to know the answer, I hope that my paper points you in the right direction towards implementing the solution on your own.

As always, I welcome any questions, comments or criticisms you may have. If you see any glaring errors and/or omissions, please let me know. It’s been quite a while since I’ve had to write a paper, and this was a tough subject to write on, so please, be kind!