Intervals

I was recently at an interview, and was asked the following question:

Suppose you have a cafeteria with people coming and going in groups. Your goal is to determine, given a log, whether the cafeteria ever exceeded its capacity. The log is on a per-group basis - each group has a number of people, the time they entered the cafeteria, and the time they left.

The goal, stated formally, is for each time t, does the cafeteria have more people than its maximum capacity? Obviously it is not practical to iterate over t, as the times (and interval size) may be quite large, or of arbitrary granularity. The solution must be to get the log in such a format that this query is obvious. Let's first start with defining a class for the interval:

class Interval(object):
    def __init__(self, start, end, val):
        self.start = start
        self.end = end
        self.val = val

A good way to start is to see where the intervals intersect. Let's write a function to see whether a given interval intersects with another interval.

def intersects_with(self, other):
    return self.end > other.start

It's pretty simple, but gives us a good idea: it may be natural to consider sorting the intervals in some way, for example by start time, or by end time. There is probably a good solution with sorting by end time, but I chose to sort by start time instead.

# given intervals is a list of intervals
intervals.sort(key=lambda interval: interval.start)

Now a solution presents itself: create non-overlapping intervals out of overlapping ones iteratively, until all intervals are non-overlapping. Then we can check the value of each non-overlapping interval to see if the capacity is ever exceeded.

def get_intersection(self, other):
    """Modify the original intervals (self and other) to be non-overlapping.
    If a new interval needs to be created, create and return it. If there is
    no new interval, return None."""
    if self.end > other.start:
        # there is an intersection
        # TODO
        pass
    else:
        # there is no intersection
        return None

What is the worst case? When the intervals all intersect each other. This is easier to see when drawn.

In this case, there are 5 intervals, and a resultant 9 non-overlapping intervals. They are shown below.

Notice that for each pair of intervals, we get back at most 1 new interval and modify the other 2 to make them non-overlapping Also, we only need to compare adjacent intervals, as long as the list stays in sorted order. So for each pair of intervals, we generate at most 1 new interval, while the two others are just replaced. Therefore the maximum number of non-overlapping intervals that can be generated is 2n - 1, where n is the number of intervals we started with.

Since we have sorted the intervals by start time, the best we can now hope for is O(nlogn). The key to the O(nlog(n)) solution is maintaining the list in sorted order. The intervals to compare will be adjacent in the list. We can remove them from the list, modify them, and add them back in the correct place. If an additional interval is generated, it is also be added to the list in the correct place. Each insertion should take O(logn), and a deletion will take either O(1) or O(logn) depending on the data structure used. In a linked list or heap, for example, deletes take O(1).

The unoptimized Python code is available on Github. The optimized Python code will be posted shortly. The slightly less error-prone version of this algorithm is to use a priority queue which keeps the elements in sorted order in order of start time. The performance is highly improved as well. When tested with 100 random elements 100 times, the queue-version performed about 34 times faster. This version of the code is available as a Github gist here. I think this is a good argument for introducing a less painful sorted queue into Python than heapq, which does not support a custom sorting function directly, but requires a work-around using tuples.

There is a second, more efficient approach which I did not come up with during the interview, but a friend of mine suggested. For each interval, split the interval into entering and leaving events. The value of a leaving event would subtract that number of people from the capacity, while the entering event would add that number of people to the capacity. Put the events into a priority queue sorted by time, then simulate the events. The code is available as a Github gist here, and is approximately an order of magnitude faster than my PQ solution, though both run in asymptotic time O(nlogn).