ICPC Regionals - Generalized Roman Numerals

This year's ICPC Regionals are over, and all that's left to do is talk about how it went. It went badly. While I could complain about how hacky and error-prone most of the problems at regionals were this year, and how we did not prepare for that style of problem, I prefer to focus on a very interesting problem that was both difficult and elegant.

## Problem Statement

This is problem D from the 2014 ICPC East Central North America Regionals. The text is copied from the ACM website and is in their words.

The ancient Romans developed a terrible numbering system in which I, V, X, L and C stood for 1, 5, 10, 50 and 100, respectively. So XXXVII represents 37 (10+10+10+5+1+1). They typically wrote the numerals in non-increasing order. However, when a single Roman numeral is written before one that is larger, we subtract the smaller from the larger. So we can write IV and IX to represent 4 and 9 (subtracting 1), or XL and XC to represent 40 and 90 (subtracting 10). To represent 94, we would write XCIV.

VIC is generally not considered a traditional Roman numeral, but we can interpret this as another representation of 94: VI is 6, so VIC is 100-6. In general, if we have two expressions a and b representing values v(a) and v(b), then we say that v(ab) is v(a) + v(b) if v(a) ≥ v(b), and v(b) − v(a) otherwise. Unfortunately, this generalization introduces some ambiguity, since different orders of evaluation may result in different values. For example, consider IVX: IV is 4 and X is 10, so by that reasoning IVX is 6. However, I is 1 and VX is 5, so this suggests that IVX is actually 4. To remedy this ambiguity, we allow the addition of parentheses. The question arises: for a given string of Roman numeral characters, how many different values can be obtained using different placements of parentheses?

Time Limit: 3 secs, No. of Test Cases: 33. Each test case consists of a single string containing only the characters I, V, X, L and C. The length of this string will be ≤ 50. A line containing of a single 0 will terminate input.

For each test case, output all possible distinct values that can be represented by the string via the addition of parentheses. Display these values in increasing order.

## Analysis and Solution

The naive solution is to do all possible interpretations of a given string, using the recursive definition. Since n = 50, we end up with exponential blowup, which becomes intractible given the extremely limited time we are given (about 0.1s per test case). The solution to this is to memoize the subsequence results in a hashmap of arrays, which maps sequences to their possible interpretations. Then on each recursion, we can compute the array for each subsequence, and then take all possible combinations of the arrays.

The next thing you might notice is that the string has a maximum size (50), and no element can be larger than 100. This means that the maximum possible value is 5000. In order to avoid duplicate work, we use an array of bools of size 5001 to impose a uniqueness on each generated array.

Here is where it gets a bit tricky. When you run this solution in C++ on some worst-case test case, the runtime is about 15 seconds. That's way too slow. To eliminate some of the slowness, eliminate the recursion overhead by building the subsequences from the ground up, in a DP approach rather than recursion. i.e. First get all possible values for subsequences of size 2, then 3, all the way up to the length of the string.

This approach runs in about 10 seconds per worst-case testcase, which is still not enough.

Then consider that instead of using a hashmap of strings, we could just use a 2D array to store the intermediate values, where the indices are the start of the subsequence, and the length of the subsequence. This solution, though it may surprise you, now runs about 100x faster than the previous one.

As always, you can view my solution on Github.

## Visualization

I implemented the approach outlined above in Javascript. Generate a random roman numeral of length at most 20 (for computational reasons), and see the algorithm in action. For brevity, it will only output the number of combinations, not the combinations themselves.

segSize ?
Number of Solutions ?

Since this code is more optimized for C++ than Javascript, I think that this algorithm might run slower in JS than a some of the earlier proposed solutions. I wonder what sort of optimizations would make it faster...

## Lessons Learned

One of the advantages of using strings over indices was the idea of being able to memoize repeating substrings. However, it turns out that the overhead of doing substring calls repeatedly and using an `unordered_map` to store the result is slower than just recomputing the values for the subsequence.

What I learned from solving this question, is that C++ string methods are very slow, as are C++ `map`s and even `unordered_map`s.