About a year ago, Jules Jacobs wrote a series (part 1 and part 2, with part 3 still forthcoming) on the best collections library design.

Much of what follows is a repeat of ideas from Swift-style map and filter views and other posts like Iterator terminology, but hopefully seeing it from a different viewpoint will make it clearer. (At least writing this response to his post made some of it clearer to me.)

Of course Jacobs acknowledges that there are tradeoffs, and his final best design isn't perfect. It's basically an extension of Clojure's reducer/transducer design (by splitting transducers into refolders and folders, aka transformers and consumers, you can design everything around composing transformers without running into all of the same problems). He's candid about what's missing.

But I think he's missed something huge.

The starting idea is that you want a single, fully-general sequence type, with functions to convert to and from all of your different collection types. This way, you only need to write all of your transformers (map, filter, zip, etc.) once, and, with minimal compiler/runtime support, fromSet.map(f).toSet can be just as efficient as a custom setMap f would be (and likewise for list, array, sorted set, or custom third-party types).

Of course this isn't a revelation. The fundamental idea behind the STL (which eventually became the nucleus of the C++ standard library) decades ago was that you can split up iteration and algorithms, so if you have N collection types and M algorithms you only need to write N+M functions instead of N*M. And this is built into Python at a fundamental level: map, filter, zip, etc. consume any kind of iterable (which means collections or iterators) and produce an iterator. And most collection types can construct themselves from any iterable. So, the Python equivalent is just set(map(f, x)).

But newer collection libraries since 1979's original STL design have run into the same N*M issues, and tried to solve them in new ways, and run into problems doing so. For example, in Scala, map and friends try to return the same collection type as their argument, meaning that things like zip or concat have to choose one argument and work out how to convert the other, and have to be able to programmatically figure out how to build a list[T,U] when specialized on list[T], set[U]. Just giving you back a general sequence type, or iterator, or whatever that can be lazily and cheaply converted to list[T,U] (or to any other type you prefer, or just used as-is) by the user avoids all of those problems.

Back to the future (1979)


But STL had something that Python doesn't, and neither do any of Jacobs' steps toward a solution: instead of just one kind of iteration, there are four kinds. (I'm ignoring output iterators, and mutating algorithms in general, in what follows, to keep things simple.)

Python iteration is one-pass, forward-only (which STL calls "input").

But sometimes this isn't sufficient. For example, if you want to build an average function with one-pass iteration, you need to keep the running total and count in some kind of persistent state. But with (multi-pass) forward iteration, you can just compose the sum and len functions in the obvious way. Jacobs's design takes this into account.

But sometimes, this isn't sufficient either. For example, if you want to build unique_justseen with forward iteration, you need to keep the previously-seen value in some kind of persistent state. But with bidirectional iteration, you can just look at the previous value directly. Or, more simply, imagine how you'd build a reverse function with only forward iteration. (In Python, of course, there's a special method for producing reversed iterables, as part of the Sequence protocol.) Jacobs's design doesn't take this into account; the only way to reverse his sequences is to stack up all of the values.

And sometimes, even this isn't sufficient, because you want to randomly access values. For example, there are a variety of sorting, permuting, etc. algorithms that are obvious with random access and either more difficult or impossible without. Jacobs's design doesn't take this into account either.

The way STL handles this is simple, if a bit clumsy in practice: instead of dealing with sequences, you deal with iterator ranges, and there are four different iterator types, with the obvious chain of subtype relationships. Some algorithms require random-access iterators; some only require forward iterators, and will therefore also work with bidirectional or random-access iterators; etc. In languages with the right kind of overloading, you can even have an algorithm that requires bidirectional, but has an optimized specialization for random-access (e.g., it can run in linear instead of log-linear time, or constant rather than linear space, with random-access iterators).

Swift makes things simpler, even if the idea is more confusing at first. Swift collections use indexing for everything. Random-access iteration is indexing with integers, as you're used to. Bidirectional iteration is indexing with special index types, which you can only get by asking a collection for its start or end or by incrementing or decrementing another index. And so on. Forward iteration is the same, except there's no decrement.

So, in Swift, you define map as a function that takes a sequence, and returns some sequence type (a MapView) that's indexable by whatever indexes worked on the original sequence. When you write m = map(f, x) and then ask for m[idx], you get f(m[idx]). And likewise for zip, and plenty of other functions.

Of course filter is a bit of a problem; there's no way to get the 7th element of a filtered list without looking at the first 6 predicate-passing elements and all predicate-failing elements up to the next passing one. So, filter can only give you bidirectional indexing, even if you give it a random-indexable sequence. But that's not all that complicated.

It should be possible design a language that let you only write the code for map, filter, etc. once, and have it automatically give you back the appropriate type (which would be inferrable at compile time from the sequence type of the argument).

How does this fit in with Jacobs's pros and cons?


The first thing to notice is that as long as you only use the results in one-shot style (even if the input supports more powerful indexing), the semantics are identical to Python's iterators. So you get all the same benefits and limitations.

But if you, e.g., have a random-access sequence, and you pass it through map to get another random-access sequence, the result is still data-parallelizable in exactly the same way as the input, which is an advantage that Python iterators (and lazy lists, etc.) don't have.

So, it's the best of all worlds, right?

Well, not quite; there's a new con that Jacobs's taxonomy didn't anticipate. If you write m = map(f, x) and then ask for m[7] twice, you're going to call f[x[7]] twice. For some algorithms, this could be a big performance hit. (And, if you stop ignoring mutation and side effects, it could even be incorrect or dangerous.)

Of course you can manually slap memoization onto a lazy sequence, but then you're risking the memory benefits. For example, if you want to one-shot iterate all adjacent pairs of values that should only take O(2) space, but a naively-memoized map view will use O(N). Sure, you can use an LRU cache with depth 1 for memoization, but then you need to convince yourself that you're never re-calculating anything. If you can do that, you could probably just as easily have written the code that uses explicit persistent state to store the last value and just stick with a one-shot iterator.

I don't know whether this is a real problem, and, if so, how often, and how easy it generally is to work around. It may be a complication worth ignoring, or it may make the entire idea into an attractive nuisance.

Are there really only four kinds of iteration?


The original goal was to get N+M functions instead of N*M. But if there are K kinds of iteration, we've really got N+M*K, which is just as bad as we started with, right?

Well, no. N, the number of possible collection types, is huge and open-ended; K, the number of possible iteration types, is small and fixed.

Maybe you can imagine ways to extend random-access indexing to things like NumPy multi-dimensional slices or Mathematica tree level selections, but if you think through what map should do with those, it's pretty clear that there shouldn't be any new code to write. (There might be some new algorithms that require these stronger kinds of indexes, but that's fine.)

Also, how many of these can you imagine? People have come up with half a dozen ways to index or iterate over the past half-century, as opposed to dozens of collection types with hundreds of different variations and implementations, with new ones appearing every year.

The hard part


So, I said above that it should be possible to build a language that lets you just write map once, and it'll work appropriately with any of the four iterator types (or any new subtypes added to the iterator tower). Of course if you want to write an algorithm with a special optimized implementation for stronger iterators, you'd need to write it twice, but that's pretty rare, and never mandatory.

And ideally, the static type system should be able to infer that map[a->b, Set[a]] returns the same kind of iterator as Set, but with element type b instead of a. (If filter has to manually specify that it can't handle anything better than bidirectional, instead of the type system inferring it, that's acceptable.)

So, is that doable?

Swift doesn't succeed. You have to write a lot of boilerplate, and to repeat yourself a lot more than should be necessary. But is that a limitation of Swift, or of the idea? I'm working this through in my spare time, but I don't have as much of that as I'd like.

Fold, refold, unfold, spindle, manipulate


Going back to Jacobs's posts, he gets into a number of things I haven't covered here. Ideally, you want to be able to support push producers as well as pull producers, and take advantage of parallelism at the production side as well as the processing side, and provide some way to keep state persistently within a refold--and doing all of those at the same time is impossible.

He takes a step toward a solution, by proposing to redefine map, filter, etc. in terms of what chains of these transformers do. I believe the idea is that you can capture all kinds of transformers by writing a single producer-adapter for each kind of input and a single consumer-adapter for each kind of output. Since he never finished part 3, I'm not positive this is what he was going to write, nor am I positive that it would work, but at least it seems like this is where he was going.

If so, it shares a lot in common with the replacing one kind of sequence with four kinds with different indexing. It means that not all functions can work with all producers or all consumers--filter can't work with a random-access consumer, and, likewise, zip can't work with a push producer--but hopefully the system can take care of this automatically (or, at worst, require you to specify it with a simple tag), leaving you to just write the basic implementation of filter in the obvious way.

The big question is, what happens when you put these two together?

I think his proposal (that I'm imagining) effectively needs an adapter for each combination of each of the binary questions (push vs. pull, parallelizable, etc.), and, ignoring the nonsensical ones, this means something like 8 adapters on each side. And I think the adapters may need to be able to interact differently with different kinds of iteration, rather than being purely orthogonal, which means we've now got 32 adapters on each side. Put another way, N+M*K+K*J isn't too bad if K and J are small and constant, but still, writing 64 functions, almost all of which are semantically empty, still seems pretty ugly. Hopefully, if I'm right about all of this, most or all of the adapters could be trivially auto-generated (which means that a clever enough language wouldn't require you to implement them at all, of course).

Anyway, that's a lot of ifs toward the end. But unless Jacobs posts part 3, or I get the time to look into this in more detail myself, that's as good as I can do.
1

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.