map
-style functions that are lazy (like Python 3), but still as full-featured as possible. On further reflection, I realized that you can get the same kinds of views without needing Swift's complicated idea of generalized indexes. So, as soon as I had some spare time, I wrote up an implementation.Views
The basic idea is simple:
>>> m = views.map(lambda x: x*x, range(10)) >>> len(m) 10 >>> m[3] 9 >>> list(m) [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] >>> list(m) [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In other words,
map
acts like a sequence, not an iterator. But it's still just as lazy as an iterator. If I just write for sq in map(lambda x: x*x, range(10)): break
, only the first square ever gets computed. Much like existing views (or "virtual sequences") in the builtins and stdlib, like range
, memoryview
, and dict_keys
.But try that with
filter
and there's an obvious problem: you can iterate a filtered list lazily (and from either end), but you can't index it. For example:>>> f = views.filter(lambda x: x%2, range(10)) >>> list(f) [1, 3, 5, 7, 9] >>> f[3] TypeError: 'filter' object is not subscriptable
How could
f[3]
work? Well, in this case, with a trivial bit of arithmetic, you can figure out that the fourth odd positive number is 7, but that obviously isn't something that works in general, or that can be automated. The only way the filter
object could do it is if it cached each value as you asked for it.That's actually not an unreasonable design, but it's a different design than what I was going for (or Swift's designers). Caching definitely fits in nicely with lazy single-linked lists a la Haskell, but in Python, we'd be getting the disadvantages (e.g., iterating a list in reverse takes linear time to get started, and linear space) without the advantages (the zero-space lazy iteration that you get automatically because iterating or tail-recursing on
lst.tail
leaves the head node unreferenced for the garbage collector). If you want caching, you can always build that as a wrapper view (which I'll get to later), but you can't remove caching if you don't want it.Anyway,
filter
can't be a sequence, but it can still be more than an iterator. It's a collection (Python doesn't have an official term for this; I'm following Terry Reedy's suggestion) that can be repeatedly iterated, to generate the same elements each time. It's also reversible.So, once you realize that
filter
can't produce a sequence, what happens if you call map
on a filter
? You can't get back a sequence, but you can get back a reversible collection. (But notice that map
can take multiple iterables for input, and if you pass it two filters, the result can't be reversible--you have to know which one is longer, and filters can't give you their lengths in constant time.)Implementation
The obvious way to implement
map
is by building a sequence that forwards each of its operations. For example:def __getitem__(self, index): # deal with slice stuff here # deal with negative indices here return self.func(*(it[index] for it in self.iterables))
Obviously, if any of the iterables aren't sequences, they'll raise a
TypeError
, and the map
will just pass that through. The magic of duck typing.But this gets a bit complicated when you consider how to handle negative indices. For example,
m = map(add, [1, 2, 3], [10, 20, 30, 40])
has values 11, 22, 33
. So, m[-2]
is 22, which you can't get from [1, 2, 3][-2] + [10, 20, 30, 40][-2]
--you need -3
on the longer input.Well, at least that's doable. The easiest way is to just map negative indices the way you would in a non-view sequence:
if index < 0: index += len(self)
(and then if it's still negative, it's an IndexError
). And __len__
can just forward to its inputs: min(len(it) for it in self.iterables)
, and that will duck-type you an error if any of them aren't sized. Since all sequences are sized, this isn't a problem.But now look at
__reversed__
. You need to do the same kind of work there, but there are reversible iterables that aren't sized. So, how do you handle that? Well, if all the iterables are sized and reversible, you use one algorithm; if not, if there's only one iterable, it doesn't have to be sized, just reversible.This is all doable, but it starts to get into a whole lot of
if
and try
statements all over the place.Meanwhile, although everything works out fine for EAFP duck typing, it doesn't work so nicely for LBYL testing. If you map over a filter, or an iterator, you get something that claims to be a sequence (
isinstance(m, Sequence)
returns true) but raises a TypeError
if you try to use it as one.And it's even worse for static type checking. How would you represent to MyPy that a map over a list is a
Sequence
, but the exact same type when used for a map over an iterator is not?One way to solve all of these problems is to have separate classes for map over iterators, map over iterables, map over a single reversible iterable, map over sized reversible iterables, and map over sequences. Each is a subclass of the one before, and each adds or overrides some functionality and also adds at least one ABC. And then,
map
isn't a constructor for a type, it's a function that figures out which type to construct and does so. Which it obviously does by LBYL against the ABCs.This also lets you distinguish iterator inputs (so that what you get back doesn't pretend to be a re-iterable view with unstable contents, but is openly just a one-shot iterator). You can't really duck-type that, because it's the less-powerful type (
Iterator
) that has the extra method (__next__
), but you can LBYL it easily. (In my initial implementation, I actually went the other way with this one, but I think that was a mistake.)And you can even statically type-check all of this by stubbing
map
as a bunch of overloads. Since there's no vararg type unpacking in PEP 484 types, you either have to be sloppy:def map(func: Callable[..., B], *iterables: Sequence) -> MapSequenceView[B]
... or repetitive:
def map(func: Callable[A0, B], iterable0: Sequence[A0]) -> MapSequenceView[B] def map(func: Callable[A0, A1, B], iterable0: Sequence[A0], iterable1: Sequence[A1]) -> MapSequenceView[B] # ... and so on for up to N arguments, where N is the most any sane person would want
(One nice thing about "gradual static typing" is that the sloppy version makes sense, and is more useful than nothing, despite being sloppy. Maybe one day Python will gain some way to express the complete set of overloads via iteration or recursion instead of repetition, but until then, we don't have to do the N overloads unless we want to.)
But anyway, do one of those, and repeat for inputs that aren't
Sequence
s but are Sized
Reversible
Iterable
s, and for a single Reversible
, and so on, and the type checker can tell that map(lambda x: x*x, range(10))[3]
is valid, but map(lambda x: x*x, filter(lambda x: x%2, range(10))[3]
is a type error.And that's pretty much exactly what you get from Swift's views, in Python, without any need for generalized indexes.
More views
The code for
map
is a bit repetitive. Worse, the code for filter
shares a lot of the same repetition. But of course they're not identical.First, how do you express the difference in how they use their function argument? Well, that's pretty easy in terms of iteration: for one element (ignoring the multiple-input case for the moment), you
yield from
a special method. In the case of map
that special method just does yield self.func(value)
, while for filter
, it's if self.func(value): yield value
.Meanwhile, there are a lot of small differences in the API:
filter
can take None
for a function (which effectively just means bool
), but map
can't (it used to, in Python 2, where it effectively just meant lambda *args: tuple(args)
). map
can take multiple iterables, but filter
can't.And, of course, there's the fact that
map
is "stronger": given appropriate inputs, it can be anything up to a Sequence
, while filter
can only be a Reversible
.Will the
yield from self._do_one(value)
trick work for all reasonable views? Does it work for reverse iteration as well as forward? Can the API differences be generalized in a simple wrapper? Is the notion of "stronger" view functions coherent and linear (and, if so, is a Sized
Reversible
stronger than a plain Reversible
)? There are clearly multiple ways to handle mulitple inputs of different lengths (hence the need for zip
and zip_longest
); can that be wrapped up? It's hard to guess in the abstract case. So, I built a few more views: zip
and zip_longest
, enumerate
, islice
...And then, of course, I got distracted by the fact that slicing an
islice
view should give you a smaller view. And...Slicing
All of the views can return sub-views for slices. Should they?
Well, that runs into the usual memory question. On the one hand, copying a largish subslice instead of just referencing it wastes memory for the copy. On the other hand, keeping a large collection alive when it's only referenced by a smallish subslice wastes memory for the rest of the collection. Which one is more of a problem in practice? For some kinds of work, clearly the copying is more of a problem--e.g., people using NumPy often have multiple slices over arrays that take up more than half their memory; if they were copies instead of slices, they wouldn't be able to fit. But in general?
Well, there's obviously a reason Python chose copying semantics. But look at what the copies are: slicing a list gives you a new list; slicing a tuple gives you a new tuple; slicing a string gives you a new string. So, slicing a map view should give you... a new map view? Then that _is_ just a reference, right?
Meanwhile, even non-sequences (like
filter
) can be "isliced" into views. (I've used itertools.islice
and the builtin filter
together in real-life code.) At first glance, it seems like it would be great if you could give filter views the native slicing syntax. But that might be more than a little odd--remember that iterating a sliced filter requires iterating everything before the start of the slice. Is that the same problem as indexing a filter requiring iterating everything before that index? Not really, because it's O(N) wasted time once per iteration, rather than O(N) wasted time N times for walking the indices, but it still seems bad.Anyway, I think it won't be clear how far to push slicing until I've played with it some more.
Caching
You can build a caching sequence on top of any iterable:
class CacheView(Sequence): def __init__(self, iterable): self._cache = [] self._iterator = iter(iterable) def __len__(self): self._cache.extend(self._iterator) return len(self._cache) def __getitem__(self, index): # deal with slices if index < 0: index += len(self) try: while index > len(self._cache): self._cache.append(next(self._iterator)) except StopIteration: raise IndexError return self._cache[index]
Of course this needs to process the first n elements to do
[n]
, which isn't necessary if you know you have a sequence. And if you know you have something sized and reversible, __len__
, [-1]
, or __reversed__
doesn't need to process the entire input. And so on. In fact, we could build views that wrap each level of the hierarchy, and provide more laziness the more powerful the input is.We could also build views that present each level of the hierarchy. For example, I may want to wrap an iterator in a repeatable iterable without adding the sequence API. At first glance, that doesn't seem necessary. If I just want to iterate the same iterator twice,
tee
is already a great way to do that, and with maximal laziness (e.g., if I have one branch at position #200 and another at #201, I've only got two elements stored, not 201). If I want anything more than tee
, I probably just want a full sequence, right? But once you consider infinite iterables, you clearly don't want to be forced to wrap those in a sequence which will consume all of your memory if you ever call len
instead of raising an exception, right?Anyway, I'm experimenting with different ideas for the caching views. I'm not sure we actually need to generalize here. Or, if we do, I'm not sure it's the same as in the other cases. For most caches, you either want
tee
, a "half-sequence" (something with __getitem__
but not __len__
, and no negative indices), a full lazy sequence, or just an eager list
. Do you need cached reversible iteration?Generalized indexes
So, we don't need generalized indexes to build views. But they still give you some nice features.
In Swift,
find
works on any iterable, and gives you an index (or a range) that you can use to reference the found element (or sub-iterable).For example, imagine that you had to implement
str.find
and str.replace
yourself. That's not too hard:def find(haystack, needle, start=0): while start <= len(haystack) - len(needle): if haystack[start:start+len(needle)] == needle: return start return -1 def replace(haystack, needle, replacement): start = 0 while True: start = find(haystack, needle, start) if start == -1: return haystack haystack = haystack[:start] + replacement + haystack[start + len(replacement) - len(needle):] start += len(replacement)
This works with any sequence type, not just
str
. But what if you wanted to work on any forward-iterable object, not just sequences? Writing find
in terms of iterables would be painful, and then writing replace
on top of that would be very difficult. But generalized indexes solve this problem. The idea is that indexes are just numbers, they're something like C++ iterators. The only thing you can do with them is compare them, advance them, and use them to index their sequence. Under the covers, they might just be a number (for a list), or they might be a list of numbers (for a tree) or even a reference to a node (for a linked list). But the key is that you can remember multiple iteration positions, advance them independently, and do things like construct slices between them, and it all works for any iterable collection (although not for an iterator).def find(haystack, needle, start=None): if start is None: start = haystack.start() while start != haystack.end(): pos = start for element in needle: if haystack[pos] != element: break pos = pos.advance() else: return start, pos start = start.advance() return start, start def replace(haystack, needle, replacement): start = haystack.start() while True: start, end = find(haystack, needle, start) if start == haystack.end(): return haystack haystack = haystack[:start] + replacement + haystack[end:] start = end
The big question is, how often do you need this? Swift has linked lists in its stdlib. More importantly, its strings are stored as UTF-8 but iterated as grapheme clusters, meaning you can't randomly access characters. But they also intentionally designed their string API so that you use slicing and
startswith
and endswith
, so you still rarely need to actually use indices as indices this way.The one place where they're indispensable in Swift is for mutation. If you want to insert into the middle of a linked list, or a string, you need some way to indicate where you want to insert. With an iterator-based API, you can't mutate like this without a tightly-coupled knowledge of the internals of the thing you're iterating over; instead, you'd have to return a new object (maybe a chain of iterators). But doing complex things immutably instead of mutably is already the norm in Python, and has a number of advantages. In fact, notice that, even with generalized indexes, what came naturally to me in
replace
above was still an immutable copy, not an in-place mutation. Sometimes, working in-place is a valuable optimization. More often, it's a pessimization--e.g., C++ copies strings all over the place where Python (or manually-managed C) doesn't, and then tries to use copy-on-write, inline-small-string, and other optimizations to reduce that.So, assuming we have views, including slice views, and the handful of tricky building-block functions (like
startswith
) are already written, and we don't want to mutate in-place, what do we get out of generalized indexes? Not enough to be worth the complexity, I think.At any rate, we definitely don't need them for the views themselves to be useful.
Conclusions
Building Swift-like views actually seems easier in Python than in Swift. The lack of generalized indexes is not a problem. Dynamic typing makes things a little easier, not harder (although it would have been more of a problem back in the days before ABCs), and gradual static typing allows us to express (and maybe type-check) the easier and more useful part of the interface without having to work out the whole thing.
Of course the problem is that Python already has a language and a stdlib designed around iteration, not views. So, it's not clear how often your code would be able to take advantage of the expanded interfaces in the first place.
View comments