Swift-style map and filter views

Along the way to an attempt to port itertools from Python to Apple's new language Swift (see my Stupid Swift Ideas blog), I discovered something interesting: Swift's map and filter functions are in some ways better than Python's. Would the same ideas be worth having in Python? What about for other functions, like the ones in itertools?

Terminology

Swift has the same key types and methods as Python, but uses different names for all of them: Generator for Iterator, Sequence for Iterable, Collection for Sequence, generate() for __iter__(), next() for __next__(), generatorOf() for iter(), etc.

For the most part, everything you know about Python translates over directly. A Python iterator is an iterable that returns itself on __iter__ and is one-pass (looping over it consumes it); a Python sequence is an iterable that can be indexed, does not return itself, and is multi-pass; the exact same things are true for the Swift equivalents.

From here on out, I'll use Python terminology for everything.

map

In Python, map(function, iterable) returns a map object, which is a kind of iterator.

Because it's an iterator, it doesn't generate new values until you ask for them, which means you can map over an large sequence without needing to waste memory storing all the mapped values, or wait until all of the values have been mapped before using them. You can even map over infinite iterators:

squares = map(lambda x: x*x, itertools.count())

If you don't get why iterators are cool, David Beazley explains it better than I ever could.

On the other hand, a map object is single-pass, it's not so nice for debugging, and it can't be indexed. These are all things people who migrate from Python 2.x (where map returns a list) complain about.

Of course you can get the Python 2.x behavior trivially:

def map2(*args, **kwargs):

return list(map(*args, **kwargs))

But Swift actually has something better.

When you call map on an iterator, you get back an iterator. But when you call it on a sequence, you get back a sequence. And not a list, but a special map view sequence. If m = map(func, seq), m[3] gives you func(seq[3]). There may or may not be a permanent or LRU cache on it, I'm not sure.

Anyway, this is pretty easy to get as well:

class MapSequenceView(collections.abc.Sequence):

def __init__(self, func, seq): self.func, self.seq = func, seq

def __getitem__(self, idx): return self.func(self.seq[idx])
def __len__(self): return len(self.seq)

def swiftmap(func, seq):

if isinstance(func, collections.ast.Sequence):

return MapView(func, seq)

else:

return map(func, seq)

This isn't complete code. For example, when __getitem__ is called with a slice, you'll want to call func on each element, not on the whole sliced list. (And if you want it cached, you'll want to use lru_cache or memoize on the individual values only, not on slices, meaning you'll want __getitem__ to delegate to a _getsingle that can be decorated. But I'm assuming we don't want it cached.) But it's not hard to flesh out. (See the documentation on collections.abc.Sequence for exactly what you need to implement to be a proper sequence.)

If you just use the MapSequenceView as an iterable, it works exactly the same as a map, with no extra cost. But if you then want to iterate it again, you can. Or if you want to iterate it in reverse, or jump to index #17, you can, again with no extra cost.

Of course the fact that it's not an iterator means it's not a drop-in replacement. You can't dropwhile a MapSequenceView to get a map starting at the first positive element, or islice it or even next it to skip a value—or, rather, you can, but you'll get a new iterator, leaving the original MapSequenceView unchanged. But that's OK; you wouldn't use this as a drop-in replacement, you'd use this when you want the new behavior. And ideally, dropwhile, islice, etc. should all be built the same way, with their own DropWhileSequenceView, etc. types.

Interestingly, some Swift collections, when sliced, return special FooSlice types that provide a view onto the sliced part of the original collection. But I don't think that's at all necessary, given that an ISliceSequenceView over a MapSequenceView would provide the exact same benefits as a special MapSequenceViewSlice type.

Also, the repr of a Swift map sequence view isn't the mapped sequence, it's essentially the constructor call that shows how to rebuild the same view. I'm not sure whether that's a good thing or a bad thing. (What might be cool is to show something like [f(0), f(1), f(2), f(3)] for map(f, [0, 1, 2, 3])…)

filter

The filter function is even more interesting. When given an iterator, it returns an iterator, but when given a sequence, it returns a sequence. How does that work? How can you know in advance that index #3 in the filtered sequence is index #8 in the original?

You can't. This is where another interesting feature of Swift comes in.

In Swift, sequences aren't necessarily indexed by integers; they're indexed by any kind of object you want, and there's a standard hierarchy of index types: ForwardIndex can't do anything but increment, BidirectionalIndex can also decrement; RandomAccessIndex can also do arithmetic; and Integer is what it sounds like. This idea is borrowed from C++, where the standard library has a bunch of collections that have begin and end methods (which Swift uses a different name for, but I'll ignore that for consistency) that return indexes (which C++ calls iterators, but I'll ignore that again) of the appropriate type, and all of its functions work on those indexes while knowing nothing of the collection itself.

And Swift makes things even more interesting by having the equivalent of range(start, end) for all kinds of indexes rather than just integers, and slice(start, end) as well (in fact, range and slice are the same thing in Swift, because sadly it doesn't support negative indexing). And enumerate returns indexes of the appropriate type too.

So, what good is all that? Well, a FilterSequenceView can return a BidirectionalIndex from its begin and end methods, rather than 0 and len(self), and therefore enumerate returns BidirectionalIndex values. Which means you can look ahead or back from the current position while looping:

for i, value in enumerate(f):
if value == "spam":
if f[i.pred()] == "eggs":
return "complete breakfast"

It also means the reversed function works on any bidirectional sequence, like a FilterSequenceView, because it's implemented as a ReversedSequenceView that wraps the other sequence and returns indexes that work backward.

Of course this requires ints to have succ and pred methods, so the same code will work whether f is random-access or just bidirectional, but that's a pretty minor thing to add.

(I'm cheating a bit here; while Swift has this capability, at least as of beta 2, its standard library doesn't actually use it. The filter function returns a forward-only sequence, which can't be reversed. Also, both bidirectional and forward sequences are basically useless for anything but iteration because the protocol provides no begin and end methods. Each sequence in the stdlib happens to provide startIndex and endIndex properties, but because Swift has statically-typed generics instead of duck typing, that doesn't help if all you know is that you have a sequence…)

Other methods

Sadly, Swift has almost no other functions that create or transform iterables—beyond map and filter, there's enumerate and partial implementations of zip and repeat. But the same idea can be carried forward to all of the functions that Python has in itertools and elsewhere. For the functions in itertools and builtin:

map, zip, count, cycle, repeat, islice, starmap, zip_longest, and product can create return randomly-accessible sequences, bidirectional sequences, forward sequences, or iterators, depending on the weakest of their arguments.
filter, chain, compress, filterfalse, dropwhile, takewhile, and permutation and friends can return bidirectional sequences, forward sequences, or iterators, depending on the weakest argument.
accumulate and groupby can return a forward sequence or iterator, depending on the weakest argument. (If the function had an inverse, accumulate could return a bidirectional sequence, but you can't even guarantee that for the default + operator for all types…)

The recipes are mostly pretty obvious, since they're just wrappers around other functions. However, some of them have their own complex implementations—e.g., grouper, roundrobin, and partition could return a random-access sequence, bidirectional sequence, forward sequence, or iterator depending on the arguments.

Implementation

It's worth noting that any itertools-like function that wants to return a sequence has to be implemented as a custom class, not a generator. In some cases, a single class can provide the best-index-possible behavior, but in other cases, you'd need as many as four separate classes (or three classes and one generator).

If we added a tower of index types that a single sequence class could use without having to care what kind of index is has, then those three classes could, I think, always be reduced to one.

However, in some cases, knowing that you have a random-access sequence allows for a more efficient implementation. (This is actually why the C++ standard originally came up with its iterator tower.)

It should be pretty easy to factor out the details for a delegating sequence, so the only ones that will be difficult to write are the ones that rely on complex implicit state. (It's worth noting that Swift apparently does not have any such factored-out helpers, and implements the same thing repeatedly, with a whole lot of boilerplate. But I think that's down to limitations in the language, and to the fact that fleshing out and polishing the stdlib hasn't been the highest priority, and the fact that they only have a handful of functions so refactoring would only provide so much benefit…)

See seqview on GitHub for a proof of concept.

An even crazier idea

It might be nice to have syntactic support for a "sequence comprehension" that returns the best kind of sequence it can for a given input (the same as the input type if there are no if clauses, no better than bidirectional if there are),

Conclusion

I have no idea which parts of this, if any, are good ideas, or at least good enough to be worth implementing, but I can break them down into separate ideas worth considering:

map and similar functions returning sequence views if given a sequence

all such functions in itertools being updated
helper to make it easier to define such sequence-or-iterator functions in Python

similar helper for C?

bidirectional, and maybe forward, sequences

single sequence type with standard index types, instead of new sequence types
making reversed work on bidirectional sequences
filter and similar functions returning sequence views if given a sequence

all such functions in itertools being updated

helper function to make it easier to define random-or-bidirectional-or-iterator functions in Python

similar helper for C?

sequence comprehensions

However, it's worth remembering that the main use for all of these functions is in iterator-pipeline-style programming, where you're going to be mixing generator expressions and custom generator functions in with your itertools calls. As soon as you have one generator anywhere in the pipeline, the whole thing is an iterator. Plus, except for debugging purposes, such code usually has no reason to restart, index, or reverse an iterator anyway. So, this may all just be idle speculation.

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed.

I haven't posted anything new in a couple years (partly because I attempted to move to a different blogging platform where I could write everything in markdown instead of HTML but got frustrated—which I may attempt again), but I've had a few private comments and emails on some of the old posts, so I

Looking before you leap

Python is a duck-typed language, and one where you usually trust EAFP ("Easier to Ask Forgiveness than Permission") over LBYL ("Look Before You Leap").

Background

Currently, CPython’s internal bytecode format stores instructions with no args as 1 byte, instructions with small args as 3 bytes, and instructions with large args as 6 bytes (actually, a 3-byte EXTENDED_ARG followed by a 3-byte real instruction).

If you want to skip all the tl;dr and cut to the chase, jump to Concrete Proposal.

Many people, when they first discover the heapq module, have two questions:

Why does it define a bunch of functions instead of a container type? Why don't those functions take a key or reverse parameter, like all the other sorting-related stuff in Python? Why not a type?

At the abstract level, it'

Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.

The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with.

One common "advanced question" on places like StackOverflow and python-list is "how do I dynamically create a function/method/class/whatever"? The standard answer is: first, some caveats about why you probably don't want to do that, and then an explanation of the various ways to do it when you reall

A few years ago, Cesare di Mauro created a project called WPython, a fork of CPython 2.6.4 that “brings many optimizations and refactorings”. The starting point of the project was replacing the bytecode with “wordcode”. However, there were a number of other changes on top of it.

Many languages have a for-each loop.

When the first betas for Swift came out, I was impressed by their collection design. In particular, the way it allows them to write map-style functions that are lazy (like Python 3), but still as full-featured as possible.

In a previous post, I explained in detail how lookup works in Python.

The documentation does a great job explaining how things normally get looked up, and how you can hook them.

But to understand how the hooking works, you need to go under the covers to see how that normal lookup actually happens.

When I say "Python" below, I'm mostly talking about CPython 3.5.

In Python (I'm mostly talking about CPython here, but other implementations do similar things), when you write the following:

def spam(x): return x+1 spam(3) What happens?

Really, it's not that complicated, but there's no documentation anywhere that puts it all together.

I've seen a number of people ask why, if you can have arbitrary-sized integers that do everything exactly, you can't do the same thing with floats, avoiding all the rounding problems that they keep running into.

In a recent thread on python-ideas, Stephan Sahm suggested, in effect, changing the method resolution order (MRO) from C3-linearization to a simple depth-first search a la old-school Python or C++.

Note: This post doesn't talk about Python that much, except as a point of comparison for JavaScript.

Most object-oriented languages out there, including Python, are class-based. But JavaScript is instead prototype-based.

About a year and a half ago, I wrote a blog post on the idea of adding pattern matching to Python.

I finally got around to playing with Scala semi-seriously, and I realized that they pretty much solved the same problem, in a pretty similar way to my straw man proposal, and it works great.

About a year ago, Jules Jacobs wrote a series (part 1 and part 2, with part 3 still forthcoming) on the best collections library design.

In three separate discussions on the Python mailing lists this month, people have objected to some design because it leaks something into the enclosing scope. But "leaks into the enclosing scope" isn't a real problem.

There's a lot of confusion about what the various kinds of things you can iterate over in Python. I'll attempt to collect definitions for all of the relevant terms, and provide examples, here, so I don't have to go over the same discussions in the same circles every time.

Python has a whole hierarchy of collection-related abstract types, described in the collections.abc module in the standard library. But there are two key, prototypical kinds. Iterators are one-shot, used for a single forward traversal, and usually lazy, generating each value on the fly as requested.

There are a lot of novice questions on optimizing NumPy code on StackOverflow, that make a lot of the same mistakes. I'll try to cover them all here.

What does NumPy speed up?

Let's look at some Python code that does some computation element-wise on two lists of lists.

When asyncio was first proposed, many people (not so much on python-ideas, where Guido first suggested it, but on external blogs) had the same reaction: Doing the core reactor loop in Python is going to be way too slow. Something based on libev, like gevent, is inherently going to be much faster.

Let's say you have a good idea for a change to Python.

There are hundreds of questions on StackOverflow that all ask variations of the same thing. Paraphrasing:

lst is a list of strings and numbers. I want to convert the numbers to int but leave the strings alone.

In Haskell, you can section infix operators. This is a simple form of partial evaluation. Using Python syntax, the following are equivalent:

(2*) lambda x: 2*x (*2) lambda x: x*2 (*) lambda x, y: x*y So, can we do the same in Python?

Grammar

The first form, (2*), is unambiguous.

Many people—especially people coming from Java—think that using try/except is "inelegant", or "inefficient". Or, slightly less meaninglessly, they think that "exceptions should only be for errors, not for normal flow control".

These people are not going to be happy with Python.

If you look at Python tutorials and sample code, proposals for new language features, blogs like this one, talks at PyCon, etc., you'll see spam, eggs, gouda, etc. all over the place.

Most control structures in most most programming languages, including Python, are subordinating conjunctions, like "if", "while", and "except", although "with" is a preposition, and "for" is a preposition used strangely (although not as strangely as in C…).

There are two ways that some Python programmers overuse lambda. Doing this almost always mkes your code less readable, and for no corresponding benefit.

Some languages have a very strong idiomatic style—in Python, Haskell, or Swift, the same code by two different programmers is likely to look a lot more similar than in Perl, Lisp, or C++.

There's an advantage to this—and, in particular, an advantage to you sticking to those idioms.

Python doesn't have a way to clone generators.

At least for a lot of simple cases, however, it's pretty obvious what cloning them should do, and being able to do so would be handy. But for a lot of other cases, it's not at all obvious.

Every time someone has a good idea, they believe it should be in the stdlib. After all, it's useful to many people, and what's the harm? But of course there is a harm.

This confuses every Python developer the first time they see it—even if they're pretty experienced by the time they see it:

>>> t = ([], []) >>> t[0] += [1] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <stdin> in <module>()

On the Python-ideas list, in yet another thread on a way to embed statements in expressions, I raised the issue that the statement-expression distinction, and not having a way to escape it, is important to why Python is so readable. But I couldn't explain exactly why.

In IEEE Floats and Python, I chose to use a tiny "binary6" type because it's easy to show a table of values that helps clarify things.

Someone suggested that you could just as easily write a table of binary64 values, ellipsizing large chunks of it, and it would also be useful for clarifying things.

Everybody knows that "floating point numbers cause problems." Many people have misconceptions about what those problems are, and how to deal with them.

Many tutorials on object-oriented programming conflate inheritance and subtyping. In fact, it's often considered part of the OO dogma that they should be conflated.

This is wrong for Python in a wide variety of ways.

It's very common in a program to want to do two things at once: repaginate a document while still responding to user input, or handle requests from two (or 10000) web browsers at the same time. In fact, pretty much any GUI application, network server, game, or simulator needs to do this.

If you've migrated to Python from C++ or one of its descendants (Java, C#, D, etc.), or to a lesser extent from other OO languages (Objective C, Ruby, etc.), the first time you asked for help on StackOverflow or CodeReview or python-list or anywhere else, the first response you got was probably: "Ge

Back in 2008, Christopher Lenz wrote a great article called The Truth About Unicode in Python, which covers a lot of useful information that you wouldn't get just by reading the Unicode HOWTO, and that you might not even know to ask about unless you were already… well, not a Unicode expert, but more

You've probably been told that you can convert any recursive function into an iterative loop just by using an explicit stack.

If you want to write a forking server, one obvious way to do it is to use the multiprocessing module to create a pool of processes, then hand off requests as you get them.

The problem is that handling a request usually involves reading from and writing to a client socket.

I was recently trying to optimize some C code using cachegrind, and discovered that branch misprediction in an inner loop was the culprit. I began wondering how much anything similar could affect Python code.