This post is meant as background to the following post on lazy Python lists. If you already know all about cons lists, triggers, lazy evaluation, tail sharing, etc., you don't need to read it.

Linked lists

What Python calls "list" is actually a dynamic array. In most other languages, especially functional languages, "list" refers to some sort of linked list. Here's a basic linked list in Python:
class Hlist(collections.abc.Iterable):
    class Node:
        def __init__(self, value, next):
            self.value, self.next = value, next
        def __repr__(self):
            return '{}({}, {})'.format(type(self).__name__,
                                       self.value, self.next)
    def __init__(self, iterable):
        self.head = last = None
        for value in iterable:
            node = Hlist.Node(value, None)
            if last is None:
                self.head = node
            else:
                last.next = node
            last = node
    def __iter__(self):
        node = self.head
        while node is not None:
            yield node.value
            node = node.next
    def __repr__(self):
        return '{}({})'.format(type(self).__name__, list(self))
Of course you can implement all of the methods of the MutableSequence ABC on top of this—although many of them will take linear time, instead of the constant time of arrays (like Python's list). For example:
    def __len__(self, value):
        length = sum(1 for _ in self)
There are all kinds of variations on linked lists that I won't get into here—double-linked lists, lists that hold a tail pointer as well as a head, lists with sentinel nodes, etc. If you want to know more, the Wikipedia article linked above should cover everything.

Removing the list handle

If you look at the Hlist class, really all it's holding is that head attribute, which is the first node. What if you just passed around the nodes themselves?

That means, among other things, that if you already have the node at which you want to perform an operation, it takes constant time to do so, because you no longer have to iterate from the start of the list.

This also means that an empty list is no longer the same type as a non-empty list—it's None, rather than a Node. But that's not a problem in a duck-typed language like Python:
class Slist(collections.abc.Iterable):
    def __init__(self, head, tail):
        self.head, self.tail = head, tail
    @classmethod
    def fromiter(cls, iterable):
        first, last = None, None
        for value in iterable:
            node = Slist(value, None)
            if last is not None:
                last.tail = node
            if first is None:
                first = node
        return first
    def __iter__(self):
        node = self
        while node is not None:
            yield node.head
            node = node.tail
    def __repr__(self):
        return '{}({})'.format(type(self).__name__, list(self))
(This also wouldn't be a problem in a statically-typed language with a sufficiently powerful type system either; you'd use a union type or some other higher-level type equivalent to Node|NoneType, or Maybe Node, etc.)

However, this design does run into problems with some of the methods of the MutableSequence ABC. Most notably, insert and __delitem__ can't operate on position 0, because you can't change the head of the list when the only reference you have is to the head of the list.

You can instead write immutable methods that return a new Slist:
    def inserted(self, index, value):
        i = iter(self)
        return Slist(chain(islice(i, 0, index)), [value], i))
… but that obviously has performance implications that may not always be acceptable. A language designed around immutable linked lists (like ML or Haskell) will have ways to optimize away most of the copying costs whenever possible, but that's not going to happen automatically in Python.

Side note: C++ std::list

I won't follow up on this in depth, but C++ (along with a few other languages, like Swift) has a clever way of getting the best of both worlds. The std::list type is itself a list handle, with head and tail pointers, which means you can destructively insert or delete at the head. But it also provides a special kind of opaque pointer (called an iterator, but of course C++ iterators aren't the same kind of thing as Python iterators) to positions within the list, which means that after you (linearly) find some position, you can then perform a bunch of (constant-time, destructive) operations at that position. On top of that, the C++ stdlib is mostly made up of generic functions that work on any such pair of head and tail pointers—whether they come from a linked list, an array, an input stream iterator, or some abstract function that generates values on the fly, as long as they know how to dereference (get the value pointed at), increment (point at the next value), and compare for equality. (In this sense, C++ iterators are related to Python iterators, of course—just as map takes any iterable, no matter what its type or where it came from, std::transform takes any pair of iterators, no matter what their type or where they came from.)

Recursion

Before we go any further, notice that Slist is defined recursively—it's a value and another Slist. And that means in many cases, a recursive definition will be simpler than an iterative one. For example, if we didn't use the iterator protocol (and itertools.islice) above, inserted would look like this:
    def inserted(self, index, value):
        snode, dnode = self, None
        result = None
        while True:
            if not index:
                node = Slist(value, None)
            else:
                if not snode:
                    raise IndexError('Slist index out of range')
                node = Slist(snode.value, None)
                snode = snode.tail
            if dnode:
                dnode.tail = node
            else:
                result = node
            index -= 1
Ouch. But a recursive definition looks like this:
    def copy(self):
        if self.tail is None:
            return Slist(self.value, None)
        return Slist(self.head, self.next.copy())
    def inserted(self, index, value):
        if not index:
            return Slist(value, self.copy())
        if self.tail is None:
            raise IndexError('Slist index out of range')
        return Slist(self.head, self.tail.insert(index-1, value))

Tail sharing

What if, instead of copying the list to insert a new element, we just shared the tail after the insertion point? Not only does everything get a lot simpler, this also can potentially save a lot of memory.
    def inserted(self, index, value):
        if not index:
            return Slist(value, self)
        if self.next is None:
            raise IndexError('Slist index out of range')
        return Slist(self.value, self.next.insert(index-1, value))
    def extended(self, slist):
        if self.tail is None:
            return SList(self.head, slist)
        else:
            return SList(self.head, self.tail.extended(slist))
    # etc.

Mutation

This tail sharing means that if we use any mutating methods on lists, like calling __setitem__ or assigning to head or tail, things can quickly get confusing. In fact, even without explicit tail-sharing methods, we can see this easily:
>>> l1 = Slist([0, 1, 2, 3])
>>> l2 = l1.tail
>>> l1
Slist([0, 1, 2, 3])
>>> l2.head = 100
>>> l1
Slist([0, 100, 2, 3])
This is effectively the same problem that Python has with binding the same list (or other mutable data type) to two different variables and then mutating one, but it can be harder to see when you're actually only sharing part or the list instead of the whole thing, and when you're using higher-level abstractions like splicing rather than just copying the tail around manually.

Persistence

In the ML/Haskell world, this doesn't come up because all data structures are immutable—if you can't mutate the shared tail, you can't tell whether tails are shared, so it's always safe to do so. But in the Scheme/Racket/Clojure world, where everything is mutable, they solve the problem by making a distinction beyond mutable vs. immutable, to persistent vs. ephemeral. I won't get into the details here (follow the link if you're interested), but basically, any code that uses only persistent operations on its persistent values is effectively referentially transparent, even if it's not purely immutable. Of course this means the stdlib needs naming conventions and/or some other way to make it clear to the reader which functions are non-mutating, which are mutating but safe for ephemeral objects, and which are always destructive, much like Python's reversed vs. list.reverse.

Cons lists

In traditional Lisp, the only compound data structure is the cons (named for the assembly instruction originally used to implement it, which is short for "construct"), which is just a pair of values (traditionally called car and cdr, for the assembly instructions used to implement them, short for "contents of address register" and "contents of data register"). Obviously, modern Lisps and derivatives have plenty of other data structures—array-based vectors, hash-based dictionaries, etc.—but the cons is still fundamental to programming in the whole family of languages.

So if Lisp stands for LISt Processing language, where are the lists?

Simple: Just define an empty list as nil (Lisp's name for None), and a non-empty list as a cons of the first element and the rest of the list, and implement functions like insert in the obvious way, and you have a tail-sharing, handle-less, singly-linked list. Of course you can still have conses that aren't lists—a cons is a list iff its tail is a list—but otherwise, this is the exact same thing we've just defined as Slist.

Lisp adds a bit of syntactic sugar to make this nicer: nil can be written as (); cons(1, cons(2, cons(3, nil))) as (1 2 3); cons(1, 2) as (1 . 2); cons(1, cons(2, 3)) as (1 2 . 3). But under the covers (and not very deep under the covers), a list is always a cons, or nil.

Triggers

Iterators

One of the great things about Python is the iterator protocol, a way to create iterables that only generate new items on demand, and don't store them anywhere. As usual, I'll defer to David Beazley's Generator Tricks for Systems Programmers for a detailed illustration of why this is so cool, but the basic appeal is simple: You can represent a very large iterable without actually storing (or computing up-front) anything but the current value, and whatever state is needed to calculate the next one. For a trivial example:
def count(start=0):
    i = start
    while True:
        yield i
        i += 1

def take(n, it):
    return [next(it) for _ in range(n)]

numbers = count()
odds = filter(lambda x: x%2, numbers)
first_10_odds = take(10, odds)
Here we've created an infinite iterable (all the nonnegative integers), filtered it to remove the even numbers, and taken the first 10. Obviously, if we tried to do this by creating a list of all of the nonnegative integers, it would take infinite time. We could create a finite list of just the nonnegative integers that we're going to need—in this case, range(20) would do, because 19 is the 10th non-negative integer—but in most non-toy examples, it's not as easy (often not even possible) to work that out. For example, a program to compute the first million primes can't know how many integers to filter, unless you already know the first million primes (in which case you don't need the program); a program to pull out the first 20 HTTP packets from a pcap stream can't possibly know how many total packets it will need to look at to find 20 HTTP packets; etc.

Of course I've been both using and implementing the iterator protocol above, just because it makes everything so much simpler and more compact that it's hard not to… But many languages don't have any support for such a thing.

Doing it manually

In such languages, we can get the same benefit, without the syntactic sugar of the iterator protocol (and, sadly, without the semantic abstraction of generators) by using triggers.

In fact, if you look under the covers of the iterator protocol, an iterator is just an object with a __next__ method that returns the next value (and updates its state to be ready to return the one after that).

Triggered lists

You can do something similar with a linked list: instead of storing a value and another linked list, we can store a value and a function that returns another linked list:
class count:
    def __init__(self, start=0):
        self.head = start
    @property
    def tail(self):
        return count(self.head+1)
Or, more generally:
def Tlist:
    def __init__(self, head, trigger):
        self.head, self._trigger = head, trigger
    @property
    def tail(self):
        return Tlist(self._trigger(self.head), self._trigger)

def count(start=0):
    return Tlist(start, lambda value: value+1)
Of course many languages don't have the equivalent of @property either, so you need an explicit function or method call to get the next value—everywhere your code used to say t.next, it has to say t.next()—but that's just minor syntactic sugar.

In many cases, you need more state than the current value (or no state at all), so it makes more sense to take a trigger as a function of no variables—which can be a closure over the needed state. Also, to allow finite lazy lists, you need some way for the trigger to signal that it's done—by raising StopIteration, by returning a new Tlist or None rather than just a value, etc. I'll do all of the options differently in the next section, for illustration.

Lazy lists

A lazy list is just a triggered list that hides the triggers under the covers—providing the same API as an Slist on top of a Tlist. Thanks to @property, we've almost got that. However, you don't want to have to recalculate the list over and over again; once a node is calculated, the trigger should be replaced by the actual node. Not all languages make it possible to build such a thing, and some have it built in, but for those in the middle, like Python, it's pretty simple:
def Llist:
    def __init__(self, head, trigger):
        self.head, self._trigger, self._tail = head, trigger, None
    @classmethod
    def fromiter(cls, iterable):
        i = iter(iterable)
        return Llist(None, iter(iterable).__next__).tail
    @property
    def tail(self):
        if self._trigger is not None:
            try:
                self._tail = Llist(self._trigger(), self._trigger)
            except StopIteration:
                self._tail = None
            self._trigger = None
        return self._tail
Just as with Slist above, we can expand this into a full Python Sequence by adding the appropriate methods:
    def __iter__(self):
        yield self.head
        if self.tail is not None:
            yield from self.tail
    def __getitem__(self, index):
        if isinstance(index, slice):
            return [self[i] for i in range(slice.start, slice.stop, slice.step)]
        elif not index:
            return self.head
        elif index < 0:
            raise IndexError('Llist cannot take negative indexes')
        elif self.tail is None:
            raise IndexError('Llist index out of range')
        else:
            return self.tail[index-1]
    # etc.
Of course I cheated by implementing __iter__ as a generator, but in a language without generators, you generally don't have to implement anything like __iter__ (Swift being the notable exception). It's not hard to turn that into an explicit class or closure if you need to, just annoyingly verbose:
    def __iter__(self):
        class Iterator:
            def __init__(self, node):
                self.node = node
            def __iter__(self):
                return self
            def __next__(self):
                if self.node is None:
                    raise StopIteration
                result = self.node.head
                self.node = self.node.tail
                return result
        return Iterator(self)

You may want to implement methods like __repr__ specially, to not iterate the entire list, because that would defeat the purpose of being lazy. For example:
    def _lazyiter(self):
        yield self.head
        if self._trigger is not None:
            yield ...
        else:
            yield from self._tail
    def __repr__(self):
        return 'Llist({})'.format(list(self._lazyiter()))
This is basically what languages like Haskell have built in: an automatically lazy, single-linked, tail-sharing cons list.

Laziness

There are two important things to observe here.

First, calling __getitem__ (or any other function that takes an index) automatically instantiates all nodes up to the index, while leaving the rest unevaluated. So, it's evaluating nodes exactly as fast as necessary, and no faster.

Second, if you just iterate over the Llist, without keeping a reference to the Llist itself around, every time you generate the next node, the previous one becomes garbage and can be cleaned up. So, just like an iterator, only a single value is alive at a time, automatically. (But, just like a list, if you _want_ to keep all the values alive, and iterate over them multiple times, you can do that, too, just by keeping the Llist around.) And if you use it in a typical recursive linked-list algorithm instead of a Python iterator loop, it's again going to free up each node as soon as you get the next one. So, a lazy linked list gives you all the advantages of an iterator, while still being a (linked) list.

Linked lists in Python

There's nothing stopping you from using lazy cons lists, other linked list types, or related types in Python. But they don't fit into the language and stdlib (and third-party libraries) nearly as well as sequences and iterators. The entire Python ecosystem is designed around the iterator, iterable, and sequence protocol, and in particular around looping over iterables, rather than around recursing over recursive data structures.

As Guido explains in his post on tail recursion elimination, this is intentional.

So, is this a limitation in Python? Not really. The iterator protocol and other Pythonic idioms bring most of the benefits of recursive programming to iterative programming. An iterator is the same kind of quasi-mutable data structure as a tail-sharing linked list: using it mutates an ephemeral object you don't care about, but if you write everything else immutably, you still get referential transparency. Stack up a bunch of iterator-transforming expressions, and at the end you get an iterator over some new virtual sequence, without affecting the original sequence, without wasting memory, without computing everything up-front, and with the ability to handle infinite sequences (as long as you truncate the iterator in some way at some point, of course). It's a different way of getting the same benefits, but the fact that Python is a bad language for writing idiomatic Haskell is no more of a problem than the fact that it's a bad language for writing idiomatic Java; it's a great language for writing idiomatic Python.

So, if you want to use lazy cons lists, go ahead… but don't expect to get much support from the stdlib, third-party libraries, etc.
1

View comments

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed.

5

I haven't posted anything new in a couple years (partly because I attempted to move to a different blogging platform where I could write everything in markdown instead of HTML but got frustrated—which I may attempt again), but I've had a few private comments and emails on some of the old posts, so I

6

Looking before you leap

Python is a duck-typed language, and one where you usually trust EAFP ("Easier to Ask Forgiveness than Permission") over LBYL ("Look Before You Leap").

1

Background

Currently, CPython’s internal bytecode format stores instructions with no args as 1 byte, instructions with small args as 3 bytes, and instructions with large args as 6 bytes (actually, a 3-byte EXTENDED_ARG followed by a 3-byte real instruction).

6

If you want to skip all the tl;dr and cut to the chase, jump to Concrete Proposal.

8

Many people, when they first discover the heapq module, have two questions:

Why does it define a bunch of functions instead of a container type? Why don't those functions take a key or reverse parameter, like all the other sorting-related stuff in Python? Why not a type?

At the abstract level, it'

1

Currently, in CPython, if you want to process bytecode, either in C or in Python, it’s pretty complicated.

The built-in peephole optimizer has to do extra work fixing up jump targets and the line-number table, and just punts on many cases because they’re too hard to deal with.

3

One common "advanced question" on places like StackOverflow and python-list is "how do I dynamically create a function/method/class/whatever"? The standard answer is: first, some caveats about why you probably don't want to do that, and then an explanation of the various ways to do it when you reall

1

A few years ago, Cesare di Mauro created a project called WPython, a fork of CPython 2.6.4 that “brings many optimizations and refactorings”. The starting point of the project was replacing the bytecode with “wordcode”. However, there were a number of other changes on top of it.

1

Many languages have a for-each loop.

4

When the first betas for Swift came out, I was impressed by their collection design. In particular, the way it allows them to write map-style functions that are lazy (like Python 3), but still as full-featured as possible.

2

In a previous post, I explained in detail how lookup works in Python.

2

The documentation does a great job explaining how things normally get looked up, and how you can hook them.

But to understand how the hooking works, you need to go under the covers to see how that normal lookup actually happens.

When I say "Python" below, I'm mostly talking about CPython 3.5.

7

In Python (I'm mostly talking about CPython here, but other implementations do similar things), when you write the following:

def spam(x): return x+1 spam(3) What happens?

Really, it's not that complicated, but there's no documentation anywhere that puts it all together.

2

I've seen a number of people ask why, if you can have arbitrary-sized integers that do everything exactly, you can't do the same thing with floats, avoiding all the rounding problems that they keep running into.

2

In a recent thread on python-ideas, Stephan Sahm suggested, in effect, changing the method resolution order (MRO) from C3-linearization to a simple depth-first search a la old-school Python or C++.

1

Note: This post doesn't talk about Python that much, except as a point of comparison for JavaScript.

Most object-oriented languages out there, including Python, are class-based. But JavaScript is instead prototype-based.

1

About a year and a half ago, I wrote a blog post on the idea of adding pattern matching to Python.

I finally got around to playing with Scala semi-seriously, and I realized that they pretty much solved the same problem, in a pretty similar way to my straw man proposal, and it works great.

About a year ago, Jules Jacobs wrote a series (part 1 and part 2, with part 3 still forthcoming) on the best collections library design.

1

In three separate discussions on the Python mailing lists this month, people have objected to some design because it leaks something into the enclosing scope. But "leaks into the enclosing scope" isn't a real problem.

2

There's a lot of confusion about what the various kinds of things you can iterate over in Python. I'll attempt to collect definitions for all of the relevant terms, and provide examples, here, so I don't have to go over the same discussions in the same circles every time.

8

Python has a whole hierarchy of collection-related abstract types, described in the collections.abc module in the standard library. But there are two key, prototypical kinds. Iterators are one-shot, used for a single forward traversal, and usually lazy, generating each value on the fly as requested.

2

There are a lot of novice questions on optimizing NumPy code on StackOverflow, that make a lot of the same mistakes. I'll try to cover them all here.

What does NumPy speed up?

Let's look at some Python code that does some computation element-wise on two lists of lists.

2

When asyncio was first proposed, many people (not so much on python-ideas, where Guido first suggested it, but on external blogs) had the same reaction: Doing the core reactor loop in Python is going to be way too slow. Something based on libev, like gevent, is inherently going to be much faster.

Let's say you have a good idea for a change to Python.

1

There are hundreds of questions on StackOverflow that all ask variations of the same thing. Paraphrasing:

lst is a list of strings and numbers. I want to convert the numbers to int but leave the strings alone.

2

In Haskell, you can section infix operators. This is a simple form of partial evaluation. Using Python syntax, the following are equivalent:

(2*) lambda x: 2*x (*2) lambda x: x*2 (*) lambda x, y: x*y So, can we do the same in Python?

Grammar

The first form, (2*), is unambiguous.

1

Many people—especially people coming from Java—think that using try/except is "inelegant", or "inefficient". Or, slightly less meaninglessly, they think that "exceptions should only be for errors, not for normal flow control".

These people are not going to be happy with Python.

2

If you look at Python tutorials and sample code, proposals for new language features, blogs like this one, talks at PyCon, etc., you'll see spam, eggs, gouda, etc. all over the place.

Most control structures in most most programming languages, including Python, are subordinating conjunctions, like "if", "while", and "except", although "with" is a preposition, and "for" is a preposition used strangely (although not as strangely as in C…).

There are two ways that some Python programmers overuse lambda. Doing this almost always mkes your code less readable, and for no corresponding benefit.

1

Some languages have a very strong idiomatic style—in Python, Haskell, or Swift, the same code by two different programmers is likely to look a lot more similar than in Perl, Lisp, or C++.

There's an advantage to this—and, in particular, an advantage to you sticking to those idioms.

1

Python doesn't have a way to clone generators.

At least for a lot of simple cases, however, it's pretty obvious what cloning them should do, and being able to do so would be handy. But for a lot of other cases, it's not at all obvious.

5

Every time someone has a good idea, they believe it should be in the stdlib. After all, it's useful to many people, and what's the harm? But of course there is a harm.

3

This confuses every Python developer the first time they see it—even if they're pretty experienced by the time they see it:

>>> t = ([], []) >>> t[0] += [1] --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <stdin> in <module>()

11
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.