Linked lists
What Python calls "list" is actually a dynamic array. In most other languages, especially functional languages, "list" refers to some sort of linked list. Here's a basic linked list in Python:class Hlist(collections.abc.Iterable): class Node: def __init__(self, value, next): self.value, self.next = value, next def __repr__(self): return '{}({}, {})'.format(type(self).__name__, self.value, self.next) def __init__(self, iterable): self.head = last = None for value in iterable: node = Hlist.Node(value, None) if last is None: self.head = node else: last.next = node last = node def __iter__(self): node = self.head while node is not None: yield node.value node = node.next def __repr__(self): return '{}({})'.format(type(self).__name__, list(self))Of course you can implement all of the methods of the MutableSequence ABC on top of this—although many of them will take linear time, instead of the constant time of arrays (like Python's list). For example:
def __len__(self, value): length = sum(1 for _ in self)There are all kinds of variations on linked lists that I won't get into here—double-linked lists, lists that hold a tail pointer as well as a head, lists with sentinel nodes, etc. If you want to know more, the Wikipedia article linked above should cover everything.
Removing the list handle
If you look at the Hlist class, really all it's holding is that head attribute, which is the first node. What if you just passed around the nodes themselves?That means, among other things, that if you already have the node at which you want to perform an operation, it takes constant time to do so, because you no longer have to iterate from the start of the list.
This also means that an empty list is no longer the same type as a non-empty list—it's None, rather than a Node. But that's not a problem in a duck-typed language like Python:
class Slist(collections.abc.Iterable): def __init__(self, head, tail): self.head, self.tail = head, tail @classmethod def fromiter(cls, iterable): first, last = None, None for value in iterable: node = Slist(value, None) if last is not None: last.tail = node if first is None: first = node return first def __iter__(self): node = self while node is not None: yield node.head node = node.tail def __repr__(self): return '{}({})'.format(type(self).__name__, list(self))(This also wouldn't be a problem in a statically-typed language with a sufficiently powerful type system either; you'd use a union type or some other higher-level type equivalent to Node|NoneType, or Maybe Node, etc.)
However, this design does run into problems with some of the methods of the MutableSequence ABC. Most notably, insert and __delitem__ can't operate on position 0, because you can't change the head of the list when the only reference you have is to the head of the list.
You can instead write immutable methods that return a new Slist:
def inserted(self, index, value): i = iter(self) return Slist(chain(islice(i, 0, index)), [value], i))… but that obviously has performance implications that may not always be acceptable. A language designed around immutable linked lists (like ML or Haskell) will have ways to optimize away most of the copying costs whenever possible, but that's not going to happen automatically in Python.
Side note: C++ std::list
I won't follow up on this in depth, but C++ (along with a few other languages, like Swift) has a clever way of getting the best of both worlds. The std::list type is itself a list handle, with head and tail pointers, which means you can destructively insert or delete at the head. But it also provides a special kind of opaque pointer (called an iterator, but of course C++ iterators aren't the same kind of thing as Python iterators) to positions within the list, which means that after you (linearly) find some position, you can then perform a bunch of (constant-time, destructive) operations at that position. On top of that, the C++ stdlib is mostly made up of generic functions that work on any such pair of head and tail pointers—whether they come from a linked list, an array, an input stream iterator, or some abstract function that generates values on the fly, as long as they know how to dereference (get the value pointed at), increment (point at the next value), and compare for equality. (In this sense, C++ iterators are related to Python iterators, of course—just as map takes any iterable, no matter what its type or where it came from, std::transform takes any pair of iterators, no matter what their type or where they came from.)Recursion
Before we go any further, notice that Slist is defined recursively—it's a value and another Slist. And that means in many cases, a recursive definition will be simpler than an iterative one. For example, if we didn't use the iterator protocol (and itertools.islice) above, inserted would look like this:def inserted(self, index, value): snode, dnode = self, None result = None while True: if not index: node = Slist(value, None) else: if not snode: raise IndexError('Slist index out of range') node = Slist(snode.value, None) snode = snode.tail if dnode: dnode.tail = node else: result = node index -= 1Ouch. But a recursive definition looks like this:
def copy(self): if self.tail is None: return Slist(self.value, None) return Slist(self.head, self.next.copy()) def inserted(self, index, value): if not index: return Slist(value, self.copy()) if self.tail is None: raise IndexError('Slist index out of range') return Slist(self.head, self.tail.insert(index-1, value))
Tail sharing
What if, instead of copying the list to insert a new element, we just shared the tail after the insertion point? Not only does everything get a lot simpler, this also can potentially save a lot of memory.def inserted(self, index, value): if not index: return Slist(value, self) if self.next is None: raise IndexError('Slist index out of range') return Slist(self.value, self.next.insert(index-1, value)) def extended(self, slist): if self.tail is None: return SList(self.head, slist) else: return SList(self.head, self.tail.extended(slist)) # etc.
Mutation
This tail sharing means that if we use any mutating methods on lists, like calling __setitem__ or assigning to head or tail, things can quickly get confusing. In fact, even without explicit tail-sharing methods, we can see this easily:>>> l1 = Slist([0, 1, 2, 3]) >>> l2 = l1.tail >>> l1 Slist([0, 1, 2, 3]) >>> l2.head = 100 >>> l1 Slist([0, 100, 2, 3])This is effectively the same problem that Python has with binding the same list (or other mutable data type) to two different variables and then mutating one, but it can be harder to see when you're actually only sharing part or the list instead of the whole thing, and when you're using higher-level abstractions like splicing rather than just copying the tail around manually.
Persistence
In the ML/Haskell world, this doesn't come up because all data structures are immutable—if you can't mutate the shared tail, you can't tell whether tails are shared, so it's always safe to do so. But in the Scheme/Racket/Clojure world, where everything is mutable, they solve the problem by making a distinction beyond mutable vs. immutable, to persistent vs. ephemeral. I won't get into the details here (follow the link if you're interested), but basically, any code that uses only persistent operations on its persistent values is effectively referentially transparent, even if it's not purely immutable. Of course this means the stdlib needs naming conventions and/or some other way to make it clear to the reader which functions are non-mutating, which are mutating but safe for ephemeral objects, and which are always destructive, much like Python's reversed vs. list.reverse.Cons lists
In traditional Lisp, the only compound data structure is the cons (named for the assembly instruction originally used to implement it, which is short for "construct"), which is just a pair of values (traditionally called car and cdr, for the assembly instructions used to implement them, short for "contents of address register" and "contents of data register"). Obviously, modern Lisps and derivatives have plenty of other data structures—array-based vectors, hash-based dictionaries, etc.—but the cons is still fundamental to programming in the whole family of languages.So if Lisp stands for LISt Processing language, where are the lists?
Simple: Just define an empty list as nil (Lisp's name for None), and a non-empty list as a cons of the first element and the rest of the list, and implement functions like insert in the obvious way, and you have a tail-sharing, handle-less, singly-linked list. Of course you can still have conses that aren't lists—a cons is a list iff its tail is a list—but otherwise, this is the exact same thing we've just defined as Slist.
Lisp adds a bit of syntactic sugar to make this nicer: nil can be written as (); cons(1, cons(2, cons(3, nil))) as (1 2 3); cons(1, 2) as (1 . 2); cons(1, cons(2, 3)) as (1 2 . 3). But under the covers (and not very deep under the covers), a list is always a cons, or nil.
Triggers
Iterators
One of the great things about Python is the iterator protocol, a way to create iterables that only generate new items on demand, and don't store them anywhere. As usual, I'll defer to David Beazley's Generator Tricks for Systems Programmers for a detailed illustration of why this is so cool, but the basic appeal is simple: You can represent a very large iterable without actually storing (or computing up-front) anything but the current value, and whatever state is needed to calculate the next one. For a trivial example:def count(start=0): i = start while True: yield i i += 1 def take(n, it): return [next(it) for _ in range(n)] numbers = count() odds = filter(lambda x: x%2, numbers) first_10_odds = take(10, odds)Here we've created an infinite iterable (all the nonnegative integers), filtered it to remove the even numbers, and taken the first 10. Obviously, if we tried to do this by creating a list of all of the nonnegative integers, it would take infinite time. We could create a finite list of just the nonnegative integers that we're going to need—in this case, range(20) would do, because 19 is the 10th non-negative integer—but in most non-toy examples, it's not as easy (often not even possible) to work that out. For example, a program to compute the first million primes can't know how many integers to filter, unless you already know the first million primes (in which case you don't need the program); a program to pull out the first 20 HTTP packets from a pcap stream can't possibly know how many total packets it will need to look at to find 20 HTTP packets; etc.
Of course I've been both using and implementing the iterator protocol above, just because it makes everything so much simpler and more compact that it's hard not to… But many languages don't have any support for such a thing.
Doing it manually
In such languages, we can get the same benefit, without the syntactic sugar of the iterator protocol (and, sadly, without the semantic abstraction of generators) by using triggers.In fact, if you look under the covers of the iterator protocol, an iterator is just an object with a __next__ method that returns the next value (and updates its state to be ready to return the one after that).
Triggered lists
You can do something similar with a linked list: instead of storing a value and another linked list, we can store a value and a function that returns another linked list:class count: def __init__(self, start=0): self.head = start @property def tail(self): return count(self.head+1)Or, more generally:
def Tlist: def __init__(self, head, trigger): self.head, self._trigger = head, trigger @property def tail(self): return Tlist(self._trigger(self.head), self._trigger) def count(start=0): return Tlist(start, lambda value: value+1)Of course many languages don't have the equivalent of @property either, so you need an explicit function or method call to get the next value—everywhere your code used to say t.next, it has to say t.next()—but that's just minor syntactic sugar.
In many cases, you need more state than the current value (or no state at all), so it makes more sense to take a trigger as a function of no variables—which can be a closure over the needed state. Also, to allow finite lazy lists, you need some way for the trigger to signal that it's done—by raising StopIteration, by returning a new Tlist or None rather than just a value, etc. I'll do all of the options differently in the next section, for illustration.
Lazy lists
A lazy list is just a triggered list that hides the triggers under the covers—providing the same API as an Slist on top of a Tlist. Thanks to @property, we've almost got that. However, you don't want to have to recalculate the list over and over again; once a node is calculated, the trigger should be replaced by the actual node. Not all languages make it possible to build such a thing, and some have it built in, but for those in the middle, like Python, it's pretty simple:def Llist: def __init__(self, head, trigger): self.head, self._trigger, self._tail = head, trigger, None @classmethod def fromiter(cls, iterable): i = iter(iterable) return Llist(None, iter(iterable).__next__).tail @property def tail(self): if self._trigger is not None: try: self._tail = Llist(self._trigger(), self._trigger) except StopIteration: self._tail = None self._trigger = None return self._tailJust as with Slist above, we can expand this into a full Python Sequence by adding the appropriate methods:
def __iter__(self): yield self.head if self.tail is not None: yield from self.tail def __getitem__(self, index): if isinstance(index, slice): return [self[i] for i in range(slice.start, slice.stop, slice.step)] elif not index: return self.head elif index < 0: raise IndexError('Llist cannot take negative indexes') elif self.tail is None: raise IndexError('Llist index out of range') else: return self.tail[index-1] # etc.Of course I cheated by implementing __iter__ as a generator, but in a language without generators, you generally don't have to implement anything like __iter__ (Swift being the notable exception). It's not hard to turn that into an explicit class or closure if you need to, just annoyingly verbose:
def __iter__(self): class Iterator: def __init__(self, node): self.node = node def __iter__(self): return self def __next__(self): if self.node is None: raise StopIteration result = self.node.head self.node = self.node.tail return result return Iterator(self)
You may want to implement methods like __repr__ specially, to not iterate the entire list, because that would defeat the purpose of being lazy. For example:
def _lazyiter(self): yield self.head if self._trigger is not None: yield ... else: yield from self._tail def __repr__(self): return 'Llist({})'.format(list(self._lazyiter()))This is basically what languages like Haskell have built in: an automatically lazy, single-linked, tail-sharing cons list.
Laziness
There are two important things to observe here.First, calling __getitem__ (or any other function that takes an index) automatically instantiates all nodes up to the index, while leaving the rest unevaluated. So, it's evaluating nodes exactly as fast as necessary, and no faster.
Second, if you just iterate over the Llist, without keeping a reference to the Llist itself around, every time you generate the next node, the previous one becomes garbage and can be cleaned up. So, just like an iterator, only a single value is alive at a time, automatically. (But, just like a list, if you _want_ to keep all the values alive, and iterate over them multiple times, you can do that, too, just by keeping the Llist around.) And if you use it in a typical recursive linked-list algorithm instead of a Python iterator loop, it's again going to free up each node as soon as you get the next one. So, a lazy linked list gives you all the advantages of an iterator, while still being a (linked) list.
Linked lists in Python
There's nothing stopping you from using lazy cons lists, other linked list types, or related types in Python. But they don't fit into the language and stdlib (and third-party libraries) nearly as well as sequences and iterators. The entire Python ecosystem is designed around the iterator, iterable, and sequence protocol, and in particular around looping over iterables, rather than around recursing over recursive data structures.As Guido explains in his post on tail recursion elimination, this is intentional.
So, is this a limitation in Python? Not really. The iterator protocol and other Pythonic idioms bring most of the benefits of recursive programming to iterative programming. An iterator is the same kind of quasi-mutable data structure as a tail-sharing linked list: using it mutates an ephemeral object you don't care about, but if you write everything else immutably, you still get referential transparency. Stack up a bunch of iterator-transforming expressions, and at the end you get an iterator over some new virtual sequence, without affecting the original sequence, without wasting memory, without computing everything up-front, and with the ability to handle infinite sequences (as long as you truncate the iterator in some way at some point, of course). It's a different way of getting the same benefits, but the fact that Python is a bad language for writing idiomatic Haskell is no more of a problem than the fact that it's a bad language for writing idiomatic Java; it's a great language for writing idiomatic Python.
So, if you want to use lazy cons lists, go ahead… but don't expect to get much support from the stdlib, third-party libraries, etc.
View comments