Code for this post can be found at https://github.com/abarnert/lazylist.

The same discussion that brought up lazy tuple unpacking also raised the idea of implementing a full lazy list class in Python.

This is easy to do, but isn't as useful as it sounds at first glance.

Background

If you don't know about cons lists, triggers, lazy evaluation, etc., first go read about lazy cons lists.

The short version is that a lazy cons list automatically provides all of the advantages of an iterator, while still being a (linked) list. If you iterate over it (or walk it recursively) without keeping a reference to the original list, each node becomes garbage as soon as the next node is created, so there's no space cost or up-front time cost to build the list. But, on the other hand, you can use the list as a list if you want to, keeping a reference to the original value, so you can iterate over it multiple times. So, in a language like Haskell, where all lists are lazy, there's no need for separate list comprehension and generator expression syntax; there's just a comprehension that gives you a lazy list, and you can use it either way.

Lazy Python lists

In Python, having the best of an iterator and a cons list in one isn't too useful, because you rarely want a cons list, you want a Python list—that is, a dynamic array that you can, e.g., randomly access in constant time.

Can we wrap things up the same way to get lazy Python lists? Well, not quite the same way, but it's about as easy.

Let's take a little shortcut here. When discussing lazy cons lists, the goal was to make it easy to wrap a function up in something that looked like a linked list. But in Python, the obvious use for lazy lists is going to be wrapping up iterators, not functions. (And you can always turn any function into an iterator trivially if you come up with some other use.) So, I'll change the API to just take an iterable in the initializer. But see below for alternate constructors.

The trick is to keep a list of already-evaluated values, together with the iterator that provides additional values on demand. At first, you'll have an empty list; each time you need to evaluate another value, you just pull it off the iterator and append it to the list:
class LazyList(collections.abc.MutableSequence):
    def __init__(self, iterable):
        self._nonlazy, self._lazy = [], iter(iterable)
    def __len__(self):
        self._delazify()
        return len(self._nonlazy)
    def _delazify(self, index=None):
        if index is None:
            return self._nonlazy.extend(self._lazy)
        if isinstance(index, slice):
            index = range(index.start, index.stop, index.step)[-1]
        if index < 0:
            raise IndexError('LazyList cannot take negative indexes')
        self._nonlazy.extend(islice(self._lazy, index - len(self._nonlazy) + 1))
    def __getitem__(self, index):
        self._delazify(index)
        return self._nonlazy[index]
    def __delitem__(self, index):
        self._delazify(index)
        del self._nonlazy[index]
    def __setitem__(self, index, value):
        self._delazify(index)
        self._nonlazy[index] = value
    def insert(self, index, value):
        if index:
            self._delazify(index-1)
        self._nonlazy.insert(index, value)
    def __iter__(self):
        yield from self._nonlazy
        for value in self._lazy:
            self._nonlazy.append(value)
            yield value
    def append(self, value):
        self._lazy = chain(self._lazy, (value,))
    def extend(self, values):
        self._lazy = chain(self._lazy, values)
    def clear(self):
        self._nonlazy, self._lazy = [], iter(())
    def __str__(self):
        return '[{}]'.format(', '.join(map(repr, self._nonlazy + ['...'])))
    def __repr__(self):
        return 'LazyList({})'.format(self)

Bikeshedding

We have a few choices to make in a real implementation.

How long is a lazy list?

A lazy list obviously doesn't have to be infinite, but it may be, and given that we're initializing with an iterator, there's no way to know whether we're infinite or not. So, what should __len__ do?

The implementation above will eagerly evaluate the entire list—which could loop forever if the iterable is infinite. This means that it's pretty easy to write innocent code that isn't expecting an infinite sequence, pass it one, and hang. On the other hand, any other alternative, like raising an exception (in which case you have to choose carefully between TypeError, ValueError, and maybe NotImplementedError), means we're cheating as a MutableSequence, and not fully usable as a drop-in replacement for a list. On top of that, methods like __contains__ or count are going to iterate the whole list anyway, and there's no real way around that, short of making them raise as well. Or, of course, you could decide not to make it a MutableSequence at all, but just an Iterable that happens to support many but not all MutableSequence methods. That may be a useful tradeoff in some use cases, but I think a list that eagerly evaluates on __len__ is the most general design.

If you were redesigning the entire stdlib (and maybe the language) around lazy lists, you could keep track of which iterables are definitely finite, definitely infinite, or unknown, so you could eagerly evaluate finite lists but raise on infinite lists—although you'd still have to make a choice for unknown lists. (A statically-typed language could even do that at compile time, and refuse to compile code that tried to take the length of an infinite sequence.) But that's obviously too big a change to drop into Python. (You can't even assume that a normal list is finite, since it can contain itself…)

Notice that many of the methods that Sequence and MutableSequence automatically define for us call __len__: __iter__, __reversed__, append, and reverse all call it directly; __contains__, count, and index call __iter__; extend calls append; __iadd__ calls extend. Also, while clear only depends on __delitem__, it will still eagerly evaluate the list just to destroy it. It's easy to implement __iter__ manually; append and friends can be done by chaining the new values to the iterator. For count, __reversed__, and reverse, there's no good way around evaluating everything; for __contains__ and index, you'll automatically get the best behavior (only evaluate until the element is found) just by implementing __iter__. So, that's what I've done above. But be aware that the library reference doesn't actually define which methods depend on which other methods, so at least in theory, that's an implementation artifact of CPython, and it might be safer to define _all_ of these methods manually.

Negative indices

Having implemented __len__, there's no reason we couldn't also implement negative indexing and slicing. In fact, we don't even need to handle it manually (including the whole mess with slice.indices); we know we have to eagerly evaluate the whole list to use any negative index, so just do that, then delegate the index or slice to the list:
    def _delazify(self, index=None):
        if index is None:
            return self._nonlazy.extend(self._lazy)
        if isinstance(index, slice):
            if index.start < 0 or index.stop < 0:
                return self._delazify()
            index = range(index.start, index.stop, index.step)[-1]
        if index < 0:
            return self._delazify()
        self._nonlazy.extend(islice(self._lazy, index - len(self._nonlazy) + 1)) 

What is and isn't public?

There are some use cases where the user is could take advantage of knowing much of the list has been evaluated so far. We could do that by having a nonlazy_length method (or even making len return that, instead of evaluating the whole list) and letting the caller index accordingly, but it seems even simpler for the caller (and obviously for the class) to just turn the private _nonlazy attribute into a public one. Then, if you want to iterate over all so-far-computed values, it's just for i in lst.nonlazy, not for i in lst[:lst.nonlazy_length()]. But "nonlazy" isn't a very good name for the value from an outside consumer point of view; maybe "computed" is better. Anyway, I can't see any good argument for keeping this hidden, and at least in some use cases there's a reason to not do so, so it would probably be a better design.

We could also expose the _delazify method, but that doesn't seem as necessary. Wanting to explicitly force evaluation (either up to a certain point, or of the whole thing) without actually wanting to see those values doesn't seem like it would be very common—and if you really want to, lst[30] or lst[-1] works just fine.

Representation

The str and repr give us another place to make a choice. In most cases, the repr is either a string that can be evaluated to produce the same value, or something that's obviously not evaluatable, like the generic <LastLazy at 0x123456789>. But for list, repr includes [...] if the list recursively includes itself, and of course eval('[1, 2, 3, [...]]') evaluates to [1, 2, 3, [Ellipsis]], which is not the same thing you started with. I think that's enough precedent to use ... for lazy lists as well, as I've done above.

We could keep track of the fact that we've exhausted our iterator (with a flag, or just replacing self._lazy with None), which would allow us to drop the ... when the list is no longer lazy, but that seems more work than it's worth for the small benefit.

Optimizing Sequences

If the input iterable is a Sequence, we could just copy the sequence to _nonlazy, instead of putting it in _lazy. For some use cases, that might have a significant performance benefit, but in others it won't make any difference, and it makes it a little harder to play with lazy lists, so I haven't done it. If you want to, it's trivial:
    def __init__(self, iterable):
        if isinstance(iterable, collections.abc.Sequence):
            self._nonlazy, self._lazy = list(iterable), iter(())
        else:
            self._nonlazy, self._lazy = [], iter(iterable)

Slicing

We could make slicing return a new LazyList instead of a list. But that's not as easy as it sounds.

The simplest solution is to just return LazyList(islice(self, slice.start, slice.stop, slice.step)), which works great—except that it won't handle negative slice indices. That may be an acceptable tradeoff, but it may not. Of course we could fully evaluate, then return LazyList(self), to handle negative slice indices. (In fact, together with the sequence optimization above, this would be effectively equivalent to what we do now for negative slices, but lazy for positive slices.)

Alternatively, we could implement a view class (as dict does for its keys, values, and items, and NumPy does for its arrays, and as I touched on in my posts on map and filter views and lazy tuple unpacking. But that's making a pretty major change to the sequence API, especially once we start dealing with mutating operations. (And it brings up new questions, e.g., should mutating a view mutate the parent, or should it be copy-on-write?)

Finally, as an alternative to a generic copy-on-write view, we could explicitly slice _nonlazy, share _lazy, keep track of the slicing information, and keep track of everyone who needs to be updated whenever _lazy is iterated (similar to what itertools.tee does). This sounds like a lot of extra work for no benefit, but it's a foundation on which we could add further benefits. For example, if we kept weak references to all owners of the shared _lazy, we could stop updating them when they go away.

All of these are potentially good ideas for specific use cases, but probably not appropriate for a general-purpose sequence class.

Benefits of LazyList

So, now we have a complete MutableSequence, which can be used as a drop-in replacement for list, but which doesn't evaluate values until you need it. So, does that give us all the advantages of Haskell's lazy lists?

Unfortunately, it doesn't. While we get the time advantages of an iterator (values aren't generated until you need them), we don't get the space advantages (only one value live at a time).

So, what's the difference? Well, as mentioned above, when you iterate over a cons list (or use it recursively) without keeping a reference to the list, each time you evaluate the next node, the previous one automatically becomes garbage and can be cleaned up. But a lazy Python list can't do that. Even if you drop your reference to the original LazyList object, the iterator still has a reference to it. And you can't magic that away; it needs a reference to at least self._nonlazy, or the list can't be reused after iteration. And of course it needs to keep appending to that self._nonlazy, too. So, by the time you're done iterating, you have the whole list alive in memory.

The last alternative discussed above—a manual copy-on-write slice view with weak references—would allow you to use the LazyList cheaply, but only in recursive slicing algorithms (whether using a[0] and recursing on a[1:], or something else like recursing on a[::2] and a[1::2]). I think in most such use cases, you'd be better off just using an iterator instead of trying to write your code around sequences that may or may not be lazy.

In short, if you need an iterable to be a sequence, you generally don't need it to be lazy; if you need it to be lazy, you generally don't need it to be a sequence; if you need it to be both, there's no convenient way to write your code.

So, lazy lists aren't useless in Python, but they're not as exciting as they first appear.

Let's look at the specific use case that brought up this suggestion: lazy tuple unpacking. The idea was that a, b, *c, d, e = foo could be implemented something like this:
lfoo = LazyList(foo)
a, b, c, d, e = lfoo[0], lfoo[1], lfoo[2:-2], lfoo[-2], lfoo[-1]
But clearly, that's going to evaluate the entire list anyway. If foo is some object that already knows how to slice itself lazily (like range, or np.ndarray), the LazyList is getting in the way here; we would have been better off just using foo[2:-2]. If it's not, the LazyList can't help us, no matter how we implement it, because there's no way it can pop off the last two values without iterating to the end. An iterator, on the other hand, could help us, at least in the case that foo is a Sequence:
a, b, c, d, e = foo[0], foo[1], islice(foo, 2, len(foo)-2), foo[-2], foo[-1]
That will make c a lazy iterator over the slice, rather than a copy of the whole thing. Of course it still doesn't help if foo is an iterator, but there's nothing that can possibly help there anyway.

And I think most of the cases where lazy lists at first seem attractive will turn out the same way: at best the same as iterators (possibly with itertools.tee), and often even less useful.

But of course I'd love to see examples that prove me wrong.

Construction

We want to make it easy to write LazyList-generating functions. That's easy; just write a decorator that you can put on any generator function, and now it returns a LazyList instead of an iterator:
def lazylist(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        return LazyList(func(*args, **kwargs))
    return wrapper
But you can go farther with this. What about a decorator that passes the LazyList itself to the function? Then you can write this:
@recursive_lazylist
def primes(lst):
    yield 2
    for candidate in itertools.count(3): #start at next number after 2
        if all(candidate % p for p in lst.computed()):
            yield candidate
That's not quite as easy to write; we can't pass the iterator returned by calling the generator function into the LazyList constructor and also pass the constructed LazyList into the generator function as an argument. But there are a couple different ways around this. We could just construct a LazyList without an iterator, call the function, then assign the iterator after the fact:
def recursive_lazylist(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        lst = LazyList(())
        lst._lazy = func(lst, *args, **kwargs)
        return lst
    return wrapper
A really clever idea (not mine) is to realize that that first argument is exactly equivalent to the self argument for a method call. So, we can create a new subclass on the fly, like this:
class RecursiveLazyList(LazyList):
    @abc.abstractmethod
    def _producer(self):
        while False:
            yield
    def __init__(self, *args, **kwds):
        super().__init__(self._producer(*args, **kwds))

def recursive_lazylist(func):
    return type(RecursiveLazyList)(func.__name__,
                                   (RecursiveLazyList,),
                                   {_producer: func})
This has the advantage that a reader can use their existing knowledge and intuitions about classes to realize that at the time the function is getting called, the new subclass's __init__ has been called but the subclass __init__ hasn't. Although the need to understand the basics of metaclasses (and dynamic class construction) may reduce the number of people who have useful intuitions here…

By the way, notice that I didn't use an explicit metaclass for RecursiveLazyList. As far as I can tell from the documentation, anything inheriting collections.abc.Sequence is guaranteed to already have some subclass of ABCMeta as its metaclass, so explicitly specifying ABCMeta here would just cause problems if some future version of collections.abc used a proper subclass rather than ABCMeta itself (as CPython 3.4 does), and can't be necessary in any possible implementation. But I may be wrong here; it certainly looks less clear to use abstractmethod when there are no visible metaclasses.

Sidebar: going crazy with LazyList

It may be that to really see the benefits of LazyList, you'd need to cram more support into the language. I don't think that's worth doing—I think if there are benefits to LazyList, you can find them without doing any more than building the class. But in case I'm wrong, you can go a lot farther without too much hacking.

There are many functions in the stdlib that return iterators. Just as many of these functions could instead return a view (as discussed in Swift-style map and filter views), they could also return a lazy list. We can easily create wrappers for them:
def _wrap(module, name):
    globals()[name] = lazylist(getattr(module, name))
for name in 'map', 'filter', 'zip', 'enumerate':
    _wrap(builtins, name)
for name in ('count', 'cycle', 'repeat', 'accumulate', 'chain',
             'compress', 'dropwhile', 'filterfalse', 'groupby',
             'islice', 'starmap', 'takewhile', 'zip_longest',
             'product', 'permutations', 'combinations',
             'combinations_with_replacement'):
    _wrap(itertools, name)
del _wrap
Or, if you really want to, you can monkeypatch the stdlib itself (just replace that globals()[name] = func with setattr(module, name, func)). This is starting to get more dangerous—there could easily be code in the stdlib or a third-party library that expects map to return an iterator, not a LazyList—but if you want to try it, go for it.

While we're going crazy, what if you wanted to automatically wrap every generator function so it became a lazylist function? That should be doable with an import hook. See inline (bytecode) assembly for the basics—or, better, you could probably do this with MacroPy. You could try to determine which FunctionDef nodes are generator functions by looking for any Yield or YieldFrom descendants, and, if found, just add a node to the decorator_list. Or, maybe even hackier, write a safe decorator and just decorate every function. For example:
def lazylist(func):
    if not inspect.isgeneratorfunction(func):
        return func
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        return LazyList(func(*args, **kwargs))
    return wrapper

builtins.lazylist = lazylist

class LazifyGenerators(ast.NodeTransformer):
    def visit_FunctionDef(self, node):
        node.decorator_last.insert(0,
            ast.copy_location(ast.Name(id='lazylist'), node))
        return node
I've left out the code to install a SourceFileLoader import hook and cram the LazifyGenerators transformer in at the right point, but emptyify has some working (but probably not very good) code for Python 3.4+.

And, while you're hacking up the AST, you could even change list displays and comprehensions (or, if you prefer, generator expressions) to instead build LazyLists):
class LazifyLists(ast.NodeTransformer):
    def visit_List(self, node):
        node = self.generic_visit(node)
        func = ast.copy_location(
            ast.Name(id='lazylist', ctx=ast.Load()),
            node)
        call = ast.copy_location(
            ast.Call(func=func, ctx=func.ctx,
                     args=[node], keywords=[],
                     starargs=None, kwargs=None),
            func)
        return call
    visit_ListComp = visit_List
All this without needing to hack up the compiler or interpreter. Of course you can do that too. For CPython, first you need to port LazyList itself to C, and make it available as a builtin, and export C-API-style methods like PyLazyList_New. Then, if you want to, e.g., change list comprehensions to always return LazyLists instead, just change the interpreter code for BUILD_LIST to call PyLazyList_New and PyLazyList_SetItem. If you'd rather add new syntax, that's less radical, but more work—starting with picking the syntax (would angle brackets work? if not, what symbols are left?), then editing the grammar, then changing the compiler to be able to generate a new BUILD_LAZYLIST opcode, and then finally modifying the interpreter the same way as above, except adding a BUILD_LAZYLIST handler instead of modifying the BUILD_LIST handler.
After all that, if you still can't find any benefits for LazyList, then it's probably time to admit that Python doesn't need it. :)
2

View comments

  1. A comment I got off-blog (I'm not sure if he wants me to share his name or not, so I won't until I hear otherwise) included some interesting ideas, so I'm going to edit them into the post.

    ReplyDelete
Hybrid Programming
Hybrid Programming
5
Greenlets vs. explicit coroutines
Greenlets vs. explicit coroutines
6
ABCs: What are they good for?
ABCs: What are they good for?
1
A standard assembly format for Python bytecode
A standard assembly format for Python bytecode
6
Unified call syntax
Unified call syntax
8
Why heapq isn't a type
Why heapq isn't a type
1
Unpacked Bytecode
Unpacked Bytecode
3
Everything is dynamic
Everything is dynamic
1
Wordcode
Wordcode
1
For-each loops should define a new variable
For-each loops should define a new variable
4
Views instead of iterators
Views instead of iterators
2
How lookup _could_ work
How lookup _could_ work
2
How lookup works
How lookup works
7
How functions work
How functions work
2
Why you can't have exact decimal math
Why you can't have exact decimal math
2
Can you customize method resolution order?
Can you customize method resolution order?
1
Prototype inheritance is inheritance
Prototype inheritance is inheritance
1
Pattern matching again
Pattern matching again
The best collections library design?
The best collections library design?
1
Leaks into the Enclosing Scope
Leaks into the Enclosing Scope
2
Iterable Terminology
Iterable Terminology
8
Creating a new sequence type is easy
Creating a new sequence type is easy
2
Going faster with NumPy
Going faster with NumPy
2
Why isn't asyncio too slow?
Why isn't asyncio too slow?
Hacking Python without hacking Python
Hacking Python without hacking Python
1
How to detect a valid integer literal
How to detect a valid integer literal
2
Operator sectioning for Python
Operator sectioning for Python
1
If you don't like exceptions, you don't like Python
If you don't like exceptions, you don't like Python
2
Spam, spam, spam, gouda, spam, and tulips
Spam, spam, spam, gouda, spam, and tulips
And now for something completely stupid…
And now for something completely stupid…
How not to overuse lambda
How not to overuse lambda
1
Why following idioms matters
Why following idioms matters
1
Cloning generators
Cloning generators
5
What belongs in the stdlib?
What belongs in the stdlib?
3
Augmented Assignments (a += b)
Augmented Assignments (a += b)
11
Statements and Expressions
Statements and Expressions
3
An Abbreviated Table of binary64 Values
An Abbreviated Table of binary64 Values
1
IEEE Floats and Python
IEEE Floats and Python
Subtyping and Ducks
Subtyping and Ducks
1
Greenlets, threads, and processes
Greenlets, threads, and processes
6
Why don't you want getters and setters?
Why don't you want getters and setters?
8
The (Updated) Truth About Unicode in Python
The (Updated) Truth About Unicode in Python
1
How do I make a recursive function iterative?
How do I make a recursive function iterative?
1
Sockets and multiprocessing
Sockets and multiprocessing
Micro-optimization and Python
Micro-optimization and Python
3
Why does my 100MB file take 1GB of memory?
Why does my 100MB file take 1GB of memory?
1
How to edit a file in-place
How to edit a file in-place
ADTs for Python
ADTs for Python
5
A pattern-matching case statement for Python
A pattern-matching case statement for Python
2
How strongly typed is Python?
How strongly typed is Python?
How do comprehensions work?
How do comprehensions work?
1
Reverse dictionary lookup and more, on beyond z
Reverse dictionary lookup and more, on beyond z
2
How to handle exceptions
How to handle exceptions
2
Three ways to read files
Three ways to read files
2
Lazy Python lists
Lazy Python lists
2
Lazy cons lists
Lazy cons lists
1
Lazy tuple unpacking
Lazy tuple unpacking
3
Getting atomic writes right
Getting atomic writes right
Suites, scopes, and lifetimes
Suites, scopes, and lifetimes
1
Swift-style map and filter views
Swift-style map and filter views
1
Inline (bytecode) assembly
Inline (bytecode) assembly
Why Python (or any decent language) doesn't need blocks
Why Python (or any decent language) doesn't need blocks
18
SortedContainers
SortedContainers
1
Fixing lambda
Fixing lambda
2
Arguments and parameters, under the covers
Arguments and parameters, under the covers
pip, extension modules, and distro packages
pip, extension modules, and distro packages
Python doesn't have encapsulation?
Python doesn't have encapsulation?
3
Grouping into runs of adjacent values
Grouping into runs of adjacent values
dbm: not just for Unix
dbm: not just for Unix
How to use your self
How to use your self
1
Tkinter validation
Tkinter validation
7
What's the deal with ttk.Frame.__init__(self, parent)
What's the deal with ttk.Frame.__init__(self, parent)
1
Does Python pass by value, or by reference?
Does Python pass by value, or by reference?
9
"if not exists" definitions
"if not exists" definitions
repr + eval = bad idea
repr + eval = bad idea
1
Solving callbacks for Python GUIs
Solving callbacks for Python GUIs
Why your GUI app freezes
Why your GUI app freezes
21
Using python.org binary installations with Xcode 5
Using python.org binary installations with Xcode 5
defaultdict vs. setdefault
defaultdict vs. setdefault
1
Lazy restartable iteration
Lazy restartable iteration
2
Arguments and parameters
Arguments and parameters
3
How grouper works
How grouper works
1
Comprehensions vs. map
Comprehensions vs. map
2
Basic thread pools
Basic thread pools
Sorted collections in the stdlib
Sorted collections in the stdlib
4
Mac environment variables
Mac environment variables
Syntactic takewhile?
Syntactic takewhile?
4
Can you optimize list(genexp)
Can you optimize list(genexp)
MISRA-C and Python
MISRA-C and Python
1
How to split your program in two
How to split your program in two
How methods work
How methods work
3
readlines considered silly
readlines considered silly
6
Comprehensions for dummies
Comprehensions for dummies
Sockets are byte streams, not message streams
Sockets are byte streams, not message streams
9
Why you don't want to dynamically create variables
Why you don't want to dynamically create variables
7
Why eval/exec is bad
Why eval/exec is bad
Iterator Pipelines
Iterator Pipelines
2
Why are non-mutating algorithms simpler to write in Python?
Why are non-mutating algorithms simpler to write in Python?
2
Sticking with Apple's Python 2.7
Sticking with Apple's Python 2.7
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.