Code for this post can be found at https://github.com/abarnert/lazylist.

The same discussion that brought up lazy tuple unpacking also raised the idea of implementing a full lazy list class in Python.

This is easy to do, but isn't as useful as it sounds at first glance.

Background

If you don't know about cons lists, triggers, lazy evaluation, etc., first go read about lazy cons lists.

The short version is that a lazy cons list automatically provides all of the advantages of an iterator, while still being a (linked) list. If you iterate over it (or walk it recursively) without keeping a reference to the original list, each node becomes garbage as soon as the next node is created, so there's no space cost or up-front time cost to build the list. But, on the other hand, you can use the list as a list if you want to, keeping a reference to the original value, so you can iterate over it multiple times. So, in a language like Haskell, where all lists are lazy, there's no need for separate list comprehension and generator expression syntax; there's just a comprehension that gives you a lazy list, and you can use it either way.

Lazy Python lists

In Python, having the best of an iterator and a cons list in one isn't too useful, because you rarely want a cons list, you want a Python list—that is, a dynamic array that you can, e.g., randomly access in constant time.

Can we wrap things up the same way to get lazy Python lists? Well, not quite the same way, but it's about as easy.

Let's take a little shortcut here. When discussing lazy cons lists, the goal was to make it easy to wrap a function up in something that looked like a linked list. But in Python, the obvious use for lazy lists is going to be wrapping up iterators, not functions. (And you can always turn any function into an iterator trivially if you come up with some other use.) So, I'll change the API to just take an iterable in the initializer. But see below for alternate constructors.

The trick is to keep a list of already-evaluated values, together with the iterator that provides additional values on demand. At first, you'll have an empty list; each time you need to evaluate another value, you just pull it off the iterator and append it to the list:
class LazyList(collections.abc.MutableSequence):
    def __init__(self, iterable):
        self._nonlazy, self._lazy = [], iter(iterable)
    def __len__(self):
        self._delazify()
        return len(self._nonlazy)
    def _delazify(self, index=None):
        if index is None:
            return self._nonlazy.extend(self._lazy)
        if isinstance(index, slice):
            index = range(index.start, index.stop, index.step)[-1]
        if index < 0:
            raise IndexError('LazyList cannot take negative indexes')
        self._nonlazy.extend(islice(self._lazy, index - len(self._nonlazy) + 1))
    def __getitem__(self, index):
        self._delazify(index)
        return self._nonlazy[index]
    def __delitem__(self, index):
        self._delazify(index)
        del self._nonlazy[index]
    def __setitem__(self, index, value):
        self._delazify(index)
        self._nonlazy[index] = value
    def insert(self, index, value):
        if index:
            self._delazify(index-1)
        self._nonlazy.insert(index, value)
    def __iter__(self):
        yield from self._nonlazy
        for value in self._lazy:
            self._nonlazy.append(value)
            yield value
    def append(self, value):
        self._lazy = chain(self._lazy, (value,))
    def extend(self, values):
        self._lazy = chain(self._lazy, values)
    def clear(self):
        self._nonlazy, self._lazy = [], iter(())
    def __str__(self):
        return '[{}]'.format(', '.join(map(repr, self._nonlazy + ['...'])))
    def __repr__(self):
        return 'LazyList({})'.format(self)

Bikeshedding

We have a few choices to make in a real implementation.

How long is a lazy list?

A lazy list obviously doesn't have to be infinite, but it may be, and given that we're initializing with an iterator, there's no way to know whether we're infinite or not. So, what should __len__ do?

The implementation above will eagerly evaluate the entire list—which could loop forever if the iterable is infinite. This means that it's pretty easy to write innocent code that isn't expecting an infinite sequence, pass it one, and hang. On the other hand, any other alternative, like raising an exception (in which case you have to choose carefully between TypeError, ValueError, and maybe NotImplementedError), means we're cheating as a MutableSequence, and not fully usable as a drop-in replacement for a list. On top of that, methods like __contains__ or count are going to iterate the whole list anyway, and there's no real way around that, short of making them raise as well. Or, of course, you could decide not to make it a MutableSequence at all, but just an Iterable that happens to support many but not all MutableSequence methods. That may be a useful tradeoff in some use cases, but I think a list that eagerly evaluates on __len__ is the most general design.

If you were redesigning the entire stdlib (and maybe the language) around lazy lists, you could keep track of which iterables are definitely finite, definitely infinite, or unknown, so you could eagerly evaluate finite lists but raise on infinite lists—although you'd still have to make a choice for unknown lists. (A statically-typed language could even do that at compile time, and refuse to compile code that tried to take the length of an infinite sequence.) But that's obviously too big a change to drop into Python. (You can't even assume that a normal list is finite, since it can contain itself…)

Notice that many of the methods that Sequence and MutableSequence automatically define for us call __len__: __iter__, __reversed__, append, and reverse all call it directly; __contains__, count, and index call __iter__; extend calls append; __iadd__ calls extend. Also, while clear only depends on __delitem__, it will still eagerly evaluate the list just to destroy it. It's easy to implement __iter__ manually; append and friends can be done by chaining the new values to the iterator. For count, __reversed__, and reverse, there's no good way around evaluating everything; for __contains__ and index, you'll automatically get the best behavior (only evaluate until the element is found) just by implementing __iter__. So, that's what I've done above. But be aware that the library reference doesn't actually define which methods depend on which other methods, so at least in theory, that's an implementation artifact of CPython, and it might be safer to define _all_ of these methods manually.

Negative indices

Having implemented __len__, there's no reason we couldn't also implement negative indexing and slicing. In fact, we don't even need to handle it manually (including the whole mess with slice.indices); we know we have to eagerly evaluate the whole list to use any negative index, so just do that, then delegate the index or slice to the list:
    def _delazify(self, index=None):
        if index is None:
            return self._nonlazy.extend(self._lazy)
        if isinstance(index, slice):
            if index.start < 0 or index.stop < 0:
                return self._delazify()
            index = range(index.start, index.stop, index.step)[-1]
        if index < 0:
            return self._delazify()
        self._nonlazy.extend(islice(self._lazy, index - len(self._nonlazy) + 1)) 

What is and isn't public?

There are some use cases where the user is could take advantage of knowing much of the list has been evaluated so far. We could do that by having a nonlazy_length method (or even making len return that, instead of evaluating the whole list) and letting the caller index accordingly, but it seems even simpler for the caller (and obviously for the class) to just turn the private _nonlazy attribute into a public one. Then, if you want to iterate over all so-far-computed values, it's just for i in lst.nonlazy, not for i in lst[:lst.nonlazy_length()]. But "nonlazy" isn't a very good name for the value from an outside consumer point of view; maybe "computed" is better. Anyway, I can't see any good argument for keeping this hidden, and at least in some use cases there's a reason to not do so, so it would probably be a better design.

We could also expose the _delazify method, but that doesn't seem as necessary. Wanting to explicitly force evaluation (either up to a certain point, or of the whole thing) without actually wanting to see those values doesn't seem like it would be very common—and if you really want to, lst[30] or lst[-1] works just fine.

Representation

The str and repr give us another place to make a choice. In most cases, the repr is either a string that can be evaluated to produce the same value, or something that's obviously not evaluatable, like the generic <LastLazy at 0x123456789>. But for list, repr includes [...] if the list recursively includes itself, and of course eval('[1, 2, 3, [...]]') evaluates to [1, 2, 3, [Ellipsis]], which is not the same thing you started with. I think that's enough precedent to use ... for lazy lists as well, as I've done above.

We could keep track of the fact that we've exhausted our iterator (with a flag, or just replacing self._lazy with None), which would allow us to drop the ... when the list is no longer lazy, but that seems more work than it's worth for the small benefit.

Optimizing Sequences

If the input iterable is a Sequence, we could just copy the sequence to _nonlazy, instead of putting it in _lazy. For some use cases, that might have a significant performance benefit, but in others it won't make any difference, and it makes it a little harder to play with lazy lists, so I haven't done it. If you want to, it's trivial:
    def __init__(self, iterable):
        if isinstance(iterable, collections.abc.Sequence):
            self._nonlazy, self._lazy = list(iterable), iter(())
        else:
            self._nonlazy, self._lazy = [], iter(iterable)

Slicing

We could make slicing return a new LazyList instead of a list. But that's not as easy as it sounds.

The simplest solution is to just return LazyList(islice(self, slice.start, slice.stop, slice.step)), which works great—except that it won't handle negative slice indices. That may be an acceptable tradeoff, but it may not. Of course we could fully evaluate, then return LazyList(self), to handle negative slice indices. (In fact, together with the sequence optimization above, this would be effectively equivalent to what we do now for negative slices, but lazy for positive slices.)

Alternatively, we could implement a view class (as dict does for its keys, values, and items, and NumPy does for its arrays, and as I touched on in my posts on map and filter views and lazy tuple unpacking. But that's making a pretty major change to the sequence API, especially once we start dealing with mutating operations. (And it brings up new questions, e.g., should mutating a view mutate the parent, or should it be copy-on-write?)

Finally, as an alternative to a generic copy-on-write view, we could explicitly slice _nonlazy, share _lazy, keep track of the slicing information, and keep track of everyone who needs to be updated whenever _lazy is iterated (similar to what itertools.tee does). This sounds like a lot of extra work for no benefit, but it's a foundation on which we could add further benefits. For example, if we kept weak references to all owners of the shared _lazy, we could stop updating them when they go away.

All of these are potentially good ideas for specific use cases, but probably not appropriate for a general-purpose sequence class.

Benefits of LazyList

So, now we have a complete MutableSequence, which can be used as a drop-in replacement for list, but which doesn't evaluate values until you need it. So, does that give us all the advantages of Haskell's lazy lists?

Unfortunately, it doesn't. While we get the time advantages of an iterator (values aren't generated until you need them), we don't get the space advantages (only one value live at a time).

So, what's the difference? Well, as mentioned above, when you iterate over a cons list (or use it recursively) without keeping a reference to the list, each time you evaluate the next node, the previous one automatically becomes garbage and can be cleaned up. But a lazy Python list can't do that. Even if you drop your reference to the original LazyList object, the iterator still has a reference to it. And you can't magic that away; it needs a reference to at least self._nonlazy, or the list can't be reused after iteration. And of course it needs to keep appending to that self._nonlazy, too. So, by the time you're done iterating, you have the whole list alive in memory.

The last alternative discussed above—a manual copy-on-write slice view with weak references—would allow you to use the LazyList cheaply, but only in recursive slicing algorithms (whether using a[0] and recursing on a[1:], or something else like recursing on a[::2] and a[1::2]). I think in most such use cases, you'd be better off just using an iterator instead of trying to write your code around sequences that may or may not be lazy.

In short, if you need an iterable to be a sequence, you generally don't need it to be lazy; if you need it to be lazy, you generally don't need it to be a sequence; if you need it to be both, there's no convenient way to write your code.

So, lazy lists aren't useless in Python, but they're not as exciting as they first appear.

Let's look at the specific use case that brought up this suggestion: lazy tuple unpacking. The idea was that a, b, *c, d, e = foo could be implemented something like this:
lfoo = LazyList(foo)
a, b, c, d, e = lfoo[0], lfoo[1], lfoo[2:-2], lfoo[-2], lfoo[-1]
But clearly, that's going to evaluate the entire list anyway. If foo is some object that already knows how to slice itself lazily (like range, or np.ndarray), the LazyList is getting in the way here; we would have been better off just using foo[2:-2]. If it's not, the LazyList can't help us, no matter how we implement it, because there's no way it can pop off the last two values without iterating to the end. An iterator, on the other hand, could help us, at least in the case that foo is a Sequence:
a, b, c, d, e = foo[0], foo[1], islice(foo, 2, len(foo)-2), foo[-2], foo[-1]
That will make c a lazy iterator over the slice, rather than a copy of the whole thing. Of course it still doesn't help if foo is an iterator, but there's nothing that can possibly help there anyway.

And I think most of the cases where lazy lists at first seem attractive will turn out the same way: at best the same as iterators (possibly with itertools.tee), and often even less useful.

But of course I'd love to see examples that prove me wrong.

Construction

We want to make it easy to write LazyList-generating functions. That's easy; just write a decorator that you can put on any generator function, and now it returns a LazyList instead of an iterator:
def lazylist(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        return LazyList(func(*args, **kwargs))
    return wrapper
But you can go farther with this. What about a decorator that passes the LazyList itself to the function? Then you can write this:
@recursive_lazylist
def primes(lst):
    yield 2
    for candidate in itertools.count(3): #start at next number after 2
        if all(candidate % p for p in lst.computed()):
            yield candidate
That's not quite as easy to write; we can't pass the iterator returned by calling the generator function into the LazyList constructor and also pass the constructed LazyList into the generator function as an argument. But there are a couple different ways around this. We could just construct a LazyList without an iterator, call the function, then assign the iterator after the fact:
def recursive_lazylist(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        lst = LazyList(())
        lst._lazy = func(lst, *args, **kwargs)
        return lst
    return wrapper
A really clever idea (not mine) is to realize that that first argument is exactly equivalent to the self argument for a method call. So, we can create a new subclass on the fly, like this:
class RecursiveLazyList(LazyList):
    @abc.abstractmethod
    def _producer(self):
        while False:
            yield
    def __init__(self, *args, **kwds):
        super().__init__(self._producer(*args, **kwds))

def recursive_lazylist(func):
    return type(RecursiveLazyList)(func.__name__,
                                   (RecursiveLazyList,),
                                   {_producer: func})
This has the advantage that a reader can use their existing knowledge and intuitions about classes to realize that at the time the function is getting called, the new subclass's __init__ has been called but the subclass __init__ hasn't. Although the need to understand the basics of metaclasses (and dynamic class construction) may reduce the number of people who have useful intuitions here…

By the way, notice that I didn't use an explicit metaclass for RecursiveLazyList. As far as I can tell from the documentation, anything inheriting collections.abc.Sequence is guaranteed to already have some subclass of ABCMeta as its metaclass, so explicitly specifying ABCMeta here would just cause problems if some future version of collections.abc used a proper subclass rather than ABCMeta itself (as CPython 3.4 does), and can't be necessary in any possible implementation. But I may be wrong here; it certainly looks less clear to use abstractmethod when there are no visible metaclasses.

Sidebar: going crazy with LazyList

It may be that to really see the benefits of LazyList, you'd need to cram more support into the language. I don't think that's worth doing—I think if there are benefits to LazyList, you can find them without doing any more than building the class. But in case I'm wrong, you can go a lot farther without too much hacking.

There are many functions in the stdlib that return iterators. Just as many of these functions could instead return a view (as discussed in Swift-style map and filter views), they could also return a lazy list. We can easily create wrappers for them:
def _wrap(module, name):
    globals()[name] = lazylist(getattr(module, name))
for name in 'map', 'filter', 'zip', 'enumerate':
    _wrap(builtins, name)
for name in ('count', 'cycle', 'repeat', 'accumulate', 'chain',
             'compress', 'dropwhile', 'filterfalse', 'groupby',
             'islice', 'starmap', 'takewhile', 'zip_longest',
             'product', 'permutations', 'combinations',
             'combinations_with_replacement'):
    _wrap(itertools, name)
del _wrap
Or, if you really want to, you can monkeypatch the stdlib itself (just replace that globals()[name] = func with setattr(module, name, func)). This is starting to get more dangerous—there could easily be code in the stdlib or a third-party library that expects map to return an iterator, not a LazyList—but if you want to try it, go for it.

While we're going crazy, what if you wanted to automatically wrap every generator function so it became a lazylist function? That should be doable with an import hook. See inline (bytecode) assembly for the basics—or, better, you could probably do this with MacroPy. You could try to determine which FunctionDef nodes are generator functions by looking for any Yield or YieldFrom descendants, and, if found, just add a node to the decorator_list. Or, maybe even hackier, write a safe decorator and just decorate every function. For example:
def lazylist(func):
    if not inspect.isgeneratorfunction(func):
        return func
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        return LazyList(func(*args, **kwargs))
    return wrapper

builtins.lazylist = lazylist

class LazifyGenerators(ast.NodeTransformer):
    def visit_FunctionDef(self, node):
        node.decorator_last.insert(0,
            ast.copy_location(ast.Name(id='lazylist'), node))
        return node
I've left out the code to install a SourceFileLoader import hook and cram the LazifyGenerators transformer in at the right point, but emptyify has some working (but probably not very good) code for Python 3.4+.

And, while you're hacking up the AST, you could even change list displays and comprehensions (or, if you prefer, generator expressions) to instead build LazyLists):
class LazifyLists(ast.NodeTransformer):
    def visit_List(self, node):
        node = self.generic_visit(node)
        func = ast.copy_location(
            ast.Name(id='lazylist', ctx=ast.Load()),
            node)
        call = ast.copy_location(
            ast.Call(func=func, ctx=func.ctx,
                     args=[node], keywords=[],
                     starargs=None, kwargs=None),
            func)
        return call
    visit_ListComp = visit_List
All this without needing to hack up the compiler or interpreter. Of course you can do that too. For CPython, first you need to port LazyList itself to C, and make it available as a builtin, and export C-API-style methods like PyLazyList_New. Then, if you want to, e.g., change list comprehensions to always return LazyLists instead, just change the interpreter code for BUILD_LIST to call PyLazyList_New and PyLazyList_SetItem. If you'd rather add new syntax, that's less radical, but more work—starting with picking the syntax (would angle brackets work? if not, what symbols are left?), then editing the grammar, then changing the compiler to be able to generate a new BUILD_LAZYLIST opcode, and then finally modifying the interpreter the same way as above, except adding a BUILD_LAZYLIST handler instead of modifying the BUILD_LIST handler.
After all that, if you still can't find any benefits for LazyList, then it's probably time to admit that Python doesn't need it. :)
2

View comments

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed. Languages in the Abject-Oriented space have been borrowing ideas from another paradigm entirely—and then everyone realized that languages like Python, Ruby, and JavaScript had been doing it for years and just hadn't noticed (because these languages do not require you to declare what you're doing, or even to know what you're doing). Meanwhile, new hybrid languages borrow freely from both paradigms.

This other paradigm—which is actually older, but was largely constrained to university basements until recent years—is called Functional Addiction.

A Functional Addict is someone who regularly gets higher-order—sometimes they may even exhibit dependent types—but still manages to retain a job.

Retaining a job is of course the goal of all programming. This is why some of these new hybrid languages, like Rust, check all borrowing, from both paradigms, so extensively that you can make regular progress for months without ever successfully compiling your code, and your managers will appreciate that progress. After all, once it does compile, it will definitely work.

Closures

It's long been known that Closures are dual to Encapsulation.

As Abject-Oriented Programming explained, Encapsulation involves making all of your variables public, and ideally global, to let the rest of the code decide what should and shouldn't be private.

Closures, by contrast, are a way of referring to variables from outer scopes. And there is no scope more outer than global.

Immutability

One of the reasons Functional Addiction has become popular in recent years is that to truly take advantage of multi-core systems, you need immutable data, sometimes also called persistent data.

Instead of mutating a function to fix a bug, you should always make a new copy of that function. For example:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

When you discover that you actually wanted fields 2 and 3 rather than 1 and 2, it might be tempting to mutate the state of this function. But doing so is dangerous. The right answer is to make a copy, and then try to remember to use the copy instead of the original:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

function getCustName2(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[2] + ' ' + custRec[3];
    return fullname;
}

This means anyone still using the original function can continue to reference the old code, but as soon as it's no longer needed, it will be automatically garbage collected. (Automatic garbage collection isn't free, but it can be outsourced cheaply.)

Higher-Order Functions

In traditional Abject-Oriented Programming, you are required to give each function a name. But over time, the name of the function may drift away from what it actually does, making it as misleading as comments. Experience has shown that people will only keep once copy of their information up to date, and the CHANGES.TXT file is the right place for that.

Higher-Order Functions can solve this problem:

function []Functions = [
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    },
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[2] + ' ' + custRec[3];
        return fullname;
    },
]

Now you can refer to this functions by order, so there's no need for names.

Parametric Polymorphism

Traditional languages offer Abject-Oriented Polymorphism and Ad-Hoc Polymorphism (also known as Overloading), but better languages also offer Parametric Polymorphism.

The key to Parametric Polymorphism is that the type of the output can be determined from the type of the inputs via Algebra. For example:

function getCustData(custId, x)
{
    if (x == int(x)) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return int(fullname);
    } else if (x.real == 0) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return double(fullname);
    } else {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return complex(fullname);
    }
}

Notice that we've called the variable x. This is how you know you're using Algebraic Data Types. The names y, z, and sometimes w are also Algebraic.

Type Inference

Languages that enable Functional Addiction often feature Type Inference. This means that the compiler can infer your typing without you having to be explicit:


function getCustName(custID)
{
    // WARNING: Make sure the DB is locked here or
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

We didn't specify what will happen if the DB is not locked. And that's fine, because the compiler will figure it out and insert code that corrupts the data, without us needing to tell it to!

By contrast, most Abject-Oriented languages are either nominally typed—meaning that you give names to all of your types instead of meanings—or dynamically typed—meaning that your variables are all unique individuals that can accomplish anything if they try.

Memoization

Memoization means caching the results of a function call:

function getCustName(custID)
{
    if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Non-Strictness

Non-Strictness is often confused with Laziness, but in fact Laziness is just one kind of Non-Strictness. Here's an example that compares two different forms of Non-Strictness:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRate(custId) {}

function callByNameTextRate(custId)
{
    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
}

Both are Non-Strict, but the second one forces the compiler to actually compile the function just so we can Call it By Name. This causes code bloat. The Lazy version will be smaller and faster. Plus, Lazy programming allows us to create infinite recursion without making the program hang:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRateRecursive(custId) { lazyTaxRateRecursive(custId); }

Laziness is often combined with Memoization:

function getCustName(custID)
{
    // if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Outside the world of Functional Addicts, this same technique is often called Test-Driven Development. If enough tests can be embedded in the code to achieve 100% coverage, or at least a decent amount, your code is guaranteed to be safe. But because the tests are not compiled and executed in the normal run, or indeed ever, they don't affect performance or correctness.

Conclusion

Many people claim that the days of Abject-Oriented Programming are over. But this is pure hype. Functional Addiction and Abject Orientation are not actually at odds with each other, but instead complement each other.
5

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.