1. The fts interface from BSD is simpler to use than Python's os.walk for most cases. It's also more flexible, it provides more information, and it requires far fewer calls to stat, making it inherently faster (around 3x-7x on tests on different platforms) for many use cases.

    Interface

    The fts module's interface is a Pythonified version of the BSD (and probably future POSIX) fts API.

    Changes from the BSD interface are:
    • fts.open in place of fts_open.
    • Constants, functions, and struct members lose the FTS_ and fts_ prefixes.
    • Failures raise OSError. (Error reading directories, etc., are generally not failures, and are returned in the error field of an FTSENT.)
    • All other functions become methods of the FTS object (e.g., f.read() instead of fts_read(f)).
    • Each options bitfield becomes a separate bool parameter.
    • Option physical is dropped because it's the opposite of logical (the C API requires exactly one of the two). This means physical is the default (instead of the default being an illegal case that returns an EINVAL error).
    • Options in negative form remove the double negative (e.g., instead of nochdir=False, it's chdir=True).
    • fts.open takes optional key and reverse parameters, instead of a compar parameter. The params are interpreted as with sorted, except that the default is directory order. You can of course use functools.cmp_to_key if you prefer to write a cmp-style function.
    • Because you can easily bind values into the key function, there is no need for the fts_set_clientptrfts_get_clientptr, and fts_get_stream functions, or for the fts_number and fts_pointer members of FTSENT.
    • FTS objects are context managers.
    • FTS objects are iterators, calling read() to get the next value.
    • children takes a nameonly bool parameter, instead of an option that must be NAMEONLY or 0.
    • chlldren returns a Python list of FTSENT objects, instead of a single FTSENT object that requires you to manually follow the link values until you reach NULL.  (However, if you want to, you can still follow the links.)
    • In place of set with three different options, there are three separate methods again, follow, and skip.
    • Passing chdir=True does not necessarily guarantee that fts will change directories. (This is already true in the C API, but it's not stated clearly, and it is guaranteed with most implementations, so people have written C code that relies on chdir.)
    • The info value NSOK will never be returned for any directory, or any readable symlink to a directory. (This is already implicit in the C API, but should be explicitly stated.) 
    So, a quick summary of the interface is:

    • fts.open(paths, comfollow=False, logical=False, chdir=True, stat=True, seedot=False, xdev=True, key=None, reverse=None): returns an FTS object.
    • FTS.read() returns the next FTSENT, or None if there are no more.
    • FTS.children(nameonly=False) returns a list of FTSENTs.
    • FTS.close() closes the FTS and cleans up any resources.
    • Deleting the FTS, or using it as a context manager and exiting it, automatically closes.
    • Iterating the FTS repeatedly calls read() until None.
    • An FTSENT has at least the following fields (as defined in the BSD documentation): info, accpath, path, name, level, errno, parent, link, cycle, statp, plus an additional error field that holds an OSError object for the errno.
      • parentlink, and cycle may be weak references.
    • fts.walk(path, topdown=True, onerror=None, followlinks=False): a drop-in replacement for os.walk that's usually a few times faster.

    Simple examples:

    For very simple use cases, an FTS is just an iterator over FTSENT objects:

    total = sum(entry.statp.st_size for entry in fts.open([path]))

    However, you really should clean up FTS objects—just as with os.walk, you're leaking O(N) file handles for an N-level tree until the GC finds everything:

    with fts.open([path]) as f:
        total = sum(entry.statp.st_size for entry in f)

    As usual, if you're trying to do anything non-trivial, you probably want to explicitly loop over the iterator, or call a filtering generator function, etc., rather than using a generator expression:

    with fts.open([path]) as f:
        totals = collections.Counter()
        for entry in f:
            totals[entry.statp.st_dev] += entry.statp.st_size
        return totals

    with fts.open([path]) as f:
        for entry in f:
            if entry.info == fts.DP:
                os.rmdir(entry.accpath)
            elif entry.info != fts.D:
                os.remove(entry.accpath)

    Comparison to os.walk

    • fts iterates a flat sequence of entries, rather than a sequence of sequences. This makes many use cases much simpler, although it does make a few use cases more complicated (e.g., if you really need those sub-sequences, you need an extra children call).
    • fts yields entry objects instead of just names. These include stat information (which os.walk actually fetches but doesn't expose to you), depth in the tree, and both of the obvious paths that you might try to create out of each name.
    • fts iterates a sequence of all entries of all kinds, rather than giving you separate sequences of directory and file entries. Again, this makes many uses cases much simpler, but makes a few more complicated, as you have to check entry.info to switch on the kind of entry.
    • fts lets you prune the walk with an explicit skip method, instead of requiring you to modify a list in place.
    • fts also lets you explicitly follow symlinks that would otherwise be skipped, and retry entries that failed (e.g., after asking the user to plug the external drive back in), as well as skipping.
    • fts yields directories both in pre-order and in post-order, instead of taking a parameter that specifies which order you want. You can ignore entries where info == fts.DP or info == fts.D to get the same effect as topdown or not topdown
    • fts has some options that are hard to reproduce simply or efficiently with walk.
    • fts can be many times faster, especially if you don't need the stat information.
    • fts is a context manager, so to clean up, you just do with fts.open([path]) as f, instead of with contextlib.closing(os.walk(path)) as w. Of course this isn't a big deal, but hopefully the change will encourage people not to leak their iterators as often.
    Below is some code illustrating the differences:

    total = sum(entry.statp.st_size for entry in fts.open([path]))

    total = 0
    for dirpath, dirnames, filenames in os.walk(path):
        total += sum(os.stat(os.path.join(dirpath, name)).st_size
                     for name in dirnames+filenames)


    with fts.open([path]) as f:
        for entry in f:
            if entry.info == fts.DP:
                os.rmdir(entry.accpath)
            elif entry.info != fts.D:
                os.remove(entry.accpath)

    with contextlib.closing(os.walk(path), topdown=False) as w:

        for dirpath, dirnames, filenames in w:
            for name in filenames:
                os.remove(os.path.join(dirpath, name))
            for name in dirnames:
                os.rmdir(os.path.join(dirpath, name))        




    with fts.open(path) as f:

        for entry in f:
            if entry.name.startswith('.'):
                f.skip(entry)
            else:
                doStuff(entry.accpath)



    with contextlib.closing(os.walk(path)) as w:
        for dirpath, dirnames, filenames in w:
            for i, name in reversed(enumerate(dirnames)):
                if name.startswith('.'):
                    del dirnames[i]
                else:
                    doStuff(os.path.join(dirpath, name))
            for name in filenames:
                if not name.startswith('.'):
                    doStuff(os.path.join(dirpath, name))




    with fts.open(path) as f:

        for entry in f:
            if entry.level > n:
                f.skip(entry)
            doStuff(entry.accpath)

    introlen = len(os.path.abspath(path)) + 1
    def getlevel(dirpath, dirname):
        dpath = os.path.abspath(os.path.join(dirpath, dirname))
        reldpath = dpath[introlen:]
        return len(os.path.split(reldpath)
    with contextlib.closing(os.walk(path)) as w:
        for dirpath, dirnames, filenames in w:
            for i, name in reversed(enumerate(dirnames)):
                if getlevel(dirpath, dirname) > n:
                    del dirnames[i]
            for name in dirnames + filenames:
                doStuff(os.path.join(dirpath, name))



    with fts.open([path]) as f:
        for entry in f:
            if entry.info == fts.SL and 'foo' in entry.name:
                f.follow(entry)
            doStuff(entry.accpath)

    with contextlib.closing(os.walk(path, followlinks=True)) as w:
        for dirpath, dirnames, filenames in w:
            for i, name in reversed(enumerate(dirnames)):
                if 'foo' not in name and os.path.islink(os.path.join(dirpath, name)):
                    del dirnames[i]
            for name in dirnames + filenames:
                doStuff(os.path.join(dirpath, name))

    with fts.open([path]) as f:
        def handleerror(error):
            print(f.error)
            rsa = input('(R)etry, (S)kip, or (A)bort?').lower()
            if rsa == 'r':
                f.again(entry)
            elif rsa == 'a':
                raise f.error
        for entry in f:
            if entry.info in (fts.DNR):
                handleerror(entry.error)
            doStuff(entry.accpath)

    def handleerror(error):
        print(error)
        rsa = input('(R)etry, (S)kip, or (A)bort?').lower()
        if rsa == 'r':
            # do another os.walk recursively from error.filename;
            # implementation left as an exercise for the reader 
        elif rsa == 'a':
            raise error
    with contextlib.closing(os.walk(path, onerror=handleerror)) as w:
        for dirpath, dirnames, filenames in w:
            for name in dirnames + filenames:
                doStuff(os.path.join(dirpath, name))


    with fts.open([path], fts.XDEV) as f:
        for entry in f:
            doStuff(entry.accpath)

    dev = os.stat(path).st_dev
    with contextlib.closing(os.walk(path)) as w:
        for dirpath, dirnames, filenames in w:
            for name in dirnames + filenames:
                doStuff(os.path.join(dirpath, name))
            for i, name in reversed(enumerate(dirnames)):
                st = os.stat(os.path.join(dirpath, name))
                if st.dev != dev:
                    del dirnames[i]

    Implementation

    On platforms with a native fts API that complies to BSD4.4, Python can just wrap the native API. I don't know of any platform that doesn't comply—in fact, I don't know of any that don't use minor forks of the same implementation—but very few of them document this compliance, which means we have to check them. FreeBSD, OpenBSD, OS X, Linux/glibc, and Cygwin are all compliant in the current versions, but we will have to check back to the oldest version of each platform compatible with Python x.y.

    On POSIX-like platforms without a native fts API, the FreeBSD implementation is almost completely portable to anything near POSIX, and compatibly licensed, so we can simply include a copy of that fts.c to use on such platforms. There are a few optimizations that may need to be disabled for some POSIX-like systems:
    • If fchdir does not exist, fts_safe_changedir should be stubbed with a return 0, and FTS_NOCHDIR should always be set. This means the chdir=False option will not optimize anything.
    • If statfs does not exist, or does not return a usable f_fstypename, fts_ufslinks should be stubbed with a return 0. This means that the UFS-style-links optimization will not be available for UFS-like filesystems.
    On Windows, fts.c actually doesn't look too hard to port. The main trick is that you get stat information immediately out of FindNextFile, but don't get d_type and friends, so the readdir loop in fts_build needs to use the stat info, and fts_stat needs to skip the actual stat calls because the info is already there, but I think that's it.

    However, I think it might be better to port fts to pure Python. This would require an iterdir_stat type API that does something like os.listdir, except that it also returns each entry's d_type on POSIX and its stat on Windows (if neither, we can just call os.stat for everything and pretend we're on Windows). This implementation would then be portable to non-CPython, and to less common platforms.

    For the fts.walk implementation, it should be something like this:


    def walk(path, topdown=True, onerror=None, followlinks=False):
        level = None
        dirents, filenames = [], []
        dirpath = path
        with fts.open([path], logical=followlinks, stat=False,
                      chdir=False, comfollow=True) as f:
            for ent in f:
                if ent.level != level:
                    dirnames = [dirent.name for dirent in dirents]
                    yield dirpath, dirnames, filenames
                    for dirent in dirents:
                        if dirent.name not in dirnames:
                            f.skip(dirent)
                    level = ent.level
                    dirents, filenames = [], []
                    path = os.path.join(path, ent.path)
                else:
                    if ent.info in (D, DC):
                        if topdown:
                            dirents.append(ent)
                    elif ent.info == DP:
                        if not topdown:
                            dirents.append(ent)
                    elif ent.info in (DNR, ERR):
                        if onerror:
                            onerror(ent.error) 
                    elif ent.info == DOT:
                        pass
                    else:
                        filenames.append(ent.name)
        dirnames = [dirent.name for dirent in dirents]
        yield dirpath, dirnames, filenames

    This hasn't been tested or analyzed yet. If it can't be made to work right in all cases (in particular, handling users modifying dirnames), it may still be useful as a drop-in replacement for most uses of os.walk—those who need features that aren't emulated can continue to use os.walk, or they can rewrite their code to use fts directly rather than calling fts.walk.

    0

    Add a comment

  2. Never mind…

    As pointed out by Nick Coghlan, a generator expression translates the argument of the outermost for clause into the argument to the generator function, not a local within the generator function. For example, this:

        lines = (line for line in file if line)

    …is not equivalent to this:

        def gen():
            for line in file:
                if line:
                    yield line
        lines = gen()

    … but this:

        def gen(file):
            for line in file:
                if line:
                    yield line
        lines = gen(file)

    So, this:

        lines = (line with open(path) as file for line in file)

    …would be equivalent to this:
        def gen(file):
            for line in file:
                if line:
                    yield line
        with open(path) as file:
            lines = gen(file)

    As 6.2.8 Generator expressions (5.2.6 in 2.x) says:
    Variables used in the generator expression are evaluated lazily when the__next__() method is called for generator object (in the same fashion as normal generators). However, the leftmost for clause is immediately evaluated, so that an error produced by it can be seen before any other possible error in the code that handles the generator expression. Subsequent for clauses cannot be evaluated immediately since they may depend on the previous for loop. For example: (x*y for x in range(10)for y in bar(x)).

    Original proposal

    This idea is not the same as the oft-suggested "with expression". I'll explain why with expressions are both different and useless later.

    The idea is to add an optional "with clause" to the existing generator expression syntax.

    Here's a basic example:

        upperlines = (line.upper() with open('foo', 'r') as file for line in file)

    And here's the translation to a generator function:

        def foo():
            with open('foo', 'r') as file:
                for line in file:
                    yield line.upper()
        upperlines = foo()
    

    The key here is that the file stays open until the expression finishes iterating over it. This is not the same as wrapping a with statement around a generator expression, where the file only stays open until the generator is built.

    Rationale

    Most people don't notice this feature is missing—but they work around it anyway. If you look over a few tutorials for modules on PyPI, questions and answers at StackOverflow, etc., you find people doing this:

        upperlines = (line.upper() for line in open('foo', 'r'))
    

    Everyone knows this is wrong, because it relies on refcounting, and therefore only works in CPython. In fact, it's even wrong in CPython, because the file isn't closed until the last reference to the generator goes away, instead of when it finishes iterating over the file. And yet, people do it all the time.

    This is why with statements were added to Python. Except here, they don't work. Compare:

        with open('foo', 'r') as file:
            contents = file.read()
        doStuff(contents) # works
        with open('foo', 'r') as file:
            contents = (line.upper() for line in file)
        doStuff(contents) # raises a ValueError
    
    
    Of course in this trivial case, you can just indent the doStuff call, but in general, there is no inherent static scope for generator objects—they may get stored and used later (especially when they're passed into other generators). And even in the trivial case, the with statement isn't quite right—doStuff might iterate over contents, then do a bunch of other stuff before returning, and the file will be left open unnecessarily the whole time.

    Here's a slightly larger example. Imagine a mail server built around a traditional select reactor, where send_async pushes a file onto a queue that will be sent as the socket is ready, and closed when it's done:

        class Connection:
    
            def handle_get(self, message_id):
                path = os.path.join(mailbox_path, message_id)
                self.send_async(open(path, 'r'))
    

    Now imagine that you need to process each line of that message—e.g., to de-activate HTTP URLs, censor bad words, recognize phone numbers and turn them into links, whatever:

        class Connection:
            def handle_get(self, message_id):
                path = os.path.join(mailbox_path, message_id)
                self.send_async(self.process(line) for line in open(path, 'r'))
    

    Oops, now we're hosed—the reactor may close the generator, but that won't cause the file to get closed.

    The only way to do this correctly today is to write the explicit generator function instead of using a generator expression.


        class Connection:
            def handle_get(self, message_id):
                path = os.path.join(mailbox_path, message_id)
                def processed_file():
                    with open(path, 'r') as f:
                        for line in f:
                            yield self.process(line)
                self.send_async(processed_file())
    


    That's more verbose, less transparent even for experts, and impenetrable for new users—the same reason generator expressions were added to the language in the first place.

    But it's even worse here. In my experience, even people who know better rarely think to write the generator function. When you point out the leaky-open problem in their code, after trying the with statement and finding it doesn't work, they either figure out the right place to put in an explicit close, or they write a class to manage the file and the iteration together. In other words, they treat Python like C++, because Python doesn't offer any obvious right way to do it.

    But it could be this simple:


        class Connection:
            def handle_get(self, message_id):
                path = os.path.join(mailbox_path, message_id)
                self.send_async(self.process(line) 
                                with open(path, 'r') as f for line in f)

    Syntax

    The name "with" is pretty obvious; the question is, where to put it.

    Adding with to the expression the same way as "if" means the right place for it is to the left of the "for" statement, so the top-down order of the nested statements in the equivalent generator function is reflected in the left-right order of the nested clauses in the expression.

    A number of people have suggested that it would look more readable if the with statement came at the end:

        upperlines = (line.upper() for line in file with open('foo', 'r') as file)
        upperlines = (line.upper() with open('foo', 'r') as file for line in file)

    However, the latter is more consistent, and easier to add to the parser or explain as a rule to new users, so that's what I'm proposing, even if it doesn't look as nice in the simplest cases.

    Either way, looking over the syntax, it's obvious that this change would not affect the meaning of any valid generator expression; it would only make certain lines (that no one would ever have written) that are currently SyntaxErrors into valid expressions. And there shouldn't be anything tricky about adding it into the parser.

    Comprehensions

    This feature isn't actually necessary in list comprehensions and dictionary comprehensions, because the iteration is finished as soon as the expression completes. But it doesn't hurt, and it might make the language more consistent, easier to learn, and even easier to parse.

    If so, I suggest adding it to comprehensions as well, but I don't really care strongly either way.

    With clauses vs. with expressions

    Nearly every person I've suggested this to has said, "Why not just a general with expression instead"?

    At first glance, that sounds nice, because then you could also write this:

        contents = file.read() with open('foo', 'r') as file

    That's the motivating example for every suggestion for with expressions ever.

    But anything that can usefully written using a with expression can be rewritten using a with statement:

        with open('foo', 'r') as file: contents = file.read()

    Anything that won't work using a with statement—because the context needs to match the dynamic lifetime of some other object defined in the expression—will not be helped by a with expression. This is obvious for most cases:

        contents = iter(file) with open('foo', 'r') as file

    This is clearly going to raise a ValueError as soon as you try to iterate contents. But it's just as true for generator expressions as for anything else. Consider what the following would mean if Python had a with expression:

        contents = (line for line in file with open('foo', 'r') as file)
        contents = (line for line in file) with open('foo', 'r') as file
        contents = (line with open('foo', 'r') as file for line in file)
    
    

    In the first one,  the with expression modifies file, which means the file is closed before you even start iterating. In the second, it modifies the entire expression, which means the file is closed as soon as the generator is done being created. In the third, it modifies line, so there is no file to iterate over. There's no way to write what we need using a general with expression.

    Of course it would be possible to add both with clauses and with expressions, and write some parser magic to figure out that with above is a generator-expression with clause, not a with expression. (You'd also have to write the magic for comprehensions.) After all, ternary if expressions don't help define if clauses in generator expressions, but we have them both. But, even if this were a useful idea, it would be a completely unconnected idea, except for the fact that implementing both of them is slightly more difficult than just one. And for the reasons given above, I don't think general with expressions are a good idea in the first place.

    In short, with expressions wouldn't remove the need for with clauses in generator expressions; they'd just made it harder to add them to the language.

    Doesn't this violate DRY?

    A few people I've mentioned this idea to have objected on the basis that (line with open('foo', 'r') as file for line in file) makes me repeat both line and file twice.

    The case for the repetition of file here is exactly the same as for the repetition of line. Yes, it's mildly annoying in completely trivial cases, but there's no way around it if you ever want to write non-trivial cases. Compare:

        (line[2:] with open('foo', 'r') as file for line in file)
        (line with open('foo', 'r') as file for line in decompress(file))

    Besides, most of the people who make this argument are the kind of people who already avoid generator expressions because they believe (line[2:] for line in file) can be better written as:

        map(operator.itergetter(slice(2, None)), file)

    I don't think there's any point catering to these people, whose real objection is usually "Lisp is a perfectly good language without the need for these new-fangled comprehensions and generator expressions"; the only reasonable answer is that Lisp was not actually outlawed when Haskell was invented.
    0

    Add a comment

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.