The fts interface from BSD is simpler to use than Python's os.walk for most cases. It's also more flexible, it provides more information, and it requires far fewer calls to stat, making it inherently faster (around 3x-7x on tests on different platforms) for many use cases.

Interface

The fts module's interface is a Pythonified version of the BSD (and probably future POSIX) fts API.

Changes from the BSD interface are:
  • fts.open in place of fts_open.
  • Constants, functions, and struct members lose the FTS_ and fts_ prefixes.
  • Failures raise OSError. (Error reading directories, etc., are generally not failures, and are returned in the error field of an FTSENT.)
  • All other functions become methods of the FTS object (e.g., f.read() instead of fts_read(f)).
  • Each options bitfield becomes a separate bool parameter.
  • Option physical is dropped because it's the opposite of logical (the C API requires exactly one of the two). This means physical is the default (instead of the default being an illegal case that returns an EINVAL error).
  • Options in negative form remove the double negative (e.g., instead of nochdir=False, it's chdir=True).
  • fts.open takes optional key and reverse parameters, instead of a compar parameter. The params are interpreted as with sorted, except that the default is directory order. You can of course use functools.cmp_to_key if you prefer to write a cmp-style function.
  • Because you can easily bind values into the key function, there is no need for the fts_set_clientptrfts_get_clientptr, and fts_get_stream functions, or for the fts_number and fts_pointer members of FTSENT.
  • FTS objects are context managers.
  • FTS objects are iterators, calling read() to get the next value.
  • children takes a nameonly bool parameter, instead of an option that must be NAMEONLY or 0.
  • chlldren returns a Python list of FTSENT objects, instead of a single FTSENT object that requires you to manually follow the link values until you reach NULL.  (However, if you want to, you can still follow the links.)
  • In place of set with three different options, there are three separate methods again, follow, and skip.
  • Passing chdir=True does not necessarily guarantee that fts will change directories. (This is already true in the C API, but it's not stated clearly, and it is guaranteed with most implementations, so people have written C code that relies on chdir.)
  • The info value NSOK will never be returned for any directory, or any readable symlink to a directory. (This is already implicit in the C API, but should be explicitly stated.) 
So, a quick summary of the interface is:

  • fts.open(paths, comfollow=False, logical=False, chdir=True, stat=True, seedot=False, xdev=True, key=None, reverse=None): returns an FTS object.
  • FTS.read() returns the next FTSENT, or None if there are no more.
  • FTS.children(nameonly=False) returns a list of FTSENTs.
  • FTS.close() closes the FTS and cleans up any resources.
  • Deleting the FTS, or using it as a context manager and exiting it, automatically closes.
  • Iterating the FTS repeatedly calls read() until None.
  • An FTSENT has at least the following fields (as defined in the BSD documentation): info, accpath, path, name, level, errno, parent, link, cycle, statp, plus an additional error field that holds an OSError object for the errno.
    • parentlink, and cycle may be weak references.
  • fts.walk(path, topdown=True, onerror=None, followlinks=False): a drop-in replacement for os.walk that's usually a few times faster.

Simple examples:

For very simple use cases, an FTS is just an iterator over FTSENT objects:

total = sum(entry.statp.st_size for entry in fts.open([path]))

However, you really should clean up FTS objects—just as with os.walk, you're leaking O(N) file handles for an N-level tree until the GC finds everything:

with fts.open([path]) as f:
    total = sum(entry.statp.st_size for entry in f)

As usual, if you're trying to do anything non-trivial, you probably want to explicitly loop over the iterator, or call a filtering generator function, etc., rather than using a generator expression:

with fts.open([path]) as f:
    totals = collections.Counter()
    for entry in f:
        totals[entry.statp.st_dev] += entry.statp.st_size
    return totals

with fts.open([path]) as f:
    for entry in f:
        if entry.info == fts.DP:
            os.rmdir(entry.accpath)
        elif entry.info != fts.D:
            os.remove(entry.accpath)

Comparison to os.walk

  • fts iterates a flat sequence of entries, rather than a sequence of sequences. This makes many use cases much simpler, although it does make a few use cases more complicated (e.g., if you really need those sub-sequences, you need an extra children call).
  • fts yields entry objects instead of just names. These include stat information (which os.walk actually fetches but doesn't expose to you), depth in the tree, and both of the obvious paths that you might try to create out of each name.
  • fts iterates a sequence of all entries of all kinds, rather than giving you separate sequences of directory and file entries. Again, this makes many uses cases much simpler, but makes a few more complicated, as you have to check entry.info to switch on the kind of entry.
  • fts lets you prune the walk with an explicit skip method, instead of requiring you to modify a list in place.
  • fts also lets you explicitly follow symlinks that would otherwise be skipped, and retry entries that failed (e.g., after asking the user to plug the external drive back in), as well as skipping.
  • fts yields directories both in pre-order and in post-order, instead of taking a parameter that specifies which order you want. You can ignore entries where info == fts.DP or info == fts.D to get the same effect as topdown or not topdown
  • fts has some options that are hard to reproduce simply or efficiently with walk.
  • fts can be many times faster, especially if you don't need the stat information.
  • fts is a context manager, so to clean up, you just do with fts.open([path]) as f, instead of with contextlib.closing(os.walk(path)) as w. Of course this isn't a big deal, but hopefully the change will encourage people not to leak their iterators as often.
Below is some code illustrating the differences:

total = sum(entry.statp.st_size for entry in fts.open([path]))

total = 0
for dirpath, dirnames, filenames in os.walk(path):
    total += sum(os.stat(os.path.join(dirpath, name)).st_size
                 for name in dirnames+filenames)


with fts.open([path]) as f:
    for entry in f:
        if entry.info == fts.DP:
            os.rmdir(entry.accpath)
        elif entry.info != fts.D:
            os.remove(entry.accpath)

with contextlib.closing(os.walk(path), topdown=False) as w:

    for dirpath, dirnames, filenames in w:
        for name in filenames:
            os.remove(os.path.join(dirpath, name))
        for name in dirnames:
            os.rmdir(os.path.join(dirpath, name))        




with fts.open(path) as f:

    for entry in f:
        if entry.name.startswith('.'):
            f.skip(entry)
        else:
            doStuff(entry.accpath)



with contextlib.closing(os.walk(path)) as w:
    for dirpath, dirnames, filenames in w:
        for i, name in reversed(enumerate(dirnames)):
            if name.startswith('.'):
                del dirnames[i]
            else:
                doStuff(os.path.join(dirpath, name))
        for name in filenames:
            if not name.startswith('.'):
                doStuff(os.path.join(dirpath, name))




with fts.open(path) as f:

    for entry in f:
        if entry.level > n:
            f.skip(entry)
        doStuff(entry.accpath)

introlen = len(os.path.abspath(path)) + 1
def getlevel(dirpath, dirname):
    dpath = os.path.abspath(os.path.join(dirpath, dirname))
    reldpath = dpath[introlen:]
    return len(os.path.split(reldpath)
with contextlib.closing(os.walk(path)) as w:
    for dirpath, dirnames, filenames in w:
        for i, name in reversed(enumerate(dirnames)):
            if getlevel(dirpath, dirname) > n:
                del dirnames[i]
        for name in dirnames + filenames:
            doStuff(os.path.join(dirpath, name))



with fts.open([path]) as f:
    for entry in f:
        if entry.info == fts.SL and 'foo' in entry.name:
            f.follow(entry)
        doStuff(entry.accpath)

with contextlib.closing(os.walk(path, followlinks=True)) as w:
    for dirpath, dirnames, filenames in w:
        for i, name in reversed(enumerate(dirnames)):
            if 'foo' not in name and os.path.islink(os.path.join(dirpath, name)):
                del dirnames[i]
        for name in dirnames + filenames:
            doStuff(os.path.join(dirpath, name))

with fts.open([path]) as f:
    def handleerror(error):
        print(f.error)
        rsa = input('(R)etry, (S)kip, or (A)bort?').lower()
        if rsa == 'r':
            f.again(entry)
        elif rsa == 'a':
            raise f.error
    for entry in f:
        if entry.info in (fts.DNR):
            handleerror(entry.error)
        doStuff(entry.accpath)

def handleerror(error):
    print(error)
    rsa = input('(R)etry, (S)kip, or (A)bort?').lower()
    if rsa == 'r':
        # do another os.walk recursively from error.filename;
        # implementation left as an exercise for the reader 
    elif rsa == 'a':
        raise error
with contextlib.closing(os.walk(path, onerror=handleerror)) as w:
    for dirpath, dirnames, filenames in w:
        for name in dirnames + filenames:
            doStuff(os.path.join(dirpath, name))


with fts.open([path], fts.XDEV) as f:
    for entry in f:
        doStuff(entry.accpath)

dev = os.stat(path).st_dev
with contextlib.closing(os.walk(path)) as w:
    for dirpath, dirnames, filenames in w:
        for name in dirnames + filenames:
            doStuff(os.path.join(dirpath, name))
        for i, name in reversed(enumerate(dirnames)):
            st = os.stat(os.path.join(dirpath, name))
            if st.dev != dev:
                del dirnames[i]

Implementation

On platforms with a native fts API that complies to BSD4.4, Python can just wrap the native API. I don't know of any platform that doesn't comply—in fact, I don't know of any that don't use minor forks of the same implementation—but very few of them document this compliance, which means we have to check them. FreeBSD, OpenBSD, OS X, Linux/glibc, and Cygwin are all compliant in the current versions, but we will have to check back to the oldest version of each platform compatible with Python x.y.

On POSIX-like platforms without a native fts API, the FreeBSD implementation is almost completely portable to anything near POSIX, and compatibly licensed, so we can simply include a copy of that fts.c to use on such platforms. There are a few optimizations that may need to be disabled for some POSIX-like systems:
  • If fchdir does not exist, fts_safe_changedir should be stubbed with a return 0, and FTS_NOCHDIR should always be set. This means the chdir=False option will not optimize anything.
  • If statfs does not exist, or does not return a usable f_fstypename, fts_ufslinks should be stubbed with a return 0. This means that the UFS-style-links optimization will not be available for UFS-like filesystems.
On Windows, fts.c actually doesn't look too hard to port. The main trick is that you get stat information immediately out of FindNextFile, but don't get d_type and friends, so the readdir loop in fts_build needs to use the stat info, and fts_stat needs to skip the actual stat calls because the info is already there, but I think that's it.

However, I think it might be better to port fts to pure Python. This would require an iterdir_stat type API that does something like os.listdir, except that it also returns each entry's d_type on POSIX and its stat on Windows (if neither, we can just call os.stat for everything and pretend we're on Windows). This implementation would then be portable to non-CPython, and to less common platforms.

For the fts.walk implementation, it should be something like this:


def walk(path, topdown=True, onerror=None, followlinks=False):
    level = None
    dirents, filenames = [], []
    dirpath = path
    with fts.open([path], logical=followlinks, stat=False,
                  chdir=False, comfollow=True) as f:
        for ent in f:
            if ent.level != level:
                dirnames = [dirent.name for dirent in dirents]
                yield dirpath, dirnames, filenames
                for dirent in dirents:
                    if dirent.name not in dirnames:
                        f.skip(dirent)
                level = ent.level
                dirents, filenames = [], []
                path = os.path.join(path, ent.path)
            else:
                if ent.info in (D, DC):
                    if topdown:
                        dirents.append(ent)
                elif ent.info == DP:
                    if not topdown:
                        dirents.append(ent)
                elif ent.info in (DNR, ERR):
                    if onerror:
                        onerror(ent.error) 
                elif ent.info == DOT:
                    pass
                else:
                    filenames.append(ent.name)
    dirnames = [dirent.name for dirent in dirents]
    yield dirpath, dirnames, filenames

This hasn't been tested or analyzed yet. If it can't be made to work right in all cases (in particular, handling users modifying dirnames), it may still be useful as a drop-in replacement for most uses of os.walk—those who need features that aren't emulated can continue to use os.walk, or they can rewrite their code to use fts directly rather than calling fts.walk.

0

Add a comment

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.