The fts interface from BSD is simpler to use than Python's os.walk for most cases. It's also more flexible, it provides more information, and it requires far fewer calls to stat, making it inherently faster (around 3x-7x on tests on different platforms) for many use cases.

Interface

The fts module's interface is a Pythonified version of the BSD (and probably future POSIX) fts API.

Changes from the BSD interface are:
  • fts.open in place of fts_open.
  • Constants, functions, and struct members lose the FTS_ and fts_ prefixes.
  • Failures raise OSError. (Error reading directories, etc., are generally not failures, and are returned in the error field of an FTSENT.)
  • All other functions become methods of the FTS object (e.g., f.read() instead of fts_read(f)).
  • Each options bitfield becomes a separate bool parameter.
  • Option physical is dropped because it's the opposite of logical (the C API requires exactly one of the two). This means physical is the default (instead of the default being an illegal case that returns an EINVAL error).
  • Options in negative form remove the double negative (e.g., instead of nochdir=False, it's chdir=True).
  • fts.open takes optional key and reverse parameters, instead of a compar parameter. The params are interpreted as with sorted, except that the default is directory order. You can of course use functools.cmp_to_key if you prefer to write a cmp-style function.
  • Because you can easily bind values into the key function, there is no need for the fts_set_clientptrfts_get_clientptr, and fts_get_stream functions, or for the fts_number and fts_pointer members of FTSENT.
  • FTS objects are context managers.
  • FTS objects are iterators, calling read() to get the next value.
  • children takes a nameonly bool parameter, instead of an option that must be NAMEONLY or 0.
  • chlldren returns a Python list of FTSENT objects, instead of a single FTSENT object that requires you to manually follow the link values until you reach NULL.  (However, if you want to, you can still follow the links.)
  • In place of set with three different options, there are three separate methods again, follow, and skip.
  • Passing chdir=True does not necessarily guarantee that fts will change directories. (This is already true in the C API, but it's not stated clearly, and it is guaranteed with most implementations, so people have written C code that relies on chdir.)
  • The info value NSOK will never be returned for any directory, or any readable symlink to a directory. (This is already implicit in the C API, but should be explicitly stated.) 
So, a quick summary of the interface is:

  • fts.open(paths, comfollow=False, logical=False, chdir=True, stat=True, seedot=False, xdev=True, key=None, reverse=None): returns an FTS object.
  • FTS.read() returns the next FTSENT, or None if there are no more.
  • FTS.children(nameonly=False) returns a list of FTSENTs.
  • FTS.close() closes the FTS and cleans up any resources.
  • Deleting the FTS, or using it as a context manager and exiting it, automatically closes.
  • Iterating the FTS repeatedly calls read() until None.
  • An FTSENT has at least the following fields (as defined in the BSD documentation): info, accpath, path, name, level, errno, parent, link, cycle, statp, plus an additional error field that holds an OSError object for the errno.
    • parentlink, and cycle may be weak references.
  • fts.walk(path, topdown=True, onerror=None, followlinks=False): a drop-in replacement for os.walk that's usually a few times faster.

Simple examples:

For very simple use cases, an FTS is just an iterator over FTSENT objects:

total = sum(entry.statp.st_size for entry in fts.open([path]))

However, you really should clean up FTS objects—just as with os.walk, you're leaking O(N) file handles for an N-level tree until the GC finds everything:

with fts.open([path]) as f:
    total = sum(entry.statp.st_size for entry in f)

As usual, if you're trying to do anything non-trivial, you probably want to explicitly loop over the iterator, or call a filtering generator function, etc., rather than using a generator expression:

with fts.open([path]) as f:
    totals = collections.Counter()
    for entry in f:
        totals[entry.statp.st_dev] += entry.statp.st_size
    return totals

with fts.open([path]) as f:
    for entry in f:
        if entry.info == fts.DP:
            os.rmdir(entry.accpath)
        elif entry.info != fts.D:
            os.remove(entry.accpath)

Comparison to os.walk

  • fts iterates a flat sequence of entries, rather than a sequence of sequences. This makes many use cases much simpler, although it does make a few use cases more complicated (e.g., if you really need those sub-sequences, you need an extra children call).
  • fts yields entry objects instead of just names. These include stat information (which os.walk actually fetches but doesn't expose to you), depth in the tree, and both of the obvious paths that you might try to create out of each name.
  • fts iterates a sequence of all entries of all kinds, rather than giving you separate sequences of directory and file entries. Again, this makes many uses cases much simpler, but makes a few more complicated, as you have to check entry.info to switch on the kind of entry.
  • fts lets you prune the walk with an explicit skip method, instead of requiring you to modify a list in place.
  • fts also lets you explicitly follow symlinks that would otherwise be skipped, and retry entries that failed (e.g., after asking the user to plug the external drive back in), as well as skipping.
  • fts yields directories both in pre-order and in post-order, instead of taking a parameter that specifies which order you want. You can ignore entries where info == fts.DP or info == fts.D to get the same effect as topdown or not topdown
  • fts has some options that are hard to reproduce simply or efficiently with walk.
  • fts can be many times faster, especially if you don't need the stat information.
  • fts is a context manager, so to clean up, you just do with fts.open([path]) as f, instead of with contextlib.closing(os.walk(path)) as w. Of course this isn't a big deal, but hopefully the change will encourage people not to leak their iterators as often.
Below is some code illustrating the differences:

total = sum(entry.statp.st_size for entry in fts.open([path]))

total = 0
for dirpath, dirnames, filenames in os.walk(path):
    total += sum(os.stat(os.path.join(dirpath, name)).st_size
                 for name in dirnames+filenames)


with fts.open([path]) as f:
    for entry in f:
        if entry.info == fts.DP:
            os.rmdir(entry.accpath)
        elif entry.info != fts.D:
            os.remove(entry.accpath)

with contextlib.closing(os.walk(path), topdown=False) as w:

    for dirpath, dirnames, filenames in w:
        for name in filenames:
            os.remove(os.path.join(dirpath, name))
        for name in dirnames:
            os.rmdir(os.path.join(dirpath, name))        




with fts.open(path) as f:

    for entry in f:
        if entry.name.startswith('.'):
            f.skip(entry)
        else:
            doStuff(entry.accpath)



with contextlib.closing(os.walk(path)) as w:
    for dirpath, dirnames, filenames in w:
        for i, name in reversed(enumerate(dirnames)):
            if name.startswith('.'):
                del dirnames[i]
            else:
                doStuff(os.path.join(dirpath, name))
        for name in filenames:
            if not name.startswith('.'):
                doStuff(os.path.join(dirpath, name))




with fts.open(path) as f:

    for entry in f:
        if entry.level > n:
            f.skip(entry)
        doStuff(entry.accpath)

introlen = len(os.path.abspath(path)) + 1
def getlevel(dirpath, dirname):
    dpath = os.path.abspath(os.path.join(dirpath, dirname))
    reldpath = dpath[introlen:]
    return len(os.path.split(reldpath)
with contextlib.closing(os.walk(path)) as w:
    for dirpath, dirnames, filenames in w:
        for i, name in reversed(enumerate(dirnames)):
            if getlevel(dirpath, dirname) > n:
                del dirnames[i]
        for name in dirnames + filenames:
            doStuff(os.path.join(dirpath, name))



with fts.open([path]) as f:
    for entry in f:
        if entry.info == fts.SL and 'foo' in entry.name:
            f.follow(entry)
        doStuff(entry.accpath)

with contextlib.closing(os.walk(path, followlinks=True)) as w:
    for dirpath, dirnames, filenames in w:
        for i, name in reversed(enumerate(dirnames)):
            if 'foo' not in name and os.path.islink(os.path.join(dirpath, name)):
                del dirnames[i]
        for name in dirnames + filenames:
            doStuff(os.path.join(dirpath, name))

with fts.open([path]) as f:
    def handleerror(error):
        print(f.error)
        rsa = input('(R)etry, (S)kip, or (A)bort?').lower()
        if rsa == 'r':
            f.again(entry)
        elif rsa == 'a':
            raise f.error
    for entry in f:
        if entry.info in (fts.DNR):
            handleerror(entry.error)
        doStuff(entry.accpath)

def handleerror(error):
    print(error)
    rsa = input('(R)etry, (S)kip, or (A)bort?').lower()
    if rsa == 'r':
        # do another os.walk recursively from error.filename;
        # implementation left as an exercise for the reader 
    elif rsa == 'a':
        raise error
with contextlib.closing(os.walk(path, onerror=handleerror)) as w:
    for dirpath, dirnames, filenames in w:
        for name in dirnames + filenames:
            doStuff(os.path.join(dirpath, name))


with fts.open([path], fts.XDEV) as f:
    for entry in f:
        doStuff(entry.accpath)

dev = os.stat(path).st_dev
with contextlib.closing(os.walk(path)) as w:
    for dirpath, dirnames, filenames in w:
        for name in dirnames + filenames:
            doStuff(os.path.join(dirpath, name))
        for i, name in reversed(enumerate(dirnames)):
            st = os.stat(os.path.join(dirpath, name))
            if st.dev != dev:
                del dirnames[i]

Implementation

On platforms with a native fts API that complies to BSD4.4, Python can just wrap the native API. I don't know of any platform that doesn't comply—in fact, I don't know of any that don't use minor forks of the same implementation—but very few of them document this compliance, which means we have to check them. FreeBSD, OpenBSD, OS X, Linux/glibc, and Cygwin are all compliant in the current versions, but we will have to check back to the oldest version of each platform compatible with Python x.y.

On POSIX-like platforms without a native fts API, the FreeBSD implementation is almost completely portable to anything near POSIX, and compatibly licensed, so we can simply include a copy of that fts.c to use on such platforms. There are a few optimizations that may need to be disabled for some POSIX-like systems:
  • If fchdir does not exist, fts_safe_changedir should be stubbed with a return 0, and FTS_NOCHDIR should always be set. This means the chdir=False option will not optimize anything.
  • If statfs does not exist, or does not return a usable f_fstypename, fts_ufslinks should be stubbed with a return 0. This means that the UFS-style-links optimization will not be available for UFS-like filesystems.
On Windows, fts.c actually doesn't look too hard to port. The main trick is that you get stat information immediately out of FindNextFile, but don't get d_type and friends, so the readdir loop in fts_build needs to use the stat info, and fts_stat needs to skip the actual stat calls because the info is already there, but I think that's it.

However, I think it might be better to port fts to pure Python. This would require an iterdir_stat type API that does something like os.listdir, except that it also returns each entry's d_type on POSIX and its stat on Windows (if neither, we can just call os.stat for everything and pretend we're on Windows). This implementation would then be portable to non-CPython, and to less common platforms.

For the fts.walk implementation, it should be something like this:


def walk(path, topdown=True, onerror=None, followlinks=False):
    level = None
    dirents, filenames = [], []
    dirpath = path
    with fts.open([path], logical=followlinks, stat=False,
                  chdir=False, comfollow=True) as f:
        for ent in f:
            if ent.level != level:
                dirnames = [dirent.name for dirent in dirents]
                yield dirpath, dirnames, filenames
                for dirent in dirents:
                    if dirent.name not in dirnames:
                        f.skip(dirent)
                level = ent.level
                dirents, filenames = [], []
                path = os.path.join(path, ent.path)
            else:
                if ent.info in (D, DC):
                    if topdown:
                        dirents.append(ent)
                elif ent.info == DP:
                    if not topdown:
                        dirents.append(ent)
                elif ent.info in (DNR, ERR):
                    if onerror:
                        onerror(ent.error) 
                elif ent.info == DOT:
                    pass
                else:
                    filenames.append(ent.name)
    dirnames = [dirent.name for dirent in dirents]
    yield dirpath, dirnames, filenames

This hasn't been tested or analyzed yet. If it can't be made to work right in all cases (in particular, handling users modifying dirnames), it may still be useful as a drop-in replacement for most uses of os.walk—those who need features that aren't emulated can continue to use os.walk, or they can rewrite their code to use fts directly rather than calling fts.walk.

0

Add a comment

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed. Languages in the Abject-Oriented space have been borrowing ideas from another paradigm entirely—and then everyone realized that languages like Python, Ruby, and JavaScript had been doing it for years and just hadn't noticed (because these languages do not require you to declare what you're doing, or even to know what you're doing). Meanwhile, new hybrid languages borrow freely from both paradigms.

This other paradigm—which is actually older, but was largely constrained to university basements until recent years—is called Functional Addiction.

A Functional Addict is someone who regularly gets higher-order—sometimes they may even exhibit dependent types—but still manages to retain a job.

Retaining a job is of course the goal of all programming. This is why some of these new hybrid languages, like Rust, check all borrowing, from both paradigms, so extensively that you can make regular progress for months without ever successfully compiling your code, and your managers will appreciate that progress. After all, once it does compile, it will definitely work.

Closures

It's long been known that Closures are dual to Encapsulation.

As Abject-Oriented Programming explained, Encapsulation involves making all of your variables public, and ideally global, to let the rest of the code decide what should and shouldn't be private.

Closures, by contrast, are a way of referring to variables from outer scopes. And there is no scope more outer than global.

Immutability

One of the reasons Functional Addiction has become popular in recent years is that to truly take advantage of multi-core systems, you need immutable data, sometimes also called persistent data.

Instead of mutating a function to fix a bug, you should always make a new copy of that function. For example:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

When you discover that you actually wanted fields 2 and 3 rather than 1 and 2, it might be tempting to mutate the state of this function. But doing so is dangerous. The right answer is to make a copy, and then try to remember to use the copy instead of the original:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

function getCustName2(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[2] + ' ' + custRec[3];
    return fullname;
}

This means anyone still using the original function can continue to reference the old code, but as soon as it's no longer needed, it will be automatically garbage collected. (Automatic garbage collection isn't free, but it can be outsourced cheaply.)

Higher-Order Functions

In traditional Abject-Oriented Programming, you are required to give each function a name. But over time, the name of the function may drift away from what it actually does, making it as misleading as comments. Experience has shown that people will only keep once copy of their information up to date, and the CHANGES.TXT file is the right place for that.

Higher-Order Functions can solve this problem:

function []Functions = [
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    },
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[2] + ' ' + custRec[3];
        return fullname;
    },
]

Now you can refer to this functions by order, so there's no need for names.

Parametric Polymorphism

Traditional languages offer Abject-Oriented Polymorphism and Ad-Hoc Polymorphism (also known as Overloading), but better languages also offer Parametric Polymorphism.

The key to Parametric Polymorphism is that the type of the output can be determined from the type of the inputs via Algebra. For example:

function getCustData(custId, x)
{
    if (x == int(x)) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return int(fullname);
    } else if (x.real == 0) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return double(fullname);
    } else {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return complex(fullname);
    }
}

Notice that we've called the variable x. This is how you know you're using Algebraic Data Types. The names y, z, and sometimes w are also Algebraic.

Type Inference

Languages that enable Functional Addiction often feature Type Inference. This means that the compiler can infer your typing without you having to be explicit:


function getCustName(custID)
{
    // WARNING: Make sure the DB is locked here or
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

We didn't specify what will happen if the DB is not locked. And that's fine, because the compiler will figure it out and insert code that corrupts the data, without us needing to tell it to!

By contrast, most Abject-Oriented languages are either nominally typed—meaning that you give names to all of your types instead of meanings—or dynamically typed—meaning that your variables are all unique individuals that can accomplish anything if they try.

Memoization

Memoization means caching the results of a function call:

function getCustName(custID)
{
    if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Non-Strictness

Non-Strictness is often confused with Laziness, but in fact Laziness is just one kind of Non-Strictness. Here's an example that compares two different forms of Non-Strictness:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRate(custId) {}

function callByNameTextRate(custId)
{
    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
}

Both are Non-Strict, but the second one forces the compiler to actually compile the function just so we can Call it By Name. This causes code bloat. The Lazy version will be smaller and faster. Plus, Lazy programming allows us to create infinite recursion without making the program hang:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRateRecursive(custId) { lazyTaxRateRecursive(custId); }

Laziness is often combined with Memoization:

function getCustName(custID)
{
    // if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Outside the world of Functional Addicts, this same technique is often called Test-Driven Development. If enough tests can be embedded in the code to achieve 100% coverage, or at least a decent amount, your code is guaranteed to be safe. But because the tests are not compiled and executed in the normal run, or indeed ever, they don't affect performance or correctness.

Conclusion

Many people claim that the days of Abject-Oriented Programming are over. But this is pure hype. Functional Addiction and Abject Orientation are not actually at odds with each other, but instead complement each other.
5

View comments

Blog Archive
About Me
About Me
Loading