There are hundreds of questions on StackOverflow from people who want to know something like, "How do I edit a file in-place, without creating a new file."

In general, the answer is, "You can't. You really do want to create a new file." (There is an exception, which I'll get to later.)

Why? Well, files are just streams of bytes—or, for text files, streams of characters. If you try to edit a file by replacing a 60-character line with a 70-character line, you've just overwritten the first 10 characters of the next line.

Of course you can get around that by shifting the entire rest of the file up by 10 characters (or, equivalently and more simply, by reading the whole rest of the file after your line, then truncating before your line, then writing your new line, then writing the rest of the file), but that's not only complicated, it's very slow, and can require huge amounts of memory if you're dealing with huge files.

Using temporary files

But you don't want a new file, you want to change the original file. Your text editor doesn't make you give the file a new name every time you save, after all. So how does it let you insert text in the middle of a file?

How your text editor works

When you tell your text editor to save, it creates a new temporary file, then moves that temporary over the original. To the end-user, that looks like the file has been changed in-place. But it does it without all the complexity and slowness of shifting most of the file around for every change.

If you think about it, the text editor obviously isn't editing the file in place for every change anyway; nothing gets changed on disk until you tell it to save.

A simple text editor will read the whole file into a big array (like a Python list or string) in memory, edit that in place. But that's still shifting a bunch of memory around for every insertion or deletion, which can be a problem for big files. A smarter text editor will use a data structure like a rope—a linked list of arrays. When you want to edit in the middle of one of those arrays, it just splits that array in half around the part you're replacing, and inserts a new one for your new text. An even smarter text editor will replace most of the arrays in that linked list with references to disk positions, only reading them into memory as-needed. And there are other tricks a text editor can do to be even cleverer.

But regardless of how it's storing things, when you save, it just writes that array to a temporary file, or walks the linked list and for each chunk writes the array or copies from the specified positions in the original file, etc.

Besides being simpler and a whole lot faster, this has other advantages. The biggest one is that saving becomes atomic. If you pull the plug in the middle of a save, obviously you've lost any changes that weren't saved left—but you still have the previous version of the file around, which you wouldn't have if you tried to edit a file in place. This also makes it a lot easier to add features like autosaving, backup files, etc.

A concrete example

The most common example people who ask this question have is modifying a CSV file. You have a CSV file with columns for Spam, Eggs, Beans, and Bill, but now your prices have changed, so you want to replace the Bill value in each row with a different one, computed from the new prices. So:
    import csv
    import tempfile
    with tempfile.NamedTemporaryFile(dir='.', delete=False) as tmp, \
         open('breakfast.csv', 'rb') as f:
        r = csv.reader(f)
        w = csv.writer(tmp)
        header = next(r)
        w.writerow(header)
        for row in r:
            bill = 2.39 * int(row[0]) + 1.98 * int(row[1]) + 0.33 * int(row[2])
            row[3] = format(bill, '.2f')
            w.writerow(row)
    os.replace(tmp.name, 'breakfast.csv')
If you're using Python 3.2 or earlier (including 2.x), you don't have os.replace. In that case, see Getting atomic writes right.

Exception: fixed-length records

Some people are still going to say, "But it must be possible to edit a file in place. If nothing else, filesystems have to be able to do exactly that, right?"

And of course that's true. And in very simple cases, it can be worth doing. But as things get more complex, the tradeoff gets worse. So let's start with the simplest case.

Let's say you have a file made up of records, each one of which is exactly 100 bytes long. In that case, it's pretty easy to edit record #789: just seek to 789*100 bytes in, then write the new 100-byte record over the old one.

(Note that a text file made up of 100-character lines does not have fixed-length records. 100 characters is only 100 bytes if your characters all happen to fit in 1 byte each (so if you're using UTF-8, if they all happen to be ASCII), and if your platform's newlines are "\n". If you're sure that's the case, then you should open the file as a binary file, and encode/decode your strings with the 'ascii' codec.)

But what if you want to insert a record, or delete one? Well, if the records don't have any particular order, this isn't too much of a problem: you can always insert at the end, and delete by copying the last record over the deleted record and then truncating the last 100 bytes off the file. (Or, if you don't want to invalidate indexes or iterators you might have lying around elsewhere in your program, you can just mark the records deleted in some way, then compact at the end.)

But if the records have some meaningful order (e.g., they're sorted), then you're back to shifting the whole rest of the file up or down by 100 bytes, in which case you might as well do the atomic write-temporary-and-replace thing.

So, that's where things stop being simple. If you want to know how to get more complex, read on; if not, just skip the whole next section.

Fun with complex file formats

Chunks

What if you make your file format more complex? For example, let's say you leave large gaps in your file (possibly even huge gaps, if your platform supports sparse files), and you use a header every N records that says which records within that chunk are in use and which aren't. (You'll want the chunk size to be an even multiple of the system page size—if you don't know what that is, 4096 is a good number.)

Then you can insert and delete, and only on the rare occasions where you've inserted so much that you fill an entire chunk do you need to shift anything outside the current chunk. (And even then, if there's room in the next or previous chunk, you can get away with just shifting within two chunks.) You will occasionally have to rewrite most of the file, but for many use cases that will be very rare.

If you think about it, this is just a bigger version of the simple fixed-record format: each chunk is a record, and you almost never have to insert or delete a chunk in-place.

And in fact, there's really nothing forcing you to use fixed records within the chunk; as long as the chunks are fixed-length, you can store blocks of text in the chunks just as easily. In fact, many text editors do something similar for autosave files.

Indexes

Even if you "almost never" have to insert a chunk in-place, sometimes you'll have to. And since that's very slow, it could affect the responsiveness of your program. Or, even worse, if it happens while the user is trying to quit or shut down, they might just kill your program before you can finish.

Isn't there some way you can ask the filesystem to just add another block into the middle of your file? Well, technically, yes, just about any filesystem has to have a way to do that. But it's not part of any portable API (and definitely not part of Python). So, unless you want to write some complicated native code for every platform you need to run on, this isn't an option.

But think about how filesystems must do that: To a filesystem, your file is some kind of list of blocks, ideally but not necessarily contiguous.

And you can do the same thing yourself. Keep some kind of index—a mapping of chunks to seek positions. So instead of just multiplying 65109 by the chunk size, you look up 65109 in the index. Where do you keep this mapping? Well, you could keep it in a separate file, like "spam.idx" alongside "spam.dat". But you can also just store it at the beginning of the file. What happens when your index gets full? You just move the first data chunk to the end of the file (and update the index), so you have room for another index chunk.

Notice that if you do this, adding new chunks is no longer so expensive, so you don't need large gaps anymore.

If your files are gigantic, you can do two or even more levels of indexing.

In fact, even if your files aren't gigantic, two levels of indexing may be worth doing. On a hard drive, seeking is much slower than reading sequentially, so if you can keep your reads and writes as local as possible, you can speed things up. (Of course this assumes that your filesystem generally keeps your files defragmented, so that adjacent chunks of a file are on adjacent disk blocks, but that's usually a decent assumption with modern filesystems.)

If your files are big enough, you also may want to take advantage of the fact that most of your chunks are in order and use run-length encoding or similar alternatives.

Free lists

If you do a lot of inserts and deletes, eventually you'll end up with multiple empty chunks, sometimes even runs of empty chunks, just lying around wasting space.

But you can easily add a free list to your index. When a chunk becomes empty, append it to the free list. Next time you need a new chunk, grab the last chunk off the free list and use that, only creating a new chunk if the free list is empty.

Defragmentation

The more inserts and deletes you do, the more your chunk list will be out of order. This means you spend more time seeking on the hard disk (which, again, is very slow), and it also means any run-length encoding or similar index compression you do becomes less useful.

The solution is the same one filesystems use. You can do an "offline defragmentation" by just reading the chunks in sequence order and writing them out to a new file. This also frees up all the chunks in your free list, shrinking the file. Or you can do "online defragmentation"—any time you notice a badly out-of-order run of chunks, make a note to defrag that part of the index as soon as you have idle time.

Atomicity

Remember the big advantage of write-temporary-and-replace was atomicity: at any given time, you've either got the original file, or the new file, not half the new file, or a corrupted file. Can you get the same advantage editing in-place?

Sure, if you're willing to write more code.

If you're not using an index, and your chunk size is a filesystem block size, you can generally count on the fact that writing a block is an atomic operation. So, you read in the 4K chunk, modify it, write it out, and either it was written or the old version is there. When you need to insert or delete a chunk, you can use the write-temporary-and-replace trick to make that atomic.

What if you're using an index? Well, again, when you're just changing a single block, you can do that atomically. When you're inserting blocks, you insert them at the end, then you rewrite the index. If you fail before updating the index, those new blocks aren't referenced by anything. At next startup, you could present them to the user as "auto-recovery" data, or just toss them. As long as your "transactions" are just single values, atomicity and consistency are just a matter of doing things in the right order.

If the index has grown to more than one block, you might have to write more than one index block. The only way around that is to add multiple levels of indexing, so the top level is never larger than a single block. So you append the new chunk, then append a new version of the lower-level index chunk that refers to that chunk (if you've grown to more than 2 levels, there will be more such appends) then rewrite the top-level index.

A different option is to write to a journal, which is just a file where you record operations in sequential order, so at startup you can roll back (or, occasionally, complete) partial changes that happened before you crashed.

Finally, there are all kinds of useful ways you can add redundancy, to make it easier to detect and recover from corruption. This obviously isn't as good as avoiding it in the first place, but if you can't be 100% sure you're 100% atomic, a fallback never hurts.

Of course often you want atomicity (and the rest of ACID) for transactions that affect more than a single record. For that, you need to build a transaction system, which I won't get into here.

Concurrency

Writing a block may be atomic, but reading a block, making changes, and writing it back obviously isn't. If you're just worried about surviving crashes, this isn't a problem—either your program did 1 write, or it did 0. But if you have two programs (or two threads in your program, or even two interleaved operations on the same thread) both trying to edit the file, you've got a classic race condition: program 1 reads the index, program 2 reads the index, program 1 writes back its modified index, program 2 writes back its modified index, and now program 1's changes are lost.

The solution here is the same as anywhere else. The details are different on different platforms, but almost everywhere but Windows, the simplest version uses pessimistic flock:
    @contextlib.contextmanager
    def flocking(f, op):
        fcntl.flock(f, op)
        try:
            yield
        finally:
            fcntl.flock(f, fcntl.LOCK_UN)

    # atomic read
    with flocking(f, fcntl.LOCK_SH):
        f.seek(pos)
        buf = f.read(4096)

    # atomic edit
    with flocking(f, fcntl.LOCK_EX):
        f.seek(pos)
        buf = bytearray(f.read(4096))
        buf[off:off+4] = struct.pack('>I', 23)
        f.seek(pos)
        f.write(buf)
This can cause a lot of contention if the changes you need to make may take a while. In that case, you use optimistic locking, or drop to the lower-level lockf (which lets you lock just part of a file instead of the whole thing), or both.

Do you really want to reinvent the wheel?

At this point, you've probably realized that you're doing most of the same work a database or a filesystem would do.

If you really want to continue on, your next step should be to take a look at the data structures and algorithms that databases and filesystems use. The key is to store your index, or possibly even your data itself, in a structure like a B-tree or B+ tree, or even a simple probing hash table. There are well-known algorithms for dealing with these data structures, and nice pre-built implementations (and the trees also give you sorting for free).

But there's an even simpler idea: Just use a database or a filesystem to do the work for you.

Using a database should be pretty obvious. You have all kinds of options, depending on what you want to do. Python comes with a simple key-value store (actually, a few variations) in the dbm package, and a complete relational database in the sqlite3 module, but there are plenty of more powerful key-value stores, relational databases, documented-oriented databases, persistent graphs, and so on. Whatever you want probably already exists, and probably has a nice Python interface. (If the native interface to whatever you want isn't nice enough, there may be an adapter—e.g., SQLAlchemy can give you a functional wrapper or a complete ORM on top of most relational databases.) See relational database and NoSQL on Wikipedia for some ideas. All kinds of things that are hard to get right, much less optimize for each platform, will be automatically taken care of for you.

Using a filesystem is also very simple: Just store each record (or each chunk) in a separate file, and use a directory as an index. You can create multiple levels of indices by just creating subdirectories. (Depending on your filesystem, you may need to do this every time you get to a few thousand files, or even less.) An atomic insert is just creating a temporary and moving it in-place; an atomic delete is just os.remove; an atomic edit is a write-temporary-and-replace. When you're working with the filesystem (and the OS cache, etc.), a lot of things get a lot easier.
0

Add a comment

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed. Languages in the Abject-Oriented space have been borrowing ideas from another paradigm entirely—and then everyone realized that languages like Python, Ruby, and JavaScript had been doing it for years and just hadn't noticed (because these languages do not require you to declare what you're doing, or even to know what you're doing). Meanwhile, new hybrid languages borrow freely from both paradigms.

This other paradigm—which is actually older, but was largely constrained to university basements until recent years—is called Functional Addiction.

A Functional Addict is someone who regularly gets higher-order—sometimes they may even exhibit dependent types—but still manages to retain a job.

Retaining a job is of course the goal of all programming. This is why some of these new hybrid languages, like Rust, check all borrowing, from both paradigms, so extensively that you can make regular progress for months without ever successfully compiling your code, and your managers will appreciate that progress. After all, once it does compile, it will definitely work.

Closures

It's long been known that Closures are dual to Encapsulation.

As Abject-Oriented Programming explained, Encapsulation involves making all of your variables public, and ideally global, to let the rest of the code decide what should and shouldn't be private.

Closures, by contrast, are a way of referring to variables from outer scopes. And there is no scope more outer than global.

Immutability

One of the reasons Functional Addiction has become popular in recent years is that to truly take advantage of multi-core systems, you need immutable data, sometimes also called persistent data.

Instead of mutating a function to fix a bug, you should always make a new copy of that function. For example:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

When you discover that you actually wanted fields 2 and 3 rather than 1 and 2, it might be tempting to mutate the state of this function. But doing so is dangerous. The right answer is to make a copy, and then try to remember to use the copy instead of the original:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

function getCustName2(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[2] + ' ' + custRec[3];
    return fullname;
}

This means anyone still using the original function can continue to reference the old code, but as soon as it's no longer needed, it will be automatically garbage collected. (Automatic garbage collection isn't free, but it can be outsourced cheaply.)

Higher-Order Functions

In traditional Abject-Oriented Programming, you are required to give each function a name. But over time, the name of the function may drift away from what it actually does, making it as misleading as comments. Experience has shown that people will only keep once copy of their information up to date, and the CHANGES.TXT file is the right place for that.

Higher-Order Functions can solve this problem:

function []Functions = [
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    },
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[2] + ' ' + custRec[3];
        return fullname;
    },
]

Now you can refer to this functions by order, so there's no need for names.

Parametric Polymorphism

Traditional languages offer Abject-Oriented Polymorphism and Ad-Hoc Polymorphism (also known as Overloading), but better languages also offer Parametric Polymorphism.

The key to Parametric Polymorphism is that the type of the output can be determined from the type of the inputs via Algebra. For example:

function getCustData(custId, x)
{
    if (x == int(x)) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return int(fullname);
    } else if (x.real == 0) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return double(fullname);
    } else {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return complex(fullname);
    }
}

Notice that we've called the variable x. This is how you know you're using Algebraic Data Types. The names y, z, and sometimes w are also Algebraic.

Type Inference

Languages that enable Functional Addiction often feature Type Inference. This means that the compiler can infer your typing without you having to be explicit:


function getCustName(custID)
{
    // WARNING: Make sure the DB is locked here or
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

We didn't specify what will happen if the DB is not locked. And that's fine, because the compiler will figure it out and insert code that corrupts the data, without us needing to tell it to!

By contrast, most Abject-Oriented languages are either nominally typed—meaning that you give names to all of your types instead of meanings—or dynamically typed—meaning that your variables are all unique individuals that can accomplish anything if they try.

Memoization

Memoization means caching the results of a function call:

function getCustName(custID)
{
    if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Non-Strictness

Non-Strictness is often confused with Laziness, but in fact Laziness is just one kind of Non-Strictness. Here's an example that compares two different forms of Non-Strictness:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRate(custId) {}

function callByNameTextRate(custId)
{
    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
}

Both are Non-Strict, but the second one forces the compiler to actually compile the function just so we can Call it By Name. This causes code bloat. The Lazy version will be smaller and faster. Plus, Lazy programming allows us to create infinite recursion without making the program hang:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRateRecursive(custId) { lazyTaxRateRecursive(custId); }

Laziness is often combined with Memoization:

function getCustName(custID)
{
    // if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Outside the world of Functional Addicts, this same technique is often called Test-Driven Development. If enough tests can be embedded in the code to achieve 100% coverage, or at least a decent amount, your code is guaranteed to be safe. But because the tests are not compiled and executed in the normal run, or indeed ever, they don't affect performance or correctness.

Conclusion

Many people claim that the days of Abject-Oriented Programming are over. But this is pure hype. Functional Addiction and Abject Orientation are not actually at odds with each other, but instead complement each other.
5

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.