In general, the answer is, "You can't. You really do want to create a new file." (There is an exception, which I'll get to later.)
Why? Well, files are just streams of bytes—or, for text files, streams of characters. If you try to edit a file by replacing a 60-character line with a 70-character line, you've just overwritten the first 10 characters of the next line.
Of course you can get around that by shifting the entire rest of the file up by 10 characters (or, equivalently and more simply, by reading the whole rest of the file after your line, then truncating before your line, then writing your new line, then writing the rest of the file), but that's not only complicated, it's very slow, and can require huge amounts of memory if you're dealing with huge files.
Using temporary files
But you don't want a new file, you want to change the original file. Your text editor doesn't make you give the file a new name every time you save, after all. So how does it let you insert text in the middle of a file?How your text editor works
When you tell your text editor to save, it creates a new temporary file, then moves that temporary over the original. To the end-user, that looks like the file has been changed in-place. But it does it without all the complexity and slowness of shifting most of the file around for every change.If you think about it, the text editor obviously isn't editing the file in place for every change anyway; nothing gets changed on disk until you tell it to save.
A simple text editor will read the whole file into a big array (like a Python list or string) in memory, edit that in place. But that's still shifting a bunch of memory around for every insertion or deletion, which can be a problem for big files. A smarter text editor will use a data structure like a rope—a linked list of arrays. When you want to edit in the middle of one of those arrays, it just splits that array in half around the part you're replacing, and inserts a new one for your new text. An even smarter text editor will replace most of the arrays in that linked list with references to disk positions, only reading them into memory as-needed. And there are other tricks a text editor can do to be even cleverer.
But regardless of how it's storing things, when you save, it just writes that array to a temporary file, or walks the linked list and for each chunk writes the array or copies from the specified positions in the original file, etc.
Besides being simpler and a whole lot faster, this has other advantages. The biggest one is that saving becomes atomic. If you pull the plug in the middle of a save, obviously you've lost any changes that weren't saved left—but you still have the previous version of the file around, which you wouldn't have if you tried to edit a file in place. This also makes it a lot easier to add features like autosaving, backup files, etc.
A concrete example
The most common example people who ask this question have is modifying a CSV file. You have a CSV file with columns for Spam, Eggs, Beans, and Bill, but now your prices have changed, so you want to replace the Bill value in each row with a different one, computed from the new prices. So:import csv import tempfile with tempfile.NamedTemporaryFile(dir='.', delete=False) as tmp, \ open('breakfast.csv', 'rb') as f: r = csv.reader(f) w = csv.writer(tmp) header = next(r) w.writerow(header) for row in r: bill = 2.39 * int(row[0]) + 1.98 * int(row[1]) + 0.33 * int(row[2]) row[3] = format(bill, '.2f') w.writerow(row) os.replace(tmp.name, 'breakfast.csv')If you're using Python 3.2 or earlier (including 2.x), you don't have os.replace. In that case, see Getting atomic writes right.
Exception: fixed-length records
Some people are still going to say, "But it must be possible to edit a file in place. If nothing else, filesystems have to be able to do exactly that, right?"And of course that's true. And in very simple cases, it can be worth doing. But as things get more complex, the tradeoff gets worse. So let's start with the simplest case.
Let's say you have a file made up of records, each one of which is exactly 100 bytes long. In that case, it's pretty easy to edit record #789: just seek to 789*100 bytes in, then write the new 100-byte record over the old one.
(Note that a text file made up of 100-character lines does not have fixed-length records. 100 characters is only 100 bytes if your characters all happen to fit in 1 byte each (so if you're using UTF-8, if they all happen to be ASCII), and if your platform's newlines are "\n". If you're sure that's the case, then you should open the file as a binary file, and encode/decode your strings with the 'ascii' codec.)
But what if you want to insert a record, or delete one? Well, if the records don't have any particular order, this isn't too much of a problem: you can always insert at the end, and delete by copying the last record over the deleted record and then truncating the last 100 bytes off the file. (Or, if you don't want to invalidate indexes or iterators you might have lying around elsewhere in your program, you can just mark the records deleted in some way, then compact at the end.)
But if the records have some meaningful order (e.g., they're sorted), then you're back to shifting the whole rest of the file up or down by 100 bytes, in which case you might as well do the atomic write-temporary-and-replace thing.
So, that's where things stop being simple. If you want to know how to get more complex, read on; if not, just skip the whole next section.
Fun with complex file formats
Chunks
What if you make your file format more complex? For example, let's say you leave large gaps in your file (possibly even huge gaps, if your platform supports sparse files), and you use a header every N records that says which records within that chunk are in use and which aren't. (You'll want the chunk size to be an even multiple of the system page size—if you don't know what that is, 4096 is a good number.)Then you can insert and delete, and only on the rare occasions where you've inserted so much that you fill an entire chunk do you need to shift anything outside the current chunk. (And even then, if there's room in the next or previous chunk, you can get away with just shifting within two chunks.) You will occasionally have to rewrite most of the file, but for many use cases that will be very rare.
If you think about it, this is just a bigger version of the simple fixed-record format: each chunk is a record, and you almost never have to insert or delete a chunk in-place.
And in fact, there's really nothing forcing you to use fixed records within the chunk; as long as the chunks are fixed-length, you can store blocks of text in the chunks just as easily. In fact, many text editors do something similar for autosave files.
Indexes
Even if you "almost never" have to insert a chunk in-place, sometimes you'll have to. And since that's very slow, it could affect the responsiveness of your program. Or, even worse, if it happens while the user is trying to quit or shut down, they might just kill your program before you can finish.Isn't there some way you can ask the filesystem to just add another block into the middle of your file? Well, technically, yes, just about any filesystem has to have a way to do that. But it's not part of any portable API (and definitely not part of Python). So, unless you want to write some complicated native code for every platform you need to run on, this isn't an option.
But think about how filesystems must do that: To a filesystem, your file is some kind of list of blocks, ideally but not necessarily contiguous.
And you can do the same thing yourself. Keep some kind of index—a mapping of chunks to seek positions. So instead of just multiplying 65109 by the chunk size, you look up 65109 in the index. Where do you keep this mapping? Well, you could keep it in a separate file, like "spam.idx" alongside "spam.dat". But you can also just store it at the beginning of the file. What happens when your index gets full? You just move the first data chunk to the end of the file (and update the index), so you have room for another index chunk.
Notice that if you do this, adding new chunks is no longer so expensive, so you don't need large gaps anymore.
If your files are gigantic, you can do two or even more levels of indexing.
In fact, even if your files aren't gigantic, two levels of indexing may be worth doing. On a hard drive, seeking is much slower than reading sequentially, so if you can keep your reads and writes as local as possible, you can speed things up. (Of course this assumes that your filesystem generally keeps your files defragmented, so that adjacent chunks of a file are on adjacent disk blocks, but that's usually a decent assumption with modern filesystems.)
If your files are big enough, you also may want to take advantage of the fact that most of your chunks are in order and use run-length encoding or similar alternatives.
Free lists
If you do a lot of inserts and deletes, eventually you'll end up with multiple empty chunks, sometimes even runs of empty chunks, just lying around wasting space.But you can easily add a free list to your index. When a chunk becomes empty, append it to the free list. Next time you need a new chunk, grab the last chunk off the free list and use that, only creating a new chunk if the free list is empty.
Defragmentation
The more inserts and deletes you do, the more your chunk list will be out of order. This means you spend more time seeking on the hard disk (which, again, is very slow), and it also means any run-length encoding or similar index compression you do becomes less useful.The solution is the same one filesystems use. You can do an "offline defragmentation" by just reading the chunks in sequence order and writing them out to a new file. This also frees up all the chunks in your free list, shrinking the file. Or you can do "online defragmentation"—any time you notice a badly out-of-order run of chunks, make a note to defrag that part of the index as soon as you have idle time.
Atomicity
Remember the big advantage of write-temporary-and-replace was atomicity: at any given time, you've either got the original file, or the new file, not half the new file, or a corrupted file. Can you get the same advantage editing in-place?Sure, if you're willing to write more code.
If you're not using an index, and your chunk size is a filesystem block size, you can generally count on the fact that writing a block is an atomic operation. So, you read in the 4K chunk, modify it, write it out, and either it was written or the old version is there. When you need to insert or delete a chunk, you can use the write-temporary-and-replace trick to make that atomic.
What if you're using an index? Well, again, when you're just changing a single block, you can do that atomically. When you're inserting blocks, you insert them at the end, then you rewrite the index. If you fail before updating the index, those new blocks aren't referenced by anything. At next startup, you could present them to the user as "auto-recovery" data, or just toss them. As long as your "transactions" are just single values, atomicity and consistency are just a matter of doing things in the right order.
If the index has grown to more than one block, you might have to write more than one index block. The only way around that is to add multiple levels of indexing, so the top level is never larger than a single block. So you append the new chunk, then append a new version of the lower-level index chunk that refers to that chunk (if you've grown to more than 2 levels, there will be more such appends) then rewrite the top-level index.
A different option is to write to a journal, which is just a file where you record operations in sequential order, so at startup you can roll back (or, occasionally, complete) partial changes that happened before you crashed.
Finally, there are all kinds of useful ways you can add redundancy, to make it easier to detect and recover from corruption. This obviously isn't as good as avoiding it in the first place, but if you can't be 100% sure you're 100% atomic, a fallback never hurts.
Of course often you want atomicity (and the rest of ACID) for transactions that affect more than a single record. For that, you need to build a transaction system, which I won't get into here.
Concurrency
Writing a block may be atomic, but reading a block, making changes, and writing it back obviously isn't. If you're just worried about surviving crashes, this isn't a problem—either your program did 1 write, or it did 0. But if you have two programs (or two threads in your program, or even two interleaved operations on the same thread) both trying to edit the file, you've got a classic race condition: program 1 reads the index, program 2 reads the index, program 1 writes back its modified index, program 2 writes back its modified index, and now program 1's changes are lost.The solution here is the same as anywhere else. The details are different on different platforms, but almost everywhere but Windows, the simplest version uses pessimistic flock:
@contextlib.contextmanager def flocking(f, op): fcntl.flock(f, op) try: yield finally: fcntl.flock(f, fcntl.LOCK_UN) # atomic read with flocking(f, fcntl.LOCK_SH): f.seek(pos) buf = f.read(4096) # atomic edit with flocking(f, fcntl.LOCK_EX): f.seek(pos) buf = bytearray(f.read(4096)) buf[off:off+4] = struct.pack('>I', 23) f.seek(pos) f.write(buf)This can cause a lot of contention if the changes you need to make may take a while. In that case, you use optimistic locking, or drop to the lower-level lockf (which lets you lock just part of a file instead of the whole thing), or both.
Do you really want to reinvent the wheel?
At this point, you've probably realized that you're doing most of the same work a database or a filesystem would do.If you really want to continue on, your next step should be to take a look at the data structures and algorithms that databases and filesystems use. The key is to store your index, or possibly even your data itself, in a structure like a B-tree or B+ tree, or even a simple probing hash table. There are well-known algorithms for dealing with these data structures, and nice pre-built implementations (and the trees also give you sorting for free).
But there's an even simpler idea: Just use a database or a filesystem to do the work for you.
Using a database should be pretty obvious. You have all kinds of options, depending on what you want to do. Python comes with a simple key-value store (actually, a few variations) in the dbm package, and a complete relational database in the sqlite3 module, but there are plenty of more powerful key-value stores, relational databases, documented-oriented databases, persistent graphs, and so on. Whatever you want probably already exists, and probably has a nice Python interface. (If the native interface to whatever you want isn't nice enough, there may be an adapter—e.g., SQLAlchemy can give you a functional wrapper or a complete ORM on top of most relational databases.) See relational database and NoSQL on Wikipedia for some ideas. All kinds of things that are hard to get right, much less optimize for each platform, will be automatically taken care of for you.
Using a filesystem is also very simple: Just store each record (or each chunk) in a separate file, and use a directory as an index. You can create multiple levels of indices by just creating subdirectories. (Depending on your filesystem, you may need to do this every time you get to a few thousand files, or even less.) An atomic insert is just creating a temporary and moving it in-place; an atomic delete is just os.remove; an atomic edit is a write-temporary-and-replace. When you're working with the filesystem (and the OS cache, etc.), a lot of things get a lot easier.
Add a comment