1. If you want to write a forking server, one obvious way to do it is to use the multiprocessing module to create a pool of processes, then hand off requests as you get them.

    The problem is that handling a request usually involves reading from and writing to a client socket. So, you need to send the socket object as part of the request. And you can't pickle sockets.

    There are easy ways around this for some cases:

    • For protocols without much data per request and response, there's an easy solution: the main process does the reads, dispatches a buffer instead of a socket, receives back a buffer, and does the writes. But that won't work for, e.g., streaming large files.
    • If your requests and responses are so big that the cost of process startup/teardown is irrelevant, you can just fork a new process for each job instead of using a pool. Then, as long as you haven't made your sockets non-inheritable, everything just works.
    • If you're using a custom protocol, it's pretty easy to give each child its own listener socket; the main process then just becomes a load balancer, listening on a well-known port, then telling the clients to reconnect to some other port. Then you don't need any socket migration.

    But really, the simplest solution is just sending the socket objects over the queue, right?

    Wrong. You can't pickle sockets.

    Why can't you pickle sockets?

    If you try, the first error you're going to get, on Python 2.x, is something about objects with __slots__ not being picklable. That one's easy to work around—use any protocol other than 1 (which already happens by default in 3.x), or write your own custom pickler for the socket type.

    The next problem is that the socket type is a wrapper around various other objects, some of which are C extension types, which aren't documented. You have to dig them out by introspection or by reading the code and writing picklers for them.

    But the biggest problem is that the main thing a socket does is wrap up a file descriptor (on POSIX) or a WinSock handle (on Windows). While there are some minor differences between the two, the basic idea is the same, so I'll just talk about file descriptors until we get to Windows details.

    A file descriptor is just a number. It's an index into a table of open files (including sockets, pipes, etc.) for your process that the kernel maintains. See Wikipedia for more details, but that should be enough to explain the problem. If your socket has file descriptor 23, and you send the number 23 to some other process, that's not going to mean anything. If you're lucky, the other process's file table doesn't have a #23, so you just get EBADFD errors. If you're unlucky, #23 refers to some completely different file, and you end up with errors that are harder to track down, like sending one client's sensitive data to some random other client, or writing garbage into your config file.

    Can you send file descriptors?

    Yes! But it's not quite as easy as you'd like. And it's different on *nix and Windows. And it's different on different *nix platforms.

    Unix sockets

    On *nix platforms, the first thing you need is a Unix socket.

    Normally you'll use socketpair to create a socket for each child in the pool, and just inherit it across the fork. This is a bit annoying with multiprocessing.Pool, because it doesn't provide a way to hook the process creation, only to specify an initializer that gets run after creation; you basically have to subclass Process and override the start method. But that's not too hard.

    Alternatively, you can just use a Unix socket with a non-anonymous name: create a filename using the tempfile module, then you can pickle and send that filename, then each side can create a socket(AF_UNIX) and call connect. But be warned that this may not work on all platforms; IIRC, at least one system (AIX?) required some special permission to send file descriptors over a non-anonymous Unix socket.

    sendmsg

    The POSIX sendmsg function allows you to send message data plus ancillary data. The message data is just a list of buffers, but the ancillary data is a list of buffers tagged with a socket level and a message type (just like the socket levels and options in setsockopt, which you might be more familiar with). One of the message types, SCM_RIGHTS, is defined by POSIX as "Indicates that the data array contains the access rights to be sent or received."

    So, what are "access rights"? Well, it doesn't say anywhere in the standard. But the way almost every *nix system interprets this, it means that if you send an array of fd's with SCM_RIGHTS via sendmsg over a Unix-domain socket, the kernel will make the same files available, with the same access rights, to the receiver. (The kernel may also renumber the fd's on the way, so don't rely on the fact that file #23 on the sender comes out as file #23 on the receiver.)

    The code for this is pretty simple:
        def sendfds(sock, *fds):
            fda = array.array('I', fds).tobytes()
            sock.sendmsg([b'F'], # we have to send _something_
                         [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fda)])
    
        def recvfds(sock):
            msg, anc, flags, addr = sock.recvmsg(1, 4096)
            fds = []
            for level, type, data in anc:
                fda = array.array('I')
                fda.frombytes(data)
                fds.extend(fda)
            return fds
    
    Notice that I went out of my way to send only one array of sockets, but to receive multiple arrays on the other side. There are a lot of fiddly details that are different between different *nix platforms; the usual rule about "be conservative in what you send, be liberal in what you accept" is extra-important here if you want your code to be portable.

    Some platforms have additional message types that (usually together with custom socket options) let you do more than just send file descriptors with sendmsg—you can pass credentials (user IDs, like letting a child sudo to you without needing as password), or verify credentials, or pass capabilities or quota privileges or all kinds of other things. But none of this is cross-platform beyond passing file descriptors with SCM_RIGHTS (and even that is not 100% portable, as mentioned above).

    Windows

    Windows doesn't have Unix sockets. Instead, it has a function WSADuplicateSocket, which can be used to create a shareable socket, and some opaque data that describes that socket. Unlike Unix, the magic isn't in how you pass the socket handle, it's in the key information embedded in that opaque data. Any process that gets hold of that opaque data can open the same shared socket.

    In Python 3.3+, this is dead simple: You call share on a socket, you get back some bytes, you pass them in some way (e.g., pickling it and posting it on the queue), and the child calls socket.fromshare, and that's it:

        def sendsock(channel, pid, sock):
            channel.put(sock.share(pid))
    
        def recvsock(channel):
            return sock.fromshare(channel.get())

    If you need this to work on 3.2 or earlier, you can look at the 3.3 source, but the basic idea is pretty simple from the MSDN docs; it's just a matter of using win32api or ctypes to call the functions.

    Wrapping it up

    So, how are you going to wrap this up so you can just say "put this socket on the queue"?

    Well, you can't quite make it that simple. The problem is that you have to know which child is going to pick up the socket before you can pickle it (to get the appropriate pid or Unix socket). Once you know that, it's pretty easy—but of course with a normal pool, you don't know that until someone picks it up.

    One way to do this is to not pass the socket itself, but some kind of key that the child can use to request the socket. At that point, it writes back to you (on a pipe, or a separate queue, or whatever) and says, "I'm PID #69105, and I need socket #23", and you respond by doing the appropriate thing. This might be more readable wrapped up in a future-based API, but at that point you're writing your own SocketMigratingProcessPoolExecutor almost from scratch, so it may not be worth it.

    With a lot less rewriting, you can probably modify either ProcessPoolExecutor or multiprocessing.Pool to add a short (depth-1?) queue per process and a queue manager thread in the main process that keeps these queues as full as possible. (Whenever a new task comes in, first look for an idle process, then fall back to a process that's not idle but has an empty per-process queue; if you find either, migrate the socket and add the task to the process's queue.)

    As you can see, this isn't going to be trivial, but there's no real conceptual difficulty.
    0

    Add a comment

  2. I was recently trying to optimize some C code using cachegrind, and discovered that branch misprediction in an inner loop was the culprit. I began wondering how much anything similar could affect Python code. After all, especially in CPython, every opcode is going through hundreds of lines of ceval code, with a giant switch and multiple branches, and so on, so, how much difference could a branch at the Python level make?

    To find out, I turned to the famous Stack Overflow question Why is processing a sorted array faster than an unsorted array? That question has a simple C++ test case where the code runs ~6x as fast on x86 and x86_64 platforms with most compilers, and it's been verified to be similar in other compiled languages like Java, C#, and go.

    So, what about Python?

    The code

    The original question takes a 32K array of bytes, and sums the bytes that are >= 128:
        long long sum = 0;
        /* ... */
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
     
    In Python terms:
        total = 0
        for i in data:
            if i >= 128:
                total += i
    
    When run with a random array of bytes, this usually takes about 6x as long as when run with the same array pre-sorted. (Your mileage may vary, because there's a way the compiler can optimize away the loop for you, at least on x86 and x86_64, and some compilers will, depending on your optimization settings, figure that out for you. If you want to know more, read the linked question.)

    So, here's a simple test driver:
        import functools
        import random
        import timeit
    
        def sumbig(data):
            total = 0
            for i in data:
                if i >= 128:
                    total += i
            return total
    
        def test(asize, reps):
            data = bytearray(random.randrange(256) for _ in range(asize)
            t0 = timeit.timeit(functools.partial(sumbig, data), number=reps)
            t1 = timeit.timeit(functools.partial(sumbig, bytearray(sorted(data))), number=reps)
            print(t0, t1)
    
        if __name__ == '__main__':
            import sys
            asize = int(sys.argv[1]) if len(sys.argv) > 1 else 32768
            reps = int(sys.argv[2]) if len(sys.argv) > 2 else 1000
            test(asize, reps)
    
    I tested this on a few different computers, using Apple pre-installed CPython 2.7.6, Python.org CPython 2.7.8 and 3.4.1, Homebrew CPython 3.4.0, and a custom build off trunk, and it pretty consistently saves about 12% to pre-sort the array. Nothing like the 84% savings in C++, but still, a lot more than you'd expect. After all, we're doing roughly 65x as much work in the CPython ceval loop than we were doing in the C++ loop, so you'd think the difference would be lost in the noise, and we're also doing much larger loops, so you'd think branch prediction wouldn't be helping as much in the first place. But if you watch the BR_MISP counters in Instruments, or do the equivalent with cachegrind, you'll see that it's mispredicting a lot more branches for the unsorted case than the sorted case—as in 1.9% of the total conditional branches in the ceval loop instead of 0.1%. Presumably, even though this is still pretty small, and nothing like what you see in C, the cost of the misprediction is also higher? It's hard to be sure…

    You'd expect a much bigger benefit in the other Python implementations. Jython and IronPython compile to JVM and ILR code, which then gets the same kind of JIT recompilation as Java and C#, so if those JITs aren't smart enough to optimize away the branch with a conditional move instruction for Java and C#, they won't be for Python code either. PyPy is JIT-compiling directly from Python—plus, its JIT is itself driven by tracing, which can be hurt by the equivalent of branch misprediction at a higher level. And in fact I found a difference of 54% in Jython 2.5.1, 68% in PyPy 2.3.1 (both the 2.7 and 3.2 versions), and 72% in PyPy 2.4.0 (2.7 only).

    C++ optimizations


    There are a number of ways to optimize this in C, but they all amount to the same thing: find some way to do a bit of extra work to avoid the conditional branch—bit-twiddling arithmetic, a lookup table, etc. The lookup table seems like the most likely version to help in Python, so here it is:
        table = bytearray(i if i >= 128 else 0 for i in range(256))
        total = 0
        for i in a:
            total + table[i]
    
    To put this inside a function, you'd want to build the table globally instead of once per call, then copy it to a local inside the function to avoid the slow name lookup inside the inner loop:
        _table = bytearray(i if i >= 128 else 0 for i in range(256))
        def sumbig_t(data):
            table = _table
            total = 0
            for i in a:
                total + table[i]
    
    The sorted and unsorted arrays are now about the same speed. And for PyPy, that's about 13% faster than with the sorted data in the original version. For CPython, on the other hand, it's 33% slower. I also tried the various bit-twiddling optimizations; they're slightly slower than the lookup table in PyPy, and at least 250% slower in CPython.

    So, what have we learned? Python, even CPython, can be affected by the same kinds of low-level performance problems as compiled languages. The alternative implementations can also handle those problems the same way. CPython can't… but then if this code were a bottleneck in your problem, you'd almost certainly be switching to PyPy or writing a C extension anyway, right?

    Python optimizations

    There's an obvious Python-specific optimization here—which I wouldn't even really call an optimization, since it's also a more Pythonic way of writing the code:
        total = sum(i for i in data if i >= 128)
    
    This does in fact speed things up by about 13% in CPython, although it slows things down by 217% in PyPy. It leaves us with the same original difference between random and sorted arrays, and the same basic effects for applying the table or bit twiddling.

    You'd think that taking two passes, applying the table to the data and then summing the result, would obviously be slower, right? And normally you'd be right; a quick test shows a 31% slowdown. But if you think about it, when we're dealing with bytes, applying the table can be done with bytes.translate. And of course the sum is now just summing a builtin sequence, not a generator. So we effectively have two C loops instead of one Python loop, which may be a win:
        def sumbig_s(data):
            return sum(data.translate(bytearray(_table)))
    
    For CPython, this saves about 83% of the time vs. the original sumbig's speed on unsorted data. And the difference between sorted and unsorted data is very small, so it's still an 81% savings on sorted data. However, for PyPy, it's 4% slower for unsorted data, and you lose the gain for sorted data.

    If you think about it, while we're doing that pass, we might as well just remove the small values instead of mapping them to 0:
        def sumbig_s2(data):
            return sum(data.translate(bytearray(_table), bytearray(range(128))))
    
    That ought to make the first loop a little slower, and affected by branch misprediction, while making the second loop twice as fast, right? And that's exactly what we see, in CPython. For unsorted data, it cuts 90% off the original code, and now sorting it gives us another 36% improvement. That's near PyPy speeds. On the other hand, in PyPy, it's just as slow for sorted data as the previous fix, and twice as slow for unsorted data, so it's even more of a pessimization.

    What about using numpy? The obvious implementation is:
        def sumbig_n(data):
            data = np.array(data)
            return data[data>=128].sum()
    
    In CPython, this cuts 90% of the speed off the original code for unsorted data, and for sorted data it cuts off another 48%. Either way, it's the fastest solution so far, even faster than using PyPy. But we can do the equivalent of translating the if to arithmetic or bit-twiddling too. I tried tricks like ~((data-128)>>31)&data, but the fastest turned out to be the simplest:
        def sumbig_n2(data):
            data = np.array(data)
            return (data&(data>=128)).sum()
    
    Now just as fast for unsorted data as for sorted, cutting 94% of the time off the original.

    Conclusions

    So, what does this mean for you?

    Well, most of the time, nothing. If your code is too slow, and it's doing arithmetic inside a loop, and the algorithm itself can't be improved, the first step should almost always be converting to numpy, PyPy, or C, and often that'll be the last step as well.

    But it's good to know that the same issues that apply to low-level code still affect Python, and in some cases (especially if you're already using numpy or PyPy) the same solutions may help.
    3

    View comments

  3. A lot of people have questions like this:
    I've got a 100MB CSV file. I read it in to a list, populated with a csv.DictReader, and my computer starts swapping. Why?
    Let's look at what it takes to store a 100MB file as a list of dicts of strings.

    But first, let's look at how to solve this problem, instead of how to understand it.

    Very often, you don't actually need the whole thing in memory; you're just, e.g., reading the rows from one CSV file, processing each one independently, and writing out the result to a different CSV file. You can do that by using the csv.DictReader as an iterator—for each row in the reader, write a row to the writer, without storing it anywhere. Then you only need enough memory for one row at a time.

    Even if you do more global work (where one row's processing may depend on another row far away), do you really need this all in memory, or just in some accessible form? If you read the CSV and use it to iteratively build a database (whether a SQL database, a dbm, or your favorite kind of NoSQL store), you can get reasonably fast random access with bounded memory usage.

    Where does the memory go?

    First, let's make it specific. We've got 200K rows of 25 columns of an average of 20 characters. All of the characters are ASCII. We're using 64-bit CPython 3.4. (I'll explain how changing each of these assumptions can change things as I go along.)

    Reading the file into memory

    If you just open the file and write contents = f.read(), you get a 100MB string. That takes 100MB of memory for the string storage, plus 48 bytes for a str header, plus possibly 1 byte for a NUL terminator. So, still effectively just 100MB.

    How do I know this? Well, you can read the source, but it's easier to just call sys.getsizeof on an empty string.

    This only works because we're using CPython 3.3+ with all-ASCII data. Python 3.x strings are Unicode. In CPython 3.0-3.2, that means either 2 or 4 bytes per character, depending on your build, which would mean 200MB or 400MB. In 3.3 and later, strings are one byte per character if they're pure ASCII, two if they're pure BMP, and four otherwise. Of course you can stay with un-decoded bytes objects instead of strings—and in Python 2.7, that's often what you'd do.

    Splitting into lines

    If you read in a list of lines, with contents = list(f), you get 200K strings, each of which has that 48-byte header on top of its 500-byte string data. So, that's an extra 9.2MB.

    Plus, the list itself has a 48-byte header, and it also needs to store references to all of those strings. In CPython, each of those references is a pointer, so in 64-bit-land, that's 8 bytes. Also, lists have extra slack on the end. (Why? Expanding a list requires allocating a new chunk of memory, copying all the old values over, and freeing the old chunk. That's expensive. If you did that for every append, lists would be too slow to use. But if you leave room for extra values on each expansion—with the extra room proportional to the current size—the reallocations and copies amortize out to essentially free. The cost is a bit of wasted space.) Let's assume it's 220K pointers; that's an extra 1.7MB.

    Splitting into columns

    If you split each line into 25 columns, whether by using contents = [line.split(',') for line in f] or by using contents = list(csv.reader(f)), you get 200K * 25 strings, each of which has that 48-byte header, plus 200K lists, each with a 48-byte header plus a bit over 25 pointers, plus of course the original list of 220K pointers. Now we're talking about 281MB.

    Storing in dicts

    If you convert each line into a dict instead of a list, e.g., by using contents = list(csv.DictReader(f)), you've got the same strings. (You don't get a bunch of separate copies of the key strings; each dict just references the same strings.)

    The only difference is that instead of 200K lists, you have 200K dicts. A dict is bigger than a list for three reasons. First, each slot in a dict has to hold a hash value, a key pointer, and a value pointer, instead of just a value pointer. Seconds, dicts are more complicated, so they have bigger headers. Finally, dicts need more slack than lists (because they're more expensive to copy when they expand, because you have to reprobe every hash instead of just copying a big buffer of memory, and because hash tables work best with specific types of numbers for their capacity).

    So, the 200K * (48 + 25 * 8) part is unchanged, but the 200K * (48 + 27 * 8) turns into 200K * (96 + 32 * 24). Now we're talking 624MB.

    Couldn't Python do better?

    One obvious thing Python could so is to store each dict as two pieces: a hash table full of keys and their hashes, which would be shared by all those 200K rows, plus a hash table full of just values, which would be separate for each row. CPython 3.4+ actually has the ability to do this, but it only gets triggered in special cases, like the attribute dictionaries of objects of the same type. (And yes, this means that if you replace that list of dicts with a list of instances of some class, you will actually save some memory.)

    You could take this even farther by using a shared hash table full of keys, hashes, and indices, and then each separate dict would just have an array of values, without the need for that extra slack. PyPy 2.0+ can do this, although I'm not sure exactly where it get triggered (except that I am sure that the class trick would work here, and work even better than in CPython).

    Beyond that, is there a way we could get rid of the "boxing", the need to store both the strings and the array full of pointers to those strings?

    Well, if all of your column values are about the same length, then you could store the strings directly in the hash table (or array). See fixedhash for a quick&dirty implementation of a hash table with fixed-size keys. But typically, a CSV file has some columns that are only 2 or 3 characters long, and others than are 50 characters or longer, so you'd be wasting more space than you'd be saving.

    Failing that, you could replace the 8-byte strings with small integer indices into a giant string. How small could this be? Well, your string is 100MB long, so that takes 27 bits. If your longest column value is 64 characters long, you need another 7 bits for the length. So that's 34 bits. Accessing arrays by bits is slow and complicated, but by bytes can be reasonable, which means you can use 5 bytes instead of 8 for each string. This cuts 38% out of the cost of one of the smaller but still significant parts of your storage.

    But ultimately, you're talking about having 5 million objects alive. That's going to take a good chunk of memory to store. The only way to avoid that is to not store all those objects. 
    1

    View comments

  4. There are hundreds of questions on StackOverflow from people who want to know something like, "How do I edit a file in-place, without creating a new file."

    In general, the answer is, "You can't. You really do want to create a new file." (There is an exception, which I'll get to later.)

    Why? Well, files are just streams of bytes—or, for text files, streams of characters. If you try to edit a file by replacing a 60-character line with a 70-character line, you've just overwritten the first 10 characters of the next line.

    Of course you can get around that by shifting the entire rest of the file up by 10 characters (or, equivalently and more simply, by reading the whole rest of the file after your line, then truncating before your line, then writing your new line, then writing the rest of the file), but that's not only complicated, it's very slow, and can require huge amounts of memory if you're dealing with huge files.

    Using temporary files

    But you don't want a new file, you want to change the original file. Your text editor doesn't make you give the file a new name every time you save, after all. So how does it let you insert text in the middle of a file?

    How your text editor works

    When you tell your text editor to save, it creates a new temporary file, then moves that temporary over the original. To the end-user, that looks like the file has been changed in-place. But it does it without all the complexity and slowness of shifting most of the file around for every change.

    If you think about it, the text editor obviously isn't editing the file in place for every change anyway; nothing gets changed on disk until you tell it to save.

    A simple text editor will read the whole file into a big array (like a Python list or string) in memory, edit that in place. But that's still shifting a bunch of memory around for every insertion or deletion, which can be a problem for big files. A smarter text editor will use a data structure like a rope—a linked list of arrays. When you want to edit in the middle of one of those arrays, it just splits that array in half around the part you're replacing, and inserts a new one for your new text. An even smarter text editor will replace most of the arrays in that linked list with references to disk positions, only reading them into memory as-needed. And there are other tricks a text editor can do to be even cleverer.

    But regardless of how it's storing things, when you save, it just writes that array to a temporary file, or walks the linked list and for each chunk writes the array or copies from the specified positions in the original file, etc.

    Besides being simpler and a whole lot faster, this has other advantages. The biggest one is that saving becomes atomic. If you pull the plug in the middle of a save, obviously you've lost any changes that weren't saved left—but you still have the previous version of the file around, which you wouldn't have if you tried to edit a file in place. This also makes it a lot easier to add features like autosaving, backup files, etc.

    A concrete example

    The most common example people who ask this question have is modifying a CSV file. You have a CSV file with columns for Spam, Eggs, Beans, and Bill, but now your prices have changed, so you want to replace the Bill value in each row with a different one, computed from the new prices. So:
        import csv
        import tempfile
        with tempfile.NamedTemporaryFile(dir='.', delete=False) as tmp, \
             open('breakfast.csv', 'rb') as f:
            r = csv.reader(f)
            w = csv.writer(tmp)
            header = next(r)
            w.writerow(header)
            for row in r:
                bill = 2.39 * int(row[0]) + 1.98 * int(row[1]) + 0.33 * int(row[2])
                row[3] = format(bill, '.2f')
                w.writerow(row)
        os.replace(tmp.name, 'breakfast.csv')
    
    If you're using Python 3.2 or earlier (including 2.x), you don't have os.replace. In that case, see Getting atomic writes right.

    Exception: fixed-length records

    Some people are still going to say, "But it must be possible to edit a file in place. If nothing else, filesystems have to be able to do exactly that, right?"

    And of course that's true. And in very simple cases, it can be worth doing. But as things get more complex, the tradeoff gets worse. So let's start with the simplest case.

    Let's say you have a file made up of records, each one of which is exactly 100 bytes long. In that case, it's pretty easy to edit record #789: just seek to 789*100 bytes in, then write the new 100-byte record over the old one.

    (Note that a text file made up of 100-character lines does not have fixed-length records. 100 characters is only 100 bytes if your characters all happen to fit in 1 byte each (so if you're using UTF-8, if they all happen to be ASCII), and if your platform's newlines are "\n". If you're sure that's the case, then you should open the file as a binary file, and encode/decode your strings with the 'ascii' codec.)

    But what if you want to insert a record, or delete one? Well, if the records don't have any particular order, this isn't too much of a problem: you can always insert at the end, and delete by copying the last record over the deleted record and then truncating the last 100 bytes off the file. (Or, if you don't want to invalidate indexes or iterators you might have lying around elsewhere in your program, you can just mark the records deleted in some way, then compact at the end.)

    But if the records have some meaningful order (e.g., they're sorted), then you're back to shifting the whole rest of the file up or down by 100 bytes, in which case you might as well do the atomic write-temporary-and-replace thing.

    So, that's where things stop being simple. If you want to know how to get more complex, read on; if not, just skip the whole next section.

    Fun with complex file formats

    Chunks

    What if you make your file format more complex? For example, let's say you leave large gaps in your file (possibly even huge gaps, if your platform supports sparse files), and you use a header every N records that says which records within that chunk are in use and which aren't. (You'll want the chunk size to be an even multiple of the system page size—if you don't know what that is, 4096 is a good number.)

    Then you can insert and delete, and only on the rare occasions where you've inserted so much that you fill an entire chunk do you need to shift anything outside the current chunk. (And even then, if there's room in the next or previous chunk, you can get away with just shifting within two chunks.) You will occasionally have to rewrite most of the file, but for many use cases that will be very rare.

    If you think about it, this is just a bigger version of the simple fixed-record format: each chunk is a record, and you almost never have to insert or delete a chunk in-place.

    And in fact, there's really nothing forcing you to use fixed records within the chunk; as long as the chunks are fixed-length, you can store blocks of text in the chunks just as easily. In fact, many text editors do something similar for autosave files.

    Indexes

    Even if you "almost never" have to insert a chunk in-place, sometimes you'll have to. And since that's very slow, it could affect the responsiveness of your program. Or, even worse, if it happens while the user is trying to quit or shut down, they might just kill your program before you can finish.

    Isn't there some way you can ask the filesystem to just add another block into the middle of your file? Well, technically, yes, just about any filesystem has to have a way to do that. But it's not part of any portable API (and definitely not part of Python). So, unless you want to write some complicated native code for every platform you need to run on, this isn't an option.

    But think about how filesystems must do that: To a filesystem, your file is some kind of list of blocks, ideally but not necessarily contiguous.

    And you can do the same thing yourself. Keep some kind of index—a mapping of chunks to seek positions. So instead of just multiplying 65109 by the chunk size, you look up 65109 in the index. Where do you keep this mapping? Well, you could keep it in a separate file, like "spam.idx" alongside "spam.dat". But you can also just store it at the beginning of the file. What happens when your index gets full? You just move the first data chunk to the end of the file (and update the index), so you have room for another index chunk.

    Notice that if you do this, adding new chunks is no longer so expensive, so you don't need large gaps anymore.

    If your files are gigantic, you can do two or even more levels of indexing.

    In fact, even if your files aren't gigantic, two levels of indexing may be worth doing. On a hard drive, seeking is much slower than reading sequentially, so if you can keep your reads and writes as local as possible, you can speed things up. (Of course this assumes that your filesystem generally keeps your files defragmented, so that adjacent chunks of a file are on adjacent disk blocks, but that's usually a decent assumption with modern filesystems.)

    If your files are big enough, you also may want to take advantage of the fact that most of your chunks are in order and use run-length encoding or similar alternatives.

    Free lists

    If you do a lot of inserts and deletes, eventually you'll end up with multiple empty chunks, sometimes even runs of empty chunks, just lying around wasting space.

    But you can easily add a free list to your index. When a chunk becomes empty, append it to the free list. Next time you need a new chunk, grab the last chunk off the free list and use that, only creating a new chunk if the free list is empty.

    Defragmentation

    The more inserts and deletes you do, the more your chunk list will be out of order. This means you spend more time seeking on the hard disk (which, again, is very slow), and it also means any run-length encoding or similar index compression you do becomes less useful.

    The solution is the same one filesystems use. You can do an "offline defragmentation" by just reading the chunks in sequence order and writing them out to a new file. This also frees up all the chunks in your free list, shrinking the file. Or you can do "online defragmentation"—any time you notice a badly out-of-order run of chunks, make a note to defrag that part of the index as soon as you have idle time.

    Atomicity

    Remember the big advantage of write-temporary-and-replace was atomicity: at any given time, you've either got the original file, or the new file, not half the new file, or a corrupted file. Can you get the same advantage editing in-place?

    Sure, if you're willing to write more code.

    If you're not using an index, and your chunk size is a filesystem block size, you can generally count on the fact that writing a block is an atomic operation. So, you read in the 4K chunk, modify it, write it out, and either it was written or the old version is there. When you need to insert or delete a chunk, you can use the write-temporary-and-replace trick to make that atomic.

    What if you're using an index? Well, again, when you're just changing a single block, you can do that atomically. When you're inserting blocks, you insert them at the end, then you rewrite the index. If you fail before updating the index, those new blocks aren't referenced by anything. At next startup, you could present them to the user as "auto-recovery" data, or just toss them. As long as your "transactions" are just single values, atomicity and consistency are just a matter of doing things in the right order.

    If the index has grown to more than one block, you might have to write more than one index block. The only way around that is to add multiple levels of indexing, so the top level is never larger than a single block. So you append the new chunk, then append a new version of the lower-level index chunk that refers to that chunk (if you've grown to more than 2 levels, there will be more such appends) then rewrite the top-level index.

    A different option is to write to a journal, which is just a file where you record operations in sequential order, so at startup you can roll back (or, occasionally, complete) partial changes that happened before you crashed.

    Finally, there are all kinds of useful ways you can add redundancy, to make it easier to detect and recover from corruption. This obviously isn't as good as avoiding it in the first place, but if you can't be 100% sure you're 100% atomic, a fallback never hurts.

    Of course often you want atomicity (and the rest of ACID) for transactions that affect more than a single record. For that, you need to build a transaction system, which I won't get into here.

    Concurrency

    Writing a block may be atomic, but reading a block, making changes, and writing it back obviously isn't. If you're just worried about surviving crashes, this isn't a problem—either your program did 1 write, or it did 0. But if you have two programs (or two threads in your program, or even two interleaved operations on the same thread) both trying to edit the file, you've got a classic race condition: program 1 reads the index, program 2 reads the index, program 1 writes back its modified index, program 2 writes back its modified index, and now program 1's changes are lost.

    The solution here is the same as anywhere else. The details are different on different platforms, but almost everywhere but Windows, the simplest version uses pessimistic flock:
        @contextlib.contextmanager
        def flocking(f, op):
            fcntl.flock(f, op)
            try:
                yield
            finally:
                fcntl.flock(f, fcntl.LOCK_UN)
    
        # atomic read
        with flocking(f, fcntl.LOCK_SH):
            f.seek(pos)
            buf = f.read(4096)
    
        # atomic edit
        with flocking(f, fcntl.LOCK_EX):
            f.seek(pos)
            buf = bytearray(f.read(4096))
            buf[off:off+4] = struct.pack('>I', 23)
            f.seek(pos)
            f.write(buf)
    
    This can cause a lot of contention if the changes you need to make may take a while. In that case, you use optimistic locking, or drop to the lower-level lockf (which lets you lock just part of a file instead of the whole thing), or both.

    Do you really want to reinvent the wheel?

    At this point, you've probably realized that you're doing most of the same work a database or a filesystem would do.

    If you really want to continue on, your next step should be to take a look at the data structures and algorithms that databases and filesystems use. The key is to store your index, or possibly even your data itself, in a structure like a B-tree or B+ tree, or even a simple probing hash table. There are well-known algorithms for dealing with these data structures, and nice pre-built implementations (and the trees also give you sorting for free).

    But there's an even simpler idea: Just use a database or a filesystem to do the work for you.

    Using a database should be pretty obvious. You have all kinds of options, depending on what you want to do. Python comes with a simple key-value store (actually, a few variations) in the dbm package, and a complete relational database in the sqlite3 module, but there are plenty of more powerful key-value stores, relational databases, documented-oriented databases, persistent graphs, and so on. Whatever you want probably already exists, and probably has a nice Python interface. (If the native interface to whatever you want isn't nice enough, there may be an adapter—e.g., SQLAlchemy can give you a functional wrapper or a complete ORM on top of most relational databases.) See relational database and NoSQL on Wikipedia for some ideas. All kinds of things that are hard to get right, much less optimize for each platform, will be automatically taken care of for you.

    Using a filesystem is also very simple: Just store each record (or each chunk) in a separate file, and use a directory as an index. You can create multiple levels of indices by just creating subdirectories. (Depending on your filesystem, you may need to do this every time you get to a few thousand files, or even less.) An atomic insert is just creating a temporary and moving it in-place; an atomic delete is just os.remove; an atomic edit is a write-temporary-and-replace. When you're working with the filesystem (and the OS cache, etc.), a lot of things get a lot easier.
    0

    Add a comment

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.