1. Recently, there have been a few proposals to change Python's syntax to make it easier to avoid break and continue statements.

    The reasoning seems to be that many people are taught never to use break and continue, or to only have a single break in any loop. Some of these people are in fact forced to follow these rules for class assignments.

    These rules are ludicrous. The stdlib has hundreds of break and continue statements. There are dozens of them in the official tutorial, the library reference docs, Guido's blogs, etc. The reasons for avoiding break and continue in other languages don't apply to Python, and many of the ways people use to avoid them don't even exist in Python. Using break and continue appropriately is clearly Pythonic.

    The python-ideas community tracked these rules back to some misguided attempts to apply the rules from MISRA-C to Python.

    It should be obvious that C and Python are very different languages, with different appropriate idioms, so applying MISRA-C to Python is insanely stupid. But, in case anyone doesn't get it, I've gone through the requirements and recommendations in MISRA-C 1998. By my count, more than half of them don't even mean anything in Python (e.g., rules about switch statements or #define macros), and more than half of the rest are bad rules that would encourage non-Pythonic code.

    I've paraphrased them, both to avoid violating the copyright on the document, and to express them in Python terms:
    • 8. req: No unicode, only bytes, especially in literals. (In other words, '\u5050' is bad; b'\xe5\x81\x90' is good.)
    • 12. rcm: Don't use the same identifier in multiple namespaces. (For example, io.open and gzip.open should not have the same name, nor should two different classes ever have members or methods with the same name.)
    • 13. rcm: Never use int when you can use long. (Only applies to Python 2.x.)
    • 18. rcm: Suffix numeric literals. (Only applies to Python 2.x.)
    • 20. req: Declare all variables and functions at the top of a module or function.
    • 31. req: Use curly braces in all initializers. (Sorry, no lists allowed, just dicts and sets.)
    • 33. req: Never use a user-defined function call on the right side of logical and/or.
    • 34. req: Never use anything but a primary expression on either side of logical and/or.
    • 37. req.: Never use bitwise operators on signed types (like int).
    • 47. rcm: Never rely on operator precedence rules.
    • 48. rcm: Use explicit conversions when performing arithmetic between multiple types.
    • 49. rcm: Always test non-bool values against False instead of relying on truthiness.
    • 53. req: All statements should have a side-effect.
    • 57. req: No continue.
    • 58. req: No break.
    • 59. req: No one-liner if and loop bodies.
    • 60. rcm: No if/elif without else.
    • 67. rcm: Don't rebind the iterator in a for loop.
    • 68. req: All functions must be at file scope.
    • 69. req: Never use *args in functions.
    • 70. req: No recursion.
    • 82. rcm: Functions should have a single point of exit.
    • 83. req: No falling off the end of a function to return None.
    • 86. rcm: Always test function returns for errors.
    • 104. req: No passing functions around.
    • 118. req: Never allocate memory on the heap.
    • 119. req: Never rely on the errno in an OSError (or handle FileNotFoundError separately, etc.).
    • 121. req: Don't use locale functions.
    • 123. req: Don't use signals.
    • 124. req: Don't use stdin, stdout, etc., including print.
    • 126. req: Don't use sys.exit, os.environ, os.system, or subprocess.*.
    • 127. req: Never use Unix-style timestamps (e.g., time.time()).


    1

    View comments

  2. Sometimes, you need to split a program into two parts.

    For many cases--one step in your process briefly uses a ton of memory, and you want that to be released to the system, or you want to parallelize part of the process to take advantage of multiple cores--you just use multiprocessing or concurrent.futures.

    But sometimes that doesn't work. Maybe the reason you want to split your code is that you need Python 3.3 features for one part, but a 2.7-only module for another part. Or part of your program needs Java or .NET, but another part needs a C extension module. And so on.

    Example

    To make this concrete, let's take one example: You've written a cool GUI in Jython, but now you discover that you need to call a function out of a C library. The library is named mylibrary; it's at /usr/local/lib/libmylibrary.so.2, and it defines two functions:

        long unhex(const char *hexstr) {
            return strtol(hexstr, NULL, 16);
        }
    
        long parse_int(const char *intstr, int base) {
            return strtol(intstr, NULL, base);
        }
    

    You could do this by using JNI to bridge C to Java and then Jython's Java API to bridge that through to your Jython, but let's say you already know how to use ctypes, and you want to use that.

    If you were just using CPython or PyPy, you could call unhex like this:

        import ctypes
    
        mylibrary = ctypes.CDLL('/usr/local/lib/libmylibrary.so.2')
        mylibrary.unhex.argtypes = [c_char_p]
        mylibrary.unhex.restype = c_long
    
        if __name__ == '__main__':
            import sys
            for arg in sys.argv[1:]:
                print(unhex(arg))
    

    But in Jython, you can't do that, because there's no ctypes.

    Fun with subprocess

    In this simple case, all you want to do is run one function in a different Python interpreter. As long as the input is just a few strings, and the output is just a string, all you need is subprocess.check_output:
        import subprocess
    
        def unhex(hexstr):
            return subprocess.check_output(['python', 'mylibrary_wrapper.py', hexstr])
    

    Obviously you can use 'python3' or 'jython' or '/usr/local/bin/pypy' or '/opt/local/custompython/bin/python' or r'C:\CustomPython\Python.exe' or whatever in place of 'python' there.

    If you need to get back more than one string as output, as long as you can easily encode it into a string, that's pretty easy. For example, let's say you wanted to unhex multiple strings:

       def unhex(*hexstrs):
            return subprocess.check_output(['python', 'mylibrary_wrapper.py', hexstrs]).splitlines()
    

    You can also encode input this way. There are limitations on what you can pass in through command-line arguments, but you can always pass things through stdin. For example, change the above program to:
            for line in sys.stdin:
                print(unhex(line))
    

    And now you can pass it a whole mess of strings without worrying about the command-line argument limits:

        def unhex(*hexstrs):
            with subprocess.Popen(['python', 'mylibrary_wrapper.py'], 
                                  stdin=subprocess.PIPE, stdout=subprocess.PIPE) as p:
                return p.communicate('\n'.join(hexstrs)).splitlines()
    

    But what if you need to call the function thousands of times, and not all at once? The cost of starting up and shutting down thousands of Python interpreters may be prohibitive.

    In that case, the answer is some form of RPC. You kick off a background program that stays running in the background, listening on a socket (or pipe, or whatever). Then, whenever you need to call on it, you send it a message over that socket, and it replies.

    Running a service

    For really trivial cases, you can build a trivial protocol that runs directly over sockets. For really complicated cases, you may want to build a custom protocol around something like Twisted. But for everything in the middle, it may be simpler to just piggyback on a protocol that already exists and has ready-to-go implementations.

    For example, let's use JSON-RPC directly over sockets, through the bjsonrpc library.

    First, we need to build the server. Take the wrapper script above, leave the ctypes stuff alone, and replace the sys.argv or sys.stdin stuff with:
        import bjsonrpc
        from bjsonrpc.handlers import BaseHandler
    
        class MyLibraryHandler(BaseHandler):
            def unhex(self, hexstr):
                return mylibrary.unhex(hexstr)
    
        s = bjsonrpc.createserver(port=12345, handler_factory=MyLibraryHandler)
        s.serve()
    

    Now, in your Jython code, you can do this:

        import subprocess
        import bjsonrpc
    
        class MyLibraryClient(object):
            def __init__(self):
                self.proc = subprocess.Popen(['python', 'mylibrary_wrapper.py'])
                self.conn = bjsonrpc.connect(port=12345)
            def close(self):
                self.conn.close()
                self.proc.kill()
            def unhex(self, hexstr):
                return self.conn.call.unhex(hexstr)
    

    And that's it.

    If you want to extend this to expose parse_int as well as unhex, you just need to wrap the ctypes function, add another method to the MyLibraryHandler and MyLibraryClient, and you can call it.

    Automating the process

    If you're wrapping up 78 functions in 5 different libraries that are under heavy development and keep changing, it will get very tedious (and error-prone and brittle) to add the same information in 3 places. You can make the ctypes stuff a lot easier by replacing it with a custom C extension module using, say, Cython, SWIG, SIP, or Boost.Python, or make it less brittle by using cffi. But what do you do about the server and client code?

    Well, first, notice that you don't really need the wrappers in the client. self.conn.call is already a dynamic wrapper around whatever the server happens to export.

    And on the server side, you're just delegating calls from self to mylibrary. You can build those delegating methods up at start time, or use your favorite other technique for delegation.

    If you want to get really crazy, you can write the interface in an IDL dialect and generate the C headers, C implementation stubs, ctypes/cffi/SIP/whatever wrappers, server wrappers, and client wrappers all out of the same source.

    Of course you probably don't want to get really crazy, but the point is that you can. You've built an RPC server, and all of the powerful features of RPC and network servers are available if you need them.
    0

    Add a comment

  3. Look at this familiar code:

        class Foo(object):
            def __init__(self, a):
                self.a = a
            def bar(self, b):
                return self.a + b
        foo = Foo(1)
    

    How do __init__ and bar get that self parameter?

    Unbound methods

    Well, bar is just a plain old function. (I'll just talk about bar for simplicity, but everything is also true for __init__, except for one minor detail I'll get to at the end.)

    Foo.bar is a method, but it's an "unbound method"—that is, it's not bound to any particular Foo instance yet. As it turns out, these are exactly the same objects as plain old functions. (This wasn't true in Python 2.x, but I'll get to that later.)

    You can call unbound methods just like any other function—but to do so, you have to pass an extra argument explicitly as self:
        >>> Foo.bar
        <function __main__.bar>
        >>> Foo.bar(foo, 2)
        3
    
    You can even save them as plain old variables outside a class.
        >>> bar = Foo.bar
        >>> bar(foo, 2)
        3
    
    This also means you monkeypatch a class to add new methods very easily:
        >>> def baz(self):
        ...     return self.a * 2
        >>> Foo.baz = baz
        >>> Foo.baz(foo)
        2
        >>> foo.baz()
        2
    
    This even affects existing instances of the class (as long as they haven't shadowed the method with an instance variable by assigning to self.baz somewhere).

    Bound methods

    While Foo.bar is the same thing as bar, foo.bar is _not_ the same thing. It's a method, not a function:
        >>> foo.bar
        <bound method Foo.bar of <__main__.Foo object at 0x1066463d0>>
    
    As you may be able to guess from the repr, a bound method wraps up a function and an object. And you can even pull them out:
        >>> foo.bar.__func__ is Foo.bar
        True
        >>> foo.bar.__self__ is foo
        True
    
    This means that monkeypatching an instance is a bit more complicated, because you have to build a method. How do you do that?
    Well, you could create a whole new class with the method, create an instance of that class, and copy the method from the new instance to the one you want to patch. But that's pretty ugly.

    Think about how you construct other types. To make a Foo, you just call Foo(1). To make an int, you just call int('1'). The same goes for list, str, bytearray, and so on.

    But what's the type of a method? Well, it's "method", but there's no built-in name bound to that.

    For types that aren't normally useful, but occasionally are, Python hides the names, but gives us the types module to access them. So:
        >>> help(types.MethodType)
        Help on class method in module builtins:
    
        class method(object)
         | method(function, instance)
        …
    
        >>> foo.baz = types.MethodType(baz, foo)
        >>> foo.baz()
        2
    

    How do bound methods work?

    Any type can be callable, not just functions. You can define your own callable types just by defining a __call__ method. So, you can simulate a bound method pretty easily:
        class BoundMethod(object):
            def __init__(self, function, instance):
                self.__func__, self.__self__ = function, instance
            def __call__(self, *args, **kwargs):
                return self.__func__(self.__self__, *args, **kwargs)
    
    Now you can use this exactly like the above:
        >>> foo.baz = BoundMethod(baz, foo)
        >>> foo.baz()
        2
    

    How do bound methods get built?

    Everything above, you could simulate yourself, without knowing anything deep about Python.

    But there's one piece you can't. How is it that Foo.bar is an unbound method, but foo.bar is a bound method?

    The obvious (and wrong) answer would be: When constructing a class instance, Python could create a bound method out of each unbound method and copy them in. That would be easy. But that wouldn't explain why adding Foo.bar made foo.bar work, even though foo had already been created.

    In fact, you can look at the __dict__ for the objects and see that Python hasn't done this; the only thing that exists is the unbound method on Foo:
        >>> Foo.__dict__
        {'bar': <function Foo.bar at 0x1086259e0>, '__dict__': <attribute '__dict__' of 'Foo' objects>, …}
        >>> foo.__dict__
        {'a': 1, 'baz': <__main__.BoundMethod at 0x106646c10>}
    
    The foo.baz that we created and added explicitly is there, but foo.bar isn't there. It's inherited from Foo.bar, just like any class attribute.

    Except that normally, a class attribute doesn't magically change value or type when accessed from an instance:
        >>> class Spam(object):
        ...     eggs = 2
        >>> spam = Spam()
        >>> spam.eggs
        2
    
    So, why is this different if the attribute is a function?

    Descriptors

    The secret is descriptors.

    Descriptors have a reputation for being scary, deep magic. But once you understand what good they are, it's not too hard to understand why they work.

    Every value in Python can have __get__, __set__, and __delete__ methods.

    When you access a class attribute through an instance, if that the attribute has a __get__ method, it gets called with the instance and class, and whatever __get__ returns is what you see.

    So, a function's __get__ method works like this:
        def __get__(self, instance, owner):
            return types.MethodType(self, instance)
    
    And that's nearly all there is to it. I've cheated a little bit (e.g., the same __get__ has to work to return a bound method when accessed on an instance, a plain function when accessed directly on a class), but it's all pretty simple.

    Putting it all together:

    When you ask for foo.bar, Python looks in foo.__dict__, and doesn't find anything. So then it goes to foo's class and looks in Foo.__dict__, and finds something named "bar". Because "bar" was accessed through the class dictionary, Python calls Foo.bar.__get__(foo, Foo), which returns a bound method.

    A classmethod is just a function whose __get__ returns types.MethodType(self, cls), which means you end up with a bound method bound to the class, rather than an instance. And a staticmethod is just a function whose __get__ returns itself. A property is just an attribute whose __get__, __set__, and __delete__ methods call the functions you defined in your @property. And so on. When you look behind the curtain, the wizard isn't that scary at all.

    You can read more about descriptors in the Descriptor HowTo Guide.

    History

    You may have noticed that we have actual types for functions and bound methods, but unbound methods are just the same thing as functions. Why even have a name for something if it's just a confusing synonym of something everyone understands?

    You may have also noticed that __get__ takes an owner parameter that nobody uses.

    In Python 2.x, unbound methods were a different type from functions, and very closely related to bound methods.

    A Python 3 bound method has attributes __func__ and __self__. In Python 2, these were called im_func and im_self, and there was a third called im_class. A method with None for im_self was an unbound method; otherwise, it was a bound method. You should be able to imagine how functions and @classmethods and @staticmethods implemented __get__.

    As it turns out, unbound methods don't add much. They make calls a little slower, make it as hard to monkeypatch classes (which is relatively common) as instances (which is uncommon), and add a whole new concept that you need to understand the complexities of. What do you get in exchange? Basically, just the fact that they can tell you which class they're part of. But really, the only thing you want to do with that is display it—which __qualname__ does a much better job at—or try to use it for pickling—which doesn't work anyway.

    Python 2.x also has some wrinkles dealing with classic classes, but these only come up in three cases:
    • Ancient code written to be compatible with Python 1.5-2.1.
    • Simple bugs by novices who don't know how to write a new-style class.
    • Troll code by people who insist that they prefer classic classes just because the core devs and everyone else in the Python community disagrees.
    If you want to know the details, read New-style and classic classes.

    What was that about __init__?

    I was hoping you'd forget…

    Python reserves the right to treat special methods specially.

    What's a special method? This isn't actually quite defined anywhere. But basically, it's any method whose name begins and ends with double underscores, and whose purpose is to be called by the language itself or by some builtin code. (Note that most implementations provide a way for extension modules to add new builtin code—that's what the C API, the Jython bridge, etc. are all about.)

    How are they treated specially? Basically, for some special methods, in some cases, Python ignores __getattr__, __getattribute__ and any other attribute mechanism besides __dict__, and skips the instance __dict__ to go straight to the class and its base classes.

    For different implementations, and even different versions of the same implementation, the set of methods and cases and the exact details of the specialness differ. If you want to know about CPython does, look at the source to _PyObject_LookupSpecial, and grep for calls to it in the Python, Objects, and Modules directories.

    Anyway, in CPython 3.3, when calling __init__ as part of object construction when you haven't overridden __new__, you go through _PyObject_LookupSpecial; in other cases, it's a normal lookup.

    Of course it kind of makes sense for __init__. Normally, an instance's dict is empty until the __init__ method, and methods like __getattr__ often need setup that's done in __init__ to work.
    3

    View comments

  4. Never call readlines() on a file

    Calling readlines() makes your code slower, less explicit, less concise, for absolutely no benefit.

    There are hundreds of questions on places like StackOverflow about the readlines method, and in every case, the answer is the same.
    "My code is takes forever before it even gets started, but it's pretty fast once it gets going."
    That's because you're calling readlines.
    "My code seems to be worse than linear on the size of the input, even though it's just a simple loop."
    That's because you're calling readlines.
    "My code can't handle giant files because it runs out of memory."
    That's because you're calling readlines.
    "My second call to readlines returns nothing."
    That's not directly because you're calling readlines; it's because you're trying to read from a file whose read pointer is at the end. But the reason it's not obvious why this can't work is that you're using readlines.

    In fact, even if you don't have any of these problems, you should not use readlines, because it never gives you any advantage.

    What's wrong with readlines()?

    The whole point of readlines() is that it reads the entire file into memory at once and parses it into a list.

    So, you can't do anything else until it's read and parsed the whole file. This is why your program takes a while to start: reading files is slow. If you let Python and your OS interleave the "waiting for the disk" part with the "running your code" part, it will get started almost immediately, and often go a lot faster overall.

    And meanwhile, you're using up memory to store the whole list at once. In fact, you need enough memory to hold the original data, the strings built out of it, the list built out of those strings, and various bits of temporary storage. (Although the temporary storage goes away once readlines is done, the giant list of strings doesn't.) That's why you run out of memory.

    Also, all that memory allocation takes time. If you only use a bit of memory at a time, Python can keep reusing it over and over; if you use a bunch of memory at the same time, Python has to find room for all of it, causing it to call malloc more often. You're also making Python fight with your OS's disk cache. And if it you allocate too much, you can cause your system to start swapping. That's why your time seems superlinear—it's actually linear, except for a few cliffs that you fall off along the way: needing to malloc, needing to swap, etc. And those transitions completely swamp everything else and make it hard (and pointless) to measure the linear part.

    It can get even worse if you're calling readlines() on a file-like object that has to do some processing. For example, if you call it on the result of a gzip.open, it has to read and decompress the entire file, which means even more startup delay, even more temporary memory wasted, and even more opportunity for interleaving lost.

    So what should I use?

    99% of the time, the answer is to just use the file itself. As the documentation says:
    Note that it’s already possible to iterate on file objects using for line in file:... without calling file.readlines().
    The reason you're calling readlines is to get an iterable full of lines, right? A file is already an iterable full of lines. And it's a smart iterable, reading lines as you need them, with some clever buffering under the covers.

    This following two blocks of code do almost the same thing:

        with open('foo.txt') as f:
            for line in f.readlines():
                dostuff(line)
    
        with open('foo.txt') as f:
            for line in f:
               dostuff(line)
    

    Both of them call dostuff on each line in foo.txt. The only difference is that the first one reads all of foo.txt into memory before starting to loop, while the second one just reads a buffer at a time, automatically, while looping.

    What if I actually need a list rather than just some arbitrary iterable?

    Make a list out of it:

        with open('foo.txt') as f:
            lines = list(f)
    

    This has exactly the same effect as calling f.readlines(), but it makes it explicit that you wanted a list, in exactly the same way you make that explicit anywhere else (e.g., calling an itertools function, or Python 3.x's map or filter).

    What about calling readlines with a sizehint?

    There's nothing wrong with that. It's often a useful optimization or simplification.

    For example, consider this code using a multiprocessing.Pool:

        with open('foo.txt') as f:
            pool.map(func, f, chunksize=104)
    

    It's a bit silly to break the file down into lines just to chunk them back up together. Also, this won't give you chunks of about-equal size unless your lines are of about-equal length. So, this may turn out to be a lot better:

        with open('foo.txt') as f:
            pool.map(func, iter(partial(f.readlines, 8192), []), chunksize=1)
    

    Of course I'd probably wrap up that iterable to make it more readable—many Python programmers need to think through both partial and two-argument iter to understand them, much less both of them together. But the idea is that, instead of reading line by line and building chunks of 104 lines in hopes that will often be around 8K, we just read 8K worth of lines at a time.

    What if I need to be compatible with older Python?

    Files have been iterable since 2.3. That's over a decade old. That's the version that RHEL 4 came with.

    If you really have to work with all of Python 2.1-2.7 (and don't mind breaking 3.x), you can use f.xreadlines instead. (Note that in 2.3+, f.xreadlines() just returns f, so there's no real harm in calling it—it's just silly to do so if you don't need to.) If you have to work with 2.0 or 1.x, you'll need to write your own custom buffer-and-splitlines code.

    But you really don't. Nobody's going to be running your new script on a system from the last century.

    What about that other 1% of the time?

    There are various other possibilities that come up sometimes (besides the ones described above), where using the file as an iterator is not the answer:

    Often you're trying to parse a CSV file or XML or something, and the right fix is to use csv or ElementTree or whatever in the first place instead of trying to do it yourself.

    Sometimes, you need to call readline() in a loop (or, equivalently, use something like iter(f.readline, '')). However, this most often comes up with sys.stdin, in which case you're probably doing it wrong in the first place—maybe input is what you wanted here?

    Sometimes, you really need to mmap the whole file (or, worse, use a sliding mmap window because you've got a 32-bit Python) and find the line breaks explicitly.

    But in none of these cases is f.readlines() going to be any better than f. It's going to have the exact same problems, plus another problem on top.
    6

    View comments

  5. You don't actually have to be a dummy to not get list comprehensions. Only a few languages (after Python, the next most popular is probably Haskell) support them. And they're only "easy" once you learn to think a different way.

    You do eventually want to learn to think that way. (In fact, you want to go beyond that and think about generator expressions, as pipelines for transforming iterators, instead of thinking about sequences at all.) But you usually can't jump up a level of abstraction without some practice. When you first learn a foreign language, you learn how to translate the vocabulary and syntax into your native language; it's only after you spend enough time trying to speak and understand it that you can actually think directly in the new language. And the same goes for each new abstraction you learn in math class.

    So, the easiest way to learn list comprehensions is to convert them to explicit loops in your head—or, if necessary, on paper/in a text editor. Once you get more comfortable with them, you'll be able to read them directly, but until then, just convert each one you see.

    Starting with the simplest case, a list comprehension like this:

    a = [func(element) for element in sequence]
    

    Is equivalent to:

    a = []
    for element in sequence:
        a.append(func(element))
    

    Just as you can add additional for loops and if clauses under the top-level for loop, you can add them to the comprehension.

    The key thing to understand is that the left-to-right order in the comprehension maps in the same order to explicit loops:

    a = [func(element) for subseq in seq2d for element in subseq if pred(element)]
    
    a = []
    for subseq in seq2d:
        for element in subseq:
            if pred(element):
                a.append(func(element))
    

    Dictionary and set comprehensions work pretty much the same way:

    a = {element: func(element) for element in sequence}
    
    a = {}
    for element in sequence:
        a[element] = func(element)
    
    b = {func(element) for element in sequence}
    
    b = set()
    for element in sequence:
        b.add(func(element))
    

    Generator expressions are similar, except that they build generators, iterators over a virtual sequence that never gets built, instead of actual sequences. It's probably easiest to first treat them as lazy list comprehensions, then convert those list comprehensions to loops.

    But if you want to convert them directly to loops, you can; you just need to add an extra function call:

    a = (func(element) for element in sequence)
    
    def _genexp():
        for element in sequence:
            yield func(element)
    
    a = _genexp()
    
    0

    Add a comment

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.