Everybody knows that "floating point numbers cause problems." Many people have misconceptions about what those problems are, and how to deal with them. There are some great references for people who want to understand things, the (justly) most famous being What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg.

But many Python programmers like to learn by experimenting. That's why you've chosen a language with an interactive REPL, powerful introspection, and, when it comes down to it, a readable open-source implementation, right?

So, let's look at how floating point numbers are represented in Python; understanding what all the bits mean may make it easier to understand how rounding errors and so forth work.

On nearly any platform you care about, floating point numbers—both binary (float) and decimal (Decimal)—are going to be stored in IEEE 754-2008 formats. So, let's look at how those work.

IEEE binary floats

Scientific notation

IEEE floats are based on scientific notation, but in binary.

Scientific notation looks like:
-1.042e3
The "-" sign is obvious. The "1.042" is almost obvious, but note that in "normalized form" there always must be exactly one nonzero digit left of the "." (that's what makes it scientific notation), and you're allowed to have trailing zeros (if, e.g., you had 11 to 4 significant digits, you'd write 1.100e1, not 1.1e1). The "e3" means that you multiple by 10**3.

So:
- (1 + 0/10 + 4/100 + 2/1000) * 10**3 == -104.2
For binary scientific notation, everything works the same way, except with 2 in place of 10. So:
-1.01e2
- (1 + 0/2 + 1/4) * 2**2 = -101 (binary) or -5 (decimal) 

Representation in bits

Obviously, any float based on scientific notation is going to have three things to store: the sign, the mantissa (the decimal 1.042 or binary 1.01), and the exponent (the 3 in 10**3 or 2 in 2**2).

The standard doesn't really specify how the bits have to be stored inside the computer, but they have to be represented for interchange as if they were stored as:
  • 1 sign bit
  • w exponent bits
  • p-1 significand bits
Let's pick a specific, very small, size to make our lives easy: just 3 exponent bits and 2 significand bits. Of course in real life you'd never deal with 6-bit floats, but having only 64 possible values means we can look at an exhaustive table of them (which I've included below), which can be very helpful.

We'll call this 6-bit format "binary6", even though technically it isn't a legal interchange format. (For one thing, interchange formats must have either 16 bits, or a multiple of 32 other than 96. For another, a k-bit interchange format for any k but 16 and 32 must have w = round(4 * math.log2(k)) − 13, which in our case would give -3 bits.)

For the binary64 format, everything is exactly the same as our binary6 format, except that w is 11 instead of 3, and p-1 is 52 instead of 2.

Sign

The sign is easy. Unlike integers, there's no 2s complement or anything fancy going on here. You compute the number from the exponent and significand, and then if the sign bit is 1, you negate it. So, -5.0 looks exactly like 5.0 except for the first bit. This does have some strange consequences—it means -0.0 and +0.0 are different numbers, and it means the bits for the smallest negative number and the smallest positive number are very far apart—but it also makes a lot of things easier. So, in our example of -1.01e2, the sign bit is 1.

Mantissa and significand

Next, we've got the mantissa, the 1.01 part. But notice that we don't have 3 mantissa bits, we have 2 significand bits. How does this work? Well, in scientific notation, there's always exactly one non-zero digit to the left of the decimal point. In binary, the only non-zero digit is 1, so the format saves a bit by not storing that 1.

So, for 2 bits, the possible values are 1.00 (stored as 00), 1.01 (as 01), 1.10 (as 10), and 1.11 (as 11).

Exponent

Finally, we've got the exponent of 2. To handle the possibility of negative exponents, instead of storing the exponent itself, we store the biased exponent—that is, we add 2**(w-1)-1. Also, the smallest and largest values are reserved for special purposes, which we'll get to in a moment. For 3 bits, the possible values are special (stored as 000), -2 (as 001), -1 (as 010), 0 (as 011), 1 (as 100), 2 (as 101), 3 (as 110), and special (as 111).

So, our number, -1.01e2, is stored as 1 (sign) 101 (exponent) 01 (significand), or 110101.

Infinity, NaN, and subnormals

The smallest exponent value is used to store 0. After all, if there were no special rule, 0 000 00 would mean +1.00 * 2**-3, which is obviously not 0. But just saying exponent 0 means the value is 0 is a waste of bits—that would mean we have 8 different zero values (0 000 00, 0 000 01, 0 000 10, 0 000 11, 1 000 00, 1 000 01, 1 000 10, and 1 000 11). Having positive and negative zero can sometimes be useful, but having 4 different positive zeros is pointless.

So, instead, a 000 exponent actually means the same thing as a 001 exponent, except that a 0 instead of a 1 is prepended to the significand to get the mantissa. A number like 0.01e-2 isn't properly normalized scientific notation—that would be 1.00e-4. So, these numbers are called "subnormal" or "denormal".

Meanwhile, the largest exponent value is used to store infinity. Again, though, that would give us 8 infinities when we only need 2. So, if the significand is nonzero, instead of infinity, it's NaN. The first bit of the significand is the quiet bit, and the remaining bits (in this case, just 1) are the payload, which is reserved for anything a user-level library wants.

Table

So, in our 6-bit format, here's what all the possible numbers mean:

    sign exponent significand    binary scientific  decimal
       0      000          00    0.00e-2 = 0        0.0
       0      000          01    0.01e-2 = 1e-4     0.0625
       0      000          10    0.10e-2 = 1e-3     0.125
       0      000          11    0.11e-2 = 1.1e-3   0.1875
       0      001          00    1.00e-2            0.25
       0      001          01    1.01e-2            0.3125
       0      001          10    1.10e-2            0.375
       0      001          11    1.11e-2            0.4375
       0      010          00    1.00e-1            0.5
       0      010          01    1.01e-1            0.625
       0      010          10    1.10e-1            0.75
       0      010          11    1.11e-1            0.875
       0      011          00    1.00e0             1.
       0      011          01    1.01e0             1.25
       0      011          10    1.10e0             1.5
       0      011          11    1.11e0             1.75
       0      100          00    1.00e1             2.
       0      100          01    1.01e1             2.5
       0      100          10    1.10e1             3.
       0      100          11    1.11e1             3.5
       0      101          00    1.00e2             4.
       0      101          01    1.01e2             5.
       0      101          10    1.10e2             6.
       0      101          11    1.11e2             7
       0      110          00    1.00e3             8.
       0      110          01    1.01e3             10.
       0      110          10    1.10e3             12.
       0      110          11    1.11e3             14.
       0      111          00    inf                inf
       0      111          01    snan1              snan1
       0      111          10    qnan0              qnan0
       0      111          11    qnan1              qnan1
       1      000          00    -0.00e-2 = -0      -0.0
       1      000          01    -0.01e-2 = -1e-4   -0.0625
       1      000          10    -0.10e-2 = -1e-3   -0.125
       1      000          11    -0.11e-2 = -1.1e-3 -0.1875
       1      001          00    -1.00e-2           -0.25
       1      001          01    -1.01e-2           -0.3125
       1      001          10    -1.10e-2           -0.375
       1      001          11    -1.11e-2           -0.4375
       1      010          00    -1.00e-1           -0.5
     ...
       1      110          00    1.00e3             -8.
       1      110          01    1.01e3             -10.
       1      110          10    1.10e3             -12.
       1      110          11    1.11e3             -14.
       1      111          00    -inf               -inf
       1      111          01    -snan1             -snan1
       1      111          10    -qnan0             -qnan0
       1      111          11    -qnan1             -qnan1

Rounding

Given the table above, rounding errors should be pretty easy to understand.

If you want to store 1.1 as a binary6 float, the closest value is 1.0 (0 100 00), which is off by 0.1; the next best would be 1.25 (0 100 01), which is off by even more 0.15.

Now, let's say you want to test 0.6 / 4 == 0.15. The closest binary6 values to start with are 0.625 and 4. Divide those, and the closest result is 0.1875. But the closest value to 0.15 is 0.125, and 0.1875 != 0.125.

You can work out the same thing with more familiar examples with binary64, like 0.3 / 3 == 0.1.

Reading floats as ints

Notice that, as long as you stick to finite nonnegative numbers, you can treat the exact same bits as if they were integers, and two adjacent floats will always be adjacent integers.

For example, Python doesn't provide a function like C99/POSIX-2001 nextafter, but it's easy to implement yourself. For a quick&dirty case, if you know you're dealing with finite positive numbers, just add or subtract 1 as an integer, and you've got the next or previous float. Dealing with negative numbers and nonfinite numbers gets a bit trickier, but it should be obvious how to do it (once you choose exactly what you want to do for, e.g., nextafter(-0) and nextbefore(inf)).

Similarly, to get the "float difference" in ulps (Units of Least Precision, or Units in the Last Place), just subtract the two floats as integers (again, with the obvious extensions for negative and nonfinite numbers). Note that this is often very different from the delta, the difference in real numbers. For example, 1.25 is 2 ulp away from both 0.875 and 1.75, but the deltas are not the same (0.375 and 0.5, or relative 30% and 40%).

You can even begin to understand tricks like the fast inverse sqrt hack.

Manipulating float bits in Python

OK, now we know how a float is stored. But how can we play with this in Python?

While it's possible (at least in CPython) to get at the values under the covers via ctypes, it's generally simpler (and a lot safer) to just use the struct module to pack a float in big-endian C double format. Then we've got the bits as a sequence of bytes—or, if you prefer, you can turn it into a 64-bit unsigned integer, or use a third-party library like bitarray or bitstring to make your life a whole lot easier.

So, let's look at how -123.456 is stored as a float:
    >>> f = -123.456
    >>> b = bitstring.pack('>d', f)
    >>> sbit, wbits, pbits = b[:1], b[1:12], b[12:]
    >>> sbit.bin
    '1'
That's easy: sign bit 1 means negative.
    >>> wbits.bin
    '10000000101'
    >>> wbits.uint - (1<<10) + 1
    6
So, the exponent is (decimal) 6.
    >>> pbits.bin
    '1110110111010010111100011010100111111011111001110111'
    >>> 1 + pbits.uint / (1<<52)
    1.929
So, the mantissa is (decimal) 1.929.

So, the number is -1.929 * 2**6. Which checks out:
    >>> -1.929 * 2**6
    -123.456
Or, if you prefer:
    >>> pbits.hex
    'edd2f1a9fbe77'
So, in Python's float hex format, the number is -1.edd2f1a9fbe77p+6, which we can check:
    >>> f.hex()
    '-0x1.edd2f1a9fbe77p+6'
You can also use this to create and parse NaN payloads, which Python doesn't do natively:
    >>> quiet, payload = 1, 3
    >>> bits = (0b11111111111 << 52) | (quiet << 51) | payload
    >>> qnan3 = struct.unpack('>d', struct.pack('>Q', bits))[0]
Of course Python is going to print qnan3 as just plain "nan", but if your platform has some way of printing out non-default NaN values, you may be able to see it like this:
    >>> ctypes.printf(b'%f', ctypes.c_double(qnan3))
    QNaN3
And if you want to do the nextafter or float_difference hacks described in the last section:
    >>> f = 1.1
    >>> d = struct.unpack('>Q', struct.pack('>d', f))[0]
    >>> struct.unpack('>d', struct.pack('>Q', d-1))[0]
    1.0999999999999999
    >>> struct.unpack('>d', struct.pack('>Q', d+1))[0]
    1.1000000000000003

The floatextras package

Now that you know how to do this all manually, the floatextras package on PyPI wraps it all up for you:
    >>> f = 1.1
    >>> floatextras.next_plus(f)
    1.1000000000000003
    >>> floatextras.float_difference(1.0999999999999999, 1.1000000000000003)
    -2
    >>> floatextras.as_tuple(-123.456)
    FloatTuple(sign=1, digits=(1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1), exponent=6)
    >>> floatextras.as_tuple(qnan3)
    FloatTuple(sign=0, digits=(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1), exponent=1024)

Caveat for some platforms

What I described above is all true for the current IEEE 754-2008 standard, but most platforms are actually based on the older IEEE 754-1987. There are a number of differences and expansions in areas I didn't get to above, and some of the terminology is different (e.g., "double" instead of "binary64").

Also, there are also a few things that the older standard left open to the implementation, which I've assumed will always be interpreted the same way. If you're on almost any platform on x86, x86_64, or ARM hardware, everything will be as above; on different hardware, there may be minor differences (e.g., Irix for MIPS used the NaN quiet/signaling bit backward, so 0 means quiet; another platform which I've mercifully forgotten used the last bit of the significand rather than the first; POWER stored floats big-endian even in little-endian mode so you had to flip the bits around to do the "treat them as ints" trick).

Decimal floats

There's no reason floats have to be stored in binary. In fact, IEEE 754-2008 specifies how you can use the same basic ideas with decimal fractions instead of binary. And Python includes an implementation of this (well, of its predecessor, IEEE 854-1987, but again, it's close enough) in the decimal module.

Most of the differences between binary and decimal formats are obvious, except for one. Decimals can't use the "free mantissa bit" digit, because the digit left of the decimal point can be anything from 1 through 9, so you have to store it. This also means decimal floats can have non-canonical representations (after all, 0.1e2 means the same thing as 1.0e1). Which also means you can store subnormal values, and 0, with different levels of precision, and some operations do so.

The decimal module docs, and the PEP linked from there, cover all of the details very nicely.

The obvious advantage of decimal floats is that there's no rounding error converting to and from decimal strings. For example, the closest binary64 float to 1.1 is 1.10000000000000009e+0; the closest decimal float (as long as you have at least 2 digits of precision) is 1.1e+0. Magic, right? Well, no; plenty of things that are exact with binary floats are inexact with decimal floats. Still, decimals are often a good solution to many floating-point problems, as long as you understand how they fall short of magic. Which, hopefully, you now do.

Also, because the decimal module is implemented from scratch, rather than relying on a native C type, it provides a lot more flexibility. Instead of being stuck with a decimal64 format, you can define and use any format. You can specify rounding modes and signaling modes. All of the optional APIs from the standard are available, instead of just the ones C/POSIX/your platform chose to implement, plus some extras to do things like picking a float apart into sign, exponent, and mantissa, or building NaN values with custom payloads.
0

Add a comment

It's been more than a decade since Typical Programmer Greg Jorgensen taught the word about Abject-Oriented Programming.

Much of what he said still applies, but other things have changed. Languages in the Abject-Oriented space have been borrowing ideas from another paradigm entirely—and then everyone realized that languages like Python, Ruby, and JavaScript had been doing it for years and just hadn't noticed (because these languages do not require you to declare what you're doing, or even to know what you're doing). Meanwhile, new hybrid languages borrow freely from both paradigms.

This other paradigm—which is actually older, but was largely constrained to university basements until recent years—is called Functional Addiction.

A Functional Addict is someone who regularly gets higher-order—sometimes they may even exhibit dependent types—but still manages to retain a job.

Retaining a job is of course the goal of all programming. This is why some of these new hybrid languages, like Rust, check all borrowing, from both paradigms, so extensively that you can make regular progress for months without ever successfully compiling your code, and your managers will appreciate that progress. After all, once it does compile, it will definitely work.

Closures

It's long been known that Closures are dual to Encapsulation.

As Abject-Oriented Programming explained, Encapsulation involves making all of your variables public, and ideally global, to let the rest of the code decide what should and shouldn't be private.

Closures, by contrast, are a way of referring to variables from outer scopes. And there is no scope more outer than global.

Immutability

One of the reasons Functional Addiction has become popular in recent years is that to truly take advantage of multi-core systems, you need immutable data, sometimes also called persistent data.

Instead of mutating a function to fix a bug, you should always make a new copy of that function. For example:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

When you discover that you actually wanted fields 2 and 3 rather than 1 and 2, it might be tempting to mutate the state of this function. But doing so is dangerous. The right answer is to make a copy, and then try to remember to use the copy instead of the original:

function getCustName(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

function getCustName2(custID)
{
    custRec = readFromDB("customer", custID);
    fullname = custRec[2] + ' ' + custRec[3];
    return fullname;
}

This means anyone still using the original function can continue to reference the old code, but as soon as it's no longer needed, it will be automatically garbage collected. (Automatic garbage collection isn't free, but it can be outsourced cheaply.)

Higher-Order Functions

In traditional Abject-Oriented Programming, you are required to give each function a name. But over time, the name of the function may drift away from what it actually does, making it as misleading as comments. Experience has shown that people will only keep once copy of their information up to date, and the CHANGES.TXT file is the right place for that.

Higher-Order Functions can solve this problem:

function []Functions = [
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[1] + ' ' + custRec[2];
        return fullname;
    },
    lambda(custID) {
        custRec = readFromDB("customer", custID);
        fullname = custRec[2] + ' ' + custRec[3];
        return fullname;
    },
]

Now you can refer to this functions by order, so there's no need for names.

Parametric Polymorphism

Traditional languages offer Abject-Oriented Polymorphism and Ad-Hoc Polymorphism (also known as Overloading), but better languages also offer Parametric Polymorphism.

The key to Parametric Polymorphism is that the type of the output can be determined from the type of the inputs via Algebra. For example:

function getCustData(custId, x)
{
    if (x == int(x)) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return int(fullname);
    } else if (x.real == 0) {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return double(fullname);
    } else {
        custRec = readFromDB("customer", custId);
        fullname = custRec[1] + ' ' + custRec[2];
        return complex(fullname);
    }
}

Notice that we've called the variable x. This is how you know you're using Algebraic Data Types. The names y, z, and sometimes w are also Algebraic.

Type Inference

Languages that enable Functional Addiction often feature Type Inference. This means that the compiler can infer your typing without you having to be explicit:


function getCustName(custID)
{
    // WARNING: Make sure the DB is locked here or
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

We didn't specify what will happen if the DB is not locked. And that's fine, because the compiler will figure it out and insert code that corrupts the data, without us needing to tell it to!

By contrast, most Abject-Oriented languages are either nominally typed—meaning that you give names to all of your types instead of meanings—or dynamically typed—meaning that your variables are all unique individuals that can accomplish anything if they try.

Memoization

Memoization means caching the results of a function call:

function getCustName(custID)
{
    if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Non-Strictness

Non-Strictness is often confused with Laziness, but in fact Laziness is just one kind of Non-Strictness. Here's an example that compares two different forms of Non-Strictness:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRate(custId) {}

function callByNameTextRate(custId)
{
    /****************************************
    *
    * TO DO:
    *
    * get tax rate for the customer state
    * eventually from some table
    *
    ****************************************/
}

Both are Non-Strict, but the second one forces the compiler to actually compile the function just so we can Call it By Name. This causes code bloat. The Lazy version will be smaller and faster. Plus, Lazy programming allows us to create infinite recursion without making the program hang:

/****************************************
*
* TO DO:
*
* get tax rate for the customer state
* eventually from some table
*
****************************************/
// function lazyTaxRateRecursive(custId) { lazyTaxRateRecursive(custId); }

Laziness is often combined with Memoization:

function getCustName(custID)
{
    // if (custID == 3) { return "John Smith"; }
    custRec = readFromDB("customer", custID);
    fullname = custRec[1] + ' ' + custRec[2];
    return fullname;
}

Outside the world of Functional Addicts, this same technique is often called Test-Driven Development. If enough tests can be embedded in the code to achieve 100% coverage, or at least a decent amount, your code is guaranteed to be safe. But because the tests are not compiled and executed in the normal run, or indeed ever, they don't affect performance or correctness.

Conclusion

Many people claim that the days of Abject-Oriented Programming are over. But this is pure hype. Functional Addiction and Abject Orientation are not actually at odds with each other, but instead complement each other.
5

View comments

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.