But many Python programmers like to learn by experimenting. That's why you've chosen a language with an interactive REPL, powerful introspection, and, when it comes down to it, a readable open-source implementation, right?
So, let's look at how floating point numbers are represented in Python; understanding what all the bits mean may make it easier to understand how rounding errors and so forth work.
On nearly any platform you care about, floating point numbers—both binary (float) and decimal (Decimal)—are going to be stored in IEEE 754-2008 formats. So, let's look at how those work.
IEEE binary floats
Scientific notation
IEEE floats are based on scientific notation, but in binary.Scientific notation looks like:
-1.042e3The "-" sign is obvious. The "1.042" is almost obvious, but note that in "normalized form" there always must be exactly one nonzero digit left of the "." (that's what makes it scientific notation), and you're allowed to have trailing zeros (if, e.g., you had 11 to 4 significant digits, you'd write 1.100e1, not 1.1e1). The "e3" means that you multiple by 10**3.
So:
- (1 + 0/10 + 4/100 + 2/1000) * 10**3 == -104.2For binary scientific notation, everything works the same way, except with 2 in place of 10. So:
-1.01e2
- (1 + 0/2 + 1/4) * 2**2 = -101 (binary) or -5 (decimal)
Representation in bits
Obviously, any float based on scientific notation is going to have three things to store: the sign, the mantissa (the decimal 1.042 or binary 1.01), and the exponent (the 3 in 10**3 or 2 in 2**2).- 1 sign bit
- w exponent bits
- p-1 significand bits
We'll call this 6-bit format "binary6", even though technically it isn't a legal interchange format. (For one thing, interchange formats must have either 16 bits, or a multiple of 32 other than 96. For another, a k-bit interchange format for any k but 16 and 32 must have w = round(4 * math.log2(k)) − 13, which in our case would give -3 bits.)
For the binary64 format, everything is exactly the same as our binary6 format, except that w is 11 instead of 3, and p-1 is 52 instead of 2.
Sign
The sign is easy. Unlike integers, there's no 2s complement or anything fancy going on here. You compute the number from the exponent and significand, and then if the sign bit is 1, you negate it. So, -5.0 looks exactly like 5.0 except for the first bit. This does have some strange consequences—it means -0.0 and +0.0 are different numbers, and it means the bits for the smallest negative number and the smallest positive number are very far apart—but it also makes a lot of things easier. So, in our example of -1.01e2, the sign bit is 1.Mantissa and significand
Next, we've got the mantissa, the 1.01 part. But notice that we don't have 3 mantissa bits, we have 2 significand bits. How does this work? Well, in scientific notation, there's always exactly one non-zero digit to the left of the decimal point. In binary, the only non-zero digit is 1, so the format saves a bit by not storing that 1.So, for 2 bits, the possible values are 1.00 (stored as 00), 1.01 (as 01), 1.10 (as 10), and 1.11 (as 11).
Exponent
Finally, we've got the exponent of 2. To handle the possibility of negative exponents, instead of storing the exponent itself, we store the biased exponent—that is, we add 2**(w-1)-1. Also, the smallest and largest values are reserved for special purposes, which we'll get to in a moment. For 3 bits, the possible values are special (stored as 000), -2 (as 001), -1 (as 010), 0 (as 011), 1 (as 100), 2 (as 101), 3 (as 110), and special (as 111).So, our number, -1.01e2, is stored as 1 (sign) 101 (exponent) 01 (significand), or 110101.
Infinity, NaN, and subnormals
The smallest exponent value is used to store 0. After all, if there were no special rule, 0 000 00 would mean +1.00 * 2**-3, which is obviously not 0. But just saying exponent 0 means the value is 0 is a waste of bits—that would mean we have 8 different zero values (0 000 00, 0 000 01, 0 000 10, 0 000 11, 1 000 00, 1 000 01, 1 000 10, and 1 000 11). Having positive and negative zero can sometimes be useful, but having 4 different positive zeros is pointless.So, instead, a 000 exponent actually means the same thing as a 001 exponent, except that a 0 instead of a 1 is prepended to the significand to get the mantissa. A number like 0.01e-2 isn't properly normalized scientific notation—that would be 1.00e-4. So, these numbers are called "subnormal" or "denormal".
Meanwhile, the largest exponent value is used to store infinity. Again, though, that would give us 8 infinities when we only need 2. So, if the significand is nonzero, instead of infinity, it's NaN. The first bit of the significand is the quiet bit, and the remaining bits (in this case, just 1) are the payload, which is reserved for anything a user-level library wants.
Table
So, in our 6-bit format, here's what all the possible numbers mean:sign exponent significand binary scientific decimal 0 000 00 0.00e-2 = 0 0.0 0 000 01 0.01e-2 = 1e-4 0.0625 0 000 10 0.10e-2 = 1e-3 0.125 0 000 11 0.11e-2 = 1.1e-3 0.1875 0 001 00 1.00e-2 0.25 0 001 01 1.01e-2 0.3125 0 001 10 1.10e-2 0.375 0 001 11 1.11e-2 0.4375 0 010 00 1.00e-1 0.5 0 010 01 1.01e-1 0.625 0 010 10 1.10e-1 0.75 0 010 11 1.11e-1 0.875 0 011 00 1.00e0 1. 0 011 01 1.01e0 1.25 0 011 10 1.10e0 1.5 0 011 11 1.11e0 1.75 0 100 00 1.00e1 2. 0 100 01 1.01e1 2.5 0 100 10 1.10e1 3. 0 100 11 1.11e1 3.5 0 101 00 1.00e2 4. 0 101 01 1.01e2 5. 0 101 10 1.10e2 6. 0 101 11 1.11e2 7 0 110 00 1.00e3 8. 0 110 01 1.01e3 10. 0 110 10 1.10e3 12. 0 110 11 1.11e3 14. 0 111 00 inf inf 0 111 01 snan1 snan1 0 111 10 qnan0 qnan0 0 111 11 qnan1 qnan1 1 000 00 -0.00e-2 = -0 -0.0 1 000 01 -0.01e-2 = -1e-4 -0.0625 1 000 10 -0.10e-2 = -1e-3 -0.125 1 000 11 -0.11e-2 = -1.1e-3 -0.1875 1 001 00 -1.00e-2 -0.25 1 001 01 -1.01e-2 -0.3125 1 001 10 -1.10e-2 -0.375 1 001 11 -1.11e-2 -0.4375 1 010 00 -1.00e-1 -0.5 ... 1 110 00 1.00e3 -8. 1 110 01 1.01e3 -10. 1 110 10 1.10e3 -12. 1 110 11 1.11e3 -14. 1 111 00 -inf -inf 1 111 01 -snan1 -snan1 1 111 10 -qnan0 -qnan0 1 111 11 -qnan1 -qnan1
Rounding
Given the table above, rounding errors should be pretty easy to understand.If you want to store 1.1 as a binary6 float, the closest value is 1.0 (0 100 00), which is off by 0.1; the next best would be 1.25 (0 100 01), which is off by even more 0.15.
Now, let's say you want to test 0.6 / 4 == 0.15. The closest binary6 values to start with are 0.625 and 4. Divide those, and the closest result is 0.1875. But the closest value to 0.15 is 0.125, and 0.1875 != 0.125.
You can work out the same thing with more familiar examples with binary64, like 0.3 / 3 == 0.1.
Reading floats as ints
Manipulating float bits in Python
OK, now we know how a float is stored. But how can we play with this in Python?While it's possible (at least in CPython) to get at the values under the covers via ctypes, it's generally simpler (and a lot safer) to just use the struct module to pack a float in big-endian C double format. Then we've got the bits as a sequence of bytes—or, if you prefer, you can turn it into a 64-bit unsigned integer, or use a third-party library like bitarray or bitstring to make your life a whole lot easier.
So, let's look at how -123.456 is stored as a float:
>>> f = -123.456 >>> b = bitstring.pack('>d', f) >>> sbit, wbits, pbits = b[:1], b[1:12], b[12:] >>> sbit.bin '1'That's easy: sign bit 1 means negative.
>>> wbits.bin '10000000101' >>> wbits.uint - (1<<10) + 1 6So, the exponent is (decimal) 6.
>>> pbits.bin '1110110111010010111100011010100111111011111001110111' >>> 1 + pbits.uint / (1<<52) 1.929So, the mantissa is (decimal) 1.929.
So, the number is -1.929 * 2**6. Which checks out:
>>> -1.929 * 2**6 -123.456Or, if you prefer:
>>> pbits.hex 'edd2f1a9fbe77'So, in Python's float hex format, the number is -1.edd2f1a9fbe77p+6, which we can check:
>>> f.hex() '-0x1.edd2f1a9fbe77p+6'You can also use this to create and parse NaN payloads, which Python doesn't do natively:
>>> quiet, payload = 1, 3 >>> bits = (0b11111111111 << 52) | (quiet << 51) | payload >>> qnan3 = struct.unpack('>d', struct.pack('>Q', bits))[0]Of course Python is going to print qnan3 as just plain "nan", but if your platform has some way of printing out non-default NaN values, you may be able to see it like this:
>>> ctypes.printf(b'%f', ctypes.c_double(qnan3)) QNaN3And if you want to do the nextafter or float_difference hacks described in the last section:
>>> f = 1.1 >>> d = struct.unpack('>Q', struct.pack('>d', f))[0] >>> struct.unpack('>d', struct.pack('>Q', d-1))[0] 1.0999999999999999 >>> struct.unpack('>d', struct.pack('>Q', d+1))[0] 1.1000000000000003
The floatextras package
Now that you know how to do this all manually, the floatextras package on PyPI wraps it all up for you:>>> f = 1.1 >>> floatextras.next_plus(f) 1.1000000000000003 >>> floatextras.float_difference(1.0999999999999999, 1.1000000000000003) -2 >>> floatextras.as_tuple(-123.456) FloatTuple(sign=1, digits=(1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1), exponent=6) >>> floatextras.as_tuple(qnan3) FloatTuple(sign=0, digits=(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1), exponent=1024)
Add a comment