Floating point
Encyclopedia
In computing
, floating point describes a method of representing real number
s in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits
and scaled using an exponent
. The base for the scaling is normally 2, 10 or 16. The typical number that can be represented exactly is of the form:
The term floating point refers to the fact that the radix point
(decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation
. Over the years, a variety of floating-point representations have been used in computers. However, since the 1990s, the most commonly encountered representation is that defined by the IEEE 754 Standard.
The advantage of floating-point representation over fixed-point
and integer
representation is that it can support a much wider range of values. For example, a fixed-point representation that has seven decimal digits with two decimal places can represent the numbers 12345.67, 123.45, 1.23 and so on, whereas a floating-point representation (such as the IEEE 754 decimal32
format) with seven decimal digits could in addition represent 1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of precision
.
The speed of floating-point operations, commonly referred to in performance measurements as FLOPS
, is an important machine characteristic, especially in software that performs large-scale mathematical calculations.
in mathematics) specifies some way of storing a number that may be encoded as a string of digits. The arithmetic is defined as a set of actions on the representation that simulate classical arithmetic operations.
There are several mechanisms by which strings of digits can represent numbers. In common mathematical notation, the digit string can be of any length, and the location of the radix point
is indicated by placing an explicit "point" character
(dot or comma) there. If the radix point is omitted then it is implicitly assumed to lie at the right (least significant) end of the string (that is, the number is an integer
). In fixed-point
systems, some specific assumption is made about where the radix point is located in the string. For example, the convention could be that the string consists of 8 decimal digits with the decimal point in the middle, so that "00012345" has a value of 1.2345.
In scientific notation
, the given number is scaled by a power of 10
so that it lies within a certain range—typically between 1 and 10, with the radix point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the revolution period of Jupiter
's moon Io
is 152853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047 seconds.
Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:
To derive the value of the floating point number, one must multiply the significand by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.
Using base-10 (the familiar decimal notation) as an example, the number 152853.5047, which has ten decimal digits of precision, is represented as the significand 1528535047 together with an exponent of 5 (if the implied position of the radix point is after the first most significant digit, here 1). To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by 105 to give 1.528535047 × 105, or 152853.5047. In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.
Symbolically, this final value is
where s is the value of the significand (after taking into account the implied radix point), b is the base, and e is the exponent.
Equivalently:
where s here means the integer value of the entire significand, ignoring any implied decimal point, and p is the precision—the number of digits in the significand.
Historically, several number bases have been used for representing floating-point numbers, with base 2 (binary
) being the most common, followed by base 10 (decimal), and other less common varieties, such as base 16 (hexadecimal notation
), as well as some exotic ones like 3 (see Setun
). Floating point numbers are rational number
s because they can be represented as one integer divided by another. The base however determines the fractions that can be represented. For instance, 1/5 cannot be represented exactly as a floating point number using a binary base but can be represented exactly using a decimal base.
The way in which the significand, exponent and sign bits are internally stored on a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision (32-bit) floating-point representation p=24 and so the significand is a string of 24 bit
s. For instance, the number π
's first 33 bits are 11001001 00001111 11011010 10100010 0. Rounding to 24 bits in binary mode means attributing the 24th bit the value of the 25th which yields 11001001 00001111 11011011. When this is stored using the IEEE 754 encoding, this becomes the significand s with e = 1 (where s is assumed to have a binary point to the right of the first bit) after a left-adjustment (or normalization) during which leading or trailing zeros are truncated should there be any. Note that they do not matter anyway. Then since the first bit of a non-zero binary significand is always 1 it need not be stored, giving an extra bit of precision. To calculate π the formula is
where n is the normalized significand's n-th bit from the left. Normalization, which is reversed when 1 is being added above, can be thought of as a form of compression; it allows a binary significand to be compressed into a field one bit shorter than the maximum precision, at the expense of extra processing.
The word "mantissa" is often used as a synonym for significand. Use of mantissa in place of significand or coefficient is discouraged, as the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the manner in which logarithm
tables were used before computers became commonplace. Log tables were actually tables of mantissas.
to be adjustable, floating-point notation allows calculations over a wide range of magnitudes, using a fixed number of digits, while maintaining good precision. For example, in a decimal floating-point system with three digits, the multiplication that humans would write as
would be expressed as × (1.2) = (1.44).
In a fixed-point system with the decimal point at the left, it would be
A digit of the result was lost because of the inability of the digits and decimal point to 'float' relative to each other within the digit string.
The range of floating-point numbers depends on the number of bits or digits used for representation of the significand (the significant digits of the number) and for the exponent. On a typical computer system, a 'double precision' (64-bit) binary floating-point number has a coefficient of 53 bits (one of which is implied), an exponent of 11 bits, and one sign bit. Positive floating-point numbers in this format have an approximate range of 10−308 to 10308, because the range of the exponent is [−1022,1023] and 308 is approximately log10(21023). The complete range of the format is from about −10308 through +10308 (see IEEE 754).
The number of normalized floating point numbers in a system F(B, P, L, U) (where B is the base of the system, P is the precision of the system to P numbers, L is the smallest exponent representable in the system, and U is the largest exponent used in the system) is:
.
There is a smallest positive normalized floating-point number,
Underflow level = UFL =
which has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest possible value for the exponent.
There is a largest floating point number,
Overflow level = OFL = which has B − 1 as the value for each digit of the significand and the largest possible value for the exponent.
In addition there are representable values strictly between −UFL and UFL. Namely, zero and negative zero, as well as subnormal numbers.
of Berlin completed the "Z1
", the first mechanical binary programmable computer. It worked with 22-bit binary floating-point numbers having a 7-bit signed exponent, a 15-bit significand (including one implicit bit), and a sign bit. The memory used sliding metal parts to store 64 words of such numbers. The relay
-based Z3, completed in 1941, implemented floating point arithmetic exceptions with representations for plus and minus infinity and undefined.
The first commercial computer with floating point hardware was Zuse's Z4
computer designed in 1942–1945. The Bell Laboratories Mark V computer implemented decimal floating point in 1946. The mass-produced vacuum tube
-based IBM 704
followed a decade later in 1954; it introduced the use of a biased exponent
. For many decades after that, floating-point hardware was typically an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computing" capability. It was not until 1989 that general-purpose computers had floating point capability in hardware as standard.
The UNIVAC 1100/2200 series
, introduced in 1962, supported two floating-point formats. Single precision used 36 bits, organized into a 1-bit sign, an 8-bit exponent, and a 27-bit significand. Double precision used 72 bits organized as a 1-bit sign, an 11-bit exponent, and a 60-bit significand. The IBM 7094, introduced the same year, also supported single and double precision, with slightly different formats.
Prior to the IEEE-754 standard, computers used many different forms of floating-point. These differed in the word sizes, the format of the representations, and the rounding behavior of operations. These differing systems implemented different parts of the arithmetic in hardware and software, with varying accuracy.
The IEEE-754 standard was created in the early 1980s after word sizes of 32 bits (or 16 or 64) had been generally settled upon. This was based on a proposal from Intel who were designing the i8087
numerical coprocessor. Among the innovations are these:
(in addition to the IEEE 754 binary and decimal formats), and Cray
vector machines, where the T90
series had an IEEE version, but the SV1
still uses Cray floating-point format.
The standard provides for many closely related formats, differing in only a few details. Five of these formats are called basic formats, and two of these are especially widely used in computer hardware and languages:
The other basic formats are quadruple precision (128-bit) binary, as well as decimal floating point
(64-bit) and "double" (128-bit) decimal floating point.
Less common formats include:
Any integer with absolute value less than or equal to 224 can be exactly represented in the single precision format, and any integer with absolute value less than or equal to 253 can be exactly represented in the double precision format. Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get 53-bit integers on platforms that have double precision floats but only 32-bit integers.
The standard specifies some special values, and their representation: positive infinity
(+∞), negative infinity (−∞), a negative zero (−0) distinct from ordinary ("positive") zero, and "not a number" values (NaN
s).
Comparison of floating-point numbers, as defined by the IEEE standard, is a bit different from usual integer comparison. Negative and positive zero compare equal, and every NaN compares unequal to every value, including itself. Apart from these special cases, more significant bits are stored before less significant bits. All values except NaN are strictly smaller than +∞ and strictly greater than −∞.
To a rough approximation, the bit representation of an IEEE binary floating-point number is proportional to its base 2 logarithm, with an average error of about 3%. (This is because the exponent field is in the more significant part of the datum.) This can be exploited in some applications, such as volume ramping in digital sound processing.
Although the 32 bit ("single") and 64 bit ("double") formats are by far the most common, the standard actually allows for many different precision levels. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit extended precision
format, with a 15 bit exponent, a 64 bit significand, and no hidden bit.
There is controversy about the failure of most programming languages to make these extended precision formats available to programmers (although C
and related programming languages usually provide these formats via the long double
type on such hardware). System vendors may also provide additional extended formats (e.g. 128 bits) emulated in software.
A project for revising the IEEE 754 standard was started in 2000 (see IEEE 754 revision
); it was completed and approved in June 2008. It includes decimal floating-point formats and a 16 bit floating point format ("binary16"). binary16 has the same structure and rules as the older formats, with 1 sign bit, 5 exponent bits and 10 trailing significand bits. It is being used in the NVIDIA Cg graphics language, and in the openEXR standard.
While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and subnormal numbers, values of all 1s are reserved for the infinities and NaNs. The exponent range for normalized numbers is [−126, 127] for single precision, [−1022, 1023] for double, or [−16382, 16383] for quad. Normalised numbers exclude subnormal values, zeros, infinities, and NaNs.
In the IEEE binary interchange formats the leading 1 bit of a normalized significand is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, single precision format actually has a significand with 24 bits of precision, double precision format has 53, and quad has 113.
For example, it was shown above that π, rounded to 24 bits of precision, has:
The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in single precision format as
gap with values
where the absolute distance between them are the same as for
adjacent values just outside of the underflow gap.
This is an improvement over the older practice to just have zero in the underflow gap,
and where underflowing results were replaced by zero (flush to zero).
Modern floating point hardware usually handles subnormal values (as well as normal values),
and does not require software emulation for subnormals.
can be represented in IEEE floating point datatypes,
just like ordinary floating point values like 1, 1.5 etc.
They are not error values in any way, though they are often (but not always, as it depends on the rounding) used as
replacement values when there is an overflow
. Upon a divide by zero exception,
a positive or negative infinity is returned as an exact result. An infinity can also be introduced as
a numeral (like C's "INFINITY" macro, or "∞" if the programming language allows that syntax).
IEEE 754 requires infinities to be handled in a reasonable way, such as
The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type of error; but there is no standard for that encoding. In theory, signaling NaNs could be used by a runtime system to extend the floating-point numbers with other special values, without slowing down the computations with ordinary values. Such extensions do not seem to be common, though.
s with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base-10, or a terminating binary expansion in base-2). Irrational numbers, such as π
or √2, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly. For example, the number 123456789 cannot be exactly represented if only eight decimal digits of precision are available.
When a number is represented in some format (such as a character string) which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value to the original, and the value thus adjusted is called the rounded value.
Whether or not a rational number has a terminating expansion depends on the base. For example, in base-10 the number 1/2 has a terminating expansion (0.5) while the number 1/3 does not (0.333...). In base-2 only rationals with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers which appear to be short and exact when written in decimal format may need to be approximated when converted to binary floating-point. For example, the decimal number 0.1 is not representable in binary floating-point of any finite precision; the exact binary representation would have a "1100" sequence continuing endlessly:
where, as previously, s is the significand and e is the exponent.
When rounded to 24 bits this becomes
which is actually 0.100000001490116119384765625 in decimal.
As a further example, the real number π
, represented in binary as an infinite series of bits is
but is
when approximated by rounding
to a precision of 24 bits.
In binary single-precision floating-point, this is represented as s = 1.10010010000111111011011 with e = 1.
This has a decimal value of
whereas a more accurate approximation of the true value of π is
The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the discretization error
and is limited by the machine epsilon
.
The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called a unit in the last place
(ULP). For example, if there is no representable number lying between the representable numbers 1.45a70c22hex and 1.45a70c24hex, the ULP is 2×16−8, or 2−31. For numbers with an exponent of 0, a ULP is exactly 2−23 or about 10−7 in single precision, and about 10−16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of a ULP.
(or rounding modes). Historically, truncation
was the typical approach. Since the introduction of IEEE 754, the default method (round to nearest, ties to even
, sometimes called Banker's Rounding) is more commonly used. This method rounds the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and gives that representation as the result. In the case of a tie, the value that would make the significand end in an even digit is chosen. The IEEE 754 standard requires the same rounding to be applied to all fundamental algebraic operations, including square root and conversions, when there is a numeric (non-NaN) result. It means that the results of IEEE 754 operations are completely determined in all bits of the result, except for the representation of NaNs. ("Library" functions such as cosine and log are not mandated.)
Alternative rounding options are also available. IEEE 754 specifies the following rounding modes:
Alternative modes are useful when the amount of error being introduced must be bounded. Applications that require a bounded error are multi-precision floating-point, and interval arithmetic
.
A further use of rounding is when a number is explicitly rounded to a certain number of decimal (or binary) places, as when rounding a result to euros and cents (two decimal places).
with 7 digit precision will be used in the examples, as in the IEEE 754 decimal32 format. The fundamental principles are the same in any radix
or precision, except that normalization is optional (it does not affect the numerical value of the result). Here, s denotes the significand and e denotes the exponent.
123456.7 = 1.234567 × 10^5
101.7654 = 1.017654 × 10^2 = 0.001017654 × 10^5
Hence:
123456.7 + 101.7654 = (1.234567 × 10^5) + (1.017654 × 10^2)
= (1.234567 × 10^5) + (0.001017654 × 10^5)
= (1.234567 + 0.001017654) × 10^5
= 1.235584654 × 10^5
In detail:
e=5; s=1.234567 (123456.7)
+ e=2; s=1.017654 (101.7654)
e=5; s=1.234567
+ e=5; s=0.001017654 (after shifting)
--------------------
e=5; s=1.235584654 (true sum: 123558.4654)
This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is
e=5; s=1.235585 (final sum: 123558.5)
Note that the low 3 digits of the second operand (654) are essentially lost. This is round-off error
. In extreme cases, the sum of two non-zero numbers may be equal to one of them:
e=5; s=1.234567
+ e=−3; s=9.876543
e=5; s=1.234567
+ e=5; s=0.00000009876543 (after shifting)
----------------------
e=5; s=1.23456709876543 (true sum)
e=5; s=1.234567 (after rounding/normalization)
Another problem of loss of significance occurs when two close numbers are subtracted. In the following example e = 5; s = 1.234571 and e = 5; s = 1.234567 are representations of the rationals 123457.1467 and 123456.659.
e=5; s=1.234571
− e=5; s=1.234567
----------------
e=5; s=0.000004
e=−1; s=4.000000 (after rounding/normalization)
The best representation of this difference is e = −1; s = 4.877000, which differs more than 20% from e = −1; s = 4.000000. In extreme cases, the final result may be zero even though an exact calculation may be several million. This cancellation
illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in numerical analysis
; see also Accuracy problems.
e=3; s=4.734612
× e=5; s=5.417242
-----------------------
e=8; s=25.648538980104 (true product)
e=8; s=25.64854 (after rounding)
e=9; s=2.564854 (after normalization)
Similarly, division is accomplished by subtracting the divisor's exponent from the dividend's exponent, and dividing the dividend's significand by the divisor's significand.
There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed in succession. In practice, the way these operations are carried out in digital logic can be quite complex (see Booth's multiplication algorithm
and digital division
).
For a fast, simple method, see the Horner method.
Floating-point computation in a computer can run into three kinds of problems:
Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind
of trap
that the programmer might be able to catch. How this worked was system-dependent,
meaning that floating-point programs were not portable
.
The original IEEE 754 standard (from 1984) took a first step towards a standard way for the IEEE 754 based operations to record that an error occurred. Here, trapping was ignored (optional in the 1984 version) and "alternate exception handling modes" (replacing trapping in the 2008 version, but still optional), and just looking at the required default method of handling exceptions according to IEEE 754. Arithmetic exceptions are (by default) required to be recorded in "sticky" error indicator bits. That they are "sticky" means that they are not reset by the next (arithmetic) operation, but stay set until explicitly reset. By default, an operation always returns a result according to specification without interrupting computation. For instance, 1/0 returns +∞, while also setting the divide-by-zero error bit.
The original IEEE 754 standard, however, failed to recommend operations to handle such sets of arithmetic error bits. So while these were implemented in hardware, initially programming language implementations did not automatically provide a means to access them (apart from assembler). Over time some programming language standards (e.g., C and Fortran) have been updated to specify methods to access and change status and error bits. The 2008 version of the IEEE 754 standard now specifies a few operations for accessing and handling the arithmetic error bits. The programming model is based on a single thread of execution and use of them by multiple threads has to be handled by a means
outside of the standard.
IEEE 754 specifies five arithmetic errors that are to be recorded in "sticky bits":
The fact that floating-point numbers cannot precisely represent all real numbers, and that floating-point operations cannot precisely represent true arithmetic operations, leads to many surprising situations. This is related to the finite precision
with which computers generally represent numbers.
For example, the non-representability of 0.1 and 0.01 (in binary) means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision) representation, 0.1 (decimal) was given previously as e = −4; s = 110011001100110011001101, which is
Squaring this number gives
Squaring it with single-precision floating-point hardware (with rounding) gives
But the representable number closest to 0.01 is
Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow. It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:
will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be −22877332.0.
By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) 0.1225 in double precision, or −0.8742 in single precision.
While floating-point addition and multiplication are both commutative (a + b = b + a and a×b = b×a), they are not necessarily associative. That is, (a + b) + c is not necessarily equal to a + (b + c). Using 7-digit decimal arithmetic:
a = 1234.567, b = 45.67834, c = 0.0004
(a + b) + c:
1234.567 (a)
+ 45.67834 (b)
____________
1280.24534 rounds to 1280.245
1280.245 (a + b)
+ 0.0004 (c)
____________
1280.2454 rounds to 1280.245 <--- (a + b) + c
a + (b + c):
45.67834 (b)
+ 0.0004 (c)
____________
45.67874
45.67874 (b + c)
+ 1234.567 (a)
____________
1280.24574 rounds to 1280.246 <--- a + (b + c)
They are also not necessarily distributive. That is, (a + b) ×c may not be the same as a×c + b×c:
1234.567 × 3.333333 = 4115.223
1.234567 × 3.333333 = 4.115223
4115.223 + 4.115223 = 4119.338
but
1234.567 + 1.234567 = 1235.802
1235.802 × 3.333333 = 4119.340
In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:
. Usually denoted Εmach, its value depends on the particular rounding being used.
With rounding to zero,
whereas rounding to nearest,
This is important since it bounds the relative error in representing any non-zero real number x within the normalised range of a floating point system:
is essential.
In addition to careful design of programs, careful handling by the compiler
is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area. See the external references at the bottom of this article.
Binary floating-point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a proton
), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. The "decimal" data type of the C# and Python
programming languages, and the IEEE 754-2008 decimal floating-point standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.
Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed if they are to work well.
Expectations from mathematics may not be realised in the field of floating-point computation. For example, it is known that , and that . These facts cannot be counted on when the quantities involved are the result of floating-point computation.
A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to the references at the bottom of this article. Descriptions of a few simple techniques follow.
The use of the equality test (
Perl
, for example,
is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly. It is often better to organize the code in such a way that such tests are unnecessary.
An awareness of when loss of significance can occur is useful. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. A typical addition would then be something like
3253.671
+ 3.141276
--------
3256.812
The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The Kahan summation algorithm
may be used to reduce the errors.
Computations may be rearranged in a way that is mathematically equivalent but less prone to error. As an example, Archimedes
approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. The recurrence formula for the circumscribed polygon is:
Here is a computation using IEEE "double" (a significand with 53 bits of precision) arithmetic:
i 6 × 2i × ti, first form 6 × 2i × ti, second form
0 .4641016151377543863 .4641016151377543863
1 .2153903091734710173 .2153903091734723496
2 596599420974940120 596599420975006733
3 60862151314012979 60862151314352708
4 27145996453136334 27145996453689225
5 8730499801259536 8730499798241950
6 6627470548084133 6627470568494473
7 6101765997805905 6101766046906629
8 70343230776862 70343215275928
9 37488171150615 37487713536668
10 9278733740748 9273850979885
11 7256228504127 7220386148377
12 717412858693 707019992125
13 189011456060 78678454728
14 717412858693 46593073709
15 19358822321783 8571730119
16 717412858693 6566394222
17 810075796233302 6065061913
18 717412858693 939728836
19 4061547378810956 908393901
20 05434924008406305 900560168
21 00068646912273617 8608396
22 349453756585929919 8122118
23 00068646912273617 95552
24 .2245152435345525443 68907
25 62246
26 62246
27 62246
28 62246
The true value is
While the two forms of the recurrence formula are clearly mathematically equivalent, the first subtracts 1 from a number extremely close to 1, leading to an increasingly problematic loss of significant digits. As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...
, floating point describes a method of representing real number
Real number
In mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...
s in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits
Significant figures
The significant figures of a number are those digits that carry meaning contributing to its precision. This includes all digits except:...
and scaled using an exponent
Exponentiation
Exponentiation is a mathematical operation, written as an, involving two numbers, the base a and the exponent n...
. The base for the scaling is normally 2, 10 or 16. The typical number that can be represented exactly is of the form:
- Significant digits × baseexponent
The term floating point refers to the fact that the radix point
Radix point
In mathematics and computing, a radix point is the symbol used in numerical representations to separate the integer part of a number from its fractional part . "Radix point" is a general term that applies to all number bases...
(decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation
Scientific notation
Scientific notation is a way of writing numbers that are too large or too small to be conveniently written in standard decimal notation. Scientific notation has a number of useful properties and is commonly used in calculators and by scientists, mathematicians, doctors, and engineers.In scientific...
. Over the years, a variety of floating-point representations have been used in computers. However, since the 1990s, the most commonly encountered representation is that defined by the IEEE 754 Standard.
The advantage of floating-point representation over fixed-point
Fixed-point arithmetic
In computing, a fixed-point number representation is a real data type for a number that has a fixed number of digits after the radix point...
and integer
Integer (computer science)
In computer science, an integer is a datum of integral data type, a data type which represents some finite subset of the mathematical integers. Integral data types may be of different sizes and may or may not be allowed to contain negative values....
representation is that it can support a much wider range of values. For example, a fixed-point representation that has seven decimal digits with two decimal places can represent the numbers 12345.67, 123.45, 1.23 and so on, whereas a floating-point representation (such as the IEEE 754 decimal32
Decimal32 floating-point format
In computing, decimal32 is a decimal floating-point computer numbering format that occupies 4 bytes in computer memory.It is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations....
format) with seven decimal digits could in addition represent 1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of precision
Accuracy and precision
In the fields of science, engineering, industry and statistics, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's actual value. The precision of a measurement system, also called reproducibility or repeatability, is the degree to which...
.
The speed of floating-point operations, commonly referred to in performance measurements as FLOPS
FLOPS
In computing, FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating-point calculations, similar to the older, simpler, instructions per second...
, is an important machine characteristic, especially in software that performs large-scale mathematical calculations.
Overview
A number representation (called a numeral systemNumeral system
A numeral system is a writing system for expressing numbers, that is a mathematical notation for representing numbers of a given set, using graphemes or symbols in a consistent manner....
in mathematics) specifies some way of storing a number that may be encoded as a string of digits. The arithmetic is defined as a set of actions on the representation that simulate classical arithmetic operations.
There are several mechanisms by which strings of digits can represent numbers. In common mathematical notation, the digit string can be of any length, and the location of the radix point
Radix point
In mathematics and computing, a radix point is the symbol used in numerical representations to separate the integer part of a number from its fractional part . "Radix point" is a general term that applies to all number bases...
is indicated by placing an explicit "point" character
Decimal separator
Different symbols have been and are used for the decimal mark. The choice of symbol for the decimal mark affects the choice of symbol for the thousands separator used in digit grouping. Consequently the latter is treated in this article as well....
(dot or comma) there. If the radix point is omitted then it is implicitly assumed to lie at the right (least significant) end of the string (that is, the number is an integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...
). In fixed-point
Fixed-point arithmetic
In computing, a fixed-point number representation is a real data type for a number that has a fixed number of digits after the radix point...
systems, some specific assumption is made about where the radix point is located in the string. For example, the convention could be that the string consists of 8 decimal digits with the decimal point in the middle, so that "00012345" has a value of 1.2345.
In scientific notation
Scientific notation
Scientific notation is a way of writing numbers that are too large or too small to be conveniently written in standard decimal notation. Scientific notation has a number of useful properties and is commonly used in calculators and by scientists, mathematicians, doctors, and engineers.In scientific...
, the given number is scaled by a power of 10
Exponentiation
Exponentiation is a mathematical operation, written as an, involving two numbers, the base a and the exponent n...
so that it lies within a certain range—typically between 1 and 10, with the radix point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the revolution period of Jupiter
Jupiter
Jupiter is the fifth planet from the Sun and the largest planet within the Solar System. It is a gas giant with mass one-thousandth that of the Sun but is two and a half times the mass of all the other planets in our Solar System combined. Jupiter is classified as a gas giant along with Saturn,...
's moon Io
Io (moon)
Io ) is the innermost of the four Galilean moons of the planet Jupiter and, with a diameter of , the fourth-largest moon in the Solar System. It was named after the mythological character of Io, a priestess of Hera who became one of the lovers of Zeus....
is 152853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047 seconds.
Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:
- A signed digit string of a given length in a given base (or radixRadixIn mathematical numeral systems, the base or radix for the simplest case is the number of unique digits, including zero, that a positional numeral system uses to represent numbers. For example, for the decimal system the radix is ten, because it uses the ten digits from 0 through 9.In any numeral...
). This digit string is referred to as the significandSignificandThe significand is part of a floating-point number, consisting of its significant digits. Depending on the interpretation of the exponent, the significand may represent an integer or a fraction.-Examples:...
, coefficientCoefficientIn mathematics, a coefficient is a multiplicative factor in some term of an expression ; it is usually a number, but in any case does not involve any variables of the expression...
or, less often, the mantissa (see below). The length of the significand determines the precision to which numbers can be represented. The radix point position is assumed to always be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost (least significant) digit. This article will generally follow the convention that the radix point is just after the most significant (leftmost) digit. - A signed integer exponent, also referred to as the characteristic or scale, which modifies the magnitude of the number.
To derive the value of the floating point number, one must multiply the significand by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.
Using base-10 (the familiar decimal notation) as an example, the number 152853.5047, which has ten decimal digits of precision, is represented as the significand 1528535047 together with an exponent of 5 (if the implied position of the radix point is after the first most significant digit, here 1). To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by 105 to give 1.528535047 × 105, or 152853.5047. In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.
Symbolically, this final value is
where s is the value of the significand (after taking into account the implied radix point), b is the base, and e is the exponent.
Equivalently:
where s here means the integer value of the entire significand, ignoring any implied decimal point, and p is the precision—the number of digits in the significand.
Historically, several number bases have been used for representing floating-point numbers, with base 2 (binary
Binary numeral system
The binary numeral system, or base-2 number system, represents numeric values using two symbols, 0 and 1. More specifically, the usual base-2 system is a positional notation with a radix of 2...
) being the most common, followed by base 10 (decimal), and other less common varieties, such as base 16 (hexadecimal notation
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
), as well as some exotic ones like 3 (see Setun
Setun
Setun was a balanced ternary computer developed in 1958 at Moscow State University. The device was built under the lead of Sergei Sobolev and Nikolay Brusentsov. It was the only modern ternary computer, using three-valued ternary logic instead of two-valued binary logic prevalent in computers...
). Floating point numbers are rational number
Rational number
In mathematics, a rational number is any number that can be expressed as the quotient or fraction a/b of two integers, with the denominator b not equal to zero. Since b may be equal to 1, every integer is a rational number...
s because they can be represented as one integer divided by another. The base however determines the fractions that can be represented. For instance, 1/5 cannot be represented exactly as a floating point number using a binary base but can be represented exactly using a decimal base.
The way in which the significand, exponent and sign bits are internally stored on a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision (32-bit) floating-point representation p=24 and so the significand is a string of 24 bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...
s. For instance, the number π
Pi
' is a mathematical constant that is the ratio of any circle's circumference to its diameter. is approximately equal to 3.14. Many formulae in mathematics, science, and engineering involve , which makes it one of the most important mathematical constants...
's first 33 bits are 11001001 00001111 11011010 10100010 0. Rounding to 24 bits in binary mode means attributing the 24th bit the value of the 25th which yields 11001001 00001111 11011011. When this is stored using the IEEE 754 encoding, this becomes the significand s with e = 1 (where s is assumed to have a binary point to the right of the first bit) after a left-adjustment (or normalization) during which leading or trailing zeros are truncated should there be any. Note that they do not matter anyway. Then since the first bit of a non-zero binary significand is always 1 it need not be stored, giving an extra bit of precision. To calculate π the formula is
where n is the normalized significand's n-th bit from the left. Normalization, which is reversed when 1 is being added above, can be thought of as a form of compression; it allows a binary significand to be compressed into a field one bit shorter than the maximum precision, at the expense of extra processing.
The word "mantissa" is often used as a synonym for significand. Use of mantissa in place of significand or coefficient is discouraged, as the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the manner in which logarithm
Common logarithm
The common logarithm is the logarithm with base 10. It is also known as the decadic logarithm, named after its base. It is indicated by log10, or sometimes Log with a capital L...
tables were used before computers became commonplace. Log tables were actually tables of mantissas.
Some other computer representations for non-integral numbers
Floating-point representation, in particular the standard IEEE format, is by far the most common way of representing an approximation to real numbers in computers because it is efficiently handled in most large computer processors. However, there are alternatives:- Fixed-pointFixed-point arithmeticIn computing, a fixed-point number representation is a real data type for a number that has a fixed number of digits after the radix point...
representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating-point and is also commonly used to perform integer operations. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications. - Binary-coded decimalBinary-coded decimalIn computing and electronic systems, binary-coded decimal is a digital encoding method for numbers using decimal notation, with each decimal digit represented by its own binary sequence. In BCD, a numeral is usually represented by four bits which, in general, represent the decimal range 0 through 9...
(BCD) is an encoding for decimal numbers in which each digit is represented by its own binary sequence. It is possible to implement a floating point system with BCD encoding. - Logarithmic number systemLogarithmic Number SystemA logarithmic number system is an arithmetic system used for representing real numbers in computer and digital hardware, especially for digital signal processing.-Theory:...
s represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating-point, but the value-to-representation curve, i. e. the graph of the logarithm function, is smooth (except at 0). Contrary to floating-point arithmetic, in a logarithmic number system multiplication, division and exponentiation are easy to implement but addition and subtraction are difficult. - Where greater precision is desired, floating-point arithmetic can be implemented (typically in software) with variable-length significands (and sometimes exponents) that are sized depending on actual need and depending on how the calculation proceeds. This is called arbitrary-precisionArbitrary-precision arithmeticIn computer science, arbitrary-precision arithmetic indicates that calculations are performed on numbers whose digits of precision are limited only by the available memory of the host system. This contrasts with the faster fixed-precision arithmetic found in most ALU hardware, which typically...
floating point arithmetic. - Some numbers (e.g., 1/3 and 0.1) cannot be represented exactly in binary floating-point no matter what the precision. Software packages that perform rational arithmeticFraction (mathematics)A fraction represents a part of a whole or, more generally, any number of equal parts. When spoken in everyday English, we specify how many parts of a certain size there are, for example, one-half, five-eighths and three-quarters.A common or "vulgar" fraction, such as 1/2, 5/8, 3/4, etc., consists...
represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "bignum" arithmetic for the individual integers. - Computer algebra systemComputer algebra systemA computer algebra system is a software program that facilitates symbolic mathematics. The core functionality of a CAS is manipulation of mathematical expressions in symbolic form.-Symbolic manipulations:...
s such as MathematicaMathematicaMathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...
and Maxima can often handle irrational numbers like or in a completely "formal" way, without dealing with a specific encoding of the significand. Such programs can evaluate expressions like "" exactly, because they "know" the underlying mathematics.
Range of floating-point numbers
By allowing the radix pointRadix point
In mathematics and computing, a radix point is the symbol used in numerical representations to separate the integer part of a number from its fractional part . "Radix point" is a general term that applies to all number bases...
to be adjustable, floating-point notation allows calculations over a wide range of magnitudes, using a fixed number of digits, while maintaining good precision. For example, in a decimal floating-point system with three digits, the multiplication that humans would write as
- 0.12 × 0.12 = 0.0144
would be expressed as × (1.2) = (1.44).
In a fixed-point system with the decimal point at the left, it would be
- 0.120 × 0.120 = 0.014.
A digit of the result was lost because of the inability of the digits and decimal point to 'float' relative to each other within the digit string.
The range of floating-point numbers depends on the number of bits or digits used for representation of the significand (the significant digits of the number) and for the exponent. On a typical computer system, a 'double precision' (64-bit) binary floating-point number has a coefficient of 53 bits (one of which is implied), an exponent of 11 bits, and one sign bit. Positive floating-point numbers in this format have an approximate range of 10−308 to 10308, because the range of the exponent is [−1022,1023] and 308 is approximately log10(21023). The complete range of the format is from about −10308 through +10308 (see IEEE 754).
The number of normalized floating point numbers in a system F(B, P, L, U) (where B is the base of the system, P is the precision of the system to P numbers, L is the smallest exponent representable in the system, and U is the largest exponent used in the system) is:
.
There is a smallest positive normalized floating-point number,
Underflow level = UFL =
which has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest possible value for the exponent.
There is a largest floating point number,
Overflow level = OFL = which has B − 1 as the value for each digit of the significand and the largest possible value for the exponent.
In addition there are representable values strictly between −UFL and UFL. Namely, zero and negative zero, as well as subnormal numbers.
History
In 1938, Konrad ZuseKonrad Zuse
Konrad Zuse was a German civil engineer and computer pioneer. His greatest achievement was the world's first functional program-controlled Turing-complete computer, the Z3, which became operational in May 1941....
of Berlin completed the "Z1
Z1 (computer)
The Z1 was a mechanical computer designed by Konrad Zuse from 1935 to 1936 and built by him from 1936 to 1938. It was a binary electrically driven mechanical calculator with limited programmability, reading instructions from punched tape....
", the first mechanical binary programmable computer. It worked with 22-bit binary floating-point numbers having a 7-bit signed exponent, a 15-bit significand (including one implicit bit), and a sign bit. The memory used sliding metal parts to store 64 words of such numbers. The relay
Relay
A relay is an electrically operated switch. Many relays use an electromagnet to operate a switching mechanism mechanically, but other operating principles are also used. Relays are used where it is necessary to control a circuit by a low-power signal , or where several circuits must be controlled...
-based Z3, completed in 1941, implemented floating point arithmetic exceptions with representations for plus and minus infinity and undefined.
The first commercial computer with floating point hardware was Zuse's Z4
Z4 (computer)
The Z4 was the world's first commercial digital computer, designed by German engineer Konrad Zuse and built by his company Zuse Apparatebau between 1942 and 1945....
computer designed in 1942–1945. The Bell Laboratories Mark V computer implemented decimal floating point in 1946. The mass-produced vacuum tube
Vacuum tube
In electronics, a vacuum tube, electron tube , or thermionic valve , reduced to simply "tube" or "valve" in everyday parlance, is a device that relies on the flow of electric current through a vacuum...
-based IBM 704
IBM 704
The IBM 704, the first mass-produced computer with floating point arithmetic hardware, was introduced by IBM in 1954. The 704 was significantly improved over the IBM 701 in terms of architecture as well as implementations which were not compatible with its predecessor.Changes from the 701 included...
followed a decade later in 1954; it introduced the use of a biased exponent
Exponent bias
In IEEE 754 floating point numbers, the exponent is biased in the engineering sense of the word – the value stored is offset from the actual value by the exponent bias....
. For many decades after that, floating-point hardware was typically an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computing" capability. It was not until 1989 that general-purpose computers had floating point capability in hardware as standard.
The UNIVAC 1100/2200 series
UNIVAC 1100/2200 series
The UNIVAC 1100/2200 series is a series of compatible 36-bit computer systems, beginning with the UNIVAC 1107 in 1962, initially made by Sperry Rand...
, introduced in 1962, supported two floating-point formats. Single precision used 36 bits, organized into a 1-bit sign, an 8-bit exponent, and a 27-bit significand. Double precision used 72 bits organized as a 1-bit sign, an 11-bit exponent, and a 60-bit significand. The IBM 7094, introduced the same year, also supported single and double precision, with slightly different formats.
Prior to the IEEE-754 standard, computers used many different forms of floating-point. These differed in the word sizes, the format of the representations, and the rounding behavior of operations. These differing systems implemented different parts of the arithmetic in hardware and software, with varying accuracy.
The IEEE-754 standard was created in the early 1980s after word sizes of 32 bits (or 16 or 64) had been generally settled upon. This was based on a proposal from Intel who were designing the i8087
Intel 8087
The Intel 8087, announced in 1980, was the first floating-point coprocessor for the 8086 line of microprocessors. It had 45,000 transistors and was manufactured as a 3 μm depletion load HMOS circuit. The 8087 was built to be paired with the Intel 8088 or 8086 microprocessors...
numerical coprocessor. Among the innovations are these:
- A precisely specified encoding of the bits, so that all compliant computers would interpret bit patterns the same way. This made it possible to transfer floating-point numbers from one computer to another.
- A precisely specified behavior of the arithmetic operations. This meant that a given program, with given data, would always produce the same result on any compliant computer. This helped reduce the almost mystical reputation that floating-point computation had for seemingly nondeterministic behavior.
- The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and be handled by the software in a controlled way.
IEEE 754: floating point in modern computers
The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM mainframes, which support IBM's own formatIBM Floating Point Architecture
IBM System/360 computers, and subsequent machines based on that architecture , support a hexadecimal floating-point format.In comparison to IEEE 754 floating-point, the IBM floating point format has a longer significand, and a shorter exponent....
(in addition to the IEEE 754 binary and decimal formats), and Cray
Cray
Cray Inc. is an American supercomputer manufacturer based in Seattle, Washington. The company's predecessor, Cray Research, Inc. , was founded in 1972 by computer designer Seymour Cray. Seymour Cray went on to form the spin-off Cray Computer Corporation , in 1989, which went bankrupt in 1995,...
vector machines, where the T90
Cray T90
The Cray T90 series was the last of a line of vector processing supercomputers manufactured by Cray Research, Inc, superseding the Cray C90 series...
series had an IEEE version, but the SV1
Cray SV1
The Cray SV1 is a vector processor supercomputer from the Cray Research division of Silicon Graphics introduced in 1998. The SV1 has since been succeeded by the Cray X1 and X1E vector supercomputers. Like its predecessor, the Cray J90, the SV1 used CMOS processors, which lowered the cost of the...
still uses Cray floating-point format.
The standard provides for many closely related formats, differing in only a few details. Five of these formats are called basic formats, and two of these are especially widely used in computer hardware and languages:
- Single precision, called "float" in the CC (programming language)C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
language family, and "real" or "real*4" in FortranFortranFortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...
. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits). - Double precisionDouble precisionIn computing, double precision is a computer number format that occupies two adjacent storage locations in computer memory. A double-precision number, sometimes simply called a double, may be defined to be an integer, fixed point, or floating point .Modern computers with 32-bit storage locations...
, called "double" in the C language family, and "double precision" or "real*8" in Fortran. This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits).
The other basic formats are quadruple precision (128-bit) binary, as well as decimal floating point
Decimal floating point
Decimal floating point arithmetic refers to both a representation and operations on decimal floating point numbers. Working directly with decimal fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions and binary fractions.The...
(64-bit) and "double" (128-bit) decimal floating point.
Less common formats include:
- Extended precisionExtended precisionThe term extended precision refers to storage formats for floating point numbers not falling into the regular sequence of single, double, and quadruple precision formats...
format, 80-bit floating point value. Sometimes "long doubleLong doubleIn C and related programming languages, long double refers to a floating point data type that is often more precise than double precision. As with C's other floating point types, it may not necessarily map to an IEEE format.-History:...
" is used for this in the C language family, though "long double" may be a synonym for "double" or may stand for quadruple precision. - Half, also called float16, a 16-bit floating point value.
Any integer with absolute value less than or equal to 224 can be exactly represented in the single precision format, and any integer with absolute value less than or equal to 253 can be exactly represented in the double precision format. Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get 53-bit integers on platforms that have double precision floats but only 32-bit integers.
The standard specifies some special values, and their representation: positive infinity
Infinity
Infinity is a concept in many fields, most predominantly mathematics and physics, that refers to a quantity without bound or end. People have developed various ideas throughout history about the nature of infinity...
(+∞), negative infinity (−∞), a negative zero (−0) distinct from ordinary ("positive") zero, and "not a number" values (NaN
NaN
In computing, NaN is a value of the numeric data type representing an undefined or unrepresentable value, especially in floating-point calculations...
s).
Comparison of floating-point numbers, as defined by the IEEE standard, is a bit different from usual integer comparison. Negative and positive zero compare equal, and every NaN compares unequal to every value, including itself. Apart from these special cases, more significant bits are stored before less significant bits. All values except NaN are strictly smaller than +∞ and strictly greater than −∞.
To a rough approximation, the bit representation of an IEEE binary floating-point number is proportional to its base 2 logarithm, with an average error of about 3%. (This is because the exponent field is in the more significant part of the datum.) This can be exploited in some applications, such as volume ramping in digital sound processing.
Although the 32 bit ("single") and 64 bit ("double") formats are by far the most common, the standard actually allows for many different precision levels. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit extended precision
Extended precision
The term extended precision refers to storage formats for floating point numbers not falling into the regular sequence of single, double, and quadruple precision formats...
format, with a 15 bit exponent, a 64 bit significand, and no hidden bit.
There is controversy about the failure of most programming languages to make these extended precision formats available to programmers (although C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
and related programming languages usually provide these formats via the long double
Long double
In C and related programming languages, long double refers to a floating point data type that is often more precise than double precision. As with C's other floating point types, it may not necessarily map to an IEEE format.-History:...
type on such hardware). System vendors may also provide additional extended formats (e.g. 128 bits) emulated in software.
A project for revising the IEEE 754 standard was started in 2000 (see IEEE 754 revision
IEEE 754 revision
IEEE 754-2008 was published in August 2008 and is a significant revision to, and replaces, the IEEE 754-1985 floating point standard...
); it was completed and approved in June 2008. It includes decimal floating-point formats and a 16 bit floating point format ("binary16"). binary16 has the same structure and rules as the older formats, with 1 sign bit, 5 exponent bits and 10 trailing significand bits. It is being used in the NVIDIA Cg graphics language, and in the openEXR standard.
Internal representation
Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand (mantissa), from left to right. For the IEEE 754 binary formats they are apportioned as follows:Type | Sign | Exponent | Significand | Total bits | Exponent bias | Bits precision | |
---|---|---|---|---|---|---|---|
Half (IEEE 754-2008) | 1 | 5 | 10 | 16 | 15 | 11 | |
Single | 1 | 8 | 23 | 32 | 127 | 24 | |
Double Double precision In computing, double precision is a computer number format that occupies two adjacent storage locations in computer memory. A double-precision number, sometimes simply called a double, may be defined to be an integer, fixed point, or floating point .Modern computers with 32-bit storage locations... |
1 | 11 | 52 | 64 | 1023 | 53 | |
Quad | 1 | 15 | 112 | 128 | 16383 | 113 |
While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and subnormal numbers, values of all 1s are reserved for the infinities and NaNs. The exponent range for normalized numbers is [−126, 127] for single precision, [−1022, 1023] for double, or [−16382, 16383] for quad. Normalised numbers exclude subnormal values, zeros, infinities, and NaNs.
In the IEEE binary interchange formats the leading 1 bit of a normalized significand is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, single precision format actually has a significand with 24 bits of precision, double precision format has 53, and quad has 113.
For example, it was shown above that π, rounded to 24 bits of precision, has:
- sign = 0 ; e = 1 ; s = 110010010000111111011011 (including the hidden bit)
The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in single precision format as
- 0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB as a hexadecimalHexadecimalIn mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
number.
Signed zero
In the IEEE 754 standard, zero is signed, meaning that there exist both a "positive zero" (+0) and a "negative zero" (−0). In most run-time environments, positive zero is usually printed as "0", while negative zero may be printed as "-0". The two values behave as equal in numerical comparisons, but some operations return different results for +0 and −0. For instance, 1/(−0) returns negative infinity (exactly), while 1/+0 returns positive infinity (exactly). A sign symmetric arccot operation will give different results for +0 and −0 without any exception. The difference between +0 and −0 is mostly noticeable for complex operations at so-called branch cuts.Subnormal numbers
Subnormal values fill the underflowArithmetic underflow
The term arithmetic underflow is a condition in a computer program that can occur when the true result of afloating point operation is smaller in magnitude...
gap with values
where the absolute distance between them are the same as for
adjacent values just outside of the underflow gap.
This is an improvement over the older practice to just have zero in the underflow gap,
and where underflowing results were replaced by zero (flush to zero).
Modern floating point hardware usually handles subnormal values (as well as normal values),
and does not require software emulation for subnormals.
Infinities
The infinities of the extended real number lineExtended real number line
In mathematics, the affinely extended real number system is obtained from the real number system R by adding two elements: +∞ and −∞ . The projective extended real number system adds a single object, ∞ and makes no distinction between "positive" or "negative" infinity...
can be represented in IEEE floating point datatypes,
just like ordinary floating point values like 1, 1.5 etc.
They are not error values in any way, though they are often (but not always, as it depends on the rounding) used as
replacement values when there is an overflow
Arithmetic overflow
The term arithmetic overflow or simply overflow has the following meanings.# In a computer, the condition that occurs when a calculation produces a result that is greater in magnitude than that which a given register or storage location can store or represent.# In a computer, the amount by which a...
. Upon a divide by zero exception,
a positive or negative infinity is returned as an exact result. An infinity can also be introduced as
a numeral (like C's "INFINITY" macro, or "∞" if the programming language allows that syntax).
IEEE 754 requires infinities to be handled in a reasonable way, such as
- (+∞) + (+7) = (+∞)
- (+∞) × (−2) = (−∞)
- (+∞) × 0 = NaN – there is no meaningful thing to do
NaNs
IEEE 754 specifies a special value called "Not a Number" (NaN) to be returned as the result of certain "invalid" operations, such as 0/0, ∞×0, or sqrt(−1). There are two kinds of NaNs, signaling and quiet. Using a signaling NaN in any arithmetic operation (including numerical comparisons) will cause an "invalid" exception. Using a quiet NaN merely causes the result to be NaN too.The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type of error; but there is no standard for that encoding. In theory, signaling NaNs could be used by a runtime system to extend the floating-point numbers with other special values, without slowing down the computations with ordinary values. Such extensions do not seem to be common, though.
Representable numbers, conversion and rounding
By their nature, all numbers expressed in floating-point format are rational numberRational number
In mathematics, a rational number is any number that can be expressed as the quotient or fraction a/b of two integers, with the denominator b not equal to zero. Since b may be equal to 1, every integer is a rational number...
s with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base-10, or a terminating binary expansion in base-2). Irrational numbers, such as π
Pi
' is a mathematical constant that is the ratio of any circle's circumference to its diameter. is approximately equal to 3.14. Many formulae in mathematics, science, and engineering involve , which makes it one of the most important mathematical constants...
or √2, or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly. For example, the number 123456789 cannot be exactly represented if only eight decimal digits of precision are available.
When a number is represented in some format (such as a character string) which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value to the original, and the value thus adjusted is called the rounded value.
Whether or not a rational number has a terminating expansion depends on the base. For example, in base-10 the number 1/2 has a terminating expansion (0.5) while the number 1/3 does not (0.333...). In base-2 only rationals with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers which appear to be short and exact when written in decimal format may need to be approximated when converted to binary floating-point. For example, the decimal number 0.1 is not representable in binary floating-point of any finite precision; the exact binary representation would have a "1100" sequence continuing endlessly:
- e = −4; s = 1100110011001100110011001100110011...,
where, as previously, s is the significand and e is the exponent.
When rounded to 24 bits this becomes
- e = −4; s = 110011001100110011001101,
which is actually 0.100000001490116119384765625 in decimal.
As a further example, the real number π
Pi
' is a mathematical constant that is the ratio of any circle's circumference to its diameter. is approximately equal to 3.14. Many formulae in mathematics, science, and engineering involve , which makes it one of the most important mathematical constants...
, represented in binary as an infinite series of bits is
- 11.0010010000111111011010101000100010000101101000110000100011010011...
but is
- 11.0010010000111111011011
when approximated by rounding
Rounding
Rounding a numerical value means replacing it by another value that is approximately equal but has a shorter, simpler, or more explicit representation; for example, replacing $23.4476 with $23.45, or the fraction 312/937 with 1/3, or the expression √2 with 1.414.Rounding is often done on purpose to...
to a precision of 24 bits.
In binary single-precision floating-point, this is represented as s = 1.10010010000111111011011 with e = 1.
This has a decimal value of
- 3.1415927410125732421875,
whereas a more accurate approximation of the true value of π is
- 3.14159265358979323846264338327950...
The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the discretization error
Discretization error
In numerical analysis, computational physics, and simulation, discretization error is error resulting from the fact that a function of a continuous variable is represented in the computer by a finite number of evaluations, for example, on a lattice...
and is limited by the machine epsilon
Machine epsilon
Machine epsilon gives an upper bound on the relative error due to rounding in floating point arithmetic. This value characterizes computer arithmetic in the field of numerical analysis, and by extension in the subject of computational science...
.
The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called a unit in the last place
Unit in the Last Place
In computer science and numerical analysis, unit in the last place or unit of least precision is the spacing between floating-point numbers, i.e., the value the least significant bit represents if it is 1...
(ULP). For example, if there is no representable number lying between the representable numbers 1.45a70c22hex and 1.45a70c24hex, the ULP is 2×16−8, or 2−31. For numbers with an exponent of 0, a ULP is exactly 2−23 or about 10−7 in single precision, and about 10−16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of a ULP.
Rounding modes
Rounding is used when the exact result of a floating-point operation (or a conversion to floating-point format) would need more digits than there are digits in the significand. There are several different rounding schemesRounding
Rounding a numerical value means replacing it by another value that is approximately equal but has a shorter, simpler, or more explicit representation; for example, replacing $23.4476 with $23.45, or the fraction 312/937 with 1/3, or the expression √2 with 1.414.Rounding is often done on purpose to...
(or rounding modes). Historically, truncation
Truncation
In mathematics and computer science, truncation is the term for limiting the number of digits right of the decimal point, by discarding the least significant ones.For example, consider the real numbersThe result would be:- Truncation and floor function :...
was the typical approach. Since the introduction of IEEE 754, the default method (round to nearest, ties to even
Rounding
Rounding a numerical value means replacing it by another value that is approximately equal but has a shorter, simpler, or more explicit representation; for example, replacing $23.4476 with $23.45, or the fraction 312/937 with 1/3, or the expression √2 with 1.414.Rounding is often done on purpose to...
, sometimes called Banker's Rounding) is more commonly used. This method rounds the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and gives that representation as the result. In the case of a tie, the value that would make the significand end in an even digit is chosen. The IEEE 754 standard requires the same rounding to be applied to all fundamental algebraic operations, including square root and conversions, when there is a numeric (non-NaN) result. It means that the results of IEEE 754 operations are completely determined in all bits of the result, except for the representation of NaNs. ("Library" functions such as cosine and log are not mandated.)
Alternative rounding options are also available. IEEE 754 specifies the following rounding modes:
- round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)
- round to nearest, where ties round away from zero (optional for binary floating-point and commonly used in decimal)
- round up (toward +∞; negative results thus round toward zero)
- round down (toward −∞; negative results thus round away from zero)
- round toward zero (truncation; it is similar to the common behavior of float-to-integer conversions, which convert −3.9 to −3 and 3.9 to 3)
Alternative modes are useful when the amount of error being introduced must be bounded. Applications that require a bounded error are multi-precision floating-point, and interval arithmetic
Interval arithmetic
Interval arithmetic, interval mathematics, interval analysis, or interval computation, is a method developed by mathematicians since the 1950s and 1960s as an approach to putting bounds on rounding errors and measurement errors in mathematical computation and thus developing numerical methods that...
.
A further use of rounding is when a number is explicitly rounded to a certain number of decimal (or binary) places, as when rounding a result to euros and cents (two decimal places).
Floating-point arithmetic operations
For ease of presentation and understanding, decimal radixRadix
In mathematical numeral systems, the base or radix for the simplest case is the number of unique digits, including zero, that a positional numeral system uses to represent numbers. For example, for the decimal system the radix is ten, because it uses the ten digits from 0 through 9.In any numeral...
with 7 digit precision will be used in the examples, as in the IEEE 754 decimal32 format. The fundamental principles are the same in any radix
Radix
In mathematical numeral systems, the base or radix for the simplest case is the number of unique digits, including zero, that a positional numeral system uses to represent numbers. For example, for the decimal system the radix is ten, because it uses the ten digits from 0 through 9.In any numeral...
or precision, except that normalization is optional (it does not affect the numerical value of the result). Here, s denotes the significand and e denotes the exponent.
Addition and subtraction
A simple method to add floating-point numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by three digits, and we then proceed with the usual addition method:123456.7 = 1.234567 × 10^5
101.7654 = 1.017654 × 10^2 = 0.001017654 × 10^5
Hence:
123456.7 + 101.7654 = (1.234567 × 10^5) + (1.017654 × 10^2)
= (1.234567 × 10^5) + (0.001017654 × 10^5)
= (1.234567 + 0.001017654) × 10^5
= 1.235584654 × 10^5
In detail:
e=5; s=1.234567 (123456.7)
+ e=2; s=1.017654 (101.7654)
e=5; s=1.234567
+ e=5; s=0.001017654 (after shifting)
--------------------
e=5; s=1.235584654 (true sum: 123558.4654)
This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is
e=5; s=1.235585 (final sum: 123558.5)
Note that the low 3 digits of the second operand (654) are essentially lost. This is round-off error
Round-off error
A round-off error, also called rounding error, is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate this error when using approximation equations and/or algorithms, especially when using finitely many...
. In extreme cases, the sum of two non-zero numbers may be equal to one of them:
e=5; s=1.234567
+ e=−3; s=9.876543
e=5; s=1.234567
+ e=5; s=0.00000009876543 (after shifting)
----------------------
e=5; s=1.23456709876543 (true sum)
e=5; s=1.234567 (after rounding/normalization)
Another problem of loss of significance occurs when two close numbers are subtracted. In the following example e = 5; s = 1.234571 and e = 5; s = 1.234567 are representations of the rationals 123457.1467 and 123456.659.
e=5; s=1.234571
− e=5; s=1.234567
----------------
e=5; s=0.000004
e=−1; s=4.000000 (after rounding/normalization)
The best representation of this difference is e = −1; s = 4.877000, which differs more than 20% from e = −1; s = 4.000000. In extreme cases, the final result may be zero even though an exact calculation may be several million. This cancellation
Loss of significance
Loss of significance is an undesirable effect in calculations using floating-point arithmetic. It occurs when an operation on two numbers increases relative error substantially more than it increases absolute error, for example in subtracting two large and nearly equal numbers. The effect is that...
illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in numerical analysis
Numerical analysis
Numerical analysis is the study of algorithms that use numerical approximation for the problems of mathematical analysis ....
; see also Accuracy problems.
Multiplication and division
To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.e=3; s=4.734612
× e=5; s=5.417242
-----------------------
e=8; s=25.648538980104 (true product)
e=8; s=25.64854 (after rounding)
e=9; s=2.564854 (after normalization)
Similarly, division is accomplished by subtracting the divisor's exponent from the dividend's exponent, and dividing the dividend's significand by the divisor's significand.
There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed in succession. In practice, the way these operations are carried out in digital logic can be quite complex (see Booth's multiplication algorithm
Booth's multiplication algorithm
Booth's multiplication algorithm is a multiplication algorithm that multiplies two signed binary numbers in two's complement notation. The algorithm was invented by Andrew Donald Booth in 1950 while doing research on crystallography at Birkbeck College in Bloomsbury, London...
and digital division
Division (digital)
Several algorithms exist to perform division in digital designs. These algorithms fall into two main categories: slow division and fast division. Slow division algorithms produce one digit of the final quotient per iteration. Examples of slow division include restoring, non-performing restoring,...
).
For a fast, simple method, see the Horner method.
Dealing with exceptional cases
Floating-point computation in a computer can run into three kinds of problems:
- An operation can be mathematically illegal, such as division by zero.
- An operation can be legal in principle, but not supported by the specific format, for example, calculating the square root of −1 or the inverse sine of 2 (both of which result in complex numberComplex numberA complex number is a number consisting of a real part and an imaginary part. Complex numbers extend the idea of the one-dimensional number line to the two-dimensional complex plane by using the number line for the real part and adding a vertical axis to plot the imaginary part...
s). - An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflowArithmetic overflowThe term arithmetic overflow or simply overflow has the following meanings.# In a computer, the condition that occurs when a calculation produces a result that is greater in magnitude than that which a given register or storage location can store or represent.# In a computer, the amount by which a...
(exponent too large), underflowArithmetic underflowThe term arithmetic underflow is a condition in a computer program that can occur when the true result of afloating point operation is smaller in magnitude...
(exponent too small) or denormalizationDenormal numberIn computer science, denormal numbers or denormalized numbers fill the underflow gap around zero in floating point arithmetic: any non-zero number which is smaller than the smallest normal number is 'sub-normal'.For example, if the smallest positive 'normal' number is 1×β−n In computer...
(precision loss).
Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind
of trap
Trap (computing)
In computing and operating systems, a trap, also known as an exception or a fault, is typicallyThere is a wide variation in the nomenclature...
that the programmer might be able to catch. How this worked was system-dependent,
meaning that floating-point programs were not portable
Porting
In computer science, porting is the process of adapting software so that an executable program can be created for a computing environment that is different from the one for which it was originally designed...
.
The original IEEE 754 standard (from 1984) took a first step towards a standard way for the IEEE 754 based operations to record that an error occurred. Here, trapping was ignored (optional in the 1984 version) and "alternate exception handling modes" (replacing trapping in the 2008 version, but still optional), and just looking at the required default method of handling exceptions according to IEEE 754. Arithmetic exceptions are (by default) required to be recorded in "sticky" error indicator bits. That they are "sticky" means that they are not reset by the next (arithmetic) operation, but stay set until explicitly reset. By default, an operation always returns a result according to specification without interrupting computation. For instance, 1/0 returns +∞, while also setting the divide-by-zero error bit.
The original IEEE 754 standard, however, failed to recommend operations to handle such sets of arithmetic error bits. So while these were implemented in hardware, initially programming language implementations did not automatically provide a means to access them (apart from assembler). Over time some programming language standards (e.g., C and Fortran) have been updated to specify methods to access and change status and error bits. The 2008 version of the IEEE 754 standard now specifies a few operations for accessing and handling the arithmetic error bits. The programming model is based on a single thread of execution and use of them by multiple threads has to be handled by a means
Concurrency (computer science)
In computer science, concurrency is a property of systems in which several computations are executing simultaneously, and potentially interacting with each other...
outside of the standard.
IEEE 754 specifies five arithmetic errors that are to be recorded in "sticky bits":
- inexact, set if the rounded (and returned) value is different from the mathematically exact result of the operation.
- underflow, set if the rounded value is tiny (as specified in IEEE 754) and inexact (or maybe limited to if it has denormalisation loss, as per the 1984 version of IEEE 754), returning a subnormal value including the zeros.
- overflow, set if the absolute value of the rounded value is too large to be represented. An infinity or maximal finite value is returned, depending on which rounding is used.
- divide-by-zero, set if the result is infinite given finite operands, returning an infinity, either +∞ or −∞.
- invalid, set if a real-valued result cannot be returned e.g. sqrt(−1) or 0/0, returning a quiet NaN.
Accuracy problems
The fact that floating-point numbers cannot precisely represent all real numbers, and that floating-point operations cannot precisely represent true arithmetic operations, leads to many surprising situations. This is related to the finite precision
Precision (computer science)
In computer science, precision of a numerical quantity is a measure of the detail in which the quantity is expressed. This is usually measured in bits, but sometimes in decimal digits. It is related to precision in mathematics, which describes the number of digits that are used to express a...
with which computers generally represent numbers.
For example, the non-representability of 0.1 and 0.01 (in binary) means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision) representation, 0.1 (decimal) was given previously as e = −4; s = 110011001100110011001101, which is
- 0.100000001490116119384765625 exactly.
Squaring this number gives
- 0.010000000298023226097399174250313080847263336181640625 exactly.
Squaring it with single-precision floating-point hardware (with rounding) gives
- 0.010000000707805156707763671875 exactly.
But the representable number closest to 0.01 is
- 0.009999999776482582092285156250 exactly.
Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow. It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:
will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be −22877332.0.
By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) 0.1225 in double precision, or −0.8742 in single precision.
While floating-point addition and multiplication are both commutative (a + b = b + a and a×b = b×a), they are not necessarily associative. That is, (a + b) + c is not necessarily equal to a + (b + c). Using 7-digit decimal arithmetic:
a = 1234.567, b = 45.67834, c = 0.0004
(a + b) + c:
1234.567 (a)
+ 45.67834 (b)
____________
1280.24534 rounds to 1280.245
1280.245 (a + b)
+ 0.0004 (c)
____________
1280.2454 rounds to 1280.245 <--- (a + b) + c
a + (b + c):
45.67834 (b)
+ 0.0004 (c)
____________
45.67874
45.67874 (b + c)
+ 1234.567 (a)
____________
1280.24574 rounds to 1280.246 <--- a + (b + c)
They are also not necessarily distributive. That is, (a + b) ×c may not be the same as a×c + b×c:
1234.567 × 3.333333 = 4115.223
1.234567 × 3.333333 = 4.115223
4115.223 + 4.115223 = 4119.338
but
1234.567 + 1.234567 = 1235.802
1235.802 × 3.333333 = 4119.340
In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:
- Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy. When we subtract two almost equal numbers we set the most significant digits to zero, leaving ourselves with just the insignificant, and most erroneous, digits. For example, when determining a derivativeDerivativeIn calculus, a branch of mathematics, the derivative is a measure of how a function changes as its input changes. Loosely speaking, a derivative can be thought of as how much one quantity is changing in response to changes in some other quantity; for example, the derivative of the position of a...
of a function the following formula is used:
- Intuatively one would want an h very close to zero, however when using floating point operations, the smallest number won't give the best approximation of a derivative. As h grows smaller the difference between f(a + h) and f(a) grows smaller, cancelling out the most significant and least erroneous digits and making the most erroneous digits more important. As a result the smallest number of h possible will give a more erroneous approximation of a derivative than a somewhat larger number. This is perhaps the most common and serious accuracy problem.
- Conversions to integer are not intuitive: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than round. Floor and ceiling functions may produce answers which are off by one from the intuitively expected value.
- Limited exponent range: results might overflow yielding infinity, or underflow yielding a subnormal number or zero. In these cases precision will be lost.
- Testing for safe division is problematic: Checking that the divisor is not zero does not guarantee that a division will not overflow.
- Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values. Programmers often perform comparisons within some tolerance (often a decimal constant, itself not accurately represented), but that does not necessarily make the problem go away.
Machine precision
"Machine precision" is a quantity that characterizes the accuracy of a floating point system. It is also known as unit roundoff or machine epsilonMachine epsilon
Machine epsilon gives an upper bound on the relative error due to rounding in floating point arithmetic. This value characterizes computer arithmetic in the field of numerical analysis, and by extension in the subject of computational science...
. Usually denoted Εmach, its value depends on the particular rounding being used.
With rounding to zero,
whereas rounding to nearest,
This is important since it bounds the relative error in representing any non-zero real number x within the normalised range of a floating point system:
Minimizing the effect of accuracy problems
Because of the issues noted above, naive use of floating-point arithmetic can lead to many problems. The creation of thoroughly robust floating-point software is a complicated undertaking, and a good understanding of numerical analysisNumerical analysis
Numerical analysis is the study of algorithms that use numerical approximation for the problems of mathematical analysis ....
is essential.
In addition to careful design of programs, careful handling by the compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...
is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area. See the external references at the bottom of this article.
Binary floating-point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a proton
Proton
The proton is a subatomic particle with the symbol or and a positive electric charge of 1 elementary charge. One or more protons are present in the nucleus of each atom, along with neutrons. The number of protons in each atom is its atomic number....
), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. The "decimal" data type of the C# and Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
programming languages, and the IEEE 754-2008 decimal floating-point standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.
Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed if they are to work well.
Expectations from mathematics may not be realised in the field of floating-point computation. For example, it is known that , and that . These facts cannot be counted on when the quantities involved are the result of floating-point computation.
A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to the references at the bottom of this article. Descriptions of a few simple techniques follow.
The use of the equality test (
if (xy) ...
) is usually not recommended when dealing with floating point numbers. Even simple expressions like 0.6/0.2-30
will, on most computers, fail to be true (in 64-bit64-bit
64-bit is a word size that defines certain classes of computer architecture, buses, memory and CPUs, and by extension the software that runs on them. 64-bit CPUs have existed in supercomputers since the 1970s and in RISC-based workstations and servers since the early 1990s...
Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
, for example,
0.6/0.2-3
is approximately equal to -4.44089209850063e-16). Consequently, such tests are sometimes replaced with "fuzzy" comparisons (if (abs(x-y) < epsilon) ...
, where epsilonMachine epsilon
Machine epsilon gives an upper bound on the relative error due to rounding in floating point arithmetic. This value characterizes computer arithmetic in the field of numerical analysis, and by extension in the subject of computational science...
is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly. It is often better to organize the code in such a way that such tests are unnecessary.
An awareness of when loss of significance can occur is useful. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. A typical addition would then be something like
3253.671
+ 3.141276
--------
3256.812
The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The Kahan summation algorithm
Kahan summation algorithm
In numerical analysis, the Kahan summation algorithm significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach...
may be used to reduce the errors.
Computations may be rearranged in a way that is mathematically equivalent but less prone to error. As an example, Archimedes
Archimedes
Archimedes of Syracuse was a Greek mathematician, physicist, engineer, inventor, and astronomer. Although few details of his life are known, he is regarded as one of the leading scientists in classical antiquity. Among his advances in physics are the foundations of hydrostatics, statics and an...
approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. The recurrence formula for the circumscribed polygon is:
Here is a computation using IEEE "double" (a significand with 53 bits of precision) arithmetic:
i 6 × 2i × ti, first form 6 × 2i × ti, second form
0 .4641016151377543863 .4641016151377543863
1 .2153903091734710173 .2153903091734723496
2 596599420974940120 596599420975006733
3 60862151314012979 60862151314352708
4 27145996453136334 27145996453689225
5 8730499801259536 8730499798241950
6 6627470548084133 6627470568494473
7 6101765997805905 6101766046906629
8 70343230776862 70343215275928
9 37488171150615 37487713536668
10 9278733740748 9273850979885
11 7256228504127 7220386148377
12 717412858693 707019992125
13 189011456060 78678454728
14 717412858693 46593073709
15 19358822321783 8571730119
16 717412858693 6566394222
17 810075796233302 6065061913
18 717412858693 939728836
19 4061547378810956 908393901
20 05434924008406305 900560168
21 00068646912273617 8608396
22 349453756585929919 8122118
23 00068646912273617 95552
24 .2245152435345525443 68907
25 62246
26 62246
27 62246
28 62246
The true value is
While the two forms of the recurrence formula are clearly mathematically equivalent, the first subtracts 1 from a number extremely close to 1, leading to an increasingly problematic loss of significant digits. As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.
Further reading
- What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, published in the March, 1991 issue of Computing Surveys.
- Donald KnuthDonald KnuthDonald Ervin Knuth is a computer scientist and Professor Emeritus at Stanford University.He is the author of the seminal multi-volume work The Art of Computer Programming. Knuth has been called the "father" of the analysis of algorithms...
. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89684-2. Section 4.2: Floating Point Arithmetic, pp. 214–264. - Press et al. Numerical RecipesNumerical RecipesNumerical Recipes is the generic title of a series of books on algorithms and numerical analysis by William H. Press, Saul Teukolsky, William Vetterling and Brian Flannery. In various editions, the books have been in print since 1986...
in C++C++C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
. The Art of Scientific Computing, ISBN 0-521-75033-4.
External links
- Kahan, William and Darcy, Joseph (2001). How Java's floating-point hurts everyone everywhere. Retrieved September 5, 2003.
- Survey of Floating-Point Formats This page gives a very brief summary of floating-point formats that have been used over the years.
- The pitfalls of verifying floating-point computations, by David Monniaux, also printed in ACMAssociation for Computing MachineryThe Association for Computing Machinery is a learned society for computing. It was founded in 1947 as the world's first scientific and educational computing society. Its membership is more than 92,000 as of 2009...
Transactions on programming languages and systems (TOPLAS), May 2008: a compendium of non-intuitive behaviours of floating-point on popular architectures, with implications for program verification and testing - http://www.opencores.org The OpenCores website contains open source floating point IP cores for the implementation of floating point operators in FPGA or ASIC devices. The project, double_fpu, contains verilog source code of a double precision floating point unit. The project, fpuvhdl, contains vhdl source code of a single precision floating point unit.