Floating-point Representation

The Fortran numeric environment is flexible, which helps make Fortran a strong language for intensive numerical calculations. The Fortran standard purposely leaves the precision of numeric quantities and the method of rounding numeric results unspecified. This allows Fortran to operate efficiently for diverse applications on diverse systems.

Computations on real numbers may not yield what you expect. This happens because the hardware must represent numbers in a finite number of bits.

There are several effects of using finite floating-point numbers. The hardware is not able to represent every real number exactly, but must approximate exact representations by rounding or truncating to finite length. In addition, some numbers lie outside the range of representation of the maximum and minimum exponents and can result in calculations that underflow and overflow. As an example of one consequence, finite precision produces many numbers that, although non-zero, behave in addition as zero.

You can minimize the effects of finite representation with programming techniques; for example, by not using floating-point numbers in LOGICAL comparisons or by giving them a tolerance (for example, IF (ABS(x-10.0) <= 0.001)), and by not attempting to combine or compare numbers that differ by more than the number of significant bits.

Floating-point numbers approximate real numbers with a finite number of bits. The bits are calculated as shown in the following formula. The representation is binary, so the base is 2. The bits bn represent binary digits (0 or 1). The precision P is the number of bits in the nonexponential part of the number (the significand), and E is the exponent. With these parameters, binary floating-point numbers approximate real numbers with the values:

( - 1)s b₀ . b₁ b₂ ... b _P-1 x 2^E

where s is 0 or 1 (+ or - ), and E_min<= E <= E_max

The following table gives the standard values for these parameters for single, double, and quad (extended precision) formats and the resulting bit widths for the sign, the exponent, and the full number.

Parameters for IEEE* Floating-Point Formats

Parameter	Single	Double	Quad or Extended Precision (IEEE_X)*
Sign width in bits	1	1	1
P	24	53	113
E_max	+127	+1023	+16383
E_min	- 126	- 1022	-16382
Exponent bias	+127	+1023	+16383
Exponent width in bits	8	11	15
Format width in bits	32	64	128

* This type is emulated in software.

The actual number of bits needed to represent the precisions 24, 53, and 113 is therefore 23, 52, and 112, respectively, because b₀ is chosen to be 1 implicitly.

A bias is added to all exponents so that only positive integer exponents occur. This expedites comparisons of exponent values. The stored exponent is actually:

e = E + bias