Floating-Point in Java: Representation, Comparison, Equality & A Few Shockers

Floating-point in Java, as in any other modern day programming language, is a complex subject. The versatility, power and the complexity of floating-point numbers is cloaked in innocent looking, simple digits. To harness the real power of floating-point numbers and to use them correctly in Java, it is essential to understand how they are represented internally in JVM and how they are so very different from integers.

What are Floating-Point Numbers?

A floating-point number is an IEEE 754 compliant data type.

If you need to represent a non-integer number in your program you will have to use some kind of floating-point number. A floating-point number is a number that is compliant with computation standard known as IEEE 754.  This standard brings homogeneity to the previously (prior to 1985) diverse floating-point implementations. This standard has become so common now that all modern day processors have a special hardware component called FPU or floating-point unit. 

Among other things the IEEE 754 standard defines –

  1. Format of a number
  2. Operations allowed on the number
  3. Behavior in response to exceptions
  4. Special values such as Not-a-number, Positive infinity, Negative Infinity

Format of Floating-Point Number

Understanding the format of a floating-point number is critical to understanding the behavior of these numbers.

Format of floating-point number specifies two main things –

  1. How to represent finite the floating-point numbers
  2. Special values for positive and negative infinities and NaN

Representing Finite Floating-Point Numbers Under IEEE 754

A floating-point number consists of

  1. Sign bit
  2. Fixed number of bits representing the exponent
  3. Fixed number of bits representing the fraction a.k.a mantissa, a.k.a significand, a.k.acoeffecient
Floating-Point representation (Image from Wikipedia)

Under this scheme a finite floating-point number has the value

(−1)​s​​ × m × b​e−c​​

where

  • s is the sign bit, and therefore can be 0 or 1
  • m is the fraction represented by the mantissa bits
  • b is the base, which is 2 in our case because we are dealing with binary numbers
  • e is the unsigned integer represented by the exponent bits
  • c is one less than half the maximum value that can be represented by exponent. For example, in single-precision floating-point numbers (which has 8 bits in exponent), max value will be 2 ^ 8 or 256. Therefore, c for single precision will be 256/2 – 1 = 127. Similarly, c is 1023 for doubles or double precision floating point numbers.

Representing Special Values Under IEEE 754

As indicated earlier, IEEE 754 makes provisions for three special values – NaN, +∞ and −∞.

These special values have all their exponent bits set to 1. Thereafter, certain unique combinations of mantissa bits indicate NaN, +∞ and −∞.

Floating-Point in Java

Java has a total of eight primitive data types. Among these eight two are floating-point data types. These are float and double. The very fact that 2 out of 8 primitive data types are floating-point numbers indicates their importance in computer science.

Note that for the purposes of this discussion we will ignore the reference data type equivalents Double and Float as they are just object based counterparts to primitive data types.

Float

Floats are also referred to as single precision floating-point numbers. Floats have 8 exponent bits and 23 bits of mantissa or significand.

The following table summarizes the key pieces about float data type in Java –

Size in bits32
Exponent bits8
Mantissa bits23
Least value that can be stored as float in Java1.401298464324817E-45f
Highest value that can be stored as float in Java3.4028234663852886E+38 f

Double

Doubles are also referred to as double precision floating-point numbers. Doubles have 11 exponent bits and 52  bits of significand.

Double is the default floating-point data type in Java. So, when you refer to the number 2.71, you are implicitly dealing with double.

The following table summarizes the key pieces about double data type in Java –

Size in bits64
Exponent bits11
Mantissa bits52
Least value that can be stored as double in Java2.2250738585072014E-308
Highest value that can be stored as double in Java1.7976931348623157E308

Floating-point Shockers: Can You Guess the Output Correctly?

Here are a few programs for you. If you can correctly guess the output of these programs, you are half-way through. If you can’t, don’t fret. We will discuss floating-point numbers in great details in other sections.

Program #1: Are all PIs equal?

Consider the program below –Java

What do you think will be the output of the above program?

Here is the output, explanations will follow later –

Program #2: Are these the same numbers?

Consider the second program below –Java

What do you think will be the output of the above program?

Here is the output, explanations will follow later –

Program #3: Are these the same numbers (2)?

Let’s consider a slight variation of the previous program by declaring the data type as float rather than double.Java

What do you think would be the output now? As opposed to the previous program, this time else part gets executed!

Program #4: Am I dreaming? What happened to math?

Consider the program below. Your high school arithmetic is back with a twist –Java

What would be the output?

Program #5: Mixing Doubles in the already dicey situation

Consider the program below –Java

What do you think will be the output of the above program? (Hint: The above program gives strange result for a totally different reason than other programs in this section)

Program #6: More confusion with NaN

Let’s add to the chaos by bringing the following program using Nan into the mix.Java

And this is the output –

Program #7: Even more confusion with NaN

Finally, consider the program below –Java

Here is the output –

Equality & Comparison of Floating-Point Numbers

All the above examples produce counter-intuitive results because equality and comparison of floating point numbers is very very different from equality and comparison of integers.

One CANNOT use == operator for equality test or other comparisons involving floating-point numbers.

Floating-point numbers are inherently imprecise numbers with varying degrees of precisions. Therefore, output of calculations and comparisons is greatly impacted by number of mantissa digits and the precision of underlying data type.

Since relational operators are meaningless when dealing with floats,  I present a few methods for comparing two floating-point numbers in Java.

Method #1: Constant Epsilon Method

This method for comparing two floating-point numbers is the most inaccurate of all methods presented here and must never be used in production code. Nevertheless, it builds up the intuition behind all other methods and is therefore a fundamental, must-know algorithm.

This fundamental algorithm acknowledges that floating-point arithmetic is imprecise and tries to correct for that using a pre-defined constant called ε (Greek letter Epsilon). ε is defined to be extremely small. So, if two numbers differ by a margin of less than ε, the algorithm deems them equal. ε is sometimes also referred to as error margin, tolerance margin or noise.

Here is an implementation of constant epsilon method –Java

When invoked with parameters

the output is

Why constant epsilon method should never be used in production code?

This method relies on a constant ε. ε is very small, so this method works for large numbers.

But if one or both operands of the method above are themselves small enough to be of order of ε, the method breaks down. One possible fix is to reduce ε ‘sufficiently’. But that merely transfers or postpones the problem and does not solve it.

It would be much better if ε is of the same order as the operands.

Method #2: Relative or Scaled Epsilon Method

Doug Gwyn suggested a fix for the above absolute constant ε method. This method treats the error margin ε as fraction or percentage relative to the size of operands.

The implementation uses Java Ternary Operator, so you would be better of reading all about the Java Conditional Operator.

Here is the method that implements floating-point comparison based on a scaled or relative ε –Java

When this method is invoked with the following arguments

the output is

As you can see, there is a difference between what the equality operator returns and what our method returns. Due to the relative closeness of the doubles, scaled epsilon method treats them as equal.

So, how exactly is this method better than the constant ε method? The difference comes to light when these methods are invoked with operands which are of the order of ε as in the example below –

The output of the two methods is –

The relative ε method correctly finds the two operands to be not equal.

Method #3: Using compare or compareTo method of Float or Double class

Java’s Float and Double classes provide two comparison methods – compare and compareTo. It’s pretty straightforward to use these methods –Java

The output of the above method when invoked as follows

is

Conclusion

In previous articles on CodingRaptor.com we had examined various data types in Java. Floating-point data types deserve extra attention because of their idiosyncrasies. We therefore began by examining how floating-point numbers are encoded and then we examined the two floating-point data types in Java. We saw seven Java methods that yielded surprising results because of the way comparison of floating-point numbers works. These seven programs represent the most common pitfalls related to floats and double in Java.

Due to the way floating-point numbers are represented internally, their precision is always limited. Floating-point arithmetic is bound to give imprecise results at some point and therefore should not be used in situations where precision is important. If precision is important, one must use Java’s BigDecimal class.

Leave a comment

Your email address will not be published. Required fields are marked *