Floating-point in Java, as in any other modern day programming language, is a complex subject. The versatility, power and the complexity of floating-point numbers is cloaked in innocent looking, simple digits. To harness the real power of floating-point numbers and to use them correctly in Java, it is essential to understand how they are represented internally in JVM and how they are so very different from integers.

# What are Floating-Point Numbers?

A floating-point number is an IEEE 754 compliant data type.

If you need to represent a non-integer number in your program you will have to use some kind of floating-point number. A floating-point number is a number that is compliant with computation standard known as IEEE 754. This standard brings homogeneity to the previously (prior to 1985) diverse floating-point implementations. This standard has become so common now that all modern day processors have a special hardware component called FPU or *floating-point unit. *

Among other things the IEEE 754 standard defines –

- Format of a number
- Operations allowed on the number
- Behavior in response to exceptions
- Special values such as
*Not-a-number, Positive infinity, Negative Infinity*

## Format of Floating-Point Number

Understanding the format of a floating-point number is critical to understanding the behavior of these numbers.

Format of floating-point number specifies two main things –

- How to represent finite the floating-point numbers
- Special values for positive and negative infinities and NaN

### Representing Finite Floating-Point Numbers Under IEEE 754

A floating-point number consists of

- Sign bit
- Fixed number of bits representing the exponent
- Fixed number of bits representing the fraction a.k.a
*mantissa,*a.k.a*significand,*a.k.a*coeffecient*

Under this scheme a finite floating-point number has the value

(−1)^{s} × m × b^{e−c}

where

- s is the sign bit, and therefore can be 0 or 1
- m is the fraction represented by the mantissa bits
- b is the base, which is 2 in our case because we are dealing with binary numbers
- e is the unsigned integer represented by the exponent bits
- c is one less than half the maximum value that can be represented by exponent. For example, in single-precision floating-point numbers (which has 8 bits in exponent), max value will be 2 ^ 8 or 256. Therefore, c for single precision will be 256/2 – 1 = 127. Similarly, c is 1023 for doubles or double precision floating point numbers.

### Representing Special Values Under IEEE 754

As indicated earlier, IEEE 754 makes provisions for three special values – NaN, $+∞ and$ $−∞.$

These special values have all their exponent bits set to 1. Thereafter, certain unique combinations of mantissa bits indicate NaN, $+∞ and$ $−∞.$

# Floating-Point in Java

Java has a total of eight primitive data types. Among these eight two are floating-point data types. These are float and double. The very fact that 2 out of 8 primitive data types are floating-point numbers indicates their importance in computer science.

Note that for the purposes of this discussion we will ignore the reference data type equivalents Double and Float as they are just object based counterparts to primitive data types.

### Float

Floats are also referred to as single precision floating-point numbers. Floats have 8 exponent bits and 23 bits of mantissa or significand.

The following table summarizes the key pieces about float data type in Java –

Size in bits | 32 |

Exponent bits | 8 |

Mantissa bits | 23 |

Least value that can be stored as float in Java | 1.401298464324817E-45f |

Highest value that can be stored as float in Java | 3.4028234663852886E+38 f |

### Double

Doubles are also referred to as double precision floating-point numbers. Doubles have 11 exponent bits and 52 bits of significand.

Double is the default floating-point data type in Java. So, when you refer to the number 2.71, you are implicitly dealing with double.

The following table summarizes the key pieces about double data type in Java –

Size in bits | 64 |

Exponent bits | 11 |

Mantissa bits | 52 |

Least value that can be stored as double in Java | 2.2250738585072014E-308 |

Highest value that can be stored as double in Java | 1.7976931348623157E308 |

# Floating-point Shockers: Can You Guess the Output Correctly?

Here are a few programs for you. If you can correctly guess the output of these programs, you are half-way through. If you can’t, don’t fret. We will discuss floating-point numbers in great details in other sections.

## Program #1: Are all PIs equal?

Consider the program below –

1 2 3 4 5 6 7 8 9 |
private static void unexpectedFloatPoint() { float f = 3.14f; double d = 3.14; if (f == d) System.out.println("A PI is a PI"); else System.out.println("Not all PIs are equal"); } |

What do you think will be the output of the above program?

Here is the output, explanations will follow later –

1 |
Not all PIs are equal |

## Program #2: Are these the same numbers?

Consider the second program below –

1 2 3 4 5 6 7 8 9 |
private static void unexpectedFloatPoint2() { float f1 = 3.1499999f; float f2 = 3.14999991f; if (f1 == f2) System.out.println("What happened to that trailing 1?"); else System.out.println("I know floating-point arithmetic"); } |

What do you think will be the output of the above program?

Here is the output, explanations will follow later –

1 |
What happened to that trailing 1? |

## Program #3: Are these the same numbers (2)?

Let’s consider a slight variation of the previous program by declaring the data type as float rather than double.

1 2 3 4 5 6 7 8 9 |
private static void unexpectedFloatPoint3() { double d1 = 3.1499999; double d2 = 3.14999991; if (d1 == d2) System.out.println("What happened to that trailing 1?"); else System.out.println("I know floating-point arithmetic"); } |

What do you think would be the output now? As opposed to the previous program, this time else part gets executed!

1 |
I know floating-point arithmetic |

## Program #4: Am I dreaming? What happened to math?

Consider the program below. Your high school arithmetic is back with a twist –

1 2 3 4 5 6 7 8 9 10 11 12 13 |
private static void unexpectedFloatPoint4() { double d1 = 3.1499999; if(d1 * 999999999 == (d1 * 1000000001 - 2*d1)) System.out.println("I confirm that 1000000001 - 2 = 999999999"); else System.out.println("Am I dreaming?"); double d2 = 3.149999993210931231293193919319392139129319391239013919310301391; if(d2 * 999999999 == (d2 * 1000000001 - 2 * d2)) System.out.println("I confirm that 1000000001 - 2 = 999999999"); else System.out.println("Am I dreaming? Suddenly, 1000000001 - 2 != 999999999"); } |

What would be the output?

1 2 |
I confirm that 1000000001 - 2 = 999999999 Am I dreaming? Suddenly, 1000000001 - 2 != 999999999 |

## Program #5: Mixing Doubles in the already dicey situation

Consider the program below –

1 2 3 4 5 6 7 8 9 |
private static void unexpectedFloatPoint5() { Double d1 = 3.149999993210931231293193919319392139129319391239013919310301391; Double d2 = 3.149999993210931231293193919319392139129319391239013919310301391; if (d1 == d2) System.out.println("You might expect this"); else System.out.println("But be shocked to see this printed"); } |

What do you think will be the output of the above program? (Hint: The above program gives strange result for a totally different reason than other programs in this section)

1 |
But be shocked to see this printed |

## Program #6: More confusion with NaN

Let’s add to the chaos by bringing the following program using Nan into the mix.

1 2 3 4 5 6 7 8 9 10 11 |
private static void unexpectedFloatPoint6() { double pi = Math.PI; double nan = Double.NaN; if(pi < nan) System.out.println("Pi is less than Nan."); else if (pi >= nan) System.out.println("Pi is greater than Nan"); else System.out.println("Pinch me. It's neither greater than, nor equal nor less!!!"); } |

And this is the output –

1 |
Pinch me. It's neither greater than, nor equal nor less!!! |

## Program #7: Even more confusion with NaN

Finally, consider the program below –

1 2 3 4 5 6 7 8 9 |
private static void unexpectedFloatPoint7() { float f1 = Float.NaN; float f2 = Float.NaN; if(f1 == f2) System.out.println("Nan should at least be Nan"); else System.out.println("Nan is not Nan!!!"); } |

Here is the output –

1 |
Nan is not Nan!!! |

# Equality & Comparison of Floating-Point Numbers

All the above examples produce counter-intuitive results because equality and comparison of floating point numbers is very very different from equality and comparison of integers.

*One CANNOT use == operator for equality test or other comparisons involving floating-point numbers.*

Floating-point numbers are inherently imprecise numbers with varying degrees of precisions. Therefore, output of calculations and comparisons is greatly impacted by number of mantissa digits and the precision of underlying data type.

Since relational operators are meaningless when dealing with floats, I present a few methods for comparing two floating-point numbers in Java.

## Method #1: Constant Epsilon Method

This method for comparing two floating-point numbers is the most inaccurate of all methods presented here and must never be used in production code. Nevertheless, it builds up the intuition behind all other methods and is therefore a fundamental, must-know algorithm.

This fundamental algorithm acknowledges that floating-point arithmetic is imprecise and tries to correct for that using a pre-defined constant called ε (Greek letter Epsilon). ε is defined to be extremely small. So, if two numbers differ by a margin of less than ε, the algorithm deems them equal. ε is sometimes also referred to as error margin, tolerance margin or noise.

Here is an implementation of constant epsilon method –

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
private static void constantEpsilon(double operand1, double operand2, double epsilon) { if(Math.abs(operand1) - Math.abs(operand2) <= epsilon) System.out.println("Constant epsilon method deems these operands equal"); else System.out.println("Constant epsilon method deems these operands unequal"); // just for fun check if(operand1 == operand2) System.out.println("== operator deems these operands equal"); else System.out.println("== operator deems these operands unequal"); } |

When invoked with parameters

1 2 3 |
double epsilon = 0.0000000000001; double operand1 = 10.0001020303; double operand2 = 10.00010203030001020303; |

the output is

1 2 |
Constant epsilon method deems these operands equal == operator deems these operands unequal |

### Why constant epsilon method should never be used in production code?

This method relies on a constant ε. ε is very small, so this method works for large numbers.

But if one or both operands of the method above are themselves small enough to be of order of ε, the method breaks down. One possible fix is to reduce ε ‘sufficiently’. But that merely transfers or postpones the problem and does not solve it.

It would be much better if ε is of the same order as the operands.

## Method #2: Relative or Scaled Epsilon Method

Doug Gwyn suggested a fix for the above absolute constant ε method. This method treats the error margin ε as fraction or percentage relative to the size of operands.

The implementation uses Java Ternary Operator, so you would be better of reading all about the Java Conditional Operator.

Here is the method that implements floating-point comparison based on a scaled or relative ε –

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
private static void relativeEpsilon(double operand1, double operand2, double epsilon) { double bigger = Math.max(Math.abs(operand1), Math.abs(operand2)); double result = bigger == 0.0 ? 0.0 : Math.abs(operand1 - operand2) / bigger; System.out.println(result); if(result <= epsilon) System.out.println("Operands are equal"); else System.out.println("Rel. Epsilon method deems these operands not equal"); // just for fun check if(operand1 == operand2) System.out.println("== operator deems these operands equal"); else System.out.println("== operator deems these operands unequal"); } |

When this method is invoked with the following arguments

1 2 3 4 5 |
double epsilon = 0.0000000000001; double operand1 = 10.0001020303; double operand2 = 10.00010203030001020303; relativeEpsilon(operand1, operand2, epsilon); |

the output is

1 2 3 |
8.881693576815236E-16 Operands are equal == operator deems these operands unequal |

As you can see, there is a difference between what the equality operator returns and what our method returns. Due to the relative closeness of the doubles, scaled epsilon method treats them as equal.

So, how exactly is this method better than the constant ε method? The difference comes to light when these methods are invoked with operands which are of the order of ε as in the example below –

1 2 |
relativeEpsilon(0.00004, 0.00005, 0.001); constantEpsilon(0.00004, 0.00005, 0.001); |

The output of the two methods is –

1 2 |
Rel. Epsilon method deems these operands not equal Constant epsilon method deems these operands equal |

The relative ε method correctly finds the two operands to be not equal.

## Method #3: Using compare or compareTo method of Float or Double class

Java’s Float and Double classes provide two comparison methods – compare and compareTo. It’s pretty straightforward to use these methods –

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
private static void floatCompare(float f1, float f2) { Float floatObject1 = f1; Float floatObject2 = f2; int result = Float.compare(floatObject1,floatObject2); if(result == 0) System.out.println(floatObject1 + " is equal to " + floatObject2); else if(result > 0) System.out.println(floatObject1 + " is > to " + floatObject2); else System.out.println(floatObject1 + " is < to " + floatObject2); result = floatObject1.compareTo(floatObject2); if(result == 0) System.out.println(floatObject1 + " is equal to " + floatObject2); else if(result > 0) System.out.println(floatObject1 + " is > to " + floatObject2); else System.out.println(floatObject1 + " is < to " + floatObject2); } |

The output of the above method when invoked as follows

1 2 |
floatCompare(10.1f, 10.0001020303f); floatCompare(10.0001f, 10.0001020303f); |

is

1 2 3 4 |
10.1 is > to 10.000102 10.1 is > to 10.000102 10.0001 is < to 10.000102 10.0001 is < to 10.000102 |

# Conclusion

In previous articles on CodingRaptor.com we had examined various data types in Java. Floating-point data types deserve extra attention because of their idiosyncrasies. We therefore began by examining how floating-point numbers are encoded and then we examined the two floating-point data types in Java. We saw seven Java methods that yielded surprising results because of the way comparison of floating-point numbers works. These seven programs represent the most common pitfalls related to floats and double in Java.

Due to the way floating-point numbers are represented internally, their precision is always limited. Floating-point arithmetic is bound to give imprecise results at some point and therefore should not be used in situations where precision is important. If precision is important, one must use Java’s BigDecimal class.

### We are social

Spread the wordFollow CodingRaptor

## Leave a Reply