Floating Point

Introduction

In computing, numbers come in two broad flavors: integers and real numbers. While integers are exact, real numbers—especially those involving fractions, irrational values, or scientific notation—require more sophisticated storage and arithmetic techniques. That’s where the Floating Point representation comes in.

Floating point allows computers to represent a wide range of real numbers, from incredibly small (like 0.000000001) to astronomically large (like 1.2 × 10²³), all while maintaining dynamic precision and compact memory usage.

But this flexibility comes with trade-offs: floating-point operations can be imprecise, unpredictable, and even misleading if not well understood.

What Is Floating Point?

Floating point is a method for representing real numbers in a scientific notation–like format within a binary system. Just like scientific notation in decimal (e.g., 6.022 × 10²³), floating-point represents a number using three components:

number = sign × mantissa × base^exponent

In binary floating-point (as defined by the IEEE 754 standard):

The base is always 2
The number is stored as:
(-1)^sign × 1.mantissa × 2^(exponent - bias)

Components of a Floating Point Number

Let’s break down a 32-bit floating-point number (also called single precision):

1. Sign Bit (1 bit)

0 = positive
1 = negative

2. Exponent (8 bits)

Biased using a bias of 127
Stores exponent as an unsigned integer

3. Mantissa / Fraction (23 bits)

Represents significant digits
Includes only the fractional part; 1 is assumed (normalized numbers)

Example: Floating Point Format (IEEE 754 – 32 Bit)

Sign | Exponent (8 bits) | Fraction (23 bits)
  0       10000001         01000000000000000000000

Sign: 0 → positive
Exponent: 129 (binary 10000001) → actual exponent = 129 − 127 = 2
Fraction: 010000... → 1.01 in binary = 1.25
Final number: +1.25 × 2² = 5.0

IEEE 754 Standard

The IEEE 754 standard defines formats for binary floating-point arithmetic.

Common Formats:

Name	Bits	Exponent Bits	Fraction Bits	Bias	Approx Range
Half Precision	16	5	10	15	~±6.55 × 10⁴
Single Precision	32	8	23	127	~±3.4 × 10³⁸
Double Precision	64	11	52	1023	~±1.8 × 10³⁰⁸

Special Values

IEEE 754 floating-point can represent special values using reserved bit patterns:

Bit Pattern	Represents
All 0s in exponent + 0s in mantissa	Zero
All 0s in exponent + non-zero mantissa	Subnormal numbers
All 1s in exponent + 0s in mantissa	Infinity (`+∞` or `−∞`)
All 1s in exponent + non-zero mantissa	NaN (Not a Number)

Floating Point Arithmetic

Operations like addition, subtraction, multiplication, and division follow rounding rules and normalization.

Key behaviors:

Loss of precision (rounding errors)
Cancellation errors in subtraction of near-equal values
Overflow/underflow when results exceed representable range
Associativity is not guaranteed:
(a + b) + c ≠ a + (b + c) in general

Common Issues

1. Precision Errors

print(0.1 + 0.2 == 0.3)  # Output: False

Explanation: Binary fractions can’t represent certain decimals exactly.

2. Rounding Modes

IEEE 754 supports different rounding strategies:

Round to nearest (default)
Round toward zero
Round toward positive/negative infinity

3. Subnormal Numbers

Very small numbers stored without normalization to extend the range of representable values near zero, albeit with reduced precision.

Decimal vs Binary Floating Point

Feature	Decimal Floating Point	Binary Floating Point
Base	10	2
Used in	Financial apps, databases	Scientific computing, most systems
Advantage	Exact decimal representation	Fast hardware implementation
Standard	IEEE 754-2008 (decimal formats)	IEEE 754-1985/2008 (binary formats)

When to Avoid Floating Point

Currency or money: Use fixed-point or decimal representations
Exact comparisons: Avoid == for float equality
Sensitive control systems: Use integer logic when feasible

Techniques to Mitigate Float Issues

1. Use `math.isclose()` or tolerance checks

abs(a - b) < 1e-9

2. Use Decimal Types in Python

from decimal import Decimal
a = Decimal('0.1') + Decimal('0.2')
print(a == Decimal('0.3'))  # True

3. Use Double Precision when Needed

Prefer float64 for numerical stability.

Real-World Applications

Domain	Use Case
Scientific Computing	Simulation, astronomy, chemistry
Graphics	3D transformations, shaders
Machine Learning	Gradient calculations (float32/16)
Finance	Decimal floating point (exact cents)
Physics Engines	Real-world simulation of forces

Floating Point in Hardware

Modern CPUs and GPUs have FPU (Floating Point Units):

Intel/AMD: SSE, AVX
ARM: NEON, SVE
GPUs: float16/float32/float64 support (NVIDIA, AMD)

Summary

Floating Point arithmetic enables computers to model real numbers and scientific calculations efficiently. But it comes with caveats: rounding errors, unexpected results, and representational limits. Understanding how it works at the binary level is key to writing numerically stable, bug-free, and efficient code — especially in scientific and high-performance computing domains.