CMPT 295 - Unit - Data Representation

Lecture 7

Last Lecture

Today’s Menu

IEEE floating point representation (single precision)

How would 47.28 be encoded as IEEE floating point number?

  1. Convert 47.28 to binary (using the positional notation R2B(X)) =>
    • 47 = 1011112\text{101111}_{2}
    • .28 = .010001111010111000012.\overline{01000111101011100001}_{2}
    • Also expressed as: .28 = .01000111101011100001010001111010111000010100011110101110000101...2\text{.01000111101011100001010001111010111000010100011110101110000101...}_{2}
  2. Normalize binary number:
    • 101111.01000111101011100001010001111010111000010100011110101110000101...2(×20)\text{101111.01000111101011100001010001111010111000010100011110101110000101...}_{2} (\times 2^{0}) becomes:
    • 1.011110100011110101110000101000111101011100001010001111010111000...2×25\text{1.011110100011110101110000101000111101011100001010001111010111000...}_{2} \times 2^{5}

A diagram of the above number not fitting inside the frac portion of a signed number due to its width.

Rounding

This selection is done by looking at the bit pattern around the rounding position.

Rounding (and error)

Example: rounding position -> round to nearest 1/4 (2 bits right of binary point)

Value Binary Rounded Action Rounded Value
23322 \frac{3}{32} 10.000112\text{10.00011}_{2} 10.002\text{10.00}_{2} (<1/2—down) 2
23162 \frac{3}{16} 10.001102\text{10.00110}_2 10.012\text{10.01}_{2} (>1/2—up) 2142 \frac{1}{4}
2782 \frac{7}{8} 10.111002\text{10.11100}_{2} 11.002\text{11.00}_{2} (1/2—up to even) 3
2582 \frac{5}{8} 10.101002\text{10.10100}_{2} 10.102\text{10.10}_{2} (1/2—down to even) 2122 \frac{1}{2}

Back to IEEE floating point representation

In the process of converting fractional decimal numbers to IEEE floating point numbers (i.e., bit patterns in fixed-size memory), we apply these same rounding rules …

Using the same numbers in our example:

Imagine that the 4th bit in the binary column is our 23rd bit of the frac => rounding position. And the 5th bit is the 24th bit.

Value Binary
23322 \frac{3}{32} 10.000112\text{10.00011}_{2}
23162 \frac{3}{16} 10.001102\text{10.00110}_{2}
2782 \frac{7}{8} 10.111002\text{10.11100}_{2}
2582 \frac{5}{8} 10.101002\text{10.10100}_{2}

Homework – Let’s practice converting and rounding!

How would 346.62 be encoded as IEEE floating point number (single precision) in memory?

Also, can you compute the minimum value of the error introduced by the rounding process since 346.62 can only be approximated when encoded as an IEEE floating point representation

Denormalized values

Equations:

Example:

Smallest:

s exp frac
0 00000000 00000000000000000000001
V=(–1)sM2E=0.00000000000000000000001×21261.4×1045V = \text{(–1)}^{s} M 2^{E} = \text{0.00000000000000000000001} \times 2^{-126} \approxeq 1.4 \times \text{10}^{-45}

Largest:

s exp frac
0 00000000 11111111111111111111111
V=(–1)sM2E=0.11111111111111111111111×21261.18×1038V = \text{(–1)}^{s} M 2^{E} = \text{0.11111111111111111111111} \times 2^{-126} \approxeq 1.18 \times 10^{-38}

Special values

Condition: exp = 111…1

Axis of all floating point values

Starting with the lowest to the highest, what values can be produced:

What if floating point represented with 8 bits

Equations:

Annotation: To get a feel for all possible values expressible using IEEE like conversion, we use a small w. Here, instead of w = 32, we use w = 8. This way, we can enumerate all values.

s exp frac E Value Type Notes
0 0000 000 -6 0 Denormalized  
0 0000 001 -6 18×164=1512\frac{1}{8}\times\frac{1}{64} = \frac{1}{512} Denormalized closest to zero
0 0000 010 -6 28×164=2512\frac{2}{8}\times\frac{1}{64} = \frac{2}{512} Denormalized  
0 0000 111 -6 68×164=6512\frac{6}{8}\times\frac{1}{64} = \frac{6}{512} Denormalized  
0 0000 111 -6 78×164=7512\frac{7}{8}\times\frac{1}{64} = \frac{7}{512} Denormalized largest denom.
0 0001 000 -6 88×164=8512\frac{8}{8}\times\frac{1}{64} = \frac{8}{512} Normalized smallest nom.
0 0001 001 -6 98×164=9512\frac{9}{8}\times\frac{1}{64} = \frac{9}{512} Normalized  
0 0110 110 -1 148×12=1416\frac{14}{8}\times\frac{1}{2} = \frac{14}{16} Normalized  
0 0110 111 -1 158×12=1516\frac{15}{8}\times\frac{1}{2} = \frac{15}{16} Normalized closest to one below
0 0111 000 0 88×1=1\frac{8}{8}\times 1 = 1 Normalized  
0 0111 001 0 98×1=98\frac{9}{8}\times 1 = \frac{9}{8} Normalized closest to one above
0 0111 010 0 108×1=108\frac{10}{8}\times 1 = \frac{10}{8} Normalized  
0 1110 110 7 148×128=224\frac{14}{8}\times 128 = 224 Normalized  
0 1110 111 7 158×128=240\frac{15}{8}\times 128 = 240 Normalized largest nom
0 1111 000 n/a \infty NaN  

Conversion in C

Demo - C code

Floating point arithmetic

Summary

Next Lecture