CMPT 295 - Unit - Data Representation

Lecture 7

Representing fractional numbers in memory
IEEE floating point representation, their arithmetic operations and float in C

Last Lecture

IEEE floating point representation
1. Normalized => exp ≠ 000…0 and exp ≠ 111…1
  - Single precision: bias = 127, exp: [1..254], E: [-126..127] => [10-38 … 1038]
  - Called “normalized” because binary numbers are normalized as part of the conversion process
  - Effect: “We get the leading bit for free” => leading bit is always assumed (never part of bit pattern) * Conversion: IEEE floating point number as encoding scheme
  - $\text{Fractional decimal number} \leftrightarrow \text{IEEE 754 (bit pattern)}$
    - $V = \text{(–1)}^{s} M 2^{E}$ ; s is sign bit, $M = 1 + \text{frac}$ , $E = \text{exp} – \text{bias}$ , $\text{bias} = 2^{k-1} – 1$ and k is width of exp
2. Denormalized
3. Special cases

Representing data in memory – Most of this is review
- “Under the Hood” - Von Neumann architecture
  - Bits and bytes in memory
  - How to diagram memory -> Used in this course and other references
  - How to represent series of bits -> In binary, in hexadecimal (conversion)
  - What kind of information (data) do series of bits represent -> Encoding scheme
  - Order of bytes in memory -> Endian
- Bit manipulation – bitwise operations
  - Boolean algebra + Shifting
Representing integral numbers in memory
- Unsigned and signed
- Converting, expanding and truncating
- Arithmetic operations
Representing real numbers in memory
- IEEE floating point representation
- Floating point in C – casting, rounding, addition, …

IEEE floating point representation (single precision)

How would 47.28 be encoded as IEEE floating point number?

Convert 47.28 to binary (using the positional notation R2B(X)) =>
- 47 = $\text{101111}_{2}$
- .28 = $.\overline{01000111101011100001}_{2}$
- Also expressed as: .28 = $\text{.01000111101011100001010001111010111000010100011110101110000101...}_{2}$
Normalize binary number:
- $\text{101111.01000111101011100001010001111010111000010100011110101110000101...}_{2} (\times 2^{0})$ becomes:
- $\text{1.011110100011110101110000101000111101011100001010001111010111000...}_{2} \times 2^{5}$

A diagram of the above number not fitting inside the frac portion of a signed number due to its width.

Rounding

This selection is done by looking at the bit pattern around the rounding position.

First, identify bit at rounding position
Then select which kind of rounding we must perform:
1. Round up
  - When the value of the bits to the right of the bit at rounding position is > half the worth of the bit at rounding position
  - We “round” up by adding 1 to the bit at rounding position
2. Round down
  - When the value of the bits to the right of the bit at rounding position is < half the worth of the bit at rounding position
  - We “round” down by discarding the bits to the right of the bit at rounding position
3. Round to “even number”
  - When the value of the bits to the right of the bit at rounding position is exactly half the worth of bit at rounding position, i.e., when these bits are 100…02
  - We “round” such that the bit at the rounding position becomes 0
  - If the bit at rounding position is 1 => then we “round to even number” by “rounding up” i.e., by adding 1
  - If the bit at rounding position is already 0 => then we “round to even number” by “rounding down” i.e., by discarding the bits to the right of the bit at rounding position

Rounding (and error)

Example: rounding position -> round to nearest 1/4 (2 bits right of binary point)

Value	Binary	Rounded	Action	Rounded Value
$2 \frac{3}{32}$	$\text{10.00011}_{2}$	$\text{10.00}_{2}$	(<1/2—down)	2
$2 \frac{3}{16}$	$\text{10.00110}_2$	$\text{10.01}_{2}$	(>1/2—up)	$2 \frac{1}{4}$
$2 \frac{7}{8}$	$\text{10.11100}_{2}$	$\text{11.00}_{2}$	(1/2—up to even)	3
$2 \frac{5}{8}$	$\text{10.10100}_{2}$	$\text{10.10}_{2}$	(1/2—down to even)	$2 \frac{1}{2}$

Explain value of a bit versus worth of a bit
value of a bit: ____
worth of a bit: ____

Back to IEEE floating point representation

In the process of converting fractional decimal numbers to IEEE floating point numbers (i.e., bit patterns in fixed-size memory), we apply these same rounding rules …

Using the same numbers in our example:

Imagine that the 4th bit in the binary column is our 23rd bit of the frac => rounding position. And the 5th bit is the 24th bit.

Value	Binary
$2 \frac{3}{32}$	$\text{10.00011}_{2}$
$2 \frac{3}{16}$	$\text{10.00110}_{2}$
$2 \frac{7}{8}$	$\text{10.11100}_{2}$
$2 \frac{5}{8}$	$\text{10.10100}_{2}$

Homework – Let’s practice converting and rounding!

How would 346.62 be encoded as IEEE floating point number (single precision) in memory?

Also, can you compute the minimum value of the error introduced by the rounding process since 346.62 can only be approximated when encoded as an IEEE floating point representation

Denormalized values

Equations:

$V = \text{(–1)}^{s} M 2^{E}$
$E = 1 – \text{bias}$
$\text{bias} = 2^{k-1} – 1$
$M = \text{frac}$

Example:

Condition: $\text{exp} = \text{00000000}$ (single precision)
Denormalized Values: $V = \text{(–1)}^{s} M 2^{E} = +/− 0.\text{frac} \times 2^{−126}$ $V=(–1)sM2E=+/−0.frac×2−126$
- Case 1: $\text{frac} = \text{000…0}$ -> +0 and –0
- Case 2: $\text{frac} \neq \text{000…0}$ -> numbers closest to 0.0 (equally spaced)

Smallest:

s	exp	frac
0	00000000	00000000000000000000001

V = \text{(–1)}^{s} M 2^{E} = \text{0.00000000000000000000001} \times 2^{-126} \approxeq 1.4 \times \text{10}^{-45}

Largest:

s	exp	frac
0	00000000	11111111111111111111111

V = \text{(–1)}^{s} M 2^{E} = \text{0.11111111111111111111111} \times 2^{-126} \approxeq 1.18 \times 10^{-38}

Special values

Condition: exp = 111…1

Case 1: frac = 000…0
- Represents value $\infty$ (infinity)
- Operation that overflows
- Both positive and negative
- E.g.:
  - $\frac{1.0}{0.0} = \frac{−1.0}{−0.0} = +\infty$
  - $1.0/−0.0 = −\infty$
Case 2: frac ≠ 000…0
- Not-a-Number (NaN)
- Represents case when no numeric value can be determined
- e.g.:
  - $\sqrt{–1}$
  - $\infty - \infty$
  - $\infty \times 0$
- NaN propagates other NaN: e.g., $\text{NaN} + x = \text{NaN}$

Axis of all floating point values

s = 1 bit
exp = k bits
1. if $\text{exp} \neq 0$ and $\text{exp} \neq \text{11...11}$ => Normalized
2. if $\text{exp} = \text{00...00}$ (all 0’s) => Denormalized
3. if $\text{exp} = \text{11...11}$ (all 1’s) => Special Cases n bits
frac = n bits

Starting with the lowest to the highest, what values can be produced:

NaN
$-\infty$
-Normalized
-Denormalized
-0
+0
+Denormalized
+Normalized
$+\infty$
NaN

What if floating point represented with 8 bits