CMPT 225 - 16 - Hash Tables

Bit Vector Review

Suppose we want to store a set $S\subseteq [o,d]$ for some $d\in \N$

A bit vector representation of S is a Boolean array B: of size d+1 s.t.

B[i] \iff i\in S

, or

S=\lbrace o\leq i \leq d : B[i] \text{ is true } \rbrace

. E.g., d=20,s={3,7,9}, b=

Operations member(x), insert(x), remove(x) are all O(1).
Only parctical where d is small
Space inefficient if $\lvert S\rvert << d$
Copy, Union, Intersection all $\Theta(0)$

Hash Functions

A hash function for a set D is a function $h:D\rightarrow M$ $h : D \to M$ where $\lvert M\rvert <\lvert D\rvert$ $∣ M ∣ < ∣ D ∣$ , i.e. a map to a smaller set.
- E.g., $h:[0,\text{MAXINT}]\rightarrow [0,12], h(x) = x\mod 13$ ( $\lvert M \rvert = 13, \lvert D\rvert =2147485647$ )
There will be values $x,y\in D \text{ s.t. } x\neq y$ but $h(x)=h(y)$
Notation: Define $h(s)=\lbrace y:y=f(x) \text{ and } x\in S\rbrace$ $h (s) = {y : y = f (x) and x \in S}$
- E.g., $h(3)=3; h(7)=7; h(13)=0; h(15)=2; h(20)=7; h(\lbrace 3,7,13,15,20\rbrace ) = \lbrace 0,2,3,7 \rbrace$
If $h(x)=h(y)$ for $x,y\in S$ , we call it a collision (e.g. 3, 15)
We will want hash functions h s.t.
- ran h = [0,m-1] for $m\in\N$ (array indecies)
- h tends to distribute S uniformly over [0,m-1]
- $m=\lvert M\rvert$ will be prime

Hash Functions + Bit Vector

Let $h:D\rightarrow [0,m-1]$ , B a boolean array of size m.

For a set

S\subseteq D

, set

$B[i]=\text{true} \iff$ there is $x\in D$ s.t. $h(x)=i$
or $\lbrace i: B[i]\rbrace = h(S)$

E.g., S={3,7,13,15,20}

$h(x) = x\mod 13; m=13$
$h(S) = \lbrace 0,2,3,7\rbrace$

B =

now:
- {x:B[h(x)]} = {0,2,3,7,13,15,19,25,27,31,...}
- B[h(x)] = 1 “suggests $x\in S$ ”
- B[h(x)] = 0 “implies $x\notin S$ ”
- e.s. there may be false positives but never false negatives

Bloom Filters

Let $H=\lbrace h_1, h_2, \dots h_k\rbrace$ be a set of distinct hash functions for a set D, each with range [0,m-1].
For $S\subseteq D$ , set: (see next MathML block)

$B[i] = \begin{cases} \text{true} & \text{if } h(x)=i \text{ for some } h\in H\\ \text{false} & \text{o.w.} \end{cases}$

To test for membership in S:
- if B[h(x)]=true for all $h\in H$ , return true
- o.w. return false
We get a false positive only when h(x) is a collision of every $h\in H$ .
B is a Bloom Filter for S
If m is large enough relative to $\lvert S\rvert$ and the $h_i$ are good enough quality, independent hash functions, then there will be few false positives.

Hash Tables

(Transcriber’s note: the $\cup$ symbol is the symbol for union.)

Let $h:D\rightarrow M$ be a hash function for D with M=[0,m-1]
Let A be an array of size $\lvert M\rvert$ $∣ M ∣$ and type $D \cup \lbrace \cdots \rbrace$ $D \cup {\dots}$ .
- e.g., $A:M\rightarrow D\cup\lbrace\cdots\rbrace$

For a set

S\subseteq D

, we want:

A[h(x)]=x, for each $x\in S$
A[i] = _ if $h(x)\neq i$ for every $x\in S$ .

E.g.: S={2,12,17,21},

h(x)=x\mod 13

, and

h(s)={2,12,4,8}

A=:

To check membership in S, return A[h(x)].
A is a hash table for S
But what if we have collisions?
Need collision handling. We will look at a few methods.

Hashing with Separate Checking

Let H be a size M array of linked lists
Set A[i] to be a list of the elements $\lbrace x\in S : h(x)=o\rbrace$
To test for membership in S:
- Return true iff x is in the list A[h(x)]
- E.g., $S=\lbrace 1,5,7,13,15,20\rbrace; h(x)=x\mod 13$
- A = (see below; “->” indicates a linked list connection)

A[0]
- 0
A[1]
- 1
A[2]
A[3]
A[4]
- 5 -> 12
A[5]
- 20 -> 7
A[6]
A[7]
A[8]
A[9]
A[10]

To insert/remove x: insert/remove from A[h(x)].
If h distributes S almost uniformly over M, the lists will be small, and time will be essentially O(1).
In the worst case, some lists have length $\Omega(n)$ and performance degrades to that of linked lists, $\Omega(n)$ .

Hashing with Probing (Open Addressing)

Let A be an array of size M and type $D\cup \lbrace\cdots\rbrace$ , f a hash function $h:D\rightarrow [0,m-1]$
Let f be a function $f:\N\rightarrow\N$ , that has f(0)=0 and is monotone increasing (e.g., $x>y\implies f(x)>f(y)$ )
Define, for $i\in N$ $i \in N$ , $h_{i}(x) = (h(x)+f(i))\mod m$ $h_{i} (x) = (h (x) + f (i)) mod m$ ; E.g.:
- $h(x)=x\mod 13, f(i)=i$
- $h_{0}(3) = h(3)+0=3$
- $h_{1}(3) = h(3)+1=4$
- $h_{2}(3) = h(3)+2=5$
To resolve collisions, probe the sequence of cells: $A[h_{0}(x)], A[h_{1}(x)], A[h_{2}(x)], \cdots$

Hashing with Probing (Open Addressing)

$h_{i}(x) = (h(x)+f(i))\mod m$
To check for membership of x:
- Examine the sequence of locations: $A[h_{0}(x)], A[h_{1}(x)], A[h_{2}(x)], \cdots$
- Stop at the final location containing x or (upside down T?, not sure here???)
  - return true if x was found, false otherwise.
To insert x:
- Examine the sequence of locations: $A[h_{0}(x)], A[h_{1}(x)], A[h_{2}(x)], \cdots$
- Stop at the first location containing – (? some kind of dash character? also not sure???) and store x there
Choice of f() determines properties.

Hashing with Linear Probing

let f(i)=i
The sequence of locations to probe is:
- A[h(x)], A[h(x)+1], A[h(x)+2], A[h(x)+3], … (+ is mod m)

Ex: Suppose

h(x)=x\mod 13

, S={2,9,18,36} (so h(S)={2,5,9,10}) and A is:

To insert 5:
- compute h(5)=5
- see that $A[5]\neq \_$ (is this char an underscore? still don’t know? same weird character as last slide, dash/underscore looking thing)
- see that $A[6] = \_$ , so set A[6]=5

Now A =:

To check if $5\in S$ $5 \in S$ :
- compute h(5)=5
- see that $A[5]\neq \_, A[5]\neq 5$
- see that $A[6]=5$ and return true
To check if $31\in S$ $31 \in S$
- compute h(31)=5
- see that $A[5]\neq 31, A[5]\neq \_$
- see that $A[6]\neq 31, A[6]\neq \_$
- see that $A[7]= \_$ and return false

Hashing with Quadratic Probing

Let $f(i)=i^{2}$
The sequence of locations to probe is: $A[h(x)], A[h(x)+1], A[h(x)+4], A[h(x)+9], \cdots$ (+ is $\mod m$ ).

Ex: Suppose

h(x)=x\mod 13, S=\lbrace 2,9,18,36\rbrace

(so

h(5)=\lbrace 2,5,9,10\rbrace

) and A is:

To insert 35:
- compute h(35)=9
- see that $A[9]\neq \_$
- see that $A[10]\neq \_$
- see that $A[0]= \_$ and store 35 there

Now A is:

To check if $35\in S$ $35 \in S$
- compute h(35)=9
- see that $A[9]\neq \_, A[9]\neq 35$
- see that $A[10]\neq \_, A[10]\neq 35$
- see that $A[0] = 35$ and return true
To check if $22\in S$ $22 \in S$ :
- compute h(22)=9
- see that $A[9],A[10],A[0],A[5]$ are not 22 or _.
- see that $A[12]=\_$ and return true

Double Hashing

Let $f(i)=i\cdot \text{hash}_{2}(x)$ , where $\text{hash}_{2}(x)$ is a hash function for D that is different from h, and with $\text{ran}(\text{hash}_{2})\subseteq [1,m]$
The sequence of locations to probe is: $A[h(x)], A[h(x)+\text{hash}_{2}(x)], A[h(x)+2\cdot\text{hash}_{2}(x)], \cdots$ (+ is mod m)

Ex: Suppose

h(x)=x\mod 13, \text{hash}_{2}(x) = (7-(x\mod 7))

S=\lbrace 2,9,18,36\rbrace

, so

h(S)=\lbrace 2,5,9,10\rbrace

, and A is:

To insert 15:
- compute h(x)=2
- see that $A[2]\neq \_$
- compute $\text{hash}_{2}(x) = 6$
- see that $A[8]=\_$ and store it there

Now A is:

To check if $15\in S$ , check A[2], then A[8], and return true
To check if $10\in S$ $10 \in S$ :
- compute h(10)=10
- see that $A[10]\neq 10, A[10]\neq \_$
- compute $\text{hash}_{2}(10) = 4$
- see that $A[1]=\_$ and return false.

Removal with Open Addressing

Suppose we have a hash table H for a set S containing x, and want to remove x.
If H uses separate chaining, we just double x
If H uses open addressing, we cannot, because x affects the probe sequence for other elements

Ex: Suppose

h(x)=x\mod 13, S=\lbrace 2,5,9,18,36\rbrace

and A was obtained as in our Linear Probing example. A:

Suppose we now double 18, so A:

Now, searching for 5 fails because $A[h(5)]=\_$ !

One solution is to mark cells where we have deleted elements.

Removal with Open Addressing

Ex: In the previous example to remove 18 we replace it with d; A:

Now search & insert procedures perform as if A[5] has same key that we will never use.
To remove x:
- determine the sequence of locations: $A[h_{0}(x)], A[h_{1}(x)], A[h_{2}(x)], \cdots$
- when x is found, replace it with d
Notice that search & insert work correctly as they are
Insert can be modified to reclaim space:
- To insert x:
  - Examine the sequence of prob locations
  - stop at the first one containing _ or d and store x there
NB: In implementation, d and _ could be special values, or A could be an array of objects or structs with “empty” and “deleted” variables/fields.

Load Factor

The “load factor” of a hash table H is:
- $\lambda = \frac{(\text{\# of keys}) + (\text{\# of elements marked d})}{m}$
Good performance requires λ not too large.
For separate chaining: λ should not be much larger than 1, as average list length is about 1.
For open addressing, want $\lambda < 0.5$ , so that it is not too hard to find a place to make an insertion.

Some Properties with Open Addressing

Linear Probing
- Insertion always successful if $\lambda < 1$ $
- Primary Clustering is a serious problem.
Quadrate Proving
- Avoids primary clustering
- Exhibits secondary clustering – but less problematic
- Insertion always succeeds if $\lambda \leq 0.5$ , but may fail if $\lambda > 0.5$ (even if there is space).
Double hashing
- Recognizes design of a second suitable hash function
- Recognized computing 2 hash functions whenever probing beyond $A[h_{0}(x)]$ is needed.

Rehashing

Rehashing hash table H means constructing a completely new hash table for the contents of H.
We may want to do it if:
- λ is too large (close to 0.5 for open addressing, much larger than 1 for separate chaining)
- Performance has become poor (which may result from clustering, from long linked lists, or from many removals)
Takes $\Theta(n)$ under the assumption that insert is $\Theta(1)$ .

Hashing Properties

Well-designed hash tables are effective in practice, with fast insert, member, remove operations.
Require a good hash function for the domain of application.
Operations O(1) on average, under assumptions that may not hold in practice:
- all keys equally likely
- hash function distributes keys uniformly
- λ is small
Do not support operations based on order of keys, such as:
- enumerate in order
- min, max, range lookups
- union, intersects
(These are efficient with AVL Trees & B-Trees).

CMPT 225 - 16 - Hash Tables

Bit Vector Review

Hash Functions

Hash Functions + Bit Vector

Bloom Filters

Hash Tables

Hashing with Separate Checking

Hashing with Probing (Open Addressing)

Hashing with Probing (Open Addressing)

Hashing with Linear Probing

Hashing with Quadratic Probing

Double Hashing

Removal with Open Addressing

Removal with Open Addressing

Load Factor

Some Properties with Open Addressing

Rehashing

Hashing Properties

End