CMPT 225 - 16 - Hash Tables
Bit Vector Review
- Suppose we want to store a set for some
- A bit vector representation of S is a Boolean array B: of size d+1 s.t. , or . E.g., d=20,s={3,7,9}, b=
-
0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
-
- Operations member(x), insert(x), remove(x) are all O(1).
- Only parctical where d is small
- Space inefficient if
- Copy, Union, Intersection all
Hash Functions
- A hash function for a set D is a function where , i.e. a map to a smaller set.
- E.g., ()
- There will be values but
- Notation: Define
- E.g.,
- If for , we call it a collision (e.g. 3, 15)
- We will want hash functions h s.t.
- ran h = [0,m-1] for (array indecies)
- h tends to distribute S uniformly over [0,m-1]
- will be prime
Hash Functions + Bit Vector
- Let , B a boolean array of size m.
- For a set , set
- there is s.t.
- or
- E.g.,
S={3,7,13,15,20}
- B =
-
1 0 1 1 0 0 0 1 0 0 0 0 0
-
- now:
{x:B[h(x)]} = {0,2,3,7,13,15,19,25,27,31,...}
B[h(x)] = 1
“suggests ”B[h(x)] = 0
“implies ”- e.s. there may be false positives but never false negatives
Bloom Filters
- Let be a set of distinct hash functions for a set D, each with range [0,m-1].
- For , set: (see next MathML block)
- To test for membership in S:
- if B[h(x)]=true for all , return true
- o.w. return false
- We get a false positive only when h(x) is a collision of every .
- B is a Bloom Filter for S
- If m is large enough relative to and the are good enough quality, independent hash functions, then there will be few false positives.
Hash Tables
(Transcriber’s note: the symbol is the symbol for union.)
- Let be a hash function for D with M=[0,m-1]
- Let A be an array of size and type .
- e.g.,
- For a set , we want:
- A[h(x)]=x, for each
- A[i] = _ if for every .
- E.g.:
S={2,12,17,21}
, , andh(s)={2,12,4,8}
- A=:
-
_ _ 2 _ 17 _ _ _ 21 _ _ 12
-
- To check membership in S, return A[h(x)].
- A is a hash table for S
- But what if we have collisions?
- Need collision handling. We will look at a few methods.
Hashing with Separate Checking
- Let H be a size M array of linked lists
- Set A[i] to be a list of the elements
- To test for membership in S:
- Return true iff x is in the list A[h(x)]
- E.g.,
- A = (see below; “->” indicates a linked list connection)
- A[0]
- 0
- A[1]
- 1
- A[2]
- A[3]
- A[4]
- 5 -> 12
- A[5]
- 20 -> 7
- A[6]
- A[7]
- A[8]
- A[9]
- A[10]
- To insert/remove x: insert/remove from A[h(x)].
- If h distributes S almost uniformly over M, the lists will be small, and time will be essentially O(1).
- In the worst case, some lists have length and performance degrades to that of linked lists, .
Hashing with Probing (Open Addressing)
- Let A be an array of size M and type , f a hash function
- Let f be a function , that has f(0)=0 and is monotone increasing (e.g., )
- Define, for , ; E.g.:
- To resolve collisions, probe the sequence of cells:
Hashing with Probing (Open Addressing)
- To check for membership of x:
- Examine the sequence of locations:
- Stop at the final location containing x or (upside down T?, not sure here???)
- return true if x was found, false otherwise.
- To insert x:
- Examine the sequence of locations:
- Stop at the first location containing – (? some kind of dash character? also not sure???) and store x there
- Choice of f() determines properties.
Hashing with Linear Probing
- let f(i)=i
- The sequence of locations to probe is:
- A[h(x)], A[h(x)+1], A[h(x)+2], A[h(x)+3], … (+ is mod m)
- Ex: Suppose ,
S={2,9,18,36}
(soh(S)={2,5,9,10}
) and A is:-
_ _ 2 _ _ 18 _ _ _ 9 36 _ _ - To insert 5:
- compute h(5)=5
- see that (is this char an underscore? still don’t know? same weird character as last slide, dash/underscore looking thing)
- see that , so set A[6]=5
- Now A =:
-
_ _ 2 _ _ 18 5 _ _ 9 36 _ _
-
- To check if :
- compute h(5)=5
- see that
- see that and return true
- To check if
- compute h(31)=5
- see that
- see that
- see that and return false
-
Hashing with Quadratic Probing
- Let
- The sequence of locations to probe is: (+ is ).
- Ex: Suppose (so ) and A is:
-
_ _ 2 _ _ 18 _ _ _ 9 36 _ _ - To insert 35:
- compute h(35)=9
- see that
- see that
- see that and store 35 there
- Now A is:
-
_ 35 2 _ _ 18 _ _ _ 9 36 _ _
-
- To check if
- compute h(35)=9
- see that
- see that
- see that and return true
- To check if :
- compute h(22)=9
- see that are not 22 or _.
- see that and return true
-
Double Hashing
- Let , where is a hash function for D that is different from h, and with
- The sequence of locations to probe is: (+ is mod m)
- Ex: Suppose , , so , and A is:
-
_ _ 2 _ _ 18 _ _ _ 9 36 _ _ - To insert 15:
- compute h(x)=2
- see that
- compute
- see that and store it there
- Now A is:
-
_ _ 2 _ _ 18 _ _ 15 9 36 _ _
-
- To check if , check A[2], then A[8], and return true
- To check if :
- compute h(10)=10
- see that
- compute
- see that and return false.
-
Removal with Open Addressing
- Suppose we have a hash table H for a set S containing x, and want to remove x.
- If H uses separate chaining, we just double x
- If H uses open addressing, we cannot, because x affects the probe sequence for other elements
- Ex: Suppose and A was obtained as in our Linear Probing example. A:
-
_ _ 2 _ _ 18 5 _ _ 9 36 _ _ - Suppose we now double 18, so A:
-
_ _ 2 _ _ _ 5 _ _ 9 36 _ _ - Now, searching for 5 fails because !
-
-
- One solution is to mark cells where we have deleted elements.
Removal with Open Addressing
- Ex: In the previous example to remove 18 we replace it with d; A:
-
_ _ 2 _ _ d 5 _ _ 9 36 _ _
-
-
Now search & insert procedures perform as if A[5] has same key that we will never use.
- To remove x:
- determine the sequence of locations:
- when x is found, replace it with d
- Notice that search & insert work correctly as they are
- Insert can be modified to reclaim space:
- To insert x:
- Examine the sequence of prob locations
- stop at the first one containing _ or d and store x there
- To insert x:
- NB: In implementation, d and _ could be special values, or A could be an array of objects or structs with “empty” and “deleted” variables/fields.
Load Factor
- The “load factor” of a hash table H is:
- Good performance requires λ not too large.
- For separate chaining: λ should not be much larger than 1, as average list length is about 1.
- For open addressing, want , so that it is not too hard to find a place to make an insertion.
Some Properties with Open Addressing
- Linear Probing
- Insertion always successful if $
- Primary Clustering is a serious problem.
- Quadrate Proving
- Avoids primary clustering
- Exhibits secondary clustering – but less problematic
- Insertion always succeeds if , but may fail if (even if there is space).
- Double hashing
- Recognizes design of a second suitable hash function
- Recognized computing 2 hash functions whenever probing beyond is needed.
Rehashing
- Rehashing hash table H means constructing a completely new hash table for the contents of H.
- We may want to do it if:
- λ is too large (close to 0.5 for open addressing, much larger than 1 for separate chaining)
- Performance has become poor (which may result from clustering, from long linked lists, or from many removals)
- Takes under the assumption that insert is .
Hashing Properties
- Well-designed hash tables are effective in practice, with fast insert, member, remove operations.
- Require a good hash function for the domain of application.
- Operations O(1) on average, under assumptions that may not hold in practice:
- all keys equally likely
- hash function distributes keys uniformly
- λ is small
- Do not support operations based on order of keys, such as:
- enumerate in order
- min, max, range lookups
- union, intersects
- (These are efficient with AVL Trees & B-Trees).