Part 5 - Hash Tables

Posted on Nov 29, 2022
(Last updated: May 26, 2024)

Hash Tables

Hash tables are one of the most famous (and widely used) data structures. But why are they so popular and powerful? Before we can answer that question we need to look back at sets and maps.

Sets and Maps in Context

Just to refresh our memory - a set is a collection of items, where duplicates aren’t allowed. Maps are sets of keys, each having an associate value - or you can formulate it as - a set of key-value pairs.

Maps are used in almost all programs and applications since they’re very powerful and intuitive. They can act as databases (some databases are just straight up maps as well):

  • Look up a person by their social security number.
  • Look up a file in a computer by its name/filetype/size.
  • Find all words appearing in a text/website/book.

Only problem is that, our usual implementation of maps makes the search() and update() are quite slow, in fact, the complexity is usually $\mathcal{O}(n)$ or in case we’re using a mutlimap $\mathcal{O}(n \cdot m)$.

We would like a complexity of $\mathcal{O}(log(n))$ or $\mathcal{O}(1)$. Even if we implement maps using all the different data structures we have - we still can’t achieve this:

  • Dynamic Array of key-value pairs
    • Search takes linear time
    • Insertion takes linear time
  • A Linked List
    • Same as the dynamic array
  • A sorted Dynamic Array
    • Search takes logarithmic time (binary search).
    • Insertion still takes linear time…

The answer is Hash Tables!

Separate Chaining Hash Tables

One thing to know before we dive in is - there are two different strategies when it comes to tables, we’ll begin with separate chaining.

Hash Functions

Before we start talking about what a hash table is we first need to look into what a hash function is. The problem with ordinary maps that they are an unordered data structure. We need some kind of function which calculates, based on different parameters, where Item x should go into our map, because afterward, we can just search for that index, which is much faster. This is what hash functions are for!

At the same time our hash functions needs to be ‘good’. We can’t have a hash functions that only returns index 1 and stack every item into the same slot. But we also can’t be lazy, in the example of sorting a collection of people. We can’t just go by their first letter in their name, there are a lot more names that begin with a ‘A’ than a ‘Z’ for example.

So we want a hash functions that also evenly distributes the objects, and at the same time efficient to compute.

With this in mind we can now define a hash table.

Hash Tables

A hash table uses a hash function to compute the array index. It’s not possible to put more than one object into one array slot, so if our hash function returns the same index for different objects (we will later see, by definition, this is a bad hash function) we need a solution. Separate chaining is one of them.

Separate Chaining

In a separate chaining hash table, instead of having just one array slot, we instead have the value/slots being a pointer. This pointer then points to collections of all the values/objects having the same hash value.

Usually this is a linked list, but any searchable collection works (A dynamic array for example).

Open Addressing Hash Tables

In the other case, being open addressing, each slot contains exactly one object/value. However - if we ever encounter a conflict, we just move that value/object into a free slot, the tricky part is an efficient way of finding a free slot. We’ll see more of this later.

Separate Chaining: Hash Set

For our implementation of Hash tables using separate chaining we could use sets - and for the underlying collection, we usually use linked lists.

So a kind of implementation (in pseudocode) would be:

class SeparateChainingHashSet <Item> implements Set<Item>:
    table: Array of Set<Item>

    // A given good hash function
    hash(x : Item) -> int:
        // Returns good hash value

    contains(x : Item) -> boolean:
        bucket : Set<Item> = table[hash(x)]
        return bucket.contains(x)

    add(x : Item):
        bucket : Set<Item> = table[hash(x)]
        bucket.add(x)

    remove(x : Item):
        bucket : Set<Item> = table[hash(x)]
        bucket.remove(x)

As we can see there is quite a lot of ‘mental overhead’, using an ordinary map it also becomes faster since we have actual key-value pairs, therefore being faster. Also, we need to know the size of the Hash table and also handle null pointers so let’s fix that as well!

class SeparateChainingHashMap<Key, Value> implements Map<Key, Value>:
    table: Array of Map<Key, Value>
    size : int = 0

    hash(k : Key) -> int:
        // Returns good hash value

    contains(k : Key) -> boolean:
        bucket : Map<Key, Value> = table[hash(k)]
        return bucket.contains(k)

    get(k : Key) -> Value:
        bucket : Map<Key, Value> = table[hash(k)]
        return bucket.get(k)

    put(k : Key, v : Value):
        bucket : Map<Key, Value> = table[hash(k)]
        if bucket == NULL:
            bucket = table[hash(x)] = new Map()
        if not bucket.contains(k):
            bucket.put(k, v)
            size += 1

    remove(k : Key):
        bucket : Map<Key, Value> = table[hash(k)]
        bucket.remove(k)

If we actually try to implement it with the underlying data structure, the linked list, it would look something like:

class SeparateChainingHashMap<Key, Value> implements Map<Key, Value>:
    ...
    class Node:
        key : Key
        val : Value
        next : Node

    get(k : Key) -> Value:
        node : Node = table[hash(k)]
        while node != NULL:
            if k == node.key
                return node.val
            node = node.next
        return NULL

    put(k : Key, v : Value):
        h : int = hash(k)
        node : Node = table[h]
        while node != NULL:
            if k == node.key:
                node.val = v
                return
            node = node.next
        // If we are given a new key
        table[h] = Node(k, v, table[h])

Load Factor

We should always assume that we have a good hash functions, since in most libraries, they are good! But to understand why it’s ‘good’ we need to understand something called Load factor. The load factor is the average number of elements per slot/index.

So load factor = $\frac{N}{M}$, where N is # of elements and M is the array size. We haven’t discussed it yet but, what’s the complexity of contains, add, and put for example? Well it depends on the total array size and the number of elements per slot. The load factor!

Therefore, the complexity of each operation is $\mathcal{O}(\frac{N}{M})$

Which tells us, if we can keep N and M roughly the same all of our operations will be constant! This is the magic with Hash tables.

So our array size must grow when the hash table grows. To achieve this we use a dynamic array, when we need to resize the hash table array, we create a new array and copy all old elements using add().

put(k : Key, v : Value):
    if size >= 8 * table.length:
        resize(2 * table.length)
    bucket : Map<Key, Value> = table[hash(k)]
        if bucket == NULL:
            bucket = table[hash(x)] = new Map()
        if not bucket.contains(k):
            bucket.put(k, v)
            size += 1

remove(k : Key):
    if size ≤ 2 * table.length:
        resize(table.length / 2)
    bucket : Map<Key, Value> = table[hash(k)]
        bucket.remove(k)

resize(buckets : int):
    oldtable = table
    table = new Array with size buckets
    size = 0
    for bucket in oldtable:
        for k,v in bucket:
            this.put(k, v)

In this example I’ve chosen to resize the table with a factor of 2. However, this is can actually lead to some problems. The ‘optimal’ way is to choose a prime closest to 2 * oldsize.

Hash and compress

In actuality, the hash functions are a composition of functions. First we get the hash code for a given object/value. Then we need to compress it to fit inside our hash table.

How we compress our is quite simple (often) - it is:

index = hash_code % array_size
i = h % M

This is called ‘modular’ hashing, it’s the most common one. One thing to remember that the hash code never changes. Only the compressed version can change since we can change the size of it.

So we might see different outputs for hash(x) depending on size - but remember, the actual hash code never changes.

Requirements on hash functions

There is one very strict requirement on hash functions:

  • Equal objects must have equal hash codes

Or in coding terms:

if x === y then x.hash() == y.hash()

There are some desirable properties:

  • If x and y have the same hash code, then they are ’equal’
  • The distribution should be uniform and independent

Open Addressing: Probing

As we defined earlier, open addressing is a method where we find a new empty slot if there’s a conflict. Probing is how we find this empty slot.

The easiest approach is so called linear probing, where we simply go to the next index (increase with some constant) and check if it’s empty, with wrapping around the array.

class LinearProbingHashSet<Item>:
    table: Array of Item
    size : int = 0

    add(x : Item):
        i = hash(x)
        while table[i] != NULL:
                if x == table[i]:
                    return
                i = (i + 1) % table.length
        table[i] = x
        size += 1

Or in the Hash Map case:

class LinearProbingHashMap<Key, Value>:
    table: Map<Key, Value>
    size : int = 0

    add(k : Key, v : Value):
        i = hash(k)
        while table[i] != NULL:
                if k == table[i]:
                    return
                i = (i + 1) % table.length
        table[i] = value
        size += 1

Open Addressing compared to Separate Chaining

Let’s now compare both of these approaches now that we have a grasp of them.

  • Separate Chaining
    • Load factor can be > 1 without performance loss.
    • The extra list nodes take up unnecessary memory.
    • Since the list nodes are (usually) scattered in memory - we can not optimize via for example CPU caching.
  • Open Addressing
    • Load factor must be < 1, and the performance drops when load factor > 3/4.
    • Doesn’t take up unnecessary memory with extra list nodes.
    • Elements with the same hash code tend to be close in memory - therefore we can utilize CPU caching.

Deletion

If we want to delete object while using Open Addressing - we will encounter a very famous problem called ‘clusters’. A short definition of clusters is, due to Open Addressing, we will get small ‘clusters’ or ‘chunks’. These are quite valuable since they tell us which objects have similar hash codes.

But first to understand why, let’s see how we would delete items.

Naive approach

Our naive approach would look something like:

remove(x : Item):
    i = hash(x)
    while table[i] != NULL:
        if x == table[i]:
            table[i] = NULL
        i = (i + 1) % table.length

And in the map case:

remove(k : Key):
    i = hash(x)
    while table[i] != NULL:
        if k == table[i]:
            table[i] = NULL
        i = (i + 1) % table.length

But this won’t work - if we then want to find something to the right of this deleted cell - but their hash is before the deleted cell, we won’t be able to, since we find the ‘NULL’ value before we can find the actual object.

There are two possible solutions:

  • Lazy deletion:
    • We don’t necessarily ‘delete’ the item, just ‘mark’ it deleted, rather than empty.
  • Recalculation:
    • We reinsert all elements in the cluster that are to the right of the deleted element.

One problem we will also encounter is, we won’t be able to properly count # of deleted cells - therefore our resizing calculations will suffer - the solution is to have a variable for each deleted cell. If this number exceeds some threshold we resize.

Final implementations

So, we have covered all topics and problems that come with implementing Hash tables - so let’s implement the final version.

For a Hash Set:

class LinearProbingHashSet<Item>:
    table: Array of HashCell
    size: int = 0
    n_deleted: int = 0
    DELETED: HashCell = HashCell(value = NULL)

    class HashCell:
        value: Item

    load_factor() -> float:
        return (size + deleted) / table.length

    add(x : Item):
        if load_factor() > 0.75:
            resize(2 * table.length)
        i = hash(x)

        while table[i] != NULL and table[i] != DELETED:
            if x == table[i].value:
                return
            i = (i + 1) % table.length
        size += 1
        if table[i] = DELETED:
            n_deleted -= 1

        table[i] = HashCell(value = x)

    contains(x : Item) -> boolean:
        i = hash(x)
        while table[i] != NULL:
            if x == table[i].value:
                return True
            i = (i + 1) % table.length
        return False

    remove(x : Item):
        i = hash(x)
        while table[i] != NULL:
            if x == table[i].value:
                table[i] = DELETED
                size -= 1
                n_deleted += 1
                if load_factor() < 0.25:
                    resize(table.length // 2)
                return
            i = (i + 1) % table.length

    resize(buckets: int):
        oldtable = table
        table = new Array with size buckets
        size = n_deleted = 0
        for cell in oldtable:
            if cell != NULL and cell != DELETED:
                table.add(cell.value)

And in the Map case:

class LinearProbingHashMap<Key, Value>:
    table: Array of HashCell
    size: int = 0
    n_deleted: int = 0
    DELETED: HashCell = HashCell(key = NULL,value = NULL)

    class HashCell:
        key: Key
        value: Value

    load_factor() -> float:
        return (size + deleted) / table.length

    put(k : Key, v : Value):
        if load_factor() > 0.75:
            resize(2 * table.length)
        i = hash(k)

        while table[i] != NULL and table[i] != DELETED:
            if k == table[i].key:
                return
            i = (i + 1) % table.length
        size += 1
        if table[i] = DELETED:
            n_deleted -= 1

        table[i] = HashCell(key = k, value = v)

    contains(k : Key) -> boolean:
        i = hash(k)
        while table[i] != NULL:
            if k == table[i].key:
                return True
            i = (i + 1) % table.length
        return False

    remove(k : Key):
        i = hash(k)
        while table[i] != NULL:
            if k == table[i].value:
                table[i] = DELETED
                size -= 1
                n_deleted += 1
                if load_factor() < 0.25:
                    resize(table.length // 2)
                return
            i = (i + 1) % table.length

    resize(buckets: int):
        oldtable = table
        table = new Array with size buckets
        size = n_deleted = 0
        for cell in oldtable:
            if cell != NULL and cell != DELETED:
                table.add(cell.key, cell.value)

Clustering

As mentioned before - clusters is by-product of open addressing - but these make the search times slower, in the worst time it becomes linear!

The famous Donald E. Knuth formulated the so called ‘Knuth’s parking problem’ which shows how rapid this cluster problem grows.

Alternatives to linear probing

Instead of linearly increase, we could do quadratic, double hashing which depends on a second hash function, or moving alternatives such as: cuckoo, hopscotch and Robin Hood hashing. We will only use linear probing though…

Summary

There’s a lot to hashing and hash tables in general. But I think we’ll leave it here, one final thing I want to bring up is:

Hash tables are not ordered! If we want to find a maximum value in a hash table - it will take linear time! Since there’s no underlying order.

In the next part we’ll begin looking at trees - starting at Balanced Search Trees or BSTs for short. We’ll actually compare them to Hash maps, since they are quite similar in some aspects.