In this article, we will examine why bad hash functions can clobber typical expandable hash tables whose table sizes are powers of 2. We will present Prime Double Hash Table as an attractive algorithm when the quality of the hash function cannot be assured in advance.
Prime Double Hash Table always uses prime number table sizes. The table is expandable, and the cost of rehashing is amortized across each insertion or deletion operation.
Hash table is a high performance data structure used for data look up. Key-value pairs can be inserted into the hash table, and given a key, its associated value can be very quickly retrieved.
A part of the definition of hash table involves hash function. A hash function maps a key to an integer. As key-value pairs are stored in array of buckets, this integer determines which bucket the key-value pair will be stored.
Commonly, the hash table bucket address is calculated as follows:
bucket_address = hash(key) % hash_table_size;
By taking the remainder of the hash table size, the bucket address is assured to never exceed the size of hash table. The bucket is represented by a linked list of key-value pairs.
There are many approaches in designing a hash table. They can be found in introductory data structure textbooks. A design can be evaluated on terms of the desired properties of a hash table:
We will look at the well known Expandable Hash Table algorithm, pointing out how it does not meet the last property. Lastly we will introduce Prime Double Hash Table, and show how it matches up against these properties.
Classical expandable hash table algorithms use hash table sizes that are always a power of 2, eg. 2, 4, 8, 16, 32... With each expansion of the hash table, one more bit of the hash function result are used for bucket address calculation. For example, when the hash table size is 128, the lower 7 bit of the hash function is read out to select the bucket address.
A poor choice of the hash function can cause all objects in the hash table to happen to contain some identical lower bits in hash results. This severe funneling matches exactly with the table size to cause all objects in the hash table to hash to the same bucket. The performance of the hash table then becomes the same as linear searching.
In a concrete demonstration of this problem, we will look at an application that uses Expandable Hashing to store a number of 2 character variable names. In this example, all variable names have names such as "a1", "b1", "d1", "X1", and "Z1". The fatal hash function is defined as:
hash(str) = (str[0] * 256 + str[1]) * 2654435761; /* unsigned 32 bit arithmetic */
This method is Knuth's Multiplicative Method, found in "The Art of Computer Programming". Unfortunately, in this case its use is not suitable.
If the hash table size is 128, some simple calculation would reveal that all the variables would hash to the same bucket!
('a' * 256 + '1') * 2654435761 % 128 = 97
('b' * 256 + '1') * 2654435761 % 128 = 97
('d' * 256 + '1') * 2654435761 % 128 = 97
('X' * 256 + '1') * 2654435761 % 128 = 97
('Z' * 256 + '1') * 2654435761 % 128 = 97
A defense against hash function of unknown quality is to use hash table whose size is a prime number. If we repeat the previous example using a hash table size of 29, then all variables would be hashed to different buckets. The prime numbers picked should seldomly be a typical factor of the key. Numbers such as 2, or 5 are not attractive for this reason.
Prime Double Hash Table is an expandable hash table whose table sizes are always prime numbers. It does not contain significant pauses during hash table expansion.
The secret to the Prime Double Hash Table is using two hash tables of prime number size. One hash table is smaller than the other one. During expansion, elements in the smaller hash table are gradually moved to the larger hash table. During shrinkage, elements in the larger hash table are gradually moved to the smaller hash table.
On average, 1.5 lookups are required to find an entry in the Prime Double Hash Table, since there are two hash tables in the lookup path. However, this may be a good trade-off to avoid the danger of degenerate linear searching.
Normally, key-value pairs are single linked in a bucket. In the Prime Double Hash Table, key-value pairs are additionally double linked to a 'membership' linked list for quick hash table migration.
Hash table traversal is the capability to retrieve key-value pair entries in the hash table sequentially. The traversal ordering refers to the sequence of key-value pairs that are returned.
For two hash tables containing identical key-value pairs, should their traversal ordering be identical as well, without regard to the order how the key-value pairs were inserted?
For applications that use hash tables as values, stable traversal ordering is required. For some applications, stable traversal ordering would be desirable for ease of debugging. For other applications, stable traversal ordering is not a big issue.
In the case of the Prime Double Hash Table, stable traversal ordering is hard to implement. This is because the expansion process migrates an item to an unpredictable place in the larger hash table. The shrinkage process has a hard time to return all items to the exact same places as before.
Sorted Linear Hash Table has a stable traversal ordering. Unfortunately, its power of 2 sized hash table makes it susceptible to bad hash functions.
In this article, we have shown that the use of hash table whose size is a power of 2 can be dangerous if the hash function is not of high quality. We also presented the Prime Double Hash Table, an expandable hash table that always uses table sizes consisted of prime numbers. Prime Double Hash Table is suitable for use as a general purpose hash table when the choice of the hash function cannot be determined ahead of time.