Hashing and Hash Tables

Summary

The Basic Idea of Hashing

Hashing is used for data structures whose design objective is providing efficient insertion and find operations. In order to meet this design objective, certain concessions are made: 1) we do not require that there be specific ordering of items in the container and 2) it is not a primary objective to make removal as efficient as insertion and find.

To meet the design objective of providing efficient insertion and find operations, we need a data structure for which both insertion and find operations are O(1) in the worst case. This constant time performance objective immediately leads us to the following conclusion: Our implementation must based on an array rather than on a linked list. This is because we access the kth element in an array in constant time whereas the same operation in a linked list takes O(k) time.

Previously two searchable containers were considered - the ordered list and the sorted list. For an ordered list, the cost of insertion is O(1) while the cost of find is O(n). For a sorted list, the cost of insertion is O(n) and the cost of find is O(log n). In order to meet this performance objective of constant time insertion and find, we need a way to do them without performing a search. That is, given an item x, we need to be able to determine directly from x the array position where it is stored.

Example

We wish to implement a searchable container that will be used to store strings from the set of strings K:

S = {"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"}

Suppose that we also define a function f(x) given by the following table:

x f(x)
"one" 1
"two" 2
"three" 3
"four" 4
"five" 5
"six" 6
"seven" 7
"eight" 8
"nine" 9
"ten" 10

We implement the searchable container by creating an array whose length is 10. To insert any of the above items, we store it at position f(x) - 1. Similarly, to locate an item x we check if it is at position f(x) - 1. For example, to store item "three", we evaluate f("three") and get back 3. Therefore, we store item "three" at location 2. Similarly, to locate item "three", we evaluate f("three") which returns 3 and check if the item exists at location 2. If the function f(x) can be evaluated in constant time, then both the find and insert operations are O(1).

Keys and Hashing Functions

The object so far is designing a container that will hold some number of items of a given set K. In this context, the elements of K are called keys. The general approach is to store the keys in an array where the position of each key is determined by a function f(key) called the hash function. The hash function f(key) determines the position of the key directly from the key's value.

In the general case, we expect the size of the set of keys K to be relatively large or even unbounded. On the other hand, we also expect the number of items stored in the container to be significantly less than the size of K. In other words, if |K| is the size of K and n is the number of items stored in the container, then we expect n << |K|. Therefore, it seems prudent to use an array of size M where M |K|.

Consequently, what we need is a function that will map the set of value to be stored in the container to subscripts in an array of length M. This function is called the hash function:

h: K à {0, 1, 2, ... M-1}

In general, because M |K|,  the mapping defined by the hash function will be many-to-many. That is, there will exist many pairs of distinct keys x and y such that x y, for which h(x) = h(y). This situation is called a collision (several approaches for dealing with collisions are explored in the following sections).

So what are the characteristics of a good hash function?

Avoiding Collisions

Ideally, given a set of keys {k1, k2, ..., kn}, the set of hash values {h(k1), h(k2), ..., h(kn)} contains no duplicates. In practice, unless we know something about the keys chosen, we cannot guarantee that there will be no collisions. For example, if the keys in our application are telephone numbers from a specific area code, then the area code should not be part of the hash function, as it will be the same for each telephone number.

Spreading Keys Evenly

Spreading keys evenly means that the hash values computed by the hash function h(x) are uniformly distributed. Unfortunately, in order to say something about the distribution of hash vales, we need to know something about the distribution of the keys.

Ease of Computation

This does not mean that it is easy to write the algorithm to compute the hash function. It just means that the running time of the hash function should be O(1).

Hashing Methods

In the following sections we explore several hashing methods. Here we assume that we are dealing with integer-values keys and that the value of the hash function falls between 0 and M-1:

 

Division Method

Function

Procedure Divide the key x (recall we are dealing with integers only) by M and use the remainder modulo,
Notes M is often chosen as a prime number (5,7,11,13,17,... ) and can be determined at runtime.

For example:

public class DivisinHash
{
    private static readonly int M = 1023;
    public static int Hash(int x)
   {
      return Math.Abs(x) % M;
   }
}

A potential disadvantage is that consecutive keys map to consecutive has values. While this ensures that consecutive keys do not collide, it does mean that consecutive array locations will be occupied. The other hashing methods that follow tend to scatter consecutive keys.

Middle-Square Method

This method avoids doing any division since integer division is slower than integer modification. By avoiding division we can potentially improve the performance of the hashing algorithm. We can avoid division by making use of the fact that a computer does finite-precision integer arithmetic. For example, all arithmetic is done modulo W where W = 2w is a power of 2 such that w is the word size of the computer.

Function

Procedure

Assume that M is a power of 2, M = 2K, where k ≥ 1. Because M and W are both a power of 2, the ratio M/W is equal to 2w-k. Therefore, in order to multiply the term (x2 mod W) by M/W we simply shift to the right by w-k bits. In effect, we are extracting k bits from the middle of the square of the key x, hence the name.

termine the number of bits in your OS's integer representation. (in C# do a typeof(int) to get number of bytes) T

Notes Middle-square method scatters consecutive keys nicely, , however, since this method only considers a subset of bits in the middle of x2,  keys which have a large number of leading or trailing zeros will collide.

For example:

public class MiddleSquareHash
{
    private const int k = 10;
    private const int w = 32;        // size of System.Int32

    public static int Hash(int x)
    {
        // Note that we cast to uint before the shift so as to cause the right-shift operator to insert zeros on the left.
        // Therefore, the result always falls between 0 and M-1

        return (int)((uint)(x * x)) >> (w-k);
    }
}

Multiplication Method

Multiplication method is a very simple variation of the middle-square method. Instead of multiplying the key x by itself, we multiply x by a carefully chosen value a, and then extract the middle k bits from this result.

Function

Procedure a should be chosen such that it does not have leading or trailing zeros.
Notes

If we choose a such that it is relatively prime to W, then there exists another number a' such that aa'  = 1 (mod W). In other words, a' is the inverse of (a mod W). Such a number has the nice property that ax multiplied by a' will give back x. In other words, axa' = aa'x = 1x. There are many possible constants with the desired properties, and one possibility which is suitable to 32-bit arithmetic (W = 232), is a =2654435769 with binary representation of 1001 1110 0011 0111 0111 1001 1011 1001

For example:

public class MultiplicationHash
{
    private const int k = 10;
    private const int w = 32;        // size of System.Int32
   
private const uint a = 2654435769U;

    public static int Hash(int x)
    {
        // Note that we cast to uint before the shift so as to cause the right-shift operator to insert zeros on the left.
        // Therefore, the result always falls between 0 and M-1

        return (int)((uint)(x * a)) >> (w-k);
    }
}

Hash Function Implementations

The hashing methods presented earlier dealt only with integer keys. In reality keys might be strings or even objects of other classes. In general, given a set of keys K, and a positive constant M, a hash function is a function of the form:

h: K à {0, 1, 2, ... M-1}

In practice, it is convenient to implement the hash function h as a composition of two functions, f and g. The function f maps keys into integers and function g maps integers to {0, 1, ..., M-1}. Therefore, the hash function is given by:

h(x) = g( f(x) )

In other words, we have decomposed the problem into two parts. The first involves finding a suitable mapping from the set of keys K to non-negative integers, and the second problem involves mapping these non-negative integers into the internal [0, M-1]. The hashing functions described earlier (division, middle-square, and multiplication) all deal with integer values which is what function g is required to do. On the other hand, the choice for function f depends on the characteristics of its domain.

The following considres various different domains (sets of keys) and discusses a suitable hash function for each. Each domain considered corresponds to a C# class.Recall that a C# class ultimately derives from System.Object which declares a method called GetHashCode(). object.GetHashCode() therefore, corresponds to the function f which maps keys into integers.

Integral Keys

C# integral types (byte, sbyte, char, short, ushort, int, uint, long, ulong) are the easiest to hash into integers. A suitable hash function is therefore the identity function f(x) = x:

public override int GetHashCode()
{
    return (int)obj;        // recall the key is an integer so that cast should not fail
}

Floating-Point Keys

C# floating point types (float and double) are just as easy as integral types. We seek a function f(x) which maps x (a floating point value) to a non-negative integer. This must be able to deal with the extremely wide range of floating-point values. For example, what is the integer representation for 0.0000003? 123.456768798? 999999.123?

Every non-zero floating-point number can be represented as:

-1s x m x 2e   s e{0,1}. ≤  m ≤

where s is called the sign, m is called the mantissa or significant, and e is called the exponential. This suggests the following function definition for f(x):

Where W i=2w such that w is the word size of the machine. In the IEEE standard floating-point format, the least significant 52 bits of a  64-bit floating point number represent the quantity m' = (m - 0.5) x 253. Since int is 32 bits, W = 232. f(x) then becomes m' 233/ 253 = m' / 220.

In other words, we compute the hash function simply by right-shifting the binary representation of the floating-point number by 20 bits. The following program illustrates:

public override int GetHashCode()
{
    long lbits = System.BitConverter.DoubleToInt64Bits((double)obj);
    return(int)((ulong)lbits >> 20);
}

Character Strings Keys

The basic ideas is that a string is simply a sequence of characters that may be arbitrarily long. The challenge is to devise a function that allows us to map from an unbounded domain into the finite range of int. Please refer to Brun Preiss book on data structures for more information.

Using Associations

Recall that hashing provides a way to determine the position of an object directly from the object itself. Given an object x, we determine its position by evaluating the appropriate hash function h(x). Assume we determined the position and inserted the object at that position. Assume now that we would like to retrieve object x. But in order to find object x, we need to determine its location, and to determine its location, we need to evaluate h(x) ! That is we need to have object x in the first place.

In practice, when using hashing, we are dealing with keyed data. In other words, we are dealing with data of the form (key1, value1), (key2, value2), etc. where keyn is used to locate valuen.

 

 

 

 

.