CSE 465 Programming Assignment 1 : Frequently Asked Questions

  1. Why is this assignment related to hash tables and hash functions?

    There are several different ways to attack the problem of determining whether two files contain the same multi-set of words. To get a linear time solution, a natural algorithm is to insert all the words of the first file into a hash table, then try deleting all the words from the second file. If all the deletions succeed and the files both have the same number of words, then you know they contain the same set of words (with the same frequencies). Note that your hash table should be able to handle repeated keys properly.

    There are other, similar solutions (for example, you can create two hash tables, one for each file, and then compare them for equality). You can also solve the problem without using hash tables, but the assignment specifically asks you to use hashing with chaining.

  2. What type of hash function should I use to hash words?

    This is really several questions. First, how big should my hash table be? Second, what hash function should I use?

  3. How big should my hash table be?

    Recall that inserting or deleting an item into a hash table takes expected constant time (under the simple uniform hashing assumption) as long as the number of items is within a constant factor of the number of cells in the hash table. The solution outlined above performs at most n insertions, so it would be sufficient to use a hash table with, say, 2n cells (better yet, use a prime number close to n). You could find n by first making a preliminary pass through the input files which counts how many words they contain.

  4. What type of hash function should I be using?

    There are several good choices for hash functions, as discussed in Section 11.3 of the book. The problem is that interpreting strings -- even fairly short ones -- as integers in the straightforward way leads to integers so large they quickly overflow the word size of most computers. Exercise 11.3-2 asks you to think of a way around this for the division method, viewing the string as an integer represented in base (radix) 128.

    Suppose the string s has t characters. Then we can interpret it as an integer k as follows: k = Σ i=1...t s[i] 128i-1 . We want to compute k mod m. We can use the fact that the "mod" operator can be exchanged with addition and multiplication (that is, (a+b) mod m = (a mod m + b mod m) mod m, and (a*b mod m) = ((a mod m) * (b mod m)) mod m ).

    The following pseudocode computes k mod m.

    GETHASH(s,m)
      multiplier := 1
      output :=0 
      for i = 1 to length(s) do
          output := (output + (multiplier * s[i]) ) mod m
          multiplier := (multiplier * 128) mod m
      return output
    
    A slightly more efficient way of doing the same thing uses "Horner's rule" for evaluating polynomials:

    k mod m = ((((s[t]*128 + s[t-1])*128 + s[t-2])*128 + s[t-3])* 128 + ...)

    That leads to the following, cleaner pseudocode:

    GETHASH(s,m)
       output := s[length(s)]
       for i=length(s)-1 downto 1 do
            output := (128*output + s[i]) mod m
       return m
    
    It is convenient to use base 128 since (normal) ASCII characters are represented using 7 bits. However, in this particular context, you could work with 26 as the base since you know the characters are lower case latin letters.

    A technique similar to the one above allows one to use the multiplication method (also described in Section 11.3) to hash strings. The multiplication method distributes values a bit more uniformly in the hash table.

  5. Does my program have to work for all inputs, or only for the examples you provided?

    Your program should work for all pairs of inputs files which consist of only spaces and lower case English letters, subject to the memory limits of the computer it is running on.


Adam D Smith
Last modified: Sat Mar 17 21:26:34 EDT 2007