There are several different ways to attack the problem of determining whether two files contain the same multi-set of words. To get a linear time solution, a natural algorithm is to insert all the words of the first file into a hash table, then try deleting all the words from the second file. If all the deletions succeed and the files both have the same number of words, then you know they contain the same set of words (with the same frequencies). Note that your hash table should be able to handle repeated keys properly.
There are other, similar solutions (for example, you can create two hash tables, one for each file, and then compare them for equality). You can also solve the problem without using hash tables, but the assignment specifically asks you to use hashing with chaining.
This is really several questions. First, how big should my hash table be? Second, what hash function should I use?
Recall that inserting or deleting an item into a hash table takes expected constant time (under the simple uniform hashing assumption) as long as the number of items is within a constant factor of the number of cells in the hash table. The solution outlined above performs at most n insertions, so it would be sufficient to use a hash table with, say, 2n cells (better yet, use a prime number close to n). You could find n by first making a preliminary pass through the input files which counts how many words they contain.
This will not work well, since each of the cells in your hash table will contain Ω (n) items on average. Searching and deleting in this hash table will then take linear time, so the overall running time of the algorithm will be Ω (n2)
There are several good choices for hash functions, as discussed in Section 11.3 of the book. The problem is that interpreting strings -- even fairly short ones -- as integers in the straightforward way leads to integers so large they quickly overflow the word size of most computers. Exercise 11.3-2 asks you to think of a way around this for the division method, viewing the string as an integer represented in base (radix) 128.
Suppose the string s has t characters. Then we can interpret it as an integer k as follows: k = Σ i=1...t s[i] 128i-1 . We want to compute k mod m. We can use the fact that the "mod" operator can be exchanged with addition and multiplication (that is, (a+b) mod m = (a mod m + b mod m) mod m, and (a*b mod m) = ((a mod m) * (b mod m)) mod m ).
The following pseudocode computes k mod m.
GETHASH(s,m) multiplier := 1 output :=0 for i = 1 to length(s) do output := (output + (multiplier * s[i]) ) mod m multiplier := (multiplier * 128) mod m return outputA slightly more efficient way of doing the same thing uses "Horner's rule" for evaluating polynomials:
k mod m = ((((s[t]*128 + s[t-1])*128 + s[t-2])*128 + s[t-3])* 128 + ...)
That leads to the following, cleaner pseudocode:
GETHASH(s,m) output := s[length(s)] for i=length(s)-1 downto 1 do output := (128*output + s[i]) mod m return mIt is convenient to use base 128 since (normal) ASCII characters are represented using 7 bits. However, in this particular context, you could work with 26 as the base since you know the characters are lower case latin letters.
A technique similar to the one above allows one to use the multiplication method (also described in Section 11.3) to hash strings. The multiplication method distributes values a bit more uniformly in the hash table.
English letters are not uniformly distributed. This hash function will not use the space in the hash table well (for example, the cell labeled "abd" would not get used at all, while the cell "pre" might be heavily loaded).
Your program should work for all pairs of inputs files which consist of only spaces and lower case English letters, subject to the memory limits of the computer it is running on.