The inverted index is a tool that maps occurrences of data (in our case, words) back to locations in a structure. This is frequently used to support full text support for information stored in databases. It is typically used to return records or documents that contain the queried word. We are going to use it to return lines of text.
There are a number of ways that we can implement the inverted index — it basically just requires an associative structure. Of course, we will be implementing it with a hash map. The keys in this case will be words. The values will be a list of the lines in which that word occurred. When a user queries the index with a word, she will be able to see all of the lines containing the queried word.
Part one: The hash map [15 pts]
The first thing you will need to implement is a hash map in a class called
HashMap. This should fully
implement the Map interface we have been working with. The storage within the
HashMap should be provided by an array, and you should use external chaining to resolve collisions. For a hashing function, use the
hashCode() method that is already defined for all
In addition to the methods required by the
Map interface, you should implement the following additional methods on the
public HashMap(int size)
- This is the constructor. The size variable determines the number of slots in the underlying hash map.
public int getMaxChainLength()
This method should return the length of the longest chain in the
public double getLoadFactor()
This method should return the load factor of the
Part two: Building the index [35 pts]
The index should be implemented in a new class called
InvertedIndex. The interface of this class should look like this:
public InvertedIndex(File file)
The constructor of the class. This reads in a
Fileand builds the index.
public InvertedIndex(File file, int tableSize)
- This performs the same function as above, but allows the size of the hash table to be specified (the first constructor should just call this one with a default value).
public Iterator<Integer> getLineNumberIterator(String word)
Iteratorthat walks through each line number associated with a particular word. The
Iteratordoes not need to implement
public Iterator<String> getHighlightedLines(String word)
Iteratorthat returns Strings containing the line number, followed by a colon, followed by a line containing at least one instance of the word. Each instance of the word should be surrounded by # characters. This should also not implement remove.
public int numOccurrences(String word)
- Return the number of occurrences of the word in the document.
The constructor will take the file and store it, line by line, in an ArrayList (use the one in
java.util). This is just a copy of the document that we can easily access by line number. The
HashTable will be used to maintain our inverted index, which will map words (the keys) to occurrences in the document. We want to store both line number and word number within the line. You should create a simple wrapper class for holding these two pieces of information. Most words will occur more than once in a document, so the “value” stored in the hash table will be a
List of the small wrapper objects. Use the
LinkedList class from
java.util for this purpose. As an implementation detail, convert all of the words to lowercase before you store them in the hash table, this will allow for case insensitive searching.
Part three: An application [10 pts]
main method to your
InvertedIndex class. This should take two arguments on the command line: the first should be a number which should determine the number of slots in the hash table, and the second will be a file name. You will, of course, use the file and size to build an
InvertedIndex instance. Print out the two hash table statistics available through the methods specified earlier. You should add some methods to the
InvertedIndex to allow acces to these.
Once the statistics are printed out, your application should enter an interactive mode. When the user enters a word, your index should look it up and return the lines that contain it. Each line will be printed out in the following format (the < and > just indicate that they are fields being described, they shouldn’t be included in the actual output):
<line number>: <text of the line with all instances of the word surrounded by the # character.>
If there are no matching lines, there should be an appropriate error message. When the user uses the
RETURN key without entering any text, the application should quit.