Projects‎ > ‎

Project 1: Inverted Index

Summary 

For this project, you will write a Java program that processes several text files and builds an inverted index that stores a mapping from words to the documents (and locations within those documents) where those words were found. For example, suppose we have the following mapping stored in our inverted index:

elephant → { ( mammals.txt, [ 3, 8 ] ),
             ( endangered.txt, [ 2 ] ) }

This indicates that the word elephant is found in two files, mammals.txt and endangered.txt. In the file mammals.txt, it is found in two locations (the 3rd and 8th word). In file endangered.txt, it is found in one place as the 2nd word in the file.

Submission

You must submit your project to your SVN repository at:

https://www.cs.usfca.edu/svn/<username>/cs212/project1

where <username> (including the < and > symbols) should be replaced with your CS username. Once you have your project submitted, please fill out the Project Submission form.

You should include the following files in this directory:

  • a jar file named project1.jar that includes all of the necessary *.class files to run your program

  • a src directory with all of the *.java files necessary to compile your program

  • a readme.txt file with your name, email address, student id, and brief description/justification of your approach

To continue and make good progress in this class, you should aim to complete this project by the suggested deadline. To be eligible to still receive an "C" project grade in this course, you must submit this project by the listed cutoff date. These dates for this project are:

 Suggested Deadline:  
 Friday, February 17, 2012
 Cutoff Date:  
 Friday, March 23, 2012

Recall that you may only have one project submitted at any time, and you must allow two weeks after your submission date for grading.

Execution

Your code must run on the lab computers. If you are developing your code on a home computer or laptop, be sure to check out your code on a lab computer and test it. Your main method must be placed in a class named Driver. Your code will be tested using the following commands:

svn export https://www.cs.usfca.edu/svn/<username>/cs212/project1
cd project1
java -cp project1.jar Driver <arguments>

where <arguments> will be the following command-line arguments (in any order):

  • -d <directory> where -d indicates the next argument is a directory, and <directory> is the directory of text files that must be processed

The output of your program should be a file invertedindex.txt that contains the contents of your inverted index in the following output format:

elephant
"/home/sjengle/files/mammals.txt", 3, 8
"/home/sjengle/files/endangered.txt", 2

cow
"/home/sjengle/files/mammals.txt", 11

where the word is listed alone on a single line, followed by lines with the absolute file path, and a comma-separated list of locations. An empty line should separate entries.

If there are any issues with your submission and running your code, you will be asked to resubmit.

Functionality

For this program, you must traverse the supplied directory and process all *.txt files found in that directory (including all subdirectories). For each text file, you must parse out each word. For each word, you must store a mapping of the word to the file and position that word was found in a custom inverted index data structure. See below for details:

  • Your inverted index must store a mapping from word to the file(s) it was found, and the position(s) in that file it is located. This will require nesting multiple data structures. You should choose these data structures wisely.

  • The positions stored in your inverted index should start at 1. For example, if a file has words "apple banana carrot", then "apple" is in position 1, "banana" is in position 2, and "carrot" is in position 3.

  • Your program must separate a file into words by any whitespace, including spaces, tabs, and new line characters.

  • Your program must be case-insensitive. For example, the words APPLE, Apple, apple, and aPpLe should all be seen as the same word.

  • Your program must ignore all characters except letters and digits. For example, the word "age-long" should be seen as "agelong" (since the dash "-" should be ignored) and the word "Hello!" should be seen as "hello" without the exclamation "!" mark. One way to accomplish this is to replace any special characters with the empty string before parsing.

  • Your program must support large text files. As a result, you should not read the entire file into memory at once. It is, however, okay to read a single line from the file into memory at once.

  • Your program must be designed to object-oriented. For example, you should separate directory traversing, file parsing, and data maintenance into different classes. 

  • You must also protect the integrity of your inverted index, making sure you have proper encapsulation of any private data members. For example, do not return a reference to a private data member in a public method.

See the next section for recommendations on how to approach this project.

Getting Started

It is recommended you work on this project in two phases. For phase 1, you should:

  • Build a basic directory traverser, that is capable of getting all the text files from a directory.

  • Build a basic file parser, that is capable of splitting words by whitespace characters.

  • Create a basic inverted index that stores mappings from words to files, but not positions.

For phase 2 of this project, you should:

  • Modify your directory traverser to also get text files from all subdirectories.

  • Modify your file parser to ignore special characters.

  • Modify your inverted index to also store file positions in addition to files.

Of course, this is just a recommendation. You should try to tackle all projects iteratively, but you can decide how best to break the project into parts.

Project Hints

Here are some recently posted hints to help you on this project:

  • Resubmission You should be learning a lot from the resubmission process with this project. For example:ALWAYS check that your code works on the lab computers. Run the commands provided on ...
    Posted Feb 27, 2012 12:31 PM by Sophie Engle
  • Comparing Output For some of the test cases, you can visually compare that your output exactly matches the test case. However, in other situations, you may need to automate the process.Often ...
    Posted Feb 22, 2012 8:22 PM by Sophie Engle
  • Inverted Index Output To properly output the inverted index, you need to iterate through your nested data structure. See the IterationExamples.java code example on the Data Structures lecture page for how to ...
    Posted Feb 5, 2012 1:19 PM by Sophie Engle
  • Regular Expressions We'll cover regular expressions in more detail later in the semester. For now, you should take a look at the Pattern class and the RegEx Tutorial for the basics ...
    Posted Feb 5, 2012 1:17 PM by Sophie Engle
Showing posts 1 - 4 of 4. View more »

Testing

You should thoroughly test your own code. Make sure it meets the functionality requirements, performs proper exception and error handling, and produces the correct output. To assist you in your testing, several test files have been provided at the following location on the lab computers:

/home/sjengle/cs212/tests/index/

Note that there are several subdirectories within the index directory above. You should test each of these subdirectories individually to start, and compare your results to the individual result files below. For example, test /home/sjengle/cs212/tests/index/subdir/simple first and compare your results with invertedindex.simple.txt.

You should not submit your code until you are able to produce the test results. However, just because you produce the test results, does not mean your code is ready for submission. See the Projects page for what is required for the project to be considered "complete".

Note: For this particular project, the order of output may differ. However, the words and positions should not.

 Results
invertedindex.cases.txt
Download
Results from the /home/sjengle/cs212/tests/index/cases directory.   2k v. 2 Feb 5, 2012 2:09 PM Sophie Engle
invertedindex.extended.txt
Download
Results from the /home/sjengle/cs212/tests/index/extended directory.  5044k v. 2 Feb 5, 2012 2:34 PM Sophie Engle
invertedindex.simple.txt
Download
Results from the /home/sjengle/cs212/tests/index/simple directory.  9k v. 3 Feb 5, 2012 12:48 PM Sophie Engle
invertedindex.subdir.txt
Download
Results from the /home/sjengle/cs212/tests/index/subdir directory.  1k v. 2 Feb 5, 2012 12:48 PM Sophie Engle
invertedindex.txt
Download
Results from the /home/sjengle/cs212/tests/index directory.  5054k v. 3 Feb 5, 2012 2:36 PM Sophie Engle
 Tests
cases.tar.gz
Download
Test cases from the /home/sjengle/cs212/tests/index/cases directory.  1k v. 2 Feb 5, 2012 2:42 PM Sophie Engle
extended.tar.gz
Download
Test cases from the /home/sjengle/cs212/tests/index/extended directory.  871k v. 2 Feb 5, 2012 2:42 PM Sophie Engle
index.tar.gz
Download
Test cases from the /home/sjengle/cs212/tests/index directory.  873k v. 2 Feb 5, 2012 2:42 PM Sophie Engle
simple.tar.gz
Download
Test cases from the /home/sjengle/cs212/tests/index/simple directory.  1k v. 2 Feb 5, 2012 2:42 PM Sophie Engle
subdir.tar.gz
Download
Test cases from the /home/sjengle/cs212/tests/index/subdir directory.  1k v. 2 Feb 5, 2012 2:42 PM Sophie Engle
Subpages (1): Hints

Comments

Sophie Engle - Feb 5, 2012 1:21 PM

There is now a "Hints" section to help you as you develop your project. It will be updated periodically.