Instructions:
- You can write your programs in any language available in the Computing Science instructional lab machines, however you can only use standard libraries in these languages.
- If you want to use a third-party or a non-standard library, ask the instructor for permission first. You will not be allowed to use any library that implements the work that you are asked to do in the assignment. Also, even with the permission of the instructor, it is your responsibility to make sure that the libraries that you need are installed on the lab machines, or can be installed without superuser powers, so that your code can be graded by the TAs.
- This programming assignment is meant to be completed individually or in pairs, under the Consultation model of collaboration as per the Computing Science Department Course Policies.
- You must not upload binary files (e.g. SQLite databases) or large text files of any kind to your GitHub repository.
- All of your code and/or data must be on the GitHub repository created by following the instructions on eClass.
- That repository already has the folder structure for the assignment and the configuration file for automated testing. DO NOT MODIFY the folder structure.
- Remember to submit the URL of your repository through eClass before the deadline.
Learning Objectives
This assignment is intended for you to become familiar with:
- Building an inverted index to enable fast document retrieval.
- Answering keyword and phrase queries.
Required Reading
The background material for this assignment is in Chapter 6 and in Section 7.3 of the textbook.
Overview
Your goal is to design, implement, and document programs to build inverted indexes allowing the efficient processing of keyword and phrase queries over a collection of documents given in a JSON file formatted exactly as in the previous assignment.
Unlike with the previous assignment, zones should be ignored for the purposes of this assignment. To do that, your program must consider the content of all zones as the document content. That is, your program should concatenate the content of the zones in the order in which they appear, adding a blank space between the content of the zones.
Here is what the content of the document with id 0
in the Dr. Seuss corpus would look like:
Unless someone like you cares a whole awful lot, nothing is going to get better. It’s not. The Lorax
Tasks
Note: If you use Python 3 for this assignment, note that you can install libraries using Virtual Environments. If you do so, write down the step-by-step instructions in your README.md
file.
Index Creation (1.5 marks)
Write a program in folder src/
of the repository that takes two command line arguments, in this order: the first is a path to the JSON file with the corpus, and the second must be a path to the directory where the single index file (again, ignoring zones) will be stored. Note: you are allowed to use a library to load JSON data (e.g., Python's json
library) for this task.
- Your index should support ranked retrieval using keyword and phrase queries.
- The index must be persistent and written to a TSV (tab-separated values) file in which the first column is the
token
, and the other columns should contain all the information necessary for answering the queries. - The index file must be sorted by token, and each posting list must be sorted by document id.
- All documents in the corpus must be indexed, unless there are errors in the corpus.
- Your program should print error messages as necessary; if an error occurs while processing a document, your program should stop, and print an error message. No index files should be written if there is an error.
Answering Keyword and Phrase Queries (3.5 marks)
Write a program in the src/
folder that takes three parameters as command line arguments, in this order: a path to a directory containing the index, the number k
of ranked documents to be returned, and a single string with a complete query.
Your program must print the following to STDOUT, in this order:
- One line with the number of documents that were considered for the query (see Step 1 below).
- One line with the number of documents with non-zero scores among those that were considered for the query.
- Up to
k
lines each with a document id and a non-zero similarity score (one besides the other, with the id first and a tab in between), sorted by decreasing score.
Note: the program can only print less than k
lines if there are less than k
documents with non-zero scores with respect to the given query.
Kinds of Queries
Keyword Queries contain only query terms separated by spaces (e.g. Lorax
, cares better
, etc.). Phrase Queries contain phrases delimited by colons (e.g., :someone like you cares:
). To get full marks, your program must handle arbitrary queries with one or more keywords and one or more phrases as a single query.
Examples:
:get better: going :like you: lot
lot like you someone
:The Lorax:
Scoring
Queries containing phrases must be done in two steps. In the first we attempt to collect a minimum number of documents containing as many of the original phrases as possible. In the second, we score the documents in that pool, or all documents in the case that we cannot create a large enough pool. For queries with keywords only we skip the first step.
Step 1: creating a pool of documents. In this step your program must use the phrases in the query, if any, to create a pool with at least 5k
(5 times the parameter k
) documents to be ranked. Documents outside of this pool should not be scored.
Ideally, every document in the pool should contain all phrases in the query. If this is not the case, your program should "relax" or remove phrases in the query with the intent of increasing the pool. This process must be done in order: phrases that appear in fewer documents should be relaxed or removed before phrases that appear in many documents.
Relaxing a phrase means replacing it by all sub-phrases with 2 consecutive words. For example, the phrase :someone like you cares:
would be relaxed as :someone like:
, :like you:
, and :you cares:
. A phrase with two words cannot be relaxed, but can be removed from consideration (in which case your program will attempt to create the pool of documents using the remaining phrases).
If, after attempting all relaxations and/or removing all two-word phrases as described above, your program still cannot find a large enough pool of documents, the program should proceed to score all documents in the corpus instead.
Step 2: scoring. Regardless of whether the program is scoring a pool or all documents, the scoring should follow the lnc.btn
"scheme" (see textbook for details), and the query vector should be formed by considering all terms that appear in the query either as a keyword or inside a phrase.
Documentation Requirements
You must include the following in the README.md
file in the repository:
- How to install any libraries that are needed and for which you have the approval of the instructor. Make sure you test your own instructions on a lab machine.
- Clear instructions on how to use your programs, including at least one example showing all parameters.
- A discussion of which errors your programs detect and how these errors are handled.
- Links to videos (one per student working in the assignment) explaining the design and implementation of the code, including a discussion of data structures and algorithms and the query optimizations that are used.
Suggestions:
- Use Python 3.
- Reuse your solution to homework 1, and take the feedback by the TAs into account.
- Start by figuring out how the
bash
shell (which is the one we will use for grading) handles command line arguments, and how to pass a query as an argument and how to parse it into a format you can handle. - Design and modularize your code well enough so that different team members can work on different parts of the code.
- Use version control to your advantage: commit often and with small changes instead of in big chunks.
- Design your index builder to be modular: break it down into high level functions and document your code with comments.
- Test your code extensively by manually inspecting the positions in which some words appear in the corpus and always checking they are indexed correctly.
- Start small: work on the Dr. Seuss corpus first (or on an even smaller one that you create for this), and build all indexes by hand, and use your answer to test your program.
Grading
Each aspect of your submission will be marked according to the following scheme. The code will be tested on both valid and invalid input from the data set provided and another used for our own testing. We will be performing all marking on standard CS instructional lab machines so make sure all code runs on them exactly as in the instructions you provided to avoid problems.
Percentage | Correctness (60%) | Documentation (20%) | Instructions (20%) |
---|---|---|---|
100 | The code meets all requirements and behaves as expected on correct and incorrect input, accepting the command line arguments as specified; (b) the code uses suitable data structures and the best possible algorithms; (c) error messages are clear; (d) the program does not crash (e.g., through an uncaught exception) with unexpected or incorrect input (e.g., missing or invalid parameters). | The code is well documented: (a) all functions have meaningful comments; (b) all variables and functions have meaningful names; (c) the README.md file explains the data structures and algorithms, covers all assumptions and the error handling strategy. |
The instructions for executing the code are complete and clear: the TA is able to compile and/or execute the code following exactly the instructions provided, on a lab machine. |
75 | The code correctly implements most, but not all, of the requirements using suitable data structures and the best possible algorithms. | The documentation covers only all key functions in the code. | The instructions are insufficient, but all missing steps are obvious. |
50 | The code correctly implements some of the features OR the code correctly implements the functionality using inappropriate data structures and/or algorithms. | The documentation covers only some of the key functions OR a video explaining the code is missing. | The instructions are incomplete and the TA has to guess how to compile and run the code. |
25 | The code correctly implements only one or two of the requirements. | Minimal documentation that does not explain the ideas behind the design decisions. | TA needs to contact the student(s) to clarify execution instructions. |
0 | The code does not work on any test cases; OR the code has syntax errors; OR the code requires libraries that cannot be installed in the lab machines or libraries for which you did not obtain written approval of the instructor. | Missing. | Missing. |
Recap of Requirements
Index creation
- The program must take two command line arguments in this order: (1) path to JSON file and (2) path to where the index files must be stored.
- There should be one index for the entire input file.
- The index file must have the token as the leftmost column followed by all necessary data for answering phrase and keyword queries in the following columns; the file must be sorted in ascending order by token.
- The program must check that the input is correct. If this is not the case, the program must print an error message and write no files as output.
Keyword and Phrase Queries
- The program must take three command line arguments in this order: (1) path to where the index files; (2) the number
k
of expected answers; and (3) a query with one or more keywords and/or phrases. - If the query is properly written, the program must print the ids of the
k
highest ranked documents, one per line, sorted by descending relevance. - Your program must attempt to find a pool of at least 5
k
documents containing all phrases, relaxing or removing them from contention until such pool is found. If less than 5k
documents can be found meeting these criteria, all documents should be scored. - Your program must score and rank documents using the
lnc.btn
"scheme" and the query vector should have all terms that appear in the query either as a keyword or in a phrase.