Instructions:
- You can write your programs in any language available in the Computing Science instructional lab machines, however you can only use standard libraries in these languages.
- If you want to use a third-party or a non-standard library, ask the instructor for permission first. You will not be allowed to use any library that implements the work that you are asked to do in the assignment. Also, even with the permission of the instructor, it is your responsibility to make sure that the libraries that you need are installed on the lab machines, or can be installed without superuser powers, so that your code can be graded by the TAs.
- This programming assignment is meant to be completed individually or in pairs, under the Consultation model of collaboration as per the Computing Science Department Course Policies.
- You must not upload binary files (e.g. SQLite databases) or large text files of any kind to your GitHub repository.
- All of your code and/or data must be on the GitHub repository created by following the instructions on eClass.
- That repository already has the folder structure for the assignment and the configuration file for automated testing. DO NOT MODIFY the folder structure.
- Remember to submit the URL of your repository through eClass before the deadline.
Learning Objectives
This assignment is intended for you to implement text classifiers based on the "bag of words" model.
Required Reading
The background material for this assignment is in Chapters 13 and 14 of the textbook.
Overview
Your goal is to design, implement, and document programs to perform text classification using the Naive Bayes and the kNN classifiers. You are not allowed to use libraries that already implement any component of those classifiers in your solution.
All of your programs must accept parameters via command line arguments. Programs that do not accept command line arguments will get a grade of zero for their correctness.
Train and test sets
In the data/
folder of your repository there are two files: train.json
, which should be used for building the model used by your classifier, and test.json
, which should be used for evaluating the classifiers. You are strongly encouraged to create by hand smaller files for training and testing which you should use to debug your program.
Each entry in those JSON files has two parts as in the example below:
{
"category": "class_name",
"text": "text in one news story"
}
The TAs will also use a held-out test set that will not be given to students until after the assignment is graded for their own independent evaluation. Ideally, the F1 on the held-out set should be very close to that on the test set given upfront. If your model is trained correctly, this is very likely to happen.
Here are the number of documents in each class:
class | train | test | held |
---|---|---|---|
business | 460 | 26 | 24 |
entertainment | 348 | 20 | 18 |
politics | 376 | 21 | 20 |
sport | 460 | 26 | 25 |
tech | 361 | 21 | 19 |
Tasks
Naive Bayes Classification and Feature Selection (10 marks)
Task 1 (3 marks). Write a program nbc/nbc_train.py that should take two command line arguments in this order: (1) path to a JSON file with training data; and (2) path to a file where the model will be written. Here is one example of how to call your program (assuming you use Python 3):
% python3 ./nbc/nbc_train.py ./data/train.json ./full_bbc_model.tsv
If the path provided for writing the model refers to an existing file, we strongly suggest that your program asks the user for confirmation before overwriting the file. Also, your program should print error messages if called with a wrong number of arguments or if the path to the training data is invalid.
The model must be written to a TSV (tab-separated values) file with two kinds of lines for representing priors or likelihoods. The first column in each line should be a single token: either prior
or likelihood
. If the line is for a prior, the next columns should be the class name and its prior. If the line is for a likelihood, the next three columns should be the class, the term, and the probability.
All probabilities must be estimated using add-one smoothing.
Task 2 (2 marks). Write a program nbc/nbc_inference.py that should take two command line arguments in this order: (1) a path to a model created by your previous program; and (2) a path to a JSON test file like the one provided. Here is an example of how to call your program (assuming you use Python 3):
% python3 ./nbc/nbc_inference.py ./full_bbc_model.tsv ./data/test.json
Your program should print error messages if called with a wrong number of arguments or if the model file or the test file are not as specified above.
Your program must print to STDOUT:
- One line for each class in the test file with the true positive, false positive, false negative and true negative counts, followed by the precision, recall and F1 score for that class.
- One line with the micro-averaged F1 for the entire test.
- One line with the macro-averaged F1 for the entire test.
All quantities above must be calculated according to the smoothed Naive Bayes Classifier as explained in the textbook.
Task 3 (4 marks). Write a program nbc/feature_selection.py that should take three command line arguments in this order: (1) a path to the training data as in the previous program; (2) a value k
; and (3) path to a file where the filtered training file will be written. Here is an example of how to call your program (assuming you use Python 3):
% python3 ./nbc/feature_selection.py ./data/train.json 10 ./data/train_top_10.json
Your program must calculate the mutual information between terms and classes in the training data and write a new training file in which the features are only those in the top-k for each class. The output of your program should be formatted exactly like the input training data, as it will be used to train other models.
If the path provided for writing the output refers to an existing file, we strongly suggest that your program asks the user for confirmation before overwriting the file. Also, your program should not overwrite the input file, and it should print error messages if called with a wrong number of arguments or if the path to the training data is invalid.
Task 4 (1 mark). Use your programs above to study the effect of feature selection. Using 10, 20, 40, 80, and 160 as the values of k
, create different training sets using your feature selection code. For each training set, train and evaluate a model using the provided test set. Report the F1 scores per-class and also the micro and macro averaged F1 scores.
Write these values in the README.md
file in your repository.
kNN Classification (5 marks)
Task 5 (2 marks). Write a program knn/knn_create_model.py that should take two command line arguments in this order: (1) path to a JSON file with training data; and (2) path to a file where the document vectors in the training data will be written. Here is one example of how to call your program (assuming you use Python 3):
% python3 ./knn/knn_create_model.py ./data/train.json ./bbc_model.tsv
If the path provided for writing the model refers to an existing file, we strongly suggest that your program asks the user for confirmation before overwriting the file. Also, your program should print error messages if called with a wrong number of arguments or if the path to the training data is invalid.
All documents in the train and test sets should be represented by vectors whose the weights are calculated using the ltn
weighting scheme. To calculate IDF, assume the corpus to be the training set only.
The output file should be a TSV (tab-separated values) file with the IDF scores of terms in the training corpus (which are needed to compute the vectors of the test document) and the vectors of the training documents. There should be two kinds of lines in this file: idf
and vector
, and the first column should indicate that (i.e., should be one of those values).
In an idf
line the next two columns should be the term and the actual IDF of that term computed from the training corpus using the t
weighting scheme. In a vector
line, the second column should be the class of the training document, and the third column should be a single string representing the vector of the document (which is a training example of class). You can (and should) omit zero-weight entries in your vectors. Your output file should have as many vector
lines as there are documents in the train set.
Task 6 (2 marks). Write a program knn/knn_inference.py that should take three command line arguments in this order: (1) path to a TSV file with vectors computed as in the previous step; (2) a value k
; and (3) a path to a JSON test file like the one provided. Here is one example of how to call your program (assuming you use Python 3):
% python3 ./knn/knn_inference.py ./bbc_doc_vectors.tsv 11 ./data/test.json
Your program should print error messages if called with a wrong number of arguments or if the model file or the test file are not as specified above.
Your program must print to SDOUT:
- One line for each class in the test file with the true positive, false positive, false negative and true negative counts, followed by the precision, recall and F1 score for that class.
- One line with the micro-averaged F1 for the entire test.
- One line with the macro-averaged F1 for the entire test.
All quantities above must be calculated according to the kNN classification algorithm as explained in the textbook, using the Euclidean distance and with the provided value of k
.
Task 7 (1 mark). Use your programs above to study the effect of the parameter k
. Using 1, 5, 11 and 23 as values of k
, compute the per-class and the aggregate (micro and macro) F1 scores using the vectors of all documents in the train set and all documents in the provided test set.
Write these values in the README.md
file in your repository.
WARNING about computation time
To complete this task you will have to perform about 228 thousand distance calculations for each of the 4 values of k
, for a total of close to 1 million calculations.
You should not underestimate how long this will take, and you should write clean and efficient code. You are strongly encouraged to use a heap data structure to keep the k
vectors closest to the test document.
For reference, if your program takes more than 5 minutes to run on the provided dataset with a single value of k
, you should be contacting a TA to discuss how to improve your code. See the notes on the Grading scheme about execution time.
Grading
Percentage | Correctness (60%) | Documentation (20%) | Instructions (20%) |
---|---|---|---|
100 | The code meets all requirements and behaves as expected on correct and incorrect input, accepting the command line arguments as specified; (b) the code uses suitable data structures and the best possible algorithms; (c) error messages are clear; (d) the program does not crash (e.g., through an uncaught exception) with unexpected or incorrect input (e.g., missing or invalid parameters). | The code is well documented: (a) all functions have meaningful comments; (b) all variables and functions have meaningful names; (c) the README.md file explains the data structures and algorithms, covers all assumptions and the error handling strategy. |
The instructions for executing the code are complete and clear: the TA is able to compile and/or execute the code following exactly the instructions provided, on a lab machine. |
75 | The code correctly implements most, but not all, of the requirements using suitable data structures and the best possible algorithms. | The documentation covers only all key functions in the code. | The instructions are insufficient, but all missing steps are obvious. |
50 | The code correctly implements some of the features OR the code correctly implements the functionality using inappropriate data structures and/or algorithms. | The documentation covers only some of the key functions OR a video explaining the code is missing. | The instructions are incomplete and the TA has to guess how to compile and run the code. |
25 | The code correctly implements only one or two of the requirements. | Minimal documentation that does not explain the ideas behind the design decisions. | TA needs to contact the student(s) to clarify execution instructions. |
0 | The program does not accept command line arguments as specified, OR the program does not produce the expected answer with the provided inputs after 5 minutes of execution on a lab machine, OR the code does not work on any test cases; OR the code has syntax errors; OR the code requires libraries that cannot be installed in the lab machines or libraries for which you did not obtain written approval of the instructor. | Missing. | Missing. |