Instructions:
- This assignment is meant to be completed individually or in pairs, under the Consultation model of collaboration as per the Computing Science Department Course Policies.
- You must not upload binary files or large text files of any kind to your GitHub repository.
- Your answers must be on the GitHub repository created by following the instructions on eClass.
- That repository already has the folder structure for the assignment. DO NOT MODIFY that directory structure.
- Remember to submit the URL of your repository through eClass before the deadline.
Learning Objectives
This unit is intended for you to:
- Familiarize yourself with a modern cloud-based noSQL system.
- Learn non-relational extensions of SQL.
Required Reading
- From the Amazon Athena documentation, at least:
- What is Amazon Athena?
- Getting Started
- Accessing Amazon Athena
- Creating Databases and Tables
- Running Queries
- From the SQL Reference for Amazon Athena
- Data Types in Athena
- DML Queries, Functions, and Operators
- From the Running SQL Queries Using Amazon Athena
- All of the Amazon S3 documentation
- The description of the IMDB dataset.
Tasks
Step 1: Create the S3 buckets
Please note that you will deal with a small subset of the IMDB database.
- Following the instructions in the S3 documentation create a single bucket for the assignment with two main folders: one for the tables and another for the results.
- Download the files from https://webdocs.cs.ualberta.ca/~denilson/teaching/cmput391/imdb2020/ into your computer.
- Under the folder for the tables, create a separate folder for each of those files in the data folder of your S3 bucket.
- Upload each file to its respective folder.
Describe the steps you completed in your video and in the README.md
file. Add screenshots to your README.md
showing the results.
NOTE: do not remove any file nor any table definition from your AWS account. The TAs will login to AWS and inspect your work to make sure it works correctly.
Step 2: Create the tables in Athena
Following the instructions in the Athena documentation:
- Create a database called
imdb
. - Add a table to the database for each of the files in the data folder.
- Name each table by ignoring the
.tsv.gz
extension and replacing the remaining period (.
) by an underscore (_
). For example, one of the tables will be calledname_basics
. - Name and define each attribute so that they match the description of the IMDB dataset exactly.
- Name each table by ignoring the
Notes:
- The field separators are tabs (
\t
). - There are inconsistencies in the way the data was released by IMDB itself and those inconsistencies were left in the data provided to you.
- Athena's SERDE (SERializer/DEserializer) on CSV and TSV files does not recognize non-relational fields with lists of values. Instead, you should set up your tables in a way that those fields are recognized as strings.
Describe the steps you completed in your video and in the README.md
file. Add screenshots to your README.md
showing the results. Make sure to explain how you handled non-relational attributes in the data.
Queries
Write queries in Athena to answer the questions below. Add your query and a file with the answer to the corresponding folder in the Github repository.
0.25 marks each
Q1 -- How many comedies with average ratings higher than 7 are there in the database?
Q2 -- What is the highest rating among dramas with more than 2000 votes?
Q3 -- Who is in the cast of "Lost in Translation" and which characters did they play in the movie?
Q4 -- Who directed the highest rated comedy with Amy Schumer?
0.5 marks each
Q5 -- Who directed the most movies with the word "love" in the title? Your query must ignore the casing of the letters in the words.
Q6 -- How many people have worked in a movie that they are known for while doing something other than one of their primary professions?
Q7 -- Who are the people born after the year 2000 who acted in more than one movie and yet had the same director for all of their movies?
0.75 marks each
Q8 -- What is the fraction of movies having more actresses than actors?
Q9 -- What is the fraction of movies where the number of actors and actresses is the same by decade?
1 mark each
Q10 -- Which words (ignoring case) appearing more than 20 times in the title of a movie appear most frequently, by genre? You answer should exclude numbers, symbols, punctuation and stopwords.
Stopwords are those words that do not carry meaning, and instead are used to form coherent sentences. Here is a list of stopwords to be removed in this assignment: a
, an
, and
, are
, as
, at
, be
, by
, for
, from
, has
, he
, in
, is
, it
, its
, of
, on
, that
, the
, to
, was
, were
, will
, with
.
How to share videos
All of your videos should be kept in the Google drive associated with your CCID. Those videos should be shared with the TA and the instructor only.
In your GitHub repository, provide links to the videos in the README.md files. If you work in a pair, both members have to create and share their own videos explaining the answer in their own words.
Grading
Each query will be scored on a four-point scale on three criteria: correctness, clarity, and documentation as follows:
Scale | Correctness | Clarity | Documentation |
---|---|---|---|
A | The query computes exactly the expected answer: it has all columns and all rows as expected and all values are computed as required. The query does not perform any unnecessary steps and all data manipulation expressions and operations are appropriate. The query is generic and would compute the correct answer on any legal instance of the schema. | The query is formatted and indented properly; all necessary sub-queries are declared in CTEs; all tuple variables have readable and sensible names; all tests in WHERE clauses are needed and sensible. | The query has appropriate comments; the README file explains the intuition behind each step in the query; the video explains each step of the query instead of essentially reading the query. |
B | The query computes the expected answer on the provided database, but it contains unnecessary columns or rows OR some of the data manipulations are inappropriate. The query is generic and would compute the correct answer on any legal instance of the schema. | The query does not use CTEs when possible OR tuple variables are not properly named OR query is not properly formatted. | Comments and explanations (README and/or video) cover some but not all steps in the query. |
C | The query correctly computes at least one of the steps necessary for the correct solution OR the query computes correctly only some of the tuples in the expected answer OR the query has unnecessary steps or inappropriate operations, even if otherwise correct. | CTEs are not used OR tuple variables are confusing OR query uses unnecessary complicated steps that could be replaced by simpler and clearer ones. | Missing comments OR explanations (README and/or video) cover just a few of the steps in the query. |
D | Query does not correctly compute any logical steps needed for the correct solution. | Missing. | Missing. |
Below are the cut-offs for each grade. Partial grades between brackets may be awarded:
Percentage | Description |
---|---|
100 | Correctness = A; Clarity = A; Documentation = A |
75 | Correctness = A; Clarity = A; Documentation = B |
50 | Correctness = B; Clarity = B; Documentation = C |
25 | Correctness = C; Clarity = C; Documentation = C |
0 | Any one D. |