CMPUT291 - Fall 2015
Mini-Project 2
Due : November 24th at 5pm
Clarifications:
You are responsible for monitoring the course discussion forum in eclass and this section of the project
specification for more details or clarifications. No clarification will be posted after 5pm on November
23rd.
Nov 6, 2015: There is a typo in the description of the file scores.txt. Scores and review ids are
separated by a comma (and not a colon). The provided data files are correct.
Nov 18, 2015: Here is a marking rubric for the project.
Introduction
The goal of this project is to teach the concept of working with data in the physical layer. This is done by
building an information retrieval system, using the Berkeley DB library for operating on files and indices.
Your job in this project is to write programs that keep data in files and maintain indices that provide basic
searches over data. 80% of the project mark would be assigned to your implementation which would be
assessed in a demo session, and is further broken down to three phases with 10% of the mark allocated for
Phase 1, 5% of the mark for Phase 2, and 65% for Phase 3. Another 15% of the mark will be assigned for
the documentation and quality of your source code and for your design document. 5% of the mark is
assigned for your project task break-down and your group coordination.
Group work policy
You will be doing this project with one or two other partners from the 291 class. Your group information
from mini-project 1 is copied to mini-project 2 groups page on the assumption that you would be working
in the same group. If you decide to change groups, please notify the instructor. It is assumed that all group
members contribute somewhat equally to the project, hence they would receive the same mark. In case of
difficulties within a group and when a partner is not lifting his/her weight, make sure to document all your
contributions. If there is a break-up, each group member will get credit only for his/her portion of the
work completed (losing the mark for any work either not completed or completed by the partner). For the
same reason, a break-up should be your last resort.
Task
You are given a data file, which you will use to construct your indices. Here is a small data file with only
10 records and here is one with 1000 records. The data includes product reviews on Amazon; each review
record consists of a product id, a product title, a product price, userid and profile name of the reviewer,
helpfulness of the review, review score, review timestamp, summary and full text of the review, Your job
is to create indices, following Phases 1 and 2, and use those indices to process the queries in Phases 3.
Phase 1: Preparing Data Files
Write a program that reads reviews from standard input and construct the following four files. Before
constructing these files, escape every double quote " in the input by replacing it with " and escape
every backslash () by replacing it with double backslash (\). Backslash is escaped since Berkeley DB
does not like it in the input; double quote is also escaped so the symbol can be used to encode the
character strings.
reviews.txt: This file has one row for each review record. The fields are ordered as given in the input, and
the consecutive fields are separated by a comma. The fields product title, profile name (of the
reviewer), review summary and review text are placed inside quotations to avoid a possible mixup in the
separator character, for example when a comma appears inside a text field. The first field is the record
number (or record id) which is assigned sequentially to input records with the first record assigned id 1.
Here is this file for our sample file with 10 records.
pterms.txt: This file includes terms of length 3 or more characters extracted from product titles; a term is
a consecutive sequence of alphanumeric and underscore '' characters, i.e [0-9a-zA-Z] or the character
class \w in Perl or Python. The format of the file is as follows: for every term T in a product title of a
review with id I, there is a row in this file of the form T',I where T' is the lowercase form of T. That
means, terms must be converted to all lowercase before writing them in this file. Here is the respective file
for our sample file with 10 records.
rterms.txt: This file includes terms of length 3 or more characters extracted from the fields review
summary and review text. The file format and the way a term is defined is the as given above for the
filepterms.txt. Here is the respective file for our sample file with 10 records.
scores.txt: This file includes one line for each review record in the form of sc:I where sc is the review
score and I is the review id. Here is the respective file for our sample file with 10 records.
These files can also be found at directory ~drafiei/291/pub on the lab machines. In the same directory, you
would also find larger size files (such as 1k and 10k records) that you may want to use in the testings of
your programs.
Phase 2: Building Indexes
Sort all the files built in Phase 1 (except the file reviews.txt which is sorted) using the
Linux sort command; pass the right option to the sort command to keep only the unique rows (see the
man page for sort). You can keep the sorted data under the same file names or pass sorted records to
stdout so they can be piped to your loading program (as described next). Suppose the sorted files are
named as before (to simplify our presentation here). Given the sorted
files reviews.txt, pterms.txt, prterms.txt and scores.txt, create the following four indexes: (1) a hash index
on reviews.txt with review id as key and the full review record as data, (2) a B+-tree index on pterms.txt
with terms as keys and review ids as data, (3) a B+-tree index on rterms.txt with terms as keys and review
ids as data, (4) a B+-tree index on scores.txt with scores as keys and review ids as data. You should note
that the keys in all 4 cases are the character strings before the first comma and the data is everything that
comes after the comma. Use the db_load command to build your indexes. db_load by default expects
keys in one line and data in the next line. Here is a simple Perl script that converts input records into the
format db_load expects. Your program for Phase 2 would produces four indexes which should be
named rw.idx, pt.idx, rt.idx, and sc.idx respectively corresponding to indexes 1, 2, 3, and 4, as discussed
above.
In addition to db_load, you may also find db_dump with option p useful as you are building and testing
the correctness of your indexes.
Phase 3: Data Retrieval
Given the index files rw.idx, pt.idx, rt.idx and sc.idx created in Phase 2 respectively on review ids, product
title terms, review terms, and scores, write a program that processes queries as follows. Each query
returns the full record of the matching review, with review id given first, followed by the rest of the fields
formatted for output display, in some readable format. Here are some examples of queries:
-
p:camera
-
r:great
-
camera
-
cam%
-
r:great cam%
-
rscore > 4
-
camera rscore < 3
-
pprice < 60 camera
-
camera rdate > 2007/06/20
-
camera rdate > 2007/06/20 pprice > 20 pprice < 60
The first query returns all records that have the term camera in the product title. The second query return
all records that have the term great in the review summary or text. The third query returns all records that
have the term camera in one of the fields product title, review summary or review text. The fourth query
returns all records that have a term starting with cam in one of the fields product title, review summary or
review text. The fifth query returns all records that have the term great in the review summary or text and
a term starting with cam in one of the fields product title, review summary or review text. The sixth query
returns all records with a review score greater than 4. The 7th query is the same as the third query except
it returns only those records with a review score less than 3. The 8th query is the same as the third query
except the query only returns those records where price is present and has a value less than 60. Note that
there is no index on the price field; this field is checked after retrieving the candidate records using
conditions on which indexes are available (e.g. terms). The 9th query returns the records that have the
term camera in one of the fields product title, review summary or review text, and the review date is after
2007/06/20. Since there is no index on the review date, this condition is checked after checking the
conditions on terms. Also the review date stored in file reviews.txt is in the form of a timestamp, and the
date give in the query must be converted to a timestamp before a comparison (e.g. check out the date
object in the datetime package for Python). Finally the last query returns the same set of results as in the
9th query except the product price must be greater than 20 and less than 60.
More formally, each query defines some conditions that must be satisfied by one or more of the following
fields: product title, product price (referred to as pprice in queries), review score (referred to as rscore in
queries), review date (referred to as rdate in queries), review summary and review text. A condition on
terms from product title, review summary and review text can be either an exact match or a partial match;
for simplicity, partial matches are restricted to prefix matches only (i.e. the wild card % can only appear at
the end of a term). All matches are case-insensitive, hence the queries "Camera", "camera", "cAmera"
would retrieve the same results; for the same reason the extracted terms in previous phases are all stored
in lowercase. Matches on review scores, review dates and product price are range conditions denoted
by rscore > and rscore < for review scores, rdate > and rdate < for review dates and pprice > and pprice
< for product price. There is zero or more spaces between any of the terms rscore, rdate, andpprice and
the symbols < and >. There is also zero or more spaces between any of the symbols < and > and the
number or the date that follows it. Hence, rscore<20, rscore< 20, rscore <20, and rscore < 20 are all
valid and would return the same matches. Queries can use two range predicates on the same field as is the
cases in Query 10. Matches on terms can be exact (as in queries 1-3) or partial (as in query 4). A query
can have multiple conditions (as in queries 5, 7, 8, 9 and 10) in which case the result must match all those
conditions (i.e. the and semantics), and there is one or more spaces between the conditions. The
keywords rscore, pprice and rdate are reserved for searches (as described above) and would not be used
for any other purposes. The dates are formatted as yyyy/mm/dd in queries but they are stored as
timestamps in the data file; this means dates in the input queries must be converted to timestamp before a
search can be performed. You can assume every query has at least one condition on an indexed column,
meaning the conditions on rscore, pprice and rdate can only be used if a condition on review/product
terms or review scores is also present.
Testing
At demo time, your code will be tested under a TA account. You will be given the name of a data file, and
will be asked (1) to prepare reviews.txt, pterms.txt, rterms.txt, scores.txt, (2) build Berkeley DB indexes
rw.idx, pt.idx, rt.idx and sc.idx, and (3) provide a query interface, which will allow us to test your system
for Phase 3. We typically follow a 5 minutes rule, meaning that you are expected to prepare the data and
build your indices in Phases 1 and 2 in less than 5 minutes; if not, we may have to use our own indexes,
in which case you would lose the marks for Phases 1 and 2.
The demo will be run using the source code submitted and nothing else. Make sure your submission
The demo will be run using the source code submitted and nothing else. Make sure your submission
includes every file that is needed. There will be a limited time for each demo. Every group will book a
time slot convenient to all group members to demo their projects. At demo time, all group members must
be present. The TA will be asking you to perform various tasks and show how your application is
handling each task. A mark will be assigned to your demo on the spot after the testing.
Important note: You can make no assumption on the number of review records (and the size of any of
the files and indexes that are built from the input file). We will be using a relatively large file for testing
and you don't want your program to break down in our tests, resulting in a poor mark! That said, you can
make very few assumptions (such as the inverted list of a term can fit in main memory) and clearly state
them in your report. You can also assume that the length of a review text cannot exceed 33000 characters
(or bytes).
Instructions for Submissions
Your submission includes (1) the source code for phases 1, 2 and 3 and any makefile or script that you
may need to compile your code, and (2) a short report. Your source code must include at least three
programs, i.e. one for each phase. Your program for Phase 3 would implement a simple query interface in
your favorite programming language (e.g. Python, C or C++, Java). For phases 1 and 2, you can have
more than one program (for example, in Python, C, C++, java or Perl) and can make use of any Unix
command and scripting language (e.g. Perl, bash) that runs under Linux on lab machines, as long as you
clearly document in your report how and in what sequence the programs or commands should run. The
source code is submitted as follows:
Create a single gzipped tar file with all your source code and additional files you may need for your
demo. Name the file prj2code.tgz.
Submit your project tarfile in the project submission site by the due date at the top of this page.
Your report must be type-written and submitted in hardcopy at the designated drop boxes located on the
first floor of CSC building, across from the room 1-45 before the due date. Your report cannot exceed 3
pages in length plus one additional cover page. Your project report is due on Friday November 27th,
at 10am, and no late submission is allowed for the project report. All project members are expected to
submit their code online but only one copy of the project report will be submitted.
The report should include (a) a general overview of your system with a small user guide, (b) a description
of your algorithm for evaluating queries, in particular evaluating queries with multiple conditions and
wild cards and range searches and an analysis of the efficiency of your algorithm, (c) your testing
strategy, and (d) your group work break-down strategy. The general overview of the system gives a high
level introduction and may include a diagram showing the flow of data between different components;
this can be useful for both users and developers of your application. The user guide should have
instructions for running your code for phases 1, 2 and 3. The testing strategy discusses your general
strategy for testing, with the scenarios being tested and the coverage of your test cases. The group work
strategy must list the break-down of the work items among partners, both the time spent (an estimate) and
the progress made by each partner, and your method of coordination to keep the project on track. The
design document should also include any assumption you have made or any possible limitations your
code may have.
Last modified: Wednesday, 18 November 2015, 9:33 PM