The following steps illustrate retrieval & indexing.
-
Choose a folder for your data. For the remainder of these steps, assume the data folder is
data/
. The immediate subfolders of the data folder are the categories. -
Use Websphinx to crawl pages & save them in your
data/
folder. -
Choose a folder for your index. For the remainder of these steps, assume the index folder is
index/
. -
Run
python app/indexer.py index "data/**/*"
. The first parameter represents the index folder. The second parameter is a unix glob pattern representing all the files to index.
The following steps illustrate running a search query. Retrieval is done using Lucene's TFIDFSimilarity class.
-
Run
python app/retrieve.py <index_folder>
where<index_folder>
is the path to your index. -
Enter a query.
The questions for the assignment are answered by running python app/questions.py <index_folder>
, where <index_folder>
is the path to the index.
The general flow of the questions.py
script follows:
-
Initialize retriever
-
Retrieve all documents organized by category, the original subfolder that the file was contained in. For example, if the file was originally saved in
data/ece
then the category of the file isece
. -
Perform KMeans clustering on the sentiments of all documents, producing 3 sentiment centroids ("negative", "neutral", "positive").
-
Give each category a sentiment ranking by:
i. Classify each document as "negative", "neutral", "positive".
ii. Assign each negative document a value of -1; each neutral document a value of 0; each positive document a value of 1.
iii. Sum the values for a category.
-
Perform KMeans clustering on the overall sentiments of the categories, producing 3 centroids ("negative", "neutral", "positive").
-
Classify each category as "negative", "neutral" or "positive".
This method classifies a category as "negative", "neutral" or "positive" by examining how many docs in the category are "negative", "neutral" or "positive". The docs themselves influence the class boundaries because the boundaries are computed using KMeans.