Application Flow

Indexing

The following steps illustrate retrieval & indexing.

Choose a folder for your data. For the remainder of these steps, assume the data folder is data/. The immediate subfolders of the data folder are the categories.
Use Websphinx to crawl pages & save them in your data/ folder.
Choose a folder for your index. For the remainder of these steps, assume the index folder is index/.
Run python app/indexer.py index "data/**/*". The first parameter represents the index folder. The second parameter is a unix glob pattern representing all the files to index.

Retrieval

The following steps illustrate running a search query. Retrieval is done using Lucene's TFIDFSimilarity class.

Run python app/retrieve.py <index_folder> where <index_folder> is the path to your index.
Enter a query.

Questions

The questions for the assignment are answered by running python app/questions.py <index_folder>, where <index_folder> is the path to the index.

The general flow of the questions.py script follows:

Initialize retriever
Retrieve all documents organized by category, the original subfolder that the file was contained in. For example, if the file was originally saved in data/ece then the category of the file is ece.
Perform KMeans clustering on the sentiments of all documents, producing 3 sentiment centroids ("negative", "neutral", "positive").
Give each category a sentiment ranking by:

i. Classify each document as "negative", "neutral", "positive".

ii. Assign each negative document a value of -1; each neutral document a value of 0; each positive document a value of 1.

iii. Sum the values for a category.
Perform KMeans clustering on the overall sentiments of the categories, producing 3 centroids ("negative", "neutral", "positive").
Classify each category as "negative", "neutral" or "positive".

This method classifies a category as "negative", "neutral" or "positive" by examining how many docs in the category are "negative", "neutral" or "positive". The docs themselves influence the class boundaries because the boundaries are computed using KMeans.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
app		app
data		data
docs		docs
mcgill-data		mcgill-data
websphinx		websphinx
.gitignore		.gitignore
README.md		README.md
mcgill.txt		mcgill.txt
no_mcgill.txt		no_mcgill.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

data

data

docs

docs

mcgill-data

mcgill-data

websphinx

websphinx

.gitignore

.gitignore

README.md

README.md

mcgill.txt

mcgill.txt

no_mcgill.txt

no_mcgill.txt

requirements.txt

requirements.txt

Repository files navigation

Application Flow

Indexing

Retrieval

Questions

About

Releases

Packages

Contributors 2

Languages

CrawlingFingers/ConcordiaCrawler

Folders and files

Latest commit

History

Repository files navigation

Application Flow

Indexing

Retrieval

Questions

About

Resources

Stars

Watchers

Forks

Languages