youngsydney/inforetrieval
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
READ_ME Project Part 1: Pre-Processing Documents & Building Inverted Index Sydney Young October 9, 2015 COSC 488: Introduction to Information Retrieval How to Run: Compile and run the run.py file. In the run.py file there is a main method which controls the processing and indexing of all the documents. Note: Set the memory constraint at the top of the indexer.py and the file path for the folder where the date files are kept at the top of the run.py The output to the screen will be the execution time for the various indexes. The lexicon/posting list is stored in the indexType_out_path and the term list with document frequencies is stored in the indexType_terms. term index = {term:docFreq, term:docFreq} lexicon = {termID: {docID:tf, docID:tf}, termID: {docID:tf}} lexicon_positional = {termID: {docID:[pos, pos, pos]}, termID: {docID:[pos]}} Notes about processing: For the phrase index, special cases get replaced with " STOP " so during the phrase identification stage the phrases can't cross over special cases. For the cases of hyphenated words, I merged the word (black-tie became blacktime) and only stored the combined in the positional index. To Fix -- .., .,.0, 1$.4, 1$.6
About
Search Engine, COSC 488
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published