NLP naive inverted index

Using the Reuters21578 corpus, Build a naive inverted index which would allow you to query words and return as a result all the documents (i.e. reuters' news articles containing said word).

To download the corpus on your machine type ./download_corpus.sh or sh download_corpus.sh at the root of the project (provided you have a unix based OS, otherwise download the data manually here and put it in a folder called data at the root of the project).

Make sure you have nltk installed by typing conda install nltk or pip install nltk depending on the package manager that you use.

Creating the index

type python indexer.py to generate the index and store it in output/inverted_index.txt if you wish to specify another output file just pass it as argument with the -o or --ouput_file flag.

Compressing the index

To compress your index type python compressor.py. By default, the compresser will use output/inverted_index.txt as it's input file and output/compressed_index.txt as its output file. You can change both these defaults with -i or --input_file and -o or --output_file respectively.

Querying your index

type python query.py -q <insert a single word>. By default the query script will take as input the index stored in output/inverted_index.txt and will store the result of the query in output/sampleQueries.json. Similarly, you can change the defaults with -i or --input_file and -o or --output_file respectively.

Keep in mind that the uncompressed index contains terms with variable casing, so if you query the word airplane, you will get case-sensitive results. However, if you are querying a compressed index, you should keep all characters lower-cased because the compression step will lowercase all dictionary terms. In this case, querying the word airplane will returns case-insenstive results and having a single upper case character while querying a compressed index is garanteed to return 0 results.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
output		output
.gitignore		.gitignore
README.md		README.md
compressor.py		compressor.py
download_corpus.sh		download_corpus.sh
indexer.py		indexer.py
query.py		query.py
test_index_loader.py		test_index_loader.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output

output

.gitignore

.gitignore

README.md

README.md

compressor.py

compressor.py

download_corpus.sh

download_corpus.sh

indexer.py

indexer.py

query.py

query.py

test_index_loader.py

test_index_loader.py

utils.py

utils.py

Repository files navigation

NLP naive inverted index

Creating the index

Compressing the index

Querying your index

About

Releases

Packages

Languages

ribal-aladeeb/NLP_naive_inverted_index

Folders and files

Latest commit

History

Repository files navigation

NLP naive inverted index

Creating the index

Compressing the index

Querying your index

About

Resources

Stars

Watchers

Forks

Languages