TF-IDF

Information Retrieval search by cosine-distance ranking of document vector representations. The document vector representations are based on the term frequency - inverse document frequency (TF-IDF) score, per word, per document.

Given a scientific paper, determine the most relevant wikipedia articles. The scientific paper is the 'query', the returned ranked wikipedia aritlces are the search results. The purpose of this project was to determine the feasibility of applying topic modeling by LDA. This information retrieval search can be viewed as a rudimentary topic decomposition of the statistics papers, by statistical and mathematical wikipedia articles. This project is a step towards the Hopper Project's overall goal of developing statistical machine learning algorithms for extracting representations of text and mathematical expressions and building new search tools for the published scientific literature

The code takes in the paths to two directories, one for the corpus of scientific papers, and the other for the corpus of wikipedia articles. It first determines a common vocabulary between the two corpora (unigrams, bi-grams, frequency cutoffs, etc). Next, document-term matrices are constructed per corpus. Each row is a document, each column a member of the vocabulary. In the cells are the frequency counts. From the document-term matrix, document-TFIDF matrices are constructed. For v in vocabulary and document d in a corpus,

TFIDF(v,d) = TF(v,d) * IDF(v,d)

where TF(v,d) = (count of v in d)/(number of vocab terms in d) and IDF(v,d) = log(number of documents in a corpus/number of documents which contain v)

We now have two document-TFIDF matrices, one for the corpus of scientific papers, the other for the corpus of wikipedia articles. A final matrix is constructed from these two document-TFIDF corpora, the cosine distance between the TFIDF vector representation of each scientific paper (rows) and each wikipedia article (columns).

To perform search over the wikipedia articles given a scientific paper query, find the corresponding row in the matrix and then order the wikipedia articles by their cosine distance scores.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data_objects		data_objects
run1_firstserial/rough		run1_firstserial/rough
run2_bettertfidf		run2_bettertfidf
run3_keyword		run3_keyword
run4_tfidfexpand		run4_tfidfexpand
run5_conditional_boldphrases		run5_conditional_boldphrases
run6_self_differentiate_expanded		run6_self_differentiate_expanded
run7_clipped_testing		run7_clipped_testing
testing/rough		testing/rough
tools/clipper		tools/clipper
.gitignore		.gitignore
.tfidf_data_process.py.swp		.tfidf_data_process.py.swp
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

data_objects

data_objects

run1_firstserial/rough

run1_firstserial/rough

run2_bettertfidf

run2_bettertfidf

run3_keyword

run3_keyword

run4_tfidfexpand

run4_tfidfexpand

run5_conditional_boldphrases

run5_conditional_boldphrases

run6_self_differentiate_expanded

run6_self_differentiate_expanded

run7_clipped_testing

run7_clipped_testing

testing/rough

testing/rough

tools/clipper

tools/clipper

.gitignore

.gitignore

.tfidf_data_process.py.swp

.tfidf_data_process.py.swp

README.md

README.md

Repository files navigation

TF-IDF

About

Releases

Packages

Languages

jerry-chee/TF-IDF

Folders and files

Latest commit

History

Repository files navigation

TF-IDF

About

Resources

Stars

Watchers

Forks

Languages