HZ-SKOS*

The Docs Indexer is a simple file which is written in Python, it uses Apache Tika(mapper plugin) to index it's file in a base 64 format. The main purpose of this script is to index all documents such as: "pdf,word,xls etc.". After being uploaded you can use. The Calculator is currently still in development. The main purpose of this script is to calculate the relevance between every single term of the documents and the SKOS concepts. It's expected result from the Skos terms and every single term from the documents.

#Docs_Indexer

The main index(structure) of the docs_indexer is written as followed :

PUT hzbwnature { "mappings" :{ "attachment" : { "properties" : { "content" : { "type" : "attachment", "fields" : { "content" : { "term_vector":"yes", "store":"yes" }, "author" : { "store" : "yes" }, "title" : { "store" : "yes"}, "date" : { "store" : "yes" }, "keywords" : { "store" : "yes", "analyzer" : "keyword"}, "name" : { "store" : "yes" }, "content_length" : { "store" : "yes" }, "content_type" : { "store" : "yes" } } } } } } }
By using termvectors it will be easier to do analysis such as finding terms and scores.

You can run this query to search for relevant terms which are indexed. See the example below. Note that if you are running this in sense you will only get the base 64 format returned instead of the normal format. If you wish to see "the normal format" try to look it up in kibana !

POST /hzbwnature/_search { "query": { "query_string": { "query": "dijk" }}}

#cosine similarity Calculator

In order to complete this step please vectorize the fields that you want to calculate. In my case it's the document content, title and the SKOS-concept terms. After indexing these terms we will move to requesting for these vectors and calculating these fields.

tvjson = es.termvector(index=index_name, doc_type="page", id=doc_id)

to get term vectors from the statistics returned we write a function

def get_tv_dict(tvjson): return dict([ (k, v['term_freq'])
for k,v in tvjson
.get('term_vectors')
.get('page_text')
.get('terms')
.iteritems()])

Once we get the term vectors for documents we can calculate the cosine similarity score Given below is the function to calculate the cosine similarity score of documents given the term vectors

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Docsindexer.py		Docsindexer.py
README.md		README.md
Vectorspacecalculator.py		Vectorspacecalculator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docsindexer.py

Docsindexer.py

README.md

README.md

Vectorspacecalculator.py

Vectorspacecalculator.py

Repository files navigation

HZ-SKOS*

About

Releases

Packages

Languages

tinnielam/hz-skos-

Folders and files

Latest commit

History

Repository files navigation

HZ-SKOS*

About

Resources

Stars

Watchers

Forks

Languages