Build docker image

(from the project root dir)

docker build -t kennissessie-text-classification docker/

Run script

(from the project root dir)

docker run -it --rm --name kennissessie-text-classification -v "$PWD":/code kennissessie-text-classification python3 script.py

Assignments

1 Add the tf idf vectorizer as a feature extraction algorithm

see: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

hint: The TF IDF vectorizer is very similar to the word count vectorizer

2 Doc2Vec document similarity

The doc2vec library exposes the top10 similar results based on file id. see: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Doc2VecKeyedVectors.most_similar

In any of the data sets pick a random document and find the most similar documents using doc2vec. If you read these texts do you agree?

Hint: you can access the similar docvecs for a document using:

document_id = 'training/10335' # (stored in document['file_id'])
document_vector = doc2vec.docvecs[document_id]
print(doc2vec.docvecs.most_similar(positive=[document_vector]))

3 Try out another classification algorithm

Sklearn has a lots of of-the-shelve classification algorithm next to one that is currently used in the "classify" function in common.py

Most of these classification algorithms have lots of parameters, you can try to tweak them and see if you get better results.

For an overview see: http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docker		docker
fig		fig
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
common.py		common.py
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker

docker

fig

fig

model

model

.gitignore

.gitignore

LICENSE

LICENSE

README.MD

README.MD

common.py

common.py

script.py

script.py

Repository files navigation

Build docker image

Run script

Assignments

1 Add the tf idf vectorizer as a feature extraction algorithm

2 Doc2Vec document similarity

3 Try out another classification algorithm

4 Bonus! Find another labeled text data set and try to add it, visualize it and classify it

About

Releases

Packages

Languages

License

mvanzelst/kennissessie-text-classification

Folders and files

Latest commit

History

Repository files navigation

Build docker image

Run script

Assignments

1 Add the tf idf vectorizer as a feature extraction algorithm

2 Doc2Vec document similarity

3 Try out another classification algorithm

4 Bonus! Find another labeled text data set and try to add it, visualize it and classify it

About

Resources

License

Stars

Watchers

Forks

Languages