PAN16

Software submitted for PAN16 Author Clustering (http://pan.webis.de/clef16/pan16-web/author-identification.html)
The submission got 3rd and 4th best for Mean F-Score and MAP (team: sari16). The complete results can be seen here (http://www.tira.io/task/author-clustering/dataset/pan16-author-clustering-test-dataset2-2016-04-12/)
The notebook paper will be published soon at http://pan.webis.de/publications.html

Before running the code, please make sure to install all dependencies software (sklearn, gensim).
To run the software, type this following command in terminal:

  python main.py -c $inputDataset -o $outputDir

The system used the character n-gram features together with K-means clustering. The number of clusters were optimized using Silhouette Coefficient. <br > Initially, we also tried to use word embeddings as the features. However, since the results didn't show any significant improvement, we decided to use character n-grams in our final software.<br >

We have created word2vec model for Dutch (see under "model" directory) and used Google word2vec binary model (you have to download it by your own and put it under "model" directory).

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
model		model
stopwords		stopwords
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cluster.py		cluster.py
main.py		main.py
produce_output.py		produce_output.py
read_data.py		read_data.py
silhoutte.py		silhoutte.py
word2vec_average.py		word2vec_average.py
word2vec_utility.py		word2vec_utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model

model

stopwords

stopwords

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

cluster.py

cluster.py

main.py

main.py

produce_output.py

produce_output.py

read_data.py

read_data.py

silhoutte.py

silhoutte.py

word2vec_average.py

word2vec_average.py

word2vec_utility.py

word2vec_utility.py

Repository files navigation

PAN16

About

Releases

Packages

Languages

License

yunitata/PAN16

Folders and files

Latest commit

History

Repository files navigation

PAN16

About

Resources

License

Stars

Watchers

Forks

Languages