Clustering: tutorial and lab

Part I. Tutorial

To run the code you need to install the graphics library Pillow-Pil - the same library which we used in the decision tree lab.

Read file `titles.txt`. Each line represents a paper title. There are 2 evident paper types: titles 1-5 represents papers on Human-Computer Interaction, and titles 6-9 -- on Theory of Computing.
Convert documents into word matrix using `titles_to_vectors.py`. Look at the matrix in file `titles_vectors.txt`. Note that stopwords as well as the words that occur only once have been removed. Why is that?
Explore different distance metrics by running `euclidean_documents.py`, `tanimoto_documents.py`, `cosine_documents.py`, and `pearson_documents.py`. For example, distance between d4 and d8 should be significant larger than distance between d7 and d8, because d7 belongs to the same topic as d8. What do you observe? Compare results for vector-based and geometrical distance: which ones work better for documents?
Now let's explore the distance between different words based on their occurrence in the documents. For that run `pearson_words.py`.
Create 2 clusters using k-means algorithm implemented in `clusters.py`. For this run `kmclustertitles.py`. Did k-means find the expected clusters?
Now cluster words by the documents where they occur by running `kmclusterwords.py`. Do the group of words make sense? Try to play with different number of clusters.
Finally, try hierarchical clustering of documents by running `hclustertitles.py`. Hierarchical clustering seems to work much better. Why do you think that is?

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
public_html		public_html
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clusters.py		clusters.py
cosine_documents.py		cosine_documents.py
data_to_vectors.py		data_to_vectors.py
dataset.csv		dataset.csv
dimensions_keywords.csv		dimensions_keywords.csv
euclidean_documents.py		euclidean_documents.py
hcluster.py		hcluster.py
hclusterwords.py		hclusterwords.py
kmclusterwords.py		kmclusterwords.py
kmeans.py		kmeans.py
pearson_documents.py		pearson_documents.py
pearson_words.py		pearson_words.py
stopwords.py		stopwords.py
tanimoto_documents.py		tanimoto_documents.py
test.py		test.py
titles.txt		titles.txt