Clustering: tutorial and lab

Part I. Tutorial

To run the code you need to install the graphics library Pillow-Pil - the same library which we used in the decision tree lab.

Read file `titles.txt`. Each line represents a paper title. There are 2 evident clusters: the first cluster represents papers on Human-Computer Interaction, and the second one on Theory of Computing.
Convert documents into word matrix using `titles_to_vectors.py`.
Explore different distance metrics by running `cosine_documents.py`, `pearson_documents.py`, and `tanimoto_documents.py`. What do you observe? Are you satisfied with the distances between each p[air of documents? Maybe using euclidean distance will give better results? Check it out.
Create 2 clusters using k-means algorithm implemented in `clusters.py`. For this run `kmclustertitles.py`. Maybe if we change the number of clusters, the results will be better? Explore.
Now cluster words by the documents where they occur by running `kmclusterwords.py`. It seems that this clustering works better. Why doo you think that is?
Finally, try hierarchical clustering of documents by running `hclustertitles.py`. Hierarchical clustering seems to work much better. Why do you think that is?

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
HW4		HW4
processed		processed
public_html		public_html
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clusters.py		clusters.py
cosine_documents.py		cosine_documents.py
hclustertitles.py		hclustertitles.py
kmclustertitles.py		kmclustertitles.py
pearson_documents.py		pearson_documents.py
pearson_words.py		pearson_words.py
stopwords.py		stopwords.py
tanimoto_documents.py		tanimoto_documents.py
titles.txt		titles.txt
titles_to_vectors.py		titles_to_vectors.py