Skip to content

gitkarsh/clustering_lab

 
 

Repository files navigation

Clustering: tutorial and lab

Part I. Tutorial

Clustering small documents

To run the code you need to install the graphics library Pillow-Pil - the same library which we used in the decision tree lab.

  1. Read file `titles.txt`. Each line represents a paper title. There are 2 evident clusters: the first cluster represents papers on Human-Computer Interaction, and the second one on Theory of Computing.
  2. Convert documents into word matrix using `titles_to_vectors.py`.
  3. Explore different distance metrics by running `cosine_documents.py`, `pearson_documents.py`, and `tanimoto_documents.py`. What do you observe? Are you satisfied with the distances between each p[air of documents? Maybe using euclidean distance will give better results? Check it out.
  4. Create 2 clusters using k-means algorithm implemented in `clusters.py`. For this run `kmclustertitles.py`. Maybe if we change the number of clusters, the results will be better? Explore.
  5. Now cluster words by the documents where they occur by running `kmclusterwords.py`. It seems that this clustering works better. Why doo you think that is?
  6. Finally, try hierarchical clustering of documents by running `hclustertitles.py`. Hierarchical clustering seems to work much better. Why do you think that is?

About

Clustering lab

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 90.7%
  • HTML 9.3%