Skip to content

supersglzc/clustering_lab

 
 

Repository files navigation

Clustering: tutorial and lab

Part I. Tutorial

Clustering small documents

To run the code you need to install the graphics library Pillow-Pil - the same library which we used in the decision tree lab.

  1. Read file `titles.txt`. Each line represents a paper title. There are 2 evident paper types: titles 1-5 represents papers on Human-Computer Interaction, and titles 6-9 -- on Theory of Computing.
  2. Convert documents into word matrix using `titles_to_vectors.py`. Look at the matrix in file `titles_vectors.txt`. Note that stopwords as well as the words that occur only once have been removed. Why is that?
  3. Explore different distance metrics by running `euclidean_documents.py`, `tanimoto_documents.py`, `cosine_documents.py`, and `pearson_documents.py`. For example, distance between d4 and d8 should be significant larger than distance between d7 and d8, because d7 belongs to the same topic as d8. What do you observe? Compare results for vector-based and geometrical distance: which ones work better for documents?
  4. Now let's explore the distance between different words based on their occurrence in the documents. For that run `pearson_words.py`.
  5. Create 2 clusters using k-means algorithm implemented in `clusters.py`. For this run `kmclustertitles.py`. Did k-means find the expected clusters?
  6. Now cluster words by the documents where they occur by running `kmclusterwords.py`. Do the group of words make sense? Try to play with different number of clusters.
  7. Finally, try hierarchical clustering of documents by running `hclustertitles.py`. Hierarchical clustering seems to work much better. Why do you think that is?

Part II. Homework

Clustering countries

See Google doc.

About

Clustering lab

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.7%
  • JavaScript 3.3%
  • HTML 1.0%