This is a fork from Genetic-Algorithm-on-K-Means-Clustering and was developed as a task for the Evolutive Systems class of 2019 for the Computer Science Masters Course at UFPel.
The resulting paper is available here (in portuguese).
PHD program clustering using Genetic K-Means and BM25 scores
- Corpus of each PHD program made of the title and resume of all their indexed production at the official brazilian plataform Sucupira -- articles, books, presentations, thesis, and etc, with resume being available mostly for dissertation and thesis only
- Using ElasticSearch and the generated dataset, extracted the top 50 scoring keywords for each program with Okapi BM-25
- For each unique term (around 11k), gets the score of that term for each program, generating a matrix
- Minmax normalization for standardization
- Davies–Bouldin index for evaluation of each cluster
- In Genetic
- Rank based selection
- One point crossover
- panda
- numpy
python __main__.py
config.txt
contain control parameters- kmax : maximum number of clusters
- budget : budget of how many times run GA
- numOInd : number of Individual
- Ps : probability of ranking Selection
- Pc : probability of crossover
- Pm : probability of mutation
norm_data.csv
is normalization datacluster_json
is centroid of each clusterresult.csv
is data with labeled to each cluster