To run, fork then open "Course-Industry Matching.ipynb" in ipython notebook. All important functions are explained there.
This repository analyzes the likelihood of matching between two independent sets of data (e.g. Course to Industry). The algorithm performs an initial Content-Based Filtering through features in text, with a dynamic capability of Collaborative Filtering through present user profiles.
Such likelihood is quantified using a matrix, where each entry describes the relative likelihood of matching. This is ideal for it is scalable with new data, and it is compatible with multiple criteria likelihood (e.g. Course to Industry to Jobs). One just needs to multiply the respective matrices to acquire a new likelihood relationship.
The steps of the algorithm is as follows:
- Data Mining / Data Gathering
- Data Cleaning
- text normalization
- prefix removal
- abbreviation mapping
- internal respelling
- Clustering
- Uses WORD STEMMING and WORD FREQUENCY
- Creation of Likelihood Matrix
- Content-based Filtering
- Uses cosine similarity of features
- Tfdif vectorization of text
- Dynamic Update of Likelihood
- Collaborative Filtering
- Uses cosine similarity as well
- Increases likelihood for each new user info (example below)
- user course: MARKETING
- user work industry: FINANCE INDUSTRY
- result: likelihood match of MARKETING and FINANCE increases
- Uses cross product of all possible keyword matches
- Repeat of previous step (5)
1) pyenchant
- with AbiWord Enchant
2) stemming
3) numpy
4) scipy
5) sklearn
6) pandas