IWTBA

I Want To Be A (IWTBA) helps you bridge the gap between the job you want and the skills you need.

You can find the live app at IWTBA.io.

Motivation

So you want to learn math, marketing, maybe some statistics. And you're one of the 13+ million people who have chosen to use online resources like Coursera to get yourself there. Great! Browse away, Coursera is structured to be easy to navigate for searches like that.

But what if you want to be something more specific? You want to be a programmer. What's more, you don't want to be any programmer, you've found your dream job on Stack Overflow Careers and you've got the job listing to prove it.

Put that job listing in to IWTBA and it will return a list of courses you should take, structured and categorized in a way that makes it easy for a user to navigate.

Data

IWTBA is intended to sit in the space between jobs and courses, and allow users to find the path from one to the other using unstructured text queries. To train the recommender, I needed data about jobs and courses.

Courses

Coursera has a robust API which I used to scrape the raw text used for training (course title, course descriptions, about sections and syllabi) and metadata (id, categories and icons). Some additional fields seemed promising but either were too generic (course FAQs, recommended backgrounds) or had too many missing values (suggested readings, target audience) to be used.

Jobs

Two sources were used for job listings: Github and the NYC government.

Github: Listings and job titles were scraped from Github jobs using BeautifulSoup.
NYC: NYC has a JSON data dump of job postings available as part of the open government movement.

Model

The heart of the model is a matrix mapping courses and jobs to the latent topics discovered in both. Cosine similarity is used to find the most similar courses and jobs to a user-inputted job listing.

Matrix

Job listings and course text data are cleaned and tokenized using stopwords, stemming and regex.
These text documents are converted into a vector space using TF-IDF. TF-IDF is a numeric representation of text documents that attempts to quantify how import each word is to the document.
The 24,000 dimension TF-IDF matrix is then reduced using singular value decomposition (SVD) to a 1000 dimension matrix. The SVD matrix is comprised of few topics instead of many words.

These topics have interesting properties, one of which is handling multiple words with similar meanings. A latent topic can encode "programming" and "coding" as related concepts, whereas in TF-IDF each word is a separate feature.

Cosine Similarity

A similarity score is computed between the input and each job and course in the dataset, giving us an unordered bag of likely recommendations.

Topic Classification

A support vector classifier (SVC) is used to score each with one or many of 26 different categories (Math, Engineering, etc.). The model was tuned using Coursera course data with category tags as ground truth, and has a cross-validated 40% true positive rate and a 5% false positive rate.

Additionally, I ran the tuned model on each Coursera course to determine which of its categories is its primary category.

Job Predictions and Recommendations

I return the 3 most similar jobs above a minimum threshold as job title predictions. The job title gives the user some insight into what data is driving the recommendations and confidence about those recommendations.

The topic classification are used to structure my recommendations. I recommend all courses that meet a very strict similarity threshold, and then for each category the job has I recommend courses that meet a lower threshold.

The Webapp

The webapp was built using Flask on a Bootstrap template, and is hosted on AWS EC2. The recommendation page is responsive: headings change or disappear according to which categories I have recommendations for.

You can find the live app at IWTBA.io.

Future Extensions

Curriculum generation: recommending courses in a specific order.
Adding additional learning resources such as YouTube and books.

Packages Used

BeautifulSoup
NLTK
sklearn
NumPy
Flask

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
eda		eda
scrapers		scrapers
static		static
templates		templates
README.md		README.md
bootstrap_site.py		bootstrap_site.py
create_model.py		create_model.py
useful_links.txt		useful_links.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

eda

eda

scrapers

scrapers

static

static

templates

templates

README.md

README.md

bootstrap_site.py

bootstrap_site.py

create_model.py

create_model.py

useful_links.txt

useful_links.txt

Repository files navigation

IWTBA

Motivation

Data

Courses

Jobs

Model

Matrix

Cosine Similarity

Topic Classification

Job Predictions and Recommendations

The Webapp

Future Extensions

Packages Used

About

Releases

Packages

Languages

ttwoodbury/IWTBA

Folders and files

Latest commit

History

Repository files navigation

IWTBA

Motivation

Data

Courses

Jobs

Model

Matrix

Cosine Similarity

Topic Classification

Job Predictions and Recommendations

The Webapp

Future Extensions

Packages Used

About

Resources

Stars

Watchers

Forks

Languages