Kaggle - Job salary prediction competition

I competed in another Kaggle competition called Job salary prediction. Goal of a competition was to predict salary based on job ads. It is a NLP competition. We have full text of Title, Full description and Location of ads. We know where in UK add was given and on which website.

Error metric is mean average error.

Stats

I tried 291 combination of models, parameters and features. I was doing this competition for 75 hours. I started in March. I did 9 submissions on public leaderboard.

Models

I used 2 different models:

Vowpal wabbit
Extra Tree regressor (ETr) from sciki-learn Python library

I tried many other models but they didn't work so well. Random forest was little worse than Etr.

All models were doing log predictions of salaries unles speccialy noted.

Features

For vowpall wabbit I used one model with all features. And another with locations changed in 5 parts. Same models as described here. Sadly I didn't try to improve score for Vowpall wabbit my score would probably be better.

For ETr I used different features:

200 most frequent words in Title, FullDescription and LocationRaw
Same as first only tf-idf normalized values
Label encoded values in: Category, Contract Time, Contract Type

I also tried to get better features with Gensim but it didn't work out this time.

Final model

Final model was average of 6 models:

Vowpal wabbit with all features
Vowpal wabbit with all features and location split in 5 parts
Etr with 30 trees with features 1. and 3.
Etr with 40 trees with features 1. and 3.
Etr with 40 trees with features 1. and 3. with normal predictions (non log)
Etr with 40 trees with features 2. and 3.

I chose this model because submodels gave best results in cross validation. I was 26/285 on public leaderboard. And TBD on private.

What I learned

Cloud can be very usefull. (I used picloud)
32 bit computer even with PAE can not use 8 GB memory in scikit
Make everything scriptable (I am getting better at this. But still it takes too long to change some paramters and run the same model on a test data and validation data. Ramp would probaly help but I didn't want to use it now because it uses pandas. I don't like pandas because it puts whole data in RAM and poor read speed. I saw too late that read speed is much better.
Sleep is good ( better to sleep then work to 2-3 AM)

Things to see into

Ramp
Spearmint (automatic parameter tuning)
Vowpal wabbit

Things I already do

Train 60%, validation 20%, test split 20% from train set
Cross validation
Version control

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
final		final
README.md		README.md
Settings.json		Settings.json
Settings_cloud.json		Settings_cloud.json
Settings_loc5.json		Settings_loc5.json
Settings_submission.json		Settings_submission.json
check_grid_scores.py		check_grid_scores.py
combine.py		combine.py
combine_new.py		combine_new.py
current_models		current_models
data_io.py		data_io.py
gridcv_extra_tree_gensim.py		gridcv_extra_tree_gensim.py
make_counts.py		make_counts.py
make_wovpal.py		make_wovpal.py
prepare_gensim.py		prepare_gensim.py
run.py		run.py
run_adaboost.py		run_adaboost.py
run_knn.py		run_knn.py
show_diff.py		show_diff.py
sort_models		sort_models
split.py		split.py
submission_extra.py		submission_extra.py
submission_extra_gensim.py		submission_extra_gensim.py
submission_extra_new.py		submission_extra_new.py
submission_extra_new_cloud.py		submission_extra_new_cloud.py
submission_extra_new_tfidf.py		submission_extra_new_tfidf.py
submission_extra_vowpall_mean.py		submission_extra_vowpall_mean.py
submission_extra_vowpall_mean_new.py		submission_extra_vowpall_mean_new.py
submission_ridge.py		submission_ridge.py
submission_splitted.py		submission_splitted.py
submission_splitted_category.py		submission_splitted_category.py
submission_splitted_part.py		submission_splitted_part.py
submission_vowpal.py		submission_vowpal.py
submission_vowpal1.py		submission_vowpal1.py
test_notext.py		test_notext.py
zagensim.py		zagensim.py

oasic/kaggle-job-salary

Folders and files

Latest commit

History

Repository files navigation

Kaggle - Job salary prediction competition

Stats

Models

Features

Final model

What I learned

Things to see into

Things I already do

About

Resources

Stars

Watchers

Forks