Skip to content

jhamilius/data-challenge-spark

Repository files navigation

Data Challenge Spark DSSP4, 07/16

How to execute on cluster

To train the model :

spark-submit --master yarn --num-executors 8 --driver-memory 2g --conf spark.ui.port=7770 code/evaluation.py

To generate the predictions.txt file :

spark-submit --master yarn --num-executors 8 --driver-memory 2g --conf spark.ui.port=7770 code/classify.py

To test the predictions :

spark-submit --master yarn --num-executors 8 --driver-memory 2g --conf spark.ui.port=7770 code/evaluate_F.py

List of files

  • evaluation.py : perform model training on training data (main file)
  • preProcessing.py : clean the data before training
  • extract_terms.py : do some features transformation on the dataset
  • helpers.py : other functions
  • loadFiles.py : load the data

Predictions

About

DataChallenge Spark DSSP4

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages