To run on EMR cluster without docker :
Create a s3 bucket named myprogrambucket123
upload ValidationDataset.csv, TrainingDataset.csv , testerD115.py and trainerD115.py in it
aws s3 cp s3://myprogrambucket123/TrainingDataset.csv ~ #to copy csv file from s3 to emr
aws s3 cp s3://myprogrambucket123/ValidationDataset.csv ~
spark submit trainerD115.py #saves the model in the s3 bucket
in current folder use command (to download model that was uploaded by trainerD115.py from s3 bucket if not present) :
aws s3 cp s3://myprogrambucket123/ . --recursive
Format : spark-submit testerD115.py #to run testerD115.py eg : spark-submit testerD115.py
For Docker : Download spark 3.0 tar and untar it in "spark" folder in the current directory.
sudo docker build -t ishani/cs_643_proj
sudo docker run -v /root/ishani/ValidationDataset.csv:/root/ishani/ValidationDataset.csv ishani/cs_643_proj /root/ishani/ValidationDataset.csv
docker pull ishani/cs_643_proj:latest