Machine-Learning-Project

Two scripts are for preprocessing data, Judge_Bio_Dataset_Preprocess.py is the first

Data+prep1.py is the second

The sentencing text files should be unziped first and change to reletive path in order for prepreocessing The preprocessing may take hours or days. We also uploaded the preprocessed data for running models directly.

In the DeepOLS&SecondStage.py and model_performance.py, we compare the vectorizer. And the file cc_merged_0429.csv is a table with raw text data. However it is too big for uploading to github. Hence we provide seperate link for downloading : https://drive.google.com/file/d/1b8OGjZf__hxe_olYdPzYqCofTtbusXhr/view?usp=sharing

The models should be able to run using bash script in bashscript.sh. please put the code in the same directory as the data. You should be able to run the code easily for both the jupyter notebook and .py file. We wasn't able to test in on server as we cannot install virtual enviornment due to permission issue.

Also we provide python notebook for code illustration.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Binary_2ndStage.ipynb		Binary_2ndStage.ipynb
Binary_2ndStage.py		Binary_2ndStage.py
Cleaned_0518.csv		Cleaned_0518.csv
Cleaned_0518.dta		Cleaned_0518.dta
Data Preprocessing.ipynb		Data Preprocessing.ipynb
Data_Preprocessing.py		Data_Preprocessing.py
Dataset_Description.csv		Dataset_Description.csv
Dataset_Description_old.csv		Dataset_Description_old.csv
DeepOLS&SecondStage.ipynb		DeepOLS&SecondStage.ipynb
DeepOLS&SecondStage.py		DeepOLS&SecondStage.py
DeepOLS&SecondStage_doc2vec_doc25&10.ipynb		DeepOLS&SecondStage_doc2vec_doc25&10.ipynb
Doc2Vec_Reduced.ipynb		Doc2Vec_Reduced.ipynb
Judge_Bio_Dataset_Preprocess.ipynb		Judge_Bio_Dataset_Preprocess.ipynb
Judge_Bio_Dataset_Preprocess.py		Judge_Bio_Dataset_Preprocess.py
README.md		README.md
Reduced_Model.ipynb		Reduced_Model.ipynb
Reduced_Model.py		Reduced_Model.py
bashscript.sh		bashscript.sh
bio_txt.csv		bio_txt.csv
cc_merged_0516.pkl		cc_merged_0516.pkl
data_bio_sumed_pred.csv		data_bio_sumed_pred.csv
h2o.ipynb		h2o.ipynb
h2o.py		h2o.py
model_performance.py		model_performance.py

xz2139/Machine-Learning-Project

Folders and files

Latest commit

History

Repository files navigation

Machine-Learning-Project

About

Resources

Stars

Watchers

Forks

Languages