Two scripts are for preprocessing data, Judge_Bio_Dataset_Preprocess.py is the first
Data+prep1.py is the second
The sentencing text files should be unziped first and change to reletive path in order for prepreocessing The preprocessing may take hours or days. We also uploaded the preprocessed data for running models directly.
In the DeepOLS&SecondStage.py and model_performance.py, we compare the vectorizer. And the file cc_merged_0429.csv is a table with raw text data. However it is too big for uploading to github. Hence we provide seperate link for downloading : https://drive.google.com/file/d/1b8OGjZf__hxe_olYdPzYqCofTtbusXhr/view?usp=sharing
The models should be able to run using bash script in bashscript.sh. please put the code in the same directory as the data. You should be able to run the code easily for both the jupyter notebook and .py file. We wasn't able to test in on server as we cannot install virtual enviornment due to permission issue.
Also we provide python notebook for code illustration.