Anti fungal peptide prediction

-----> File description

train.csv - Train data from kaggle

test.csv - Test data from kaggle for we have to calculate model

result.csv - Result from mainCode.py (from our model)

Estimation.py - In this python code we try to select best possible model with respect
to roc and then estimate its best parameter

mainCode.py - In this commented python we train, test, evaluate its accuracy and get
result from the model

Cross_Validation_score.png - Cross validation score from model selection in estimator.py Roc_graph.png - Roc graph

-----> Main idea In machine learning model we cannot give input in character but in numbers only and input size of these number should also be constant,here we have to train model with protein sequence which is in character so we can convert these numbers into relatable numbers ie which hold the importance of characters size of the sequence to do this we have following model :-

Frequency Matrice
Binary Array Conversion
Composition matrice using pfeatures

From above option Frequency Matrices give us the best result in this we have array of 20 length and each index store the Frequency of that corresponding character like 0th index represent A character so number at 0th index will represent frequency of A in the entire sequence. As it calculate the frequency it store the relevance to sequence As the frequency matice of all sequence is of length 20 so the input length of sequence is constant.

-----> Fitting frequency matrices into model
after calculating the frequency matrice we run it on model selection function which plot graph of model vs roc_auc score which is store in Cross_Validation_score.png, from this graph we can clearly see that the adaboost with random forest perform best then other classifier. There are many classifier available but we choose those type of classifier which are best for classifying sequence.

After choosing adaboost with random forest as our model now we have to calculate its best parameter for this we used grid search function with roc as parameter which gives us that best n_estimator is 300 from list of [1,10,30,50,70,100,130,150,170,200,230,250,300] estimators

Code for this 2 is stored in the Estimation.py File

After getting model and its best parameter we fit frequency model in model and calculate our result for test data from kaggle in result.csv file on kaggle
accuracy score on internal data (test data by spliting train data from kaggle) is 89% AUC score for roc on internal data is approx 93%
accuracy score on external data (test file from kaggle) 92% approx

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Anti_fungal_Peptide.ipynb		Anti_fungal_Peptide.ipynb
Cross_Validation_score.png		Cross_Validation_score.png
Estimation.py		Estimation.py
README.md		README.md
Roc_graph.png		Roc_graph.png
Test_binary.csv		Test_binary.csv
Test_seq.csv		Test_seq.csv
Train_binary.csv		Train_binary.csv
Train_seq.csv		Train_seq.csv
mainCode.py		mainCode.py
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anti_fungal_Peptide.ipynb

Anti_fungal_Peptide.ipynb

Cross_Validation_score.png

Cross_Validation_score.png

Estimation.py

Estimation.py

README.md

README.md

Roc_graph.png

Roc_graph.png

Test_binary.csv

Test_binary.csv

Test_seq.csv

Test_seq.csv

Train_binary.csv

Train_binary.csv

Train_seq.csv

Train_seq.csv

mainCode.py

mainCode.py

test.csv

test.csv

train.csv

train.csv

Repository files navigation

Anti fungal peptide prediction

About

Releases

Packages

Languages

samarth1107/Anti-fungal_Peptide

Folders and files

Latest commit

History

Repository files navigation

Anti fungal peptide prediction

About

Resources

Stars

Watchers

Forks

Languages