Protein-Purification-Model-Public

Surrodash by Just: a Capstone Project (JCP)

A surrogate modeling approach for predicting yield and purity from molecular interaction parameters 1600 times faster

Project Background

This project was sponsored by Just--Evotec Biologics, a company focused on designing technologies to accelerate the development of biotherapeutics while reducing the manufacturing cost. Just's strengths are in molecular design, process and product design, and manufacturing plant design.

This project exists at the interface of molecular and process design; the best way to build an affordable manufacturing process is to start with molecules and known operating conditions that produce high protein yields and high purity.

So, how do we find the combination of molecular interaction parameters and operating conditions that give the highest yield and purity?

Companies have used mechanistic models (built on scientific principles and run through computer simulations) in the past to iterate through a list of molecular interaction parameters/operating conditions to predict yield and purity. However, this process requires a lot of time and computing power. What if machine learning could take this already generated data and accurately predict yield and purity based on the same input parameters in a fraction of the time?

Surrodash by JCP seeks to use the provided mechanistic model to produce datasets that can instead train models to predict yield and purity using less time and computational power. The trained predictive models can then be used to pinpoint the molecular interaction parameters and operating conditions that produce the highest yield and purity, allowing Just to focus on the best initial candidates when it comes to molecular and process design.

Project Use Cases and Components

Use Case 1

Module: Mechanistic Model by Just

Function: Generate testing and training datasets.

MM (provided by our sponsor Just) was modified to include the most accurate parameters taken from literature to produce datasets with molecular species interaction parameters and their yield and purity. Data was generated by sampling parameters from a random uniform distribution to give the widest range of possible outputs. An example notebook for viewing the data produced from the MM can be found in the surrogate_models/notebooks directory.

Use Case 2

Module: Surrogate Model by JCP

Function: Predict yield and purity for a given set of molecular species interaction parameters faster than the MM.

This python package cleans the input data produced by the MM and trains 4 models (deterministic linear regression, probabilistic linear regression, deterministic NN, and probabilistic NN) to predict yield and purity for a set of given molecular species interaction parameters. An example notebook for the process of training/visualizing your model accuracy and then saving your model for later use can be found in the surrogate_models/notebooks directory. We have also included an example notebook on how to use our functions to run K-fold cross validation on your models.

Use Case 3

Module: Dash App by JCP

Function: Visualize input datasets, load/train/test/save models, visualize model accuracy, and produce training curves.

This dash app provides visualization of data produced by the MM to show general trends between the data. The app also allows the user to train, test, load, and save their own models. The app allows the user to visualize model accuracy, validating the training process through training curves, and lasso data to query the traits of predicted data that isn't as close to true values. This allows the user to understand what areas of input parameter space each model struggles with the most. A demo video showing the functionality of the dash app can be found in the dash_apps/apps/assets directory on Github.

Setup and Operating Instructions

How to Install and Run

Clone this repository.

git clone https://github.com/Just-DIRECT-Capstone/Protein-Purification-Model-Public

Run the setup.sh file in your chosen bash shell.

source setup.sh

Launch the dash app.

source launch.sh

Model Characterization

We compared our models' performance on data of different size and data generated with different isotherm types and different resin types. Our comparison notebooks can be found in the surrogate_models/notebooks/development_notebooks folder, but indicate that our model is fairly accurate at training/testing on data of different isotherm type. The model accuracy is greatly decreased as dataset size decreases and the model cannot predict on data if it's been trained with a different resin type. This is due to the column parameters related to each resin type in the mechanistic model. To accurately test for a certain resin type, you have to train the model on that same resin type.

Future Goals and Next Steps

We have a number of future steps we'd like to take to improve both the mechanistic model provided to us and our own surrogate modeling/dash visualization python package.

Mechanistic Model

Improvements to the MM

One of our main challenges with this project was generating datasets for the model to train/test on that contained realistic input parameters. While we modified areas of the model based on our literature search to include the widest range of valuable information, we need to both improve the Langmuir isotherm model and validate further variable approximations to have the maximum confidence in our datasets.

The MM code could also be improved in its usability; for the purposes of this project, every time we needed to change an input parameter we would manually change that in the code. This often led to scouring the code for every instance of a single variable. In the future, we'd like to clarify the usage of each variable in the code so that only one instance of a single variable will change that variable throughout the code.

We would also like to make the MM code more manageable by providing an easier user interface for changing variables and generating new data; the largest improvement to the MM could be in automating the process of generating data by developing a command line interface that prompts the user to choose their parameters and then generates the dataset with their ideal parameters in a simple way.

Dataset Production

For this project, we probed model performance by comparing the accuracy of models trained and tested across different isotherms, resin types, and dataset sizes. We would be interested in the future to see how the model performs when two or more of these inputs are changed; how does the model do when trained on different resin types and isotherms? How does dataset size affect the accuracy of the model with such a wide range of input parameters?

We would also like to modify the MM code to generate datasets with multiple impurities; the data analyzed here has only included one impurity but this isn't likely the case in most real world scenarios. It would be interesting to see how model performance is affected by multiple impurities (and thus multiple input parameters).

Surrogate Model and Dash Platform

In the future, we'd also be interested in comparing our current model performance to a developed Gaussian Process Regression (GPR) model due to its ability to function well on small datasets and provide uncertainty measurements.

Further visualizations in the Dash platform could also help our users; if you're interested in seeing a new visualization added, open an issue and let us know!

In the data comparison dash tool, it would also be helpful to provide quantitative statistical analyses (analysis of variance) that detail if the "inaccurate" data in a graph are actually statistically different than the "accurate" data of the graph so that the user can understand which parameters break the model.

ML Modeling in Biotech

The predictive process could be sped up even more by removing the need for the mechanistic model entirely. A NN that maps protein sequence/structure to yield and purity would greatly speed the molecular design process, although challenges with this idea would likely be finding enough sequence + structure to yield + purity data to train a robust NN. We hope to investigate this idea in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
dash_apps		dash_apps
notebooks		notebooks
project_documentation		project_documentation
sample_datasets		sample_datasets
surrogate_models		surrogate_models
tests		tests
visualization		visualization
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
dash_app.py		dash_app.py
environment.yml		environment.yml
launch.sh		launch.sh
requirements.txt		requirements.txt
setup.sh		setup.sh
utils.py		utils.py

License

Just-DIRECT-Capstone/Protein-Purification-Model-Public

Folders and files

Latest commit

History

Repository files navigation