Skip to content

NC717/Genetic_engineering_attribution_challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains a novel solution for Genetic Engineering Attribution challenge Organized by ALT labs on Drivendata

Challenge Overview

alt text

Overview The goal was to create an algorithm that identifies the most likely lab-of-origin for genetically engineered DNA.

Applications for genetic engineering are rapidly diversifying. Researchers across the world are using powerful new techniques in synthetic biology to solve some of the world’s most pressing challenges in medicine, agriculture, manufacturing and more. At the same time, increasingly powerful genetically engineered systems could yield unintended consequences for people, food crops, livestock, and industry. These incredible advances in capability demand tools that support accountable innovation.

Genetic engineering attribution is the process of identifying the source of a genetically engineered piece of DNA. This ability ensures that scientists who have spent countless hours developing breakthrough technology get their due credit, intellectual property is protected, and responsible innovation is promoted. By connecting a genetically engineered system with its designers, society can examine the policies, processes, and decisions that led to its creation. As has been observed in other disciplines, reducing anonymity encourages more prudent behavior within scientific and entrepreneurial communities—without stifling innovation.

Development of attribution capabilities is critical for the maturation of genetic engineering as a field, protecting the significant benefits it promises society while promoting accountability, responsibility, and dialog. In this competition, the challenge was to advance the state-of-the-art in this exciting new domain!

Results from the hackathon leaderboard

alt text

Final approach

To accurately predict the lab of origin for plasmid sequences a combination of features were used to create a final XGBoost classifier. The features were based in Graph representation learning to learn the structural information for the protein sequences, in coherence with the n-gram features which accurately capture the positional features for the sequences.

Step 1: Conversion of protein sequences into SMILE (Simplified Molecular Input Line Entry System) notation

I used an open source library (rdkit) to convert the protein sequences into SMILE, the final structure for a sample protein sequence is hown below.

alt text

Step 2: Final model architecture used

I tried a variety of Graph neural network based approached to build the complete model. The final model which I used was a graph attention network to learn the embeddings for the protein graphs. These embedding were then concatenated with the n-gram features and a XGB classifier was used on top of it to predict the lab of origin for the protein sequences.

alt text

About

This repository contains 130th position solution for GE attribution challenge hosted by ALT labs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published