Predicting Protein Developability via Convolutional Sequence Representation
ORAL
Abstract
Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability - quantified by expression, solubility, and stability - hinders commercialization. The ability to predict protein developability from amino acid sequence would reduce the experimental burden when selecting candidates. Recent advances in screening technologies enabled a high-throughput (HT) developability dataset for 105 of 1020 possible variants of protein scaffold Gp2. In this work, we evaluate the ability of neural networks to learn a developability representation from the HT dataset and transfer the knowledge to predict recombinant expression beyond the observed sequences. Mimicking protein theory, our model convolves learned amino acid properties to predict expression levels 42% closer to the experimental variance compared to a non-embedded control. Analysis of learned amino acid embeddings highlights the uniqueness of cysteine and the importance of hydrophobicity and charge, and unimportance of aromaticity, when aiming to improve developability. We identify clusters of similar sequences with increased developability through nonlinear dimensionality reduction (UMAP) and explore the inferred developability landscape via nested sampling.
–
Presenters
-
Alexander Golinski
- University of Minnesota
- Department of Chemical Engineering and Materials Science, University of Minnnesota