ChatGPT-esque Technology Redefining Protein Engineering
Written by William Boyce '25
Edited by Lorenzo Mahoney '24
It’s a “new era in digital biology.” Researchers at Salesforce.com and the University of California San Francisco (UCSF) have created an artificial intelligence (AI) system capable of generating novel proteins with biological functionality . The team’s AI program, called ProGen, is modeled on natural language processing technology like OpenAI’s Chat-GPT and utilizes deep learning not to answer essay prompts or check emails, but assemble amino acids into artificial proteins .
Traditionally, genetically engineering proteins involves modifying the structure’s underlying genetic instructions to achieve a desired outcome. However, this process can be time-consuming, expensive, and technically challenging. Hoping to revolutionize this process, researchers at Salesforce and UCSF have developed ProGen, a protein language model that collects millions of raw protein sequences, integrates the data into an “understanding” of protein design, then applies its knowledge to generate artificial, de novo proteins for specialized functions . Representing the fundamental protein “code”, amino acid sequences can be reordered and combined to generate a vast array of proteins . Working from this natural base, the researchers fed ProGen the amino acid sequences of 280 million different proteins and let it digest the information for a couple of weeks . Then, they primed the model with 56,000 sequences from five lysozyme families, a comprehensively understood enzyme family, along with some contextual information . After some time, Progen generated over one million amino acid sequences encoding “potentially viable” lysozyme-analoges . However, how based in nature would the AI’s protein sequences be, and would they even be functional?
Searching to answer this question, the researchers selected Progen’s most promising amino acid sequences, then experimentally crafted five artificial proteins to test in cells . Since lysozymes cleave bacterial cell walls, the researchers compared their protein-analogs to an enzyme found in the whites of chicken eggs, known as hen egg white lysozyme (HEWL) . Two of Progen’s artificial enzymes were able to break down the cell walls of bacteria with activity comparable to HEWL . Despite comparable functionality, the artificial sequences were only about 18% identical to one another . The two sequences were only about 90% and 70% identical to any known protein across any organism . The AI-generated enzymes showed activity even when as little as 31.4% of their sequence resembled any known natural protein, despite the fact that just one mutation can potentially terminate protein functionality .
Progen was able to learn how the enzymes should be shaped, simply by studying the raw sequence data. Measured with X-ray crystallography, the atomic structures of the artificial proteins appeared exactly as Progen predicted, although the sequences represented completely novel proteins . Regarding proteins, Progen’s potential sequence design choices are almost limitless, given that there are an enormous number of possible combinations.
Undoubtedly, the implications of this research are vast. Over the last 50 years, protein engineering has become increasingly advanced, and Progen’s development could significantly speed up the generation of new proteins designed for 21st-century challenges. The new AI-based approach has the potential to become more powerful than even directed evolution . Directed evolution represents science’s current process to expedite natural selection through iterative trials of gene diversification and primary sequence screening . Progen’s technology possesses applications in many fields, including medicine, biotechnology, and environmental protection. For example, they could design enzymes that are incredibly thermostable or enzymes that like acidic environments, or won't interact with other proteins. Progen’s advantage stems from its incredible specificity and ability to generate protein structure based entirely upon intended functionality. This level of control over protein design has never been possible before and could be incredibly useful in developing new personalized drugs or cleaning up pollution.
Additionally, the technology has several advantages over traditional protein engineering methods. While traditional methods rely on the slow and laborious process of directed evolution, the AI model can generate a much larger number of sequences more quickly . This means that it can explore a much larger space of possible designs and may discover proteins that would have been missed with traditional methods.
However, this technology also raises ethical concerns. As with any new technology, the potential risks and consequences of the use of AI-generated proteins need to be carefully considered. There are concerns that the use of such proteins could have unintended consequences on living organisms or the environment. For instance, generating unnatural proteins can potentially develop new infectious particles that modern medicine has never seen before. The potential risks of using AI-generated proteins in humans need to be carefully studied to ensure that they are safe for human use. Additionally, the widespread adoption of AI-generated proteins could lead to job displacement in the protein engineering field.
In conclusion, Salesforce and UCSF’s recent breakthroughs in protein engineering using AI has the potential to revolutionize humanity’s interactions with our proteomic makeup. Although much additional computational research is needed, the so-called “new era of digital biology” appears increasingly brighter every day.
 Adeletron. Wikimedia Commons [Internet]. 2021 [cited 2023 Mar 6]. Available from: https://commons.wikimedia.org/wiki/File:Artificial-Intelligence.jpg
 Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 Jan 26;1–8.
 Goodsell, David. PDB101: Molecule of the Month: Lysozyme [Internet]. RCSB: PDB-101. [cited 2023 Mar 6]. Available from: http://pdb101.rcsb.org/motm/9
 Wang Y, Xue P, Cao M, Yu T, Lane ST, Zhao H. Directed Evolution: Methodologies and Applications. Chem Rev. 2021 Oct 27;121(20):12384–444.