Structural Token Language Modeling of Proteins for Controllable Generation

12 minute read

Published: December 13, 2024

Final blog post for 6.S978 Deep Generative Models

Introduction

Recent advances in protein design have demonstrated the power of deep learning approaches over traditional physics-based methods. However, current state-of-the-art models, particularly those based on diffusion techniques, face significant computational challenges - often scaling cubically with sequence length. This blog post will discuss an approach that combines the flexibility of language models with geometric understanding of protein structures, while maintaining more efficient computational scaling.

Protein Structure Representation

Understanding protein structure representation requires appreciating the fundamental challenge: how do we capture the complex three-dimensional arrangement of atoms in a way that’s both computationally tractable and mathematically meaningful? Traditional approaches using Cartesian coordinates, while intuitive, fail to capture the inherent geometric relationships between protein elements.

When we examine a protein structure, we observe a hierarchical organization that spans multiple scales:

Primary structure represents the amino acid sequence, where each residue is connected through peptide bonds, forming the protein’s backbone.
Secondary structure elements emerge from local hydrogen bonding patterns - α-helices complete one full turn every 3.6 residues, while β-sheets form extended conformations stabilized by hydrogen bonds between strands.
Tertiary structure describes the global three-dimensional arrangement, often driven by hydrophobic packing and disulfide bonds.
Quaternary structure captures the arrangement of multiple protein chains in functional complexes.

Our approach uses SE(3) frames - a mathematical construct from geometric algebra that unifies rotational and translational transformations. For those unfamiliar with SE(3), imagine attaching a tiny coordinate system to each amino acid in the protein chain. This coordinate system tells us not just where the amino acid is (translation) but also how it’s oriented in space (rotation).

To compute these frames, we follow a precise procedure for each residue:

Establish the local coordinate system:
- Origin: Place at the Cα atom
- X-axis: Vector from Cα to C (carbonyl carbon)
- Z-axis: Cross product of (Cα to C) and (Cα to N)
- Y-axis: Complete right-handed system
Calculate relative transformations: For consecutive residues i and j, we compute: \(T_{i,j} = \begin{bmatrix} R_{i,j} & t_{i,j} \\ 0 & 1 \end{bmatrix}\) where:
- \(R_{i,j} \in SO(3)\) is the rotation matrix aligning frame i to frame j
- \(t_{i,j} \in \mathbb{R}^3\) is the translation vector between origins
Express as rotation-translation pairs: We decompose each transformation into: \(R_{i,j} = \exp(\omega^\wedge)\) (rotation) \(t_{i,j} = \Delta x\) (translation) where \(\omega^\wedge\) is the skew-symmetric matrix corresponding to the rotation axis.

This approach captures both local and global structural information, maintains geometric invariance, and enables natural handling of protein flexibility. Further, rather than working with continuous coordinates, we tokenize and develop a discrete vocabulary that captures common structural motifs:

Distance Metric Design: Our distance metric \(d(g_1, g_2)\) combines rotational and translational differences:
\[d(g_1, g_2) = \alpha \arccos(\frac{1}{2}(\text{tr}(R_1^TR_2) - 1)) + \beta \|t_1 - t_2\|_2\]
where:
- The first term measures geodesic distance in SO(3)
- The second term measures Euclidean distance between translations
- \(\alpha = 0.5\) and \(\beta = 0.5\) were determined empirically to balance the contributions
Clustering Strategy: We employ hierarchical clustering in SE(3) space using:
- Initial vocabulary size: 8192 tokens
- Minimum cluster size: 50 transformations
- Maximum intra-cluster distance: 0.8 (in normalized units)
Token Assignment: For a new transformation T, we assign the token via: \(\tau(T) = \arg \min_{k \in V} d(T, c_k)\) where \(c_k\) represents the k-th cluster centroid

This tokenization achieves several key properties:

Geometric consistency: Similar transformations map to the same token
Completeness: Coverage of 98% of observed protein motifs
Invertibility: Mean reconstruction RMSD of 0.45Å

Autoregressive Architecture

Our approach leverages an autoregressive architecture to generate protein structures one residue at a time. This design choice is motivated by the natural sequential nature of protein folding and allows us to maintain quadratic runtime scaling with sequence length. The model generates structural tokens sequentially, conditioning each prediction on all previously generated tokens. For a protein structure represented by tokens \((t_1, ..., t_n)\), we decompose the generation probability as:

\[P(t_1, ..., t_n) = \prod_{i=1}^n P(t_i | t_1, ..., t_{i-1})\]

This formulation enables the model to capture long-range dependencies while maintaining computational efficiency. At each step, the model predicts the next structural token using a transformer-based architecture that attends to all previous tokens.

Structure-Aware Loss

Perhaps the most important feature of this implementation is how we train the model to respect protein physics. Traditional language models optimize for token prediction accuracy, but proteins must satisfy complex physical constraints. Our loss function acts as a strict teacher, penalizing physically impossible structures while rewarding natural protein-like features.

The complete loss function combines multiple educational signals:

\[\mathcal{L} = \mathcal{L}_{token} + \lambda_1\mathcal{L}_{geometry} + \lambda_2\mathcal{L}_{consistency}\]

Let’s break down each component:

Token Prediction Loss (\(\mathcal{L}_{token}\)):
- Standard cross-entropy loss for next-token prediction
- Includes token importance weighting based on structural impact
- Implements focal loss to handle imbalanced token distributions

Geometric Physics Loss (\(\mathcal{L}_{geometry}\)): Best explained by this snippet of pseudocode

def geometry_loss(structure):
    # Bond length constraints
    bond_deviation = compute_bond_lengths(structure) - ideal_lengths
       
    # Angle constraints (both backbone and side-chain)
    angle_violation = compute_angle_violations(structure)
       
    # Ramachandran plot consistency
    rama_score = ramachandran_probability(structure)
       
    # Van der Waals clashes
    clash_score = compute_atomic_clashes(structure)
       
    return (w1 * bond_deviation + w2 * angle_violation 
            - w3 * rama_score + w4 * clash_score)

Multi-scale Consistency (\(\mathcal{L}_{consistency}\)): Ensures structural coherence across different spatial scales:
- Local: proper secondary structure formation
- Medium: domain packing and interfaces
- Global: overall fold topology

Experiments

Our experimental investigation after training the model on a subset of the Protein Data Bank reveals both the capabilities and limitations of structural token language modeling for protein design. Through systematic evaluation, we uncovered several key insights about the model’s behavior and its implications for protein generation.

Nucleus Sampling

Initial experiments with standard sampling techniques produced structures that, while protein-like, often exhibited unnatural local geometries. The introduction of nucleus sampling dramatically improved generation quality across multiple metrics:

Standard vs. Nucleus Sampling Results:
                    Standard    Nucleus
Clash-free         83.3%       93.2%
Structure Coverage  33%         55%
scTM > 0.5         27%         78%

This stark improvement suggests that the model learns a distribution of valid protein structures but requires careful sampling to access the high-probability regions of this distribution. The increase in clash-free structures (from 83.3% to 93.2%) indicates that nucleus sampling helps the model maintain consistent local geometry while generating sequences. More importantly, the dramatic improvement in structures with scTM > 0.5 (27% to 78%) reveals that the sampling strategy affects not just local features but also global fold stability.

Structure Generation Characteristics

Detailed analysis of the generated structures revealed an unexpected strength in immunoglobulin-like domain generation. The model shows higher success rates (mean TM-score 0.72) with β-sandwich folds compared to α-helical bundles (mean TM-score 0.58). This disparity suggests that our SE(3) frame representation might better capture the geometric constraints of β-sheet arrangements. However, this might be simply an artifact from limited training data.

   Natural vs. Generated Secondary Structure Distribution:
   Structure Type    Natural    Generated
   α-helix          32%        28%
   β-sheet          21%        25%
   Loop/Other       47%        47%

The slight overrepresentation of β-sheets in generated structures correlates with the model’s proficiency in immunoglobulin-like domains.

Energy minimization experiments provided crucial insights into the model’s understanding of protein stability. When subjected to molecular dynamics relaxation:

93.2% of structures maintain their fold (RMSD < 2Å)
Mean Rosetta energy scores improved by 15% post-relaxation
Side-chain packing quality matches natural proteins (mean TALOS+ score: 0.82)

These results suggest that the model generates structures that lie close to local energy minima, though not necessarily at the absolute minimum. This property is actually beneficial for design, as it allows for natural structural flexibility while maintaining stability.

Computational Performance Analysis

The quadratic scaling of our approach represents a significant improvement over cubic-scaling diffusion models, but the actual performance characteristics reveal interesting nuances:

Sequence Length    Generation Time    Memory Usage
residues       0.8s               1.2GB
residues      2.9s               2.3GB
residues      10.5s              4.5GB

While the scaling is indeed quadratic, we observe that the constant factors become significant for proteins longer than 128 residues. This suggests potential optimization opportunities through:

Sparse attention mechanisms for very long sequences
Hierarchical token generation for multi-domain proteins
Parallel generation of independent structural segments

Conditioning Strategies

The true test of any protein generation model lies not just in its ability to produce valid structures, but in the degree of control it offers over the generation process. Our investigations into controlled generation with a few conditoning strategies reveal both promising capabilities and intriguing challenges that shape future directions in protein design.

Experiments in Structural Control

Initial experiments with domain-based conditioning demonstrated strong potential for controlled generation. By modifying our generative process to condition on domain specifications:

\[P(t_1, ..., t_n | D) = \prod_{i=1}^n P(t_i | t_1, ..., t_{i-1}, D)\]

we achieved 85% accuracy in maintaining specified domain architectures. This success, however, came with an important caveat: while individual domains showed high fidelity (mean RMSD 2.3Å), interface regions exhibited elevated clash rates. This observation led us to explore more fine-grained control mechanisms.

Feature-based conditioning through embedded vectors:

\[f = [f_{size}, f_{charge}, f_{hydrophobicity}, f_{flexibility}]\]

proved particularly effective for controlling physical properties, with generated proteins showing 92% correlation with target size specifications and 88% accuracy in hydrophobicity profiles. These results suggest that the model learns meaningful relationships between structural features and their physical manifestations.

Perhaps most intriguing were our experiments with temperature-based sampling:

\[P(t_i | t_{1:i-1}) = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}\]

At τ = 0.8, the model generated highly natural-like structures (95% success rate), while τ = 1.2 led to the emergence of potentially novel folds (35% of generations) while maintaining physical feasibility. This control parameter revealed a fascinating tension between structural conservation and innovation.

Discussion, Insights, and Future Directions

Our experiments with both basic generation and controlled synthesis point to several key insights that shape future directions in protein design using language models.

The model’s unexpected proficiency with β-sheet structures (particularly in immunoglobulin-like domains) combined with its strong performance in feature-based conditioning suggests a natural path forward in antibody design. However, the persistent challenge of interface quality - evident in both basic generation and domain-conditioned experiments - indicates that our current frame-based representation might not fully capture the subtle geometric relationships at domain boundaries.

Temperature-based sampling experiments revealed another intriguing avenue: while we can control the degree of structural innovation, we lack precise metrics for quantifying and validating truly novel folds. This challenge is compounded by our observation that generated structures tend to lie near, but not at, local energy minima - a property that could be either beneficial or problematic for novel fold design.

The path forward appears to lie at the intersection of these observations. Future work should focus on:

Developing specialized representations for interface regions that capture the geometric subtleties of domain interactions - perhaps through hierarchical tokenization schemes that explicitly model multi-scale structural relationships. Our current success with feature-based conditioning provides a foundation for this approach, but the elevated clash rates at interfaces (12% vs 6.8% overall) suggest room for improvement.

The relationship between sequence and structure deserves deeper investigation, particularly given our mean self-consistency TM-score of 0.65. The strong performance of our feature-based conditioning hints at the possibility of more sophisticated conditioning schemes that could bridge this gap, potentially incorporating evolutionary information or learned sequence-structure correlations.

Lastly, the challenge in developing rigorous metrics and validation approaches for novel folds while maintaining physical feasibility remains. It’s clear that temperature selection doesn’t have the resolution to cut it but the model’s current ability to generate structures near energy minima provides a promising starting point and requires careful balance with the desire for structural innovation.

These directions aren’t merely technical challenges - they represent fundamental questions about the relationship between sequence, structure, and function in proteins. As we continue to develop and refine these models, we move closer to understanding not just how to generate protein structures, but how to design them with precise control over their properties and functions.

Share on

Twitter Facebook LinkedIn

Arthur Liang