Autoregressive Boltzmann Generators

Autoregressive Boltzmann Generators (ArBG)§

Core Idea§

Boltzmann Generators (BGs) aim to generate independent equilibrium samples from a target Boltzmann distribution $p(x) \propto e^{-U(x)/kT}$ by training a generative model $q_\theta(x)$ with exact likelihoods and reweighting via importance sampling. Prior BGs rely on normalizing flows (NFs), which are constrained by invertibility (discrete-time NFs) or suffer from costly Jacobian determinants (continuous-time NFs). ArBG departs from flows by using an autoregressive model, which sequentially generates coordinates and allows tractable likelihoods without invertibility constraints.

Methodology§

ArBG factorizes the joint distribution of molecular coordinates $x = (x_1, \dots, x_N)$ (where each $x_i$ may represent a group of atoms) as: $$ q_\theta(x) = \prod_{i=1}^N q_\theta(x_i \mid x_{<i}). $$ Each conditional is modeled as a mixture of Gaussians or a more flexible distribution (e.g., real-valued non-volume preserving (real NVP) block). The model is trained by minimizing the Kullback-Leibler divergence to the target, which reduces to maximum likelihood on samples from the target (if available) or via self-training with importance sampling. The key advantage: autoregressive models are universal approximators of probability distributions and do not require invertibility, enabling more expressive conditionals.

Inference allows sequential interventions: one can condition on partial observations (e.g., fixing a dihedral angle) and sample the remaining degrees of freedom, which is not straightforward with flows.

Training§

For training, ArBG uses a dataset of conformations or a self-consistent procedure. The loss is the negative log-likelihood: $$ \mathcal{L} = -\frac{1}{B} \sum_{j=1}^B \sum_{i=1}^N \log q_\theta(x_i^{(j)} \mid x_{<i}^{(j)}). $$ The model architecture leverages Transformer-like layers for scalability (e.g., masked attention to enforce autoregressive ordering). The authors introduce a 132M parameter model called Robin, trained on a diverse set of peptide systems.

Code Snippet: Model Forward Pass§

import torch
import torch.nn as nn

class AutoregressiveBoltzmannGenerator(nn.Module):
    def __init__(self, d_input, d_model, n_layers, n_heads):
        super().__init__()
        self.embed = nn.Linear(d_input, d_model)
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, n_heads, batch_first=True),
            num_layers=n_layers
        )
        self.output_mu = nn.Linear(d_model, d_input)
        self.output_log_sigma = nn.Linear(d_model, d_input)

    def forward(self, x):
        # x: (batch, seq_len, d_input); autoregressive mask applied in transformer
        mask = torch.triu(torch.ones(seq_len, seq_len) * float('-inf'), diagonal=1)
        h = self.embed(x)
        h = self.transformer(h, mask=mask.to(x.device))
        mu = self.output_mu(h)
        log_sigma = self.output_log_sigma(h)
        # conditional log prob: Gaussian
        log_prob = -0.5 * ( ((x - mu) / log_sigma.exp())**2 + 2*log_sigma + torch.log(2*torch.pi) )
        return log_prob.sum(dim=-1).sum(dim=-1)  # sum over dimensions and residues

Results§

ArBG significantly outperforms flow-based BGs on all benchmarks, especially on larger systems like Chignolin (10 residues). The Robin model achieves a 60% reduction in zero-shot energy Wasserstein-2 error on 8-residue peptides compared to prior state-of-the-art.

Abstract

Technical Analysis & Implementation