Langevin Sampling

β€’4 min read

Introduction

We often know a target density only up to a normalizing constant:

p(x)=1Zp~(x),βˆ‡log⁑p(x)=βˆ‡log⁑p~(x)p(x) = \frac{1}{Z} \tilde p(x), \qquad \nabla \log p(x) = \nabla \log \tilde p(x)

Langevin sampling (Langevin Monte Carlo, LMC) exploits gradients of the log-density to construct a continuous-time diffusion whose stationary distribution is pp. Discretizing that diffusion yields a Markov chain that (under conditions) approaches pp and can mix faster than random-walk Metropolis for high dimensions.

Define a potential U(x)=βˆ’log⁑p(x)+constU(x) = -\log p(x) + \mathrm{const}; then βˆ‡log⁑p(x)=βˆ’βˆ‡U(x)\nabla \log p(x) = -\nabla U(x).

Brownian Motion and ItΓ΄ Calculus

A (scalar) Brownian motion BtB_t satisfies:

  • B0=0B_0 = 0
  • Independent increments
  • Bt2βˆ’Bt1∼N(0,t2βˆ’t1)B_{t_2} - B_{t_1} \sim \mathcal{N}(0, t_2 - t_1) for t2>t1t_2 > t_1
  • Almost surely continuous paths, nowhere classically differentiable.

Heuristically, partitioning [0,T][0,T] finely gives

βˆ‘i(Bti+1βˆ’Bti)2β†’T\sum_i (B_{t_{i+1}} - B_{t_i})^2 \to T

suggesting the mnemonic (dBt)2=dt(dB_t)^2 = dt (not an algebraic identity, but a bookkeeping rule in stochastic calculus).

ItΓ΄'s Lemma

For an ItΓ΄ process in Rd\mathbb{R}^d:

dXt=a(Xt,t) dt+B(Xt,t) dWt,dX_t = a(X_t,t)\,dt + B(X_t,t)\, dW_t,

with WtW_t a kk-dimensional Brownian motion and diffusion matrix D=BB⊀D = B B^\top. For smooth f:RdΓ—Rβ†’Rf:\mathbb{R}^d \times \mathbb{R} \to \mathbb{R}:

df=(βˆ‚tf+βˆ‡f⊀a+12Tr(Dβˆ‡2f))dt+(βˆ‡f)⊀B dWt.df = \Big(\partial_t f + \nabla f^\top a + \tfrac{1}{2}\mathrm{Tr}( D \nabla^2 f)\Big) dt + (\nabla f)^\top B\, dW_t.

Fokker-Planck (Forward Kolmogorov) Equation

The density p(x,t)p(x,t) of XtX_t evolves as

βˆ‚pβˆ‚t=βˆ’βˆ‡β‹…(a(x,t)p(x,t))+12βˆ‘i,jβˆ‚2βˆ‚xiβˆ‚xj(Dij(x,t)p(x,t)).\frac{\partial p}{\partial t} = -\nabla \cdot \big(a(x,t) p(x,t)\big) + \tfrac{1}{2} \sum_{i,j} \frac{\partial^2}{\partial x_i \partial x_j}\Big(D_{ij}(x,t) p(x,t)\Big).

Overdamped Langevin Dynamics

Choose the SDE

dXt=βˆ‡log⁑p(Xt) dt+2 dWt(equivalentlyΒ dXt=βˆ’βˆ‡U(Xt) dt+2 dWt)dX_t = \nabla \log p(X_t)\, dt + \sqrt{2}\, dW_t \quad \text{(equivalently } dX_t = -\nabla U(X_t)\, dt + \sqrt{2}\, dW_t\text{)}

Here a(x)=βˆ‡log⁑p(x)a(x) = \nabla \log p(x), D=2ID = 2 I. Plugging into Fokker-Planck:

βˆ‚tpt=βˆ’βˆ‡β‹…(βˆ‡log⁑pβ€…β€Špt)+βˆ‡2pt=βˆ’βˆ‡β‹…(ptβˆ‡log⁑p)+βˆ‡β‹…(βˆ‡pt).\partial_t p_t = -\nabla \cdot (\nabla \log p \; p_t) + \nabla^2 p_t = -\nabla \cdot (p_t \nabla \log p) + \nabla \cdot (\nabla p_t).

When pt=pp_t = p, note βˆ‡log⁑pβ€…β€Šp=βˆ‡p\nabla \log p \; p = \nabla p so the RHS is zero; hence pp is stationary. Under regularity & confining tails (e.g. UU strongly convex outside a ball), the process is ergodic and Xtβ‡’pX_t \Rightarrow p.

(Zero probability flux / detailed balance viewpoint: stationary current J=apβˆ’βˆ‡p=0J = a p - \nabla p = 0.)

Euler-Maruyama method

Time-step the SDE with step Ξ·>0\eta > 0:

xk+1=xk+Ξ·βˆ‡log⁑p(xk)+2η ξk,ΞΎk∼N(0,I).x_{k+1} = x_k + \eta \nabla \log p(x_k) + \sqrt{2\eta}\,\xi_k,\quad \xi_k \sim \mathcal{N}(0,I).

This introduces discretization bias: the chain has invariant distribution pΞ·β‰ pp_\eta \neq p. Under smoothness & dissipativity, W2(pΞ·,p)=O(Ξ·)W_2(p_\eta, p) = O(\eta).

Often we only know βˆ‡log⁑p~\nabla \log \tilde p, so use it directly.

Practical Pseudocode

  1. Initialize x0x_0
  2. For k=0..Kβˆ’1k = 0..K-1:
    • g=βˆ‡log⁑p(xk)g = \nabla \log p(x_k)
    • noise∼N(0,I)\text{noise} \sim \mathcal{N}(0,I)
    • xk+1=xk+Ξ·g+2η noisex_{k+1} = x_k + \eta g + \sqrt{2\eta}\,\text{noise}
  3. Discard burn-in, thin if desired.

Metropolis-Adjusted Langevin Algorithm (MALA)

Reduce bias via Metropolis-Hastings using Langevin proposal: Proposal density:

q(xβ€²βˆ£x)=N(xβ€²;x+Ξ·βˆ‡log⁑p(x),2Ξ·I).q(x'|x) = \mathcal{N}\Big(x'; x + \eta \nabla \log p(x), 2\eta I\Big).

Accept with probability

Ξ±=min⁑(1,p(xβ€²)q(x∣xβ€²)p(x)q(xβ€²βˆ£x)).\alpha = \min\Big(1, \frac{p(x') q(x|x')}{p(x) q(x'|x)}\Big).

For small Ξ·\eta, acceptance β‰ˆ high; optimal scaling in high dimension behaves like η∝dβˆ’1/3\eta \propto d^{-1/3} (heuristic from diffusion limits).

Choosing the Step Size

Guidelines:

  • ULA: pick Ξ·\eta so drift and noise scales comparable: monitor acceptance surrogate (norm of update).
  • MALA: tune Ξ·\eta to target acceptance 0.55–0.6 (empirical).
  • Strongly convex regions: larger Ξ·\eta possible; multimodal / rugged: smaller Ξ·\eta.
  • Use adaptive schemes (but beware of breaking detailed balance unless stabilization after burn-in).

Preconditioning and Variants

To accelerate mixing, use a positive definite matrix MM:

dXt=Mβˆ‡log⁑p(Xt)dt+2M dWt.dX_t = M \nabla \log p(X_t) dt + \sqrt{2M}\, dW_t.

Discretization:

xk+1=xk+Ξ·Mβˆ‡log⁑p(xk)+2Ξ·M ξk.x_{k+1} = x_k + \eta M \nabla \log p(x_k) + \sqrt{2\eta M}\,\xi_k.

Choices:

  • Diagonal MM from running variance estimates.
  • Fisher information (Riemannian Langevin) when available.
  • Stochastic gradients (SGLD) replace exact gradient with minibatch estimate + added noise; requires decreasing step schedule Ξ·k\eta_k to control bias.