We often know a target density only up to a normalizing constant:
p ( x ) = 1 Z p ~ ( x ) , β log β‘ p ( x ) = β log β‘ p ~ ( x ) p(x) = \frac{1}{Z} \tilde p(x), \qquad \nabla \log p(x) = \nabla \log \tilde p(x) p ( x ) = Z 1 β p ~ β ( x ) , β log p ( x ) = β log p ~ β ( x )
Langevin sampling (Langevin Monte Carlo, LMC) exploits gradients of the log-density to construct a continuous-time diffusion whose stationary distribution is p p p . Discretizing that diffusion yields a Markov chain that (under conditions) approaches p p p and can mix faster than random-walk Metropolis for high dimensions.
Define a potential U ( x ) = β log β‘ p ( x ) + c o n s t U(x) = -\log p(x) + \mathrm{const} U ( x ) = β log p ( x ) + const ; then β log β‘ p ( x ) = β β U ( x ) \nabla \log p(x) = -\nabla U(x) β log p ( x ) = β β U ( x ) .
A (scalar) Brownian motion B t B_t B t β satisfies:
B 0 = 0 B_0 = 0 B 0 β = 0
Independent increments
B t 2 β B t 1 βΌ N ( 0 , t 2 β t 1 ) B_{t_2} - B_{t_1} \sim \mathcal{N}(0, t_2 - t_1) B t 2 β β β B t 1 β β βΌ N ( 0 , t 2 β β t 1 β ) for t 2 > t 1 t_2 > t_1 t 2 β > t 1 β
Almost surely continuous paths, nowhere classically differentiable.
Heuristically, partitioning [ 0 , T ] [0,T] [ 0 , T ] finely gives
β i ( B t i + 1 β B t i ) 2 β T \sum_i (B_{t_{i+1}} - B_{t_i})^2 \to T β i β ( B t i + 1 β β β B t i β β ) 2 β T
suggesting the mnemonic ( d B t ) 2 = d t (dB_t)^2 = dt ( d B t β ) 2 = d t (not an algebraic identity, but a bookkeeping rule in stochastic calculus).
For an ItΓ΄ process in R d \mathbb{R}^d R d :
d X t = a ( X t , t ) β d t + B ( X t , t ) β d W t , dX_t = a(X_t,t)\,dt + B(X_t,t)\, dW_t, d X t β = a ( X t β , t ) d t + B ( X t β , t ) d W t β ,
with W t W_t W t β a k k k -dimensional Brownian motion and diffusion matrix D = B B β€ D = B B^\top D = B B β€ .
For smooth f : R d Γ R β R f:\mathbb{R}^d \times \mathbb{R} \to \mathbb{R} f : R d Γ R β R :
d f = ( β t f + β f β€ a + 1 2 T r ( D β 2 f ) ) d t + ( β f ) β€ B β d W t . df = \Big(\partial_t f + \nabla f^\top a + \tfrac{1}{2}\mathrm{Tr}( D \nabla^2 f)\Big) dt + (\nabla f)^\top B\, dW_t. df = ( β t β f + β f β€ a + 2 1 β Tr ( D β 2 f ) ) d t + ( β f ) β€ B d W t β .
The density p ( x , t ) p(x,t) p ( x , t ) of X t X_t X t β evolves as
β p β t = β β β
( a ( x , t ) p ( x , t ) ) + 1 2 β i , j β 2 β x i β x j ( D i j ( x , t ) p ( x , t ) ) . \frac{\partial p}{\partial t}
= -\nabla \cdot \big(a(x,t) p(x,t)\big) + \tfrac{1}{2} \sum_{i,j} \frac{\partial^2}{\partial x_i \partial x_j}\Big(D_{ij}(x,t) p(x,t)\Big). β t β p β = β β β
( a ( x , t ) p ( x , t ) ) + 2 1 β β i , j β β x i β β x j β β 2 β ( D ij β ( x , t ) p ( x , t ) ) .
Choose the SDE
d X t = β log β‘ p ( X t ) β d t + 2 β d W t (equivalentlyΒ d X t = β β U ( X t ) β d t + 2 β d W t ) dX_t = \nabla \log p(X_t)\, dt + \sqrt{2}\, dW_t
\quad \text{(equivalently } dX_t = -\nabla U(X_t)\, dt + \sqrt{2}\, dW_t\text{)} d X t β = β log p ( X t β ) d t + 2 β d W t β (equivalentlyΒ d X t β = β β U ( X t β ) d t + 2 β d W t β )
Here a ( x ) = β log β‘ p ( x ) a(x) = \nabla \log p(x) a ( x ) = β log p ( x ) , D = 2 I D = 2 I D = 2 I . Plugging into Fokker-Planck:
β t p t = β β β
( β log β‘ p β
β p t ) + β 2 p t = β β β
( p t β log β‘ p ) + β β
( β p t ) . \partial_t p_t = -\nabla \cdot (\nabla \log p \; p_t) + \nabla^2 p_t
= -\nabla \cdot (p_t \nabla \log p) + \nabla \cdot (\nabla p_t). β t β p t β = β β β
( β log p p t β ) + β 2 p t β = β β β
( p t β β log p ) + β β
( β p t β ) .
When p t = p p_t = p p t β = p , note β log β‘ p β
β p = β p \nabla \log p \; p = \nabla p β log p p = β p so the RHS is zero; hence p p p is stationary. Under regularity & confining tails (e.g. U U U strongly convex outside a ball), the process is ergodic and X t β p X_t \Rightarrow p X t β β p .
(Zero probability flux / detailed balance viewpoint: stationary current J = a p β β p = 0 J = a p - \nabla p = 0 J = a p β β p = 0 .)
Time-step the SDE with step Ξ· > 0 \eta > 0 Ξ· > 0 :
x k + 1 = x k + Ξ· β log β‘ p ( x k ) + 2 Ξ· β ΞΎ k , ΞΎ k βΌ N ( 0 , I ) . x_{k+1} = x_k + \eta \nabla \log p(x_k) + \sqrt{2\eta}\,\xi_k,\quad \xi_k \sim \mathcal{N}(0,I). x k + 1 β = x k β + Ξ· β log p ( x k β ) + 2 Ξ· β ΞΎ k β , ΞΎ k β βΌ N ( 0 , I ) .
This introduces discretization bias: the chain has invariant distribution p Ξ· β p p_\eta \neq p p Ξ· β ξ = p . Under smoothness & dissipativity, W 2 ( p Ξ· , p ) = O ( Ξ· ) W_2(p_\eta, p) = O(\eta) W 2 β ( p Ξ· β , p ) = O ( Ξ· ) .
Often we only know β log β‘ p ~ \nabla \log \tilde p β log p ~ β , so use it directly.
Initialize x 0 x_0 x 0 β
For k = 0.. K β 1 k = 0..K-1 k = 0.. K β 1 :
g = β log β‘ p ( x k ) g = \nabla \log p(x_k) g = β log p ( x k β )
noise βΌ N ( 0 , I ) \text{noise} \sim \mathcal{N}(0,I) noise βΌ N ( 0 , I )
x k + 1 = x k + Ξ· g + 2 Ξ· β noise x_{k+1} = x_k + \eta g + \sqrt{2\eta}\,\text{noise} x k + 1 β = x k β + Ξ· g + 2 Ξ· β noise
Discard burn-in, thin if desired.
Reduce bias via Metropolis-Hastings using Langevin proposal:
Proposal density:
q ( x β² β£ x ) = N ( x β² ; x + Ξ· β log β‘ p ( x ) , 2 Ξ· I ) . q(x'|x) = \mathcal{N}\Big(x'; x + \eta \nabla \log p(x), 2\eta I\Big). q ( x β² β£ x ) = N ( x β² ; x + Ξ· β log p ( x ) , 2 Ξ· I ) .
Accept with probability
Ξ± = min β‘ ( 1 , p ( x β² ) q ( x β£ x β² ) p ( x ) q ( x β² β£ x ) ) . \alpha = \min\Big(1, \frac{p(x') q(x|x')}{p(x) q(x'|x)}\Big). Ξ± = min ( 1 , p ( x ) q ( x β² β£ x ) p ( x β² ) q ( x β£ x β² ) β ) .
For small Ξ· \eta Ξ· , acceptance β high; optimal scaling in high dimension behaves like Ξ· β d β 1 / 3 \eta \propto d^{-1/3} Ξ· β d β 1/3 (heuristic from diffusion limits).
Guidelines:
ULA: pick Ξ· \eta Ξ· so drift and noise scales comparable: monitor acceptance surrogate (norm of update).
MALA: tune Ξ· \eta Ξ· to target acceptance 0.55β0.6 (empirical).
Strongly convex regions: larger Ξ· \eta Ξ· possible; multimodal / rugged: smaller Ξ· \eta Ξ· .
Use adaptive schemes (but beware of breaking detailed balance unless stabilization after burn-in).
To accelerate mixing, use a positive definite matrix M M M :
d X t = M β log β‘ p ( X t ) d t + 2 M β d W t . dX_t = M \nabla \log p(X_t) dt + \sqrt{2M}\, dW_t. d X t β = M β log p ( X t β ) d t + 2 M β d W t β .
Discretization:
x k + 1 = x k + Ξ· M β log β‘ p ( x k ) + 2 Ξ· M β ΞΎ k . x_{k+1} = x_k + \eta M \nabla \log p(x_k) + \sqrt{2\eta M}\,\xi_k. x k + 1 β = x k β + Ξ· M β log p ( x k β ) + 2 Ξ· M β ΞΎ k β .
Choices:
Diagonal M M M from running variance estimates.
Fisher information (Riemannian Langevin) when available.
Stochastic gradients (SGLD) replace exact gradient with minibatch estimate + added noise; requires decreasing step schedule Ξ· k \eta_k Ξ· k β to control bias.