What the hell is HMC, anyways?

It’s like soap sliding around a bathtub. It’s like a person skiing. It’s so simple, just <insert complicated physics analogy here>. These are descriptions you may have heard for Hamiltonian Monte Carlo (HMC). Perhaps they left you feeling like you understand what’s going on (I certainly did, at least for the first two). But then I started working with HMC. And I found myself somewhere in the middle of the ocean in a rowboat with one paddle and a half-torn map. (Dan Simpson has a much more colorful lake-based metaphor for being confused in statistics, which is no less apt.) What follows are some of the lessons I (think I) have learned about HMC, in the hopes that they might help others—or at the very least future me—understand. It is not a particularly rigorous introduction, for that you might prefer Michael Betancourt’s piece.

A brief intro to Metropolis-Hastings MCMC

Markov chain Monte Carlo (MCMC) is a way to sample from a probability distribution. Perhaps you’ve heard the analogy that MCMC is like a robot wandering around on a hill. Maybe you even played around with Paul Lewis’ MCMC robot. If you haven’t, and for posterity, the analogy is as follows. First, you parachute a robot down onto a hill. Then you give it a simple set of instructions:

Pick a random direction.
Take a step to a proposed new location.
If it’s higher, stay there. If not, compute the ratio, A, of the height of the just-stepped-to location to height of the old location, and draw $u \sim \text{Uniform}(0,1)$. If $u \leq A$, stay there, otherwise go back to where you were.
Record the location.
Repeat 1-4 “long enough” (we’re not even going to touch on that vagary here).

When done, the robot will have sampled locations proportional to their height, and you’ll have an idea of the shape of the hill. The ability to step down can be mystifying, but suffice it to say that if your goal is to explore the surface of the mountain, you have to move downhill eventually.

Aside: is Metropolis-Hastings MCMC inelegant?

I’ve heard people bash MCMC (especially Metropolis-Hastings MCMC) as inelegant. It’s true that in general, we use MCMC as a last resort, if we had a more efficient approach we would use it. And it’s also true that sometimes “long enough” is longer than you have available (and possibly even beyond the heat death of the universe). But MCMC is elegantly simple. With MCMC, you can sample from any imaginable distribution with a little patience, a few simple rules, and a way to propose new states. And that’s cool.

Just use gradients

The one thing I knew for sure about HMC was that it needs gradients. Gradients are vectors of partial derivatives of a function with respect to the vector of variables/parameters. Logically, if we’re trying to explore a space (or a hill), it is helpful to know which way is up and which is down. Since we are basically never starting our MCMC run near the actual bulk of the posterior mass, gradients should help us move where we want to be sampling, to the peak.

Hang on, isn’t that just hill climbing?

At this point, you may be wondering how we use derivatives. If we just follow the slopes, eventually we’d end up at the posterior mode. Thinking back to our intrepid robot, this would be something like never allowing it to step down. That would make this a hill-climbing algorithm which is precisely what we do not want. We want to explore the hill. So, well, what are we to do? We inject some noise.

A much simpler gradient-based MCMC algorithm is the Metropolis-adjusted Langevin algorithm. This approach is basically to run Metropolis-Hastings where you propose the next centered not on where the robot is now, but by centering it on a location determined by moving a little ways along the gradient. The gradient helps us orient towards the high point of the posterior distribution, the randomness helps us explore, and we carefully compute the acceptance ratio (step 3) to make things work.

That isn’t how HMC works, though. HMC uses the gradients to move along long sweeping paths through the posterior distribution. Why is this useful? Well, MCMC is a delicate balancing act between proposing new states that are close enough to be accepted but far enough away to be useful. Given that MCMC moves from one place in the posterior to a nearby place, the samples are correlated. The closer the next sample to the last, the more autocorrelation and the longer you need to run the chain to get enough samples. So if we can use gradients to move far away, to states nearly independent of the current one, but have a good acceptance rate, then we’ll be exploring much more efficiently.

Great, but we still are talking about paths based on gradients and haven’t solved how this isn’t just climbing hills. Randomness gets injected into the system through the so-called momentum. Someone once compared HMC to other MCMC by saying it’s like putting the robot on skis. Instead of stepping in a random direction, the robot picks a random direction and gives itself a shove of some random strength and goes. Then the momentum, plus the contours of the posterior distribution, determine the path it takes to explore and where it ends up.

Spherical cows on frictionless planes

For a long time when picturing HMC, I assumed that, like someone skiing, the MCMC chain would get a push, go for a ways, and stop. But we’re not simulating a system with friction here. HMC lives in the same part of physics where the spherical cows graze. As our skiing robot moves uphill and slows down, it is perfectly converting that momentum into potential energy. That will be converted perfectly back into more momentum as it heads back downhill. Rinse and repeat ad infinitum, not even bumping into spherical cows is going to stop it. That means that after we draw the momentum and get the MCMC chain speeding along its path, we have to decide how long to let it go before we make it stop. Turns out, there is no free lunch, though the cost of tuning an additional knob in the MCMC setup isn’t all that much compared to the potential gain in efficiency.

But wait, paths contain a lot of points

We’re talking about moving along paths of some length. But we’re moving using gradients which depend on the parameters and thus change, but not in ways we have easy solutions for. This means what we actually have to do is approximate the Hamiltonian dynamics numerically. That is, compute the gradients needed, move a bit, recompute. Rinse, repeat (give or take clever tricks about interleaving updates to momentum and position that stabilize things a lot). Then we can use this to get from where we started to where we’re going.

But, now we’re not really doing what we said we were (moving according to Hamiltonian dynamics). We’re approximating it. So, we’ve got some error from that. We handle this by, drumroll please, using an accept/reject step at the very end.

TL;DR

After all this, we arrive full circle at a bit of an oddity. HMC is, effectively, a fancy MCMC proposal. We draw momentum, analytically integrate a path, and treat that as a proposal for the Metropolis-Hastings algorithm. So why is it worth our time? Because we get to explore the distribution quite efficiently. Efficiently enough that even though we have to compute the gradient of the likelihood many times per sample, those samples are more useful than if we had just randomly proposed new states. Because gradients tell us useful things about the shape of the posterior, beyond just whether one state is better than another.

Andrew Magee