Partition functions and statistical inference, done by Nature
·
Deduction and inference
There are two types of truth in science: Things can be true mathematically, if they’re derived in a logically consistent way from fundamental axioms and if these axioms are free of contradiction. But things can be true empirically as well, meaning that they are consistent with an experiment and that the experiment is reproducible. The process of deriving statements from axioms is referred to as deduction, and making statements about physical laws based on experimental data is called inference. Please note that the famous detective Sherlock Holmes does both: He formulates hypotheses about the crime using deduction, and isolates likely conclusions from observations by inference, despite the fact that his author Arthur Conan Doyle calls the entire process deduction.
Bayes’ theorem and evidence
Of course Nature is not perfect as all observations come with some intrinsic amount of uncertainty. To stay in the picture of a crime scene analysed by Sherlock Holmes, there is never a smoking gun signature from which one infers one hypothesis with absolute certainty. Instead, one needs to apply Bayes’ theorem which makes a statement about the posterior distribution p(h∣data) on the basis of the likelihood L(data∣h), making an informed statement about the distribution of possible parameters h taking the observation of data into account:
p(h∣data)=p(data)L(data∣h)π(h)(1)
where
p(data)=∫dhL(data∣h)π(h).(2)
At the heart, Bayes’ theorem is a statement about conditional probabilities: When the random variable is interchanged with the condition in transitioning from L(data∣h) to p(h∣data), one needs to apply a corrections π(h)/p(data) involving the prior distribution π(h) and the Bayesian evidence p(data). The prior π(h) encapsulates the uncertainty on the parameter h before the data is taken, and might be sourced from previous experiments or from theory, and the evidence p(data) serves as a normalisation for p(h∣data).
Clearly, π(h)/p(data) is not unity, so it matters for computing p(h∣data)! In fact, one can be charged with procedural misconduct in a court, if one argues that the probability of somebody being innocent, given the evidence, has to be small if damning evidence would be produced at low probability by an innocent person. This is particularly perfidious if one continued to make statements about guilt given an observation. Clearly, one needs to take the probability of observing the evidence and the prior probability that a given person is innocent into account: The evidence could have been generated by pure chance, uncorrelated with the innocent bystander.
Boltzmann’s probability (or rather: Boltzmann’s likelihood)
The core object in Bayesian reasoning is the likelihood: For all intents and purposes, it is a probability, namely the probability of observing the data given a choice of the parameter h. It encapsulates our understanding of the physical law and of the measurement process: Every physical theory needs to be able to predict the range of possible outcomes of an experiment with their associated probabilities. Let’s turn away from criminology towards an innocent example: the barometric formula. Molecules in the air undergo collisions all the time and are constantly reshuffling their energy, and they’re using that energy to rise up in the gravitational field of the Earth to a height h until their kinetic energy is spent. The likelihood of observing a molecule at height h in an isothermal atmosphere with inverse temperature β=1/T is given by Boltzmann’s probability
L(data∣h)∝exp(−βϵ(h))withϵ(h)=mgh(3)
which is in fact a likelihood, as it makes a statement about the probability that an observation occurs: What we called "data" before is the mere observation of a molecule at height h. Boltzmann’s probability translates energy differences to probability ratios, and in our case the energy is simply the potential energy mgh, assuming that the gravitational field is homogeneous.
Partitions and the posterior distribution
But how would one now derive the distribution of heights h at which one can observe molecules from the probability for an observation given a height? Clearly through Bayes’ theorem! But for proper application one would need a prior: In the case of molecules in the atmosphere, this could simply be the statement that observations are impossible below ground level, so h is necessarily positive. Then, one can write down an explicit expression for the evidence p(data)
p(data)=∫dhL(data∣h)π(h)=∫0∞dhexp(−βmgh)(4)
where the prior makes the integral converge, and Bayes’ theorem would imply for the posterior distribution for the heights
Of course it is not surprising at all, that the air density decreases exponentially with height, ρ∝exp(−h), with a scale height hscale=1/(βmg) with a numerical value of ∼8×103m for air in the gravitational potential of the Earth: After all, the density of air molecules must be proportional to the probability that an air molecules makes it up to a certain height with the thermal energy borrowed from the environment.
Construcing the posterior distribution through its moments
One can take this idea a tiny bit further, by constructing a (canonical) partition function Z[β,J] out of the Bayesian evidence p(data),
Z[β,J]=∫0∞dhexp(−βmgh)exp(−Jh)(6)
by introducing a new term exp(Jh) which effectively constructs a Laplace-transform replacing h by J. It has the neat property of being able to compute moments
⟨hn⟩=∫0∞dhp(h∣data)hn=Z1∂Jn∂nZ[β,J]∣∣J=0(7)
in a differentiation instead of an integration process. Through the derivatives, one generates additional factors of hn inside the integral.
Please note that the partition function is essentially enabled by the particular form of Bayes’ theorem, where the evidence is in the denominator with its derivative in the numerator. Let’s try to compute the partition explicitly,
and divide by Z to obtain the moments of order n, which leads to
⟨hn⟩=n!(−hscale)n,(11)
clearly what is expected from dimensional arguments.
We can assemble the distribution from its moments (neglecting all difficulties implied by the Riemann-Stieltjes issue): The Laplace-transform of a given distribution p(h∣data) one obtains by replacing the transforming exponential in a power series:
technically through integration by substitution, leading directly to the barometric formula.
Summary
A seemingly innocuous result like the barometric formula combines many aspects of physics and statistics, most importantly the idea of statistical, Bayesian inference, extended to a partition function, which in turn is a neat way to generate moments of a distribution. But all technicalities aside, the partition function itself is a weird object: it assembles all possible parameters h that would be in principle be compatible with an observation, with relative weights according to the likelihood L(data∣h) (i.e. the probability that a choice h generates the observation) as well as the prior π(h) with the state of knowledge before a measurement is done.
The second aspect is the physics one, namely the Boltzmann-factor: It states how likely an observation of a particle with a certain energy would be, if the system is in thermodynamic equilibrium at (inverse) temperature β. There is in fact a more fundamental principle from which the Boltzmann-factor originates, and that is the principle of maximised information entropy - but clearly that would be the topic of another blog article.