A recent paper published in PNAS titled “The Fermi-Dirac distribution provides a calibrated probabilistic output for binary classifiers” caught my attention, because it describes a surprising relationship between machine learning and quantum physics. In fact, surprising is an understatement. Mind-boggling is more like it. According to the analogy developed by the authors, positive samples in binary classification problems are like… fermions?! What?! I decided that I should try to understand the gist of this paper, at least to the extent that I can.
First of all, let me say that this paper can be read purely as a theoretic machine learning paper, with several insightful, even practical, points about $AUC$ and ensemble classifiers. However, in this post, I will only talk about the machine learning/quantum analogy, because that’s what got my imagination running wild.
Drawing analogies between physics and machine learning is not new. There are many statistical mechanics terms in machine learning (Hello! Boltzmann machine and Helmholtz machine). However, the ones that I have seen before are all from classical physics. Where do fermions come from? How can these guys possibly have anything to do with something as basic as binary classifiers? That’s what I had to find out.
Let’s start with any binary classifier. Say we have a test set with $N$ samples, with $N_1$ positive samples among them. For every sample in the test set, we first use the classifier to generate a score $s_i$ (a continuous value, where $i$ is the index of the sample), and then threshold it to decide if the sample is classified positive ($s_i$ > threshold) or negative (otherwise). A slightly different way to describe this is that we first rank the samples from high $s_i$ to low $s_i$, and then assign the positive label to all samples lower in rank than a certain threshold. All other samples are assigned the negative label. If we randomly sample the test set many times, given a sample with rank $r_i$, what is the probably that it is classified as positive? In other words, we want to know $P(y=1|r_i)$, the class probability conditioned on the rank ($y$ denotes the label). I don’t normally think about this probability when I think about classifiers, but ranking is very natural to classification. That’s how we draw ROC curves. The shape of $P(y=1|r_i)$ obviously varies from algorithm to algorithm. However, without knowing the specifics of the algorithm, it can be shown that the most that we can say about the density function is that it is a particular form of sigmoid function.
The authors derived this function by observing a mapping between the classification problem and a problem in quantum physics. Consider a scenario where we have $N$ quantum states (indexed by $i$) and $N_1$ particles. We can envision these quantum states as discrete points on a line, because each state is associated with an energy level. Now, what is the expected number of particles in each of these states? The answer depends on the species of the particles. Fermion is a particle species that includes quarks and leptons, but in this context, we don’t need to worry about their quantum characteristics such as spin. All we need to know is that fermions obey Pauli’s exclusion principle, which means that each state can have either 0 or 1 fermion. The expected number of fermion in a state, $\langle n_i \rangle$, is therefore a function bounded by 1. What does this function look like? You guessed it… it’s a sigmoid function called the Fermi-Dirac distribution.
So the analogy is (roughly) this: The ranks $r_i$ produced by the classifier are like a quantum state, where the energy increases with ranking. Positive samples are like fermions, which can be found in some of the quantum states (although no state can have more than one fermion). This analogy is built on the observation that given a fixed number of fermions and total energy, the maximal entropy solution to $\langle n_i \rangle$ can be mapped to our classification problem. Whereas in physics, entropy maximization is about the statistical properties of the system; in classification, entropy maximization is about inferencing. It means that if we don’t know anything about the classification algorithm beyond $N_1$ and $AUC$, the most that we can say about its $P(y=1|r)$ is that it has the same form as $\langle n_i \rangle$.
What is funny here is that although I tend to see elementary particles as really weird objects, it is exactly the quantum-ness that makes our scenario non-weird. Without the discreteness of quantum energy, we can’t associate ranks with quantum states. Without fermions obeying Pauli’s exclusion principle, two different positive sample can have the same rank, which violates the general assumption that the score monotonically decreases with ranking. I would never have imagined that to map a simple idea like $P(y=1|r_i)$ to physics, we have to turn to fermions, but where else in physics do we get this kind of density function?
Because the Fermi-Dirac distribution depends on two parameters with physical meanings, one of which being $T$ the absolute temperature, we can take this analogy further and talk about the temperature of a classifier. $T$ turns out to be related to $AUC$. Let’s first consider a perfect binary classifier with $AUC=1$. In this situation, positive samples are perfectly separated from negative samples by the classifier. So positive samples and only positive samples occupy the lowest ranks. This is like a fermion system at absolute zero temperature, where all the fermions are found at the lowest energy states. At the other extreme, a random classifier with $AUC=0.5$ is a like fermion system with very high temperature ($T$ approaching infinity), where the fermions can be in any state. Indeed, for a random classifier, the positive label can have any ranking. A better classifier is therefore like a fermion system with lower temperature. This gives a very interesting interpretation of a new ensemble classification algorithm proposed in this paper, which weights the outputs of the individual classifiers by their temperature.
Instead of giving quantum interpretations to classification, we can also turn the table around, and ask questions like “What does a false positive or negative correspond to in quantum physics?” That turns out to be related to the other parameter in the Fermi-Dirac distribution, the chemical potential. For brevity, I won’t go into this topic, although it is quite a useful discussion, because it is related to the optimal threshold for classification.
So, I think it’s pretty cool that I am able to get something useful out of this paper, with my minimal knowledge in quantum physics. But I wonder if it is really useful to tell a quantum story about binary classifiers? It depends on the perspective, but for me, analogies always stimulate the imagination. I have spent the entire afternoon today imagining how a system of bosoms following the Bose-Einstein distribution, or a classical system following the Boltzmann distribution, might map onto machine learning problems. My undisciplined daydreams didn’t produce anything meaningful, but in the hands of clever people, maybe this analogy can lead to new directions in algorithm development. It might also serve to unify other classes of machine learning algorithms into a broad framework.
The other thing that excites me is that it is yet another example of discovering the deep relationship between physical processes and inferencing. The former is traditionally seen as a matter of objective reality, while the latter is about subjective beliefs. The classic 1957 paper by physicist E.T. Jaynes, which I found in the references of this paper, argued that the close relationship between the two can have profound, even disturbing implications. I am out of my depth here, but it just goes to show that this is a particularly stimulating paper.
Note: The opinions in this post (and all posts in this blog) are my own.