CMP’s Substack

Quantifying Prediction Difficulty

Carlos Miguel Patiño — Mon, 23 Dec 2024 12:46:43 GMT

I’ve been working on uncertainty quantification for deep learning models. Most recently, I collaborated on the paper Decision-Focused Uncertainty Quantification, where we explored ways to extend the standard conformal prediction method to generate more useful prediction sets for decision-makers.

As part of the project, we needed to quantify the difficulty of classifying an image using a specific model and connect that difficulty to the set size that resulted from conformal prediction. More specifically, we wanted to prove that difficult images resulted in larger predicted sets. This post discusses the two methods we considered for measuring prediction difficulty and why we chose one over the other.

Entropy of the Softmax Distribution

The intuition behind this method is to leverage the shape of the softmax distribution to determine the certainty of the model’s prediction. If the model is sure about its prediction, the output softmax distribution will have a dominant class with a score of around 0.99, while the other classes will have tiny scores. If the model is unsure which class to assign the image to, the softmax distribution will be mostly flat—i.e., all images will have roughly the same score.

As a former physicist, I can’t help but think of the entropy as a measurement of the flatness of the softmax distribution. We are going to use Shannon’s entropy in the context of information theory, which is given by

The entropy of the softmax distribution will be large if we have a flat distribution because it’s the state of largest uncertainty. In that case, the flat distribution resembles a uniform distribution where we really don’t know which class we should assign to the image.

In contrast, the entropy of the softmax output will be small if we have a peak in one of the classes. Intuitively, this will be where we have less uncertainty about our prediction. Mathematically, we can use the entropy equation to analyze why the entropy will be small. The class that has the peak p_k ~ 1 and log(p_k) ~ 0. For all the other classes, p_k ~ 0. That means all the terms on the sum in the entropy equation will be small.

We can verify this with a toy example where we vary the score of one of the classes and distribute the remaining density across the other classes. The snippet below does exactly that and produces the plot we expect.

import matplotlib.pyplot as plt
import numpy as np

from scipy import stats

true_class_scores = np.linspace(0, 1, 100)
n_classes = 100
entropies = []

for true_class_score in true_class_scores:
    extra_class_score = (1 - true_class_score) / (n_classes - 1)
    extra_classes = [extra_class_score] * (n_classes - 1)
    softmax_output = np.array(
        [true_class_score] + extra_classes
    )
    entropies.append(stats.entropy(softmax_output, base=2))

Score of the True Class

This is a short description because the method is pretty simple. Another way to quantify the difficulty of an image for a model is to quantify the model's expected accuracy. This calculation is easier because we can just use the true label's softmax score to measure the model's expected accuracy for a specific image. The score will not be perfectly calibrated to a probability, but it will be good enough to measure difficulty. If the softmax score is small for the true class, the probability of the model being right for that image is also small.

The Verdict

We ended up using the score of the true class for the following reasons related to how conformal prediction impacts the predicted set:

Conformal prediction, by definition, will output a larger set if the distribution is flatter because we need more classes to reach the conformal score threshold. That means the “flatness” of the distribution is not a good metric because the relationship between difficulty and set size is pre-defined—i.e., we can’t have difficult examples with small sets.
The score of the true class ignores the score of the other classes. If the true class has a low score, it doesn’t necessarily mean the set will be large because other classes may have large scores that reach the threshold with just a few classes. In other words, we could have a small set size and a model that assigns a high score to the wrong class. That means the example is difficult to classify because it confuses the model and has a small set size. Since our objective was to prove that our method had large sets for difficult images, this difficulty definition was perfect for us.

I really enjoyed thinking about this problem because both methods are good ways to quantify the model's uncertainty. However, in conformal prediction, we have a clear winner when determining the effect of prediction difficulty on the set size.

Mean vs. Total Cross-Entropy Loss

Carlos Miguel Patiño — Fri, 08 Nov 2024 15:28:46 GMT

I recently started my MSc in Artificial Intelligence at the University of Amsterdam, so I am reviewing deep learning fundamentals. Those fundamentals include using pure Numpy to write the components of a neural network and use them to train a simple model on CIFAR10 with linear layers and ELU activations. The dataset has 10 classes, so using the softmax function in the last layer and the cross-entropy as the loss function is natural.

I started implementing the forward and backward passes, feeling lucky to have frameworks like PyTorch and JAX that save us from implementing these parts (especially the backward pass) when using neural networks. Everything looked good with the forward and backward passes, so it was time to move on to training the neural network.

Fortunately, the training loop ran quickly and got the first results in a few minutes. Unfortunately, they looked like this:

There was something clearly wrong with the model, and since I wasn’t standing on the shoulders of PyTorch or JAX, I wasn’t sure where to begin to debug.

My first instinct was to implement the model in PyTorch and check layer by layer to see if I was getting the same result as my Numpy implementation. This wasn’t fun. I had to initialize the numpy and PyTorch networks to the same values and make the layer values reproducible. After hours of checking tensors and arrays of outputs and gradients, I convinced myself that the layer implementations had identical results.

Could it be something with the initialization? I was getting warnings about potential overflows and wondered if my implementation of the Kaiming initialization was causing large values and exploding the values and gradients of the network. I tried multiple initialization methods and was still getting the same curve.

It was getting dark outside, so I returned to the pipeline I had replicated in PyTorch and began reading the parameters of each of the layers. I had defaulted to using the sum reduction for the cross-entropy

-np.sum(y_ohe * np.log(x + 1e-6))

which in PyTorch is equivalent to

nn.CrossEntropyLoss(reduction=”sum”)

I realized I was using a parameter different from PyTorch’s default. Changing to the default mean reduction resulted in the plot below. Beautiful.

I then changed my numpy implementation of the cross entropy loss and got the same nice (and expected!) curve above. The implementation now looked like this

(1 / y.shape[0]) * -np.sum(y_ohe * np.log(x + 1e-6))

A small but very consequential change! It basically was the difference between a usable and unusable model, and it is a mistake I could have also made in PyTorch by setting the wrong parameter value.

In retrospect, I should have remembered that using the sum of the loss function brings problems:

Training instability if the batch size is large because the gradient—and therefore the parameter update—becomes large. This is true, especially if you haven’t tuned your learning rate.
You can’t easily compare losses between runs if you change the batch size between them.

Number 1 was the cause of my problems this time, but number 2 was problematic when running multiple experiments.

So, as a reminder to my future self, use the mean loss function unless there is a good reason not to!

Initializing the Bias in Output Layers

Carlos Miguel Patiño — Fri, 01 Mar 2024 13:01:19 GMT

One of Andrej Karpathy’s recommendations in his famous A Recipe for Training Neural Networks is to “initialize well.”

If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias.

I have used his advice to initialize my binary classifiers and, so far, have achieved good results. However, I never stopped to test whether initializing the bias with that recommendation helped model performance significantly. This post aims to answer that question with a quick experiment.

The Setup

Dataset: We are going to use the CIFAR10 dataset with a twist. The twist is that we want to benchmark the bias initialization in a binary task, but CIFAR10 has ten classes. We will follow the Hot Dog-Not Hot Dog approach to turn the task into a binary classification problem. We don’t have hot dogs in CIFAR10, so we will train a model that classifies Frog-Not Frog.
Model: We used EfficientNet B1 with ~7.8M parameters available in torchvision. This model strikes a good balance between speed and performance: it is small enough to run multiple experiments quickly but performs well, as measured by its acc@1 of 79.838 on ImageNet. We ran experiments with and without pre-trained ImageNet weights to check whether the pre-training impacted our results.
Training: We trained each model for five epochs and repeated each training process ten times to account for randomness in initialization and training. The optimizer is the vanilla SGD with a learning rate of 0.0001. You can see all the training details in this GitHub link.
Hardware: We ran the experiments in a T4 GPU, which is right for this experiment because the model and the data aren’t too large. Training the two models ten times each costs less than USD 10 on a single T4. I ran everything using Lightning Studios, which has proven great for quickly spinning up instances with GPUs and running experiments.

Results

ImageNet Weights

The first experiment was to train the model using the pre-trained ImageNet weights. The learning curve at epoch 0 shows how initializing the model at the positive class rate starts with a lower loss before training the model. However, the model with the vanilla initialization quickly closes the gap, and both models have identical performance by the 5th training epoch.

Despite both models achieving nearly identical performance after a few epochs, the positive rate initialization avoids the hockey stick shape in the learning curve. This head start in the loss may be crucial for larger datasets or models you can’t afford to train for multiple epochs.

Results for a model initialized with pre-trained ImageNet weights.

Trained without ImageNet Weights

The second experiment was to train the model without the pre-trained ImageNet weights. The first—perhaps expected—conclusion is that we have a harder time training the model without the ImageNet weights. Both initializations achieved less than 0.9 ROC-AUC in the fifth epoch, while the model with ImageNet weights was already close to perfect ROC-AUC at that point in training.

As before, the positive rate initialization has a head start from the model initialized with the vanilla output bias. If we focus on the ROC-AUC learning curve, the red curve has a harder time catching up with the blue curve up to the third epoch. This indicates that the advantage given by the positive rate initialization is more valuable in the case of models without pre-trained weights.

What’s interesting about the plot below is how the vanilla initialization model surpasses the positive rate initialization model. None of the two models have converged, but you would benefit from choosing the vanilla initialization if your FLOPs budget only allowed for training five epochs.

Results for a model initialized without ImageNet weights.

To compare the model with the pre-trained weights, we trained both models without the ImageNet weights until they converged. The plots below are just from training the models twice instead of ten as from the previous two plots.

The conclusion is the same as with the ImageNet weights: both models achieve similar metrics if we can afford to train them to convergence.

Models initialized with pre-trained weights and trained until convergence.

Takeaways

Initializing the bias in the model’s output layer with the positive class rate, as Karpathy mentions in his guide, avoids the hockey stick shape in the learning curve. This head start proves useful when your FLOPs budget can only afford a few training epochs. However, the model initialized with the vanilla method closes the gap quickly for the pre-trained model and after a few epochs for the model trained from scratch.

Also, always use pre-trained weights when possible! The strongest conclusion from this set of runs is that models with pre-trained weights converge significantly faster than models trained from scratch.

Why is Numpy Faster than Pure Python?

Carlos Miguel Patiño — Fri, 02 Feb 2024 13:20:53 GMT

I've used Numpy since my early programming days in Python during my undergrad in physics. I was taught we should always use Numpy when dealing with numeric computations like matrix multiplication. I knew Numpy has optimized C implementations of some operations, but I wanted to dig deeper into what's going on under the hood to make Numpy more efficient than pure Python for matrix-like structures. After all, Python also runs C in the background.

I found good answers in the High Performance Python book, which I decided to summarize in this post.

Vectorization

Let’s assume for a moment that we can have all the data we need to run an operation on the CPU. Vectorization is a process where you apply the same operation simultaneously to multiple elements. For example, you can multiply parts of arrays at once instead of multiplying element by element.

The caveat of vectorized operations is that they run on a different part of the CPU and with different instructions than non-vectorized operations. Python doesn't leverage vectorized operations that are possible in the CPU, so Numpy has specialized code that takes advantage of the vectorizations enabled by the CPU. That’s why vectorization is one of the reasons Numpy is faster than pure Python.

Fragmented vs. Contiguous Memory

Fragmented memory results in irrelevant data (green squares) in each memory transfer, so we need more memory transfers to pass the relevant data (blue squares) to the CPU cache.

We now know that vectorization requires all the data in the CPU. However, CPUs have limited memory, so we need to figure out how to transfer data between the RAM and the CPU’s cache.

Python lists and Numpy arrays handle data differently. Python lists store pointers to the data; meaning lists don't hold the data we care about. Storing pointers of the data allows Python to hold multiple types of data in a list, resulting in our relevant data being fragmented in different memory locations. This fragmentation is fine for most cases, but a potential optimization becomes clear when we understand communications between the RAM and the CPU.

The data we use is initially stored in RAM and then moved to the CPU when we need to run calculations with that data. Communication between the two devices is costly in terms of time, so CPUs have a cache memory where they can store data they know they will require to run the calculations we request. That's why the CPU tries to predict which data will be required and tries to transfer that data to its cache. The CPU usually makes this prediction well (using techniques like branch prediction and pipelining), so the real bottleneck is moving data quickly between the RAM and the CPU cache.

Transferring between the RAM and the CPU cache—also known as the L1/L2 cache—is done by a bus that transfers memory in blocks. We usually want to run an operation over the entire object for data structures like matrices, so we know that, eventually, we want to transfer all the elements in the matrix to the CPU. If the data we want to use is fragmented across our RAM, the transferred blocks will contain pieces that are not relevant to our calculation. If our data is stored in contiguous blocks, most of our data will be relevant to the calculation. That's why we need fewer transfer operations than in the fragmented memory case. Using contiguous memory gets our relevant data to the CPU faster, and this optimization results in faster runtimes. You can see a comparison of the two cases in the diagram below.

Takeaways

Python is a very flexible language, but that flexibility often comes with a price we pay in performance. In our case, Python allows having lists containing elements of different types, which are challenging to store in contiguous memory without causing issues. Numpy then enforces elements in an array to be the same type to speed up calculations by reducing data transfers and leveraging vectorization. That limitation of having arrays of the same data type can be limiting for some use cases but isn’t a problem for numeric computing. That’s why Numpy works great for handling operations between numerical tensors and is faster than pure Python for those cases.

Coming soon

Carlos Miguel Patiño — Fri, 24 Nov 2023 17:55:40 GMT

This is CMP’s Substack.

Subscribe now