You painstakingly stack layers one after another, carefully choosing the activation function, debating whether to add batch norm before or after the activations and whether to go fully convolutional or add a linear layer on top. After several computational and mental epochs, the model is ready.
You feed a perfectly centered cat picture and the model says ‘Yes, its a cat’. You feed a dog picture and the model says 'its a dog’ and then you feed a picture with the dog, cat, and Snoopdog and your model goes 🤔 hmm "Insufficient data for a meaningful answer".
You wonder if you could see the parts of your image that contributed the most in your model’s decision. Enter Grad-CAM 😎
Gradient-weighted class activation mapping (Grad-CAM) allows us to visualise the parts of an input image that contributed the most when the network made its prediction. This is done by generating a heatmap and overlaying it over the input image. The following video shows the output of heatmap superimposed over a test video where the intensity of red color shows where the model is focussing at. The video was generated using imagenet trained VGG16 network without any finetuning, as you can make out from the predicted classes.
Background score was generated using an RNN generative model
So, how do we generate the heatmap ?
Step 1: Obtain the gradient of predicted class with respect to feature maps of last convolutional layer. Why last convolutional layer, because they capture the highest representation construct.
Step 2: Compute the global average pool of gradients to capture the importance of feature map for the predicted class
Step 3: Generate the heatmap by multiplying pooled gradients with feature map for each channel, taking channelwise mean and passing the output through RELU. RELU will discard the negative gradients, i.e the gradients that do not influence predicted class.
Step 4: Resize the heatmap to the size of the input image
Step 5: Overlay the heatmap over input image by blending it with desired opacity
Grad-CAM is a generalisation of CAM visualisation technique proposed by this paper. In order to implement the techniques proposed in CAM paper, we need to modify the network architecture by replacing dense layers with global average pooling layer to capture class-specific feature maps. This may be undesirable because the change will impact prediction accuracy and will also require retraining. Grad-CAM offers the generalisation of CAM techniques without any modifications to the network architecture and no retraining.
We saw a simple technique, Grad-CAM, to visualise the activation maps for a specific class. This technique can be used to visually inspect the parts that influence the network predictions the most. And hopefully, make the CNN predictions a bit more transparent.
References:
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization [Arxiv]
Learning Deep Features for Discriminative Localization [website, Arxiv]