an attempt at blogging by: @deeplearnerd
The eye sees only what the mind is prepared to comprehend.
In the vast realm of deep learning, convolutional neural networks (CNNs) have established themselves as masterful image classifiers. Yet, I've often found myself pondering: How does a CNN decide what it sees? This curiosity led me down the rabbit hole of model interpretability and, eventually, to the doorstep of Gradient-weighted Class Activation Mapping (Grad-CAM).
In this blog, I'll try to guide you through my journey of understanding and implementing Grad-CAM. We'll explore the math, unravel the intuition, and craft code that brings transparency to our deep learning models.
credits: @mrsiipa
Convolutional Neural Networks (CNNs) are powerful tools for image classification, but they're often seen as mysterious black boxes, it's hard to understand how they make decisions, which is a problem that goes beyond just computer science.
Let's look at medical imaging as an example. When a CNN identifies an image as showing a disease, it's crucial to know why it made that decision. Without this understanding, there's a risk of mistakes that could affect patient care. This shows why we need ways to see how these complex systems work.
This is where Gradient-weighted Class Activation Mapping (Grad-CAM) comes in. Grad-CAM is a technique that helps us see inside the CNN. It shows us which parts of an image the model focuses on when making a decision. This helps answer a big question: What does the model actually see when it looks at an image?
Grad-CAM makes CNNs more transparent. This is important not just for satisfying our curiosity, but also for using AI more responsibly in important areas like healthcare and self-driving cars. By understanding how these models work, we can trust them more and use them more effectively.
Grad-CAM serves as an insightful visualisation tool, illuminating the decision-making process of convolutional neural networks (CNNs). By tracing the gradients of any target class back to the final convolutional layer, Grad-CAM creates a localisation map, effectively showing where in the image the model focuses to make its prediction.
The choice of the final convolutional layer is grounded in both the mathematical design and architectural principles of CNNs:
Spatial-Feature Mapping: Convolutional layers inherently retain spatial information. For a feature map $A^k$ with width $u$ and height $v$, each neuron at position $(i,j)$ corresponds to a specific region in the input image:
$$ \begin{equation} A_{ij}^k \leftrightarrow \text{Region}(i, j) \text{ in input} \end{equation}
$$
Preservation of Spatial Relationships: Convolutional layers capture spatial relationships, while fully connected layers flatten them, losing these details. The final convolutional layer balances:
The core idea behind Grad-CAM leverages the calculus chain rule. For a particular class $c$, we calculate the gradient of its score $y^c$ with respect to each feature map $A^k$:
$$ \begin{equation} \frac{\partial y^c}{\partial A_{ij}^k} \end{equation} $$
This gradient reflects how much the class score would change if we slightly altered the activation at each location $(i,j)$ of $A^k$. High gradients highlight regions that significantly impact the network's prediction.