CV Basics - 2 | Notion

by: https://x.com/deeplearnerd

In the previous two blogs, we laid the groundwork for computer vision - from basic image processing to detecting patterns in images. We explored how computers understand edges, identify key features, and recognise patterns using traditional techniques. But now it's time to dive into what's revolutionised computer vision in the past decade. We're moving from manually crafted features to letting machines learn their own representations. In this final blog, we'll explore the world of deep learning in computer vision, covering:

CNN Basics: Convolution operations, pooling layers, and activation functions
Modern Architectures: Understanding ResNet, VGG, and transfer learning concepts
Training Deep Models: Loss functions, optimisation techniques, and training strategies

CNN Basics

Probably one of the most common blog/tutorial topic you would find over the internet. Convolution Neural Networks are something people really love to write about and use also because in my opinion they are highly intuitive to understand and at the same time are super effective to solve most of your computer vision tasks.

I’ll just link to some amazing resources I came across while researching for this blog (since our blog is more of just brushing the basics, you can check this out to know about the details):

At their core, CNNs are a type of artificial neural network specifically designed for processing structured data like images. Unlike traditional neural networks, CNNs exploit the spatial structure of images, making them highly effective at recognising patterns such as edges, textures, and shapes.

How Do Convolutional Neural Networks Work?

To understand CNNs, it's essential to grasp their primary components and how they interact to process visual information:

1. Convolutional Layers

The convolutional layer is the foundation of a CNN. It applies a set of learnable filters (or kernels) to the input image or the output from the previous layer. Each filter slides over the input spatially, performing element-wise multiplication and summation to produce a feature map. These feature maps highlight the presence of specific features (like edges or patterns) in different regions of the input.

We have already understood the maths and code behind convolution in the previous blog:

Key Concepts:

Filters/Kernels: Small matrices (e.g., 3x3 or 5x5) that detect specific features.
Stride: The number of pixels the filter moves across the input. A larger stride reduces the spatial dimensions of the output.
Padding: Adding borders to the input to control the spatial size of the output feature maps.

2. Activation Function (ReLU)

After the convolution operation, an activation function like Rectified Linear Unit (ReLU) is applied element-wise to introduce non-linearity into the model. ReLU transforms all negative values to zero while keeping positive values unchanged, enabling the network to learn complex patterns.