BYOL-ing It: Breaking Down Self-Supervised Learning

an attempt at blogging by: @deeplearnerd

True understanding arises not from external instruction, but from the silent dialogues between curiosity and discovery.

This principle comes alive in Self-Supervised Learning (SSL), where models learn from data without human labels. Bootstrap Your Own Latent (BYOL) takes this further - it learns by comparing different views of the same data, essentially having a dialogue with itself.

I'm assuming there's some familiarity with Supervised and Unsupervised Learning. In short, supervised learning uses labeled data, while unsupervised learning finds patterns without labels. SSL bridges the gap, allowing models to create their own labels from data patterns.

In this post, the flow will be as follows:

SSL basics
How BYOL works
Key equations
References

Using this blog I shall try to explain how machines, much like curious students, teach themselves.

Self-supervised learning (SSL) comes in two types: contrastive and non-contrastive. Contrastive methods like SimCLR and MoCo learn by comparing an image (say, an otter 🦦) with both similar views of that image and different images (like a pineapple or car). They need these negative examples and large batches to work well. Non-contrastive methods like BYOL and SimSiam learn just by comparing different views of the same image - like an otter photo cropped differently or in different colours.

This blog focuses on non-contrastive learning, specifically BYOL, because it's simpler to use (no negative samples needed), works with less data, and can run efficiently on smaller computers. Plus, its clever design offers valuable insights into how machines can truly learn on their own.

Contrastive methods and SimCLR

Contrastive methods like SimCLR and MoCo faced two main challenges.

Augmentation bias: In contrastive methods like SimCLR, there's a risk of augmentation bias, where the model overly relies on specific augmentations (like cropping or color changes) rather than learning a general concept. For example, if an otter image is consistently cropped in a particular way, the model may learn to recognize only that specific crop of the otter, rather than the broader concept of "otter" itself. This overfitting to specific augmentations reduces the model's ability to generalise to other representations of the otter, which limits its real-world performance.
Need for large batches: Without enough, the model might compare a otter only with itself, failing to learn meaningful differences and collapsing into a single output for everything.

What does 'Collapse' mean in SSLs?

In self-supervised learning (SSL), collapse occurs when the model generates the same output for all inputs, losing the ability to learn meaningful differences. For example, instead of distinguishing between cats and dogs, the model outputs the same feature for both.