Made with ❤️ and Cursor

Thoughts on Unsupervised Learning

Introduction

Unsupervised Learning (UL) can feel a bit prickly. Unlike Supervised Learning, which follows a clear paradigm of mapping inputs to targets, UL often feels like a disconnected bag of tricks. It is rare to find expositions that link the various algorithms—from K-Means to Large Language Models—back to a single underlying principle or probabilistic model.

After meditating on this fragmentation, I’ve organized a few thoughts to bridge these gaps.

The Core Shift: From Labels to Structure

In Supervised Learning, the goal is clear: minimize the error between a prediction and a given label. In Unsupervised Learning, we have no labels. We only have the data, \(X\).

Consequently, the goal shifts from mapping to structure discovery. We aren’t asking “What is the answer?” we are asking “What is the shape of this data?”

This search for shape generally relies on two fundamental concepts: 1. Distance: Classically, we compare examples to one another. If two data points are “close” (Euclidean distance, Cosine similarity), they likely share an underlying property. 2. Probability: We attempt to model the underlying probability distribution, \(P(X)\), that generated the data.

A Unified Taxonomy of Unsupervised Learning

Rather than listing algorithms randomly, we can categorize them by how they manipulate the feature space and the probability density.

1. Feature Manipulation

This category focuses on changing the representation of the data to make it more useful. * Dimensionality Reduction (Compression): Finding a lower-dimensional representation that preserves the most important structures. * Linear: Principal Component Analysis (PCA). * Non-Linear (Manifold Learning): Autoencoders, t-SNE, UMAP. * Dimensionality Expansion (Projection): Projecting data into higher dimensions to make it linearly separable or richer. * Random Projections. * Kernel Methods: Used in SVMs and RBFs. * Feature Maps: The internal expansions seen in Convolutional Neural Networks (CNNs).

2. Clustering

Grouping data points based on distance metrics. This is arguably the oldest form of structure discovery. * Examples: K-Means, DBSCAN, Hierarchical Clustering.

3. Probability Density Modeling

Explicitly or implicitly learning the mathematical function that describes the data generation. * Explicit Density: We can calculate the exact likelihood of a data point (e.g., Gaussian Mixture Models). * Implicit Density: We cannot calculate the likelihood easily, but we can sample from it (e.g., Variational Autoencoders (VAEs), GANs).

4. Self-Supervised Learning (The Modern Paradigm)

This is the bridge to modern AI. Instead of human labels, we generate labels from the data itself by masking parts of the input and trying to predict them. * Examples: Masked Autoencoders (BERT), Next-Token Prediction (GPT).

Generative AI: The Evolution of Density Modeling

It is crucial to understand that Generative AI is fundamentally Unsupervised Learning.

Foundation models (like GPT-4 or Llama) are essentially doing probability density modeling on an internet-scale dataset. By recursively predicting the next token, the model learns the joint probability distribution of sequences of text.

Note: While Reinforcement Learning from Human Feedback (RLHF) is used to fine-tune these models (a supervised/RL step), the “intelligence” and world knowledge come from the massive unsupervised pre-training phase.

A New Perspective: Clustering as Extreme Reduction

Here is a mental model to link these distinct categories: Clustering is just extreme, discretized dimensionality reduction.

Consider “Manifold Learning” (Non-linear reduction). We usually reduce a complex input into a continuous lower-dimensional latent space (e.g., a 128-dimensional vector). * If we reduce that output all the way down to 1 dimension… * And we discretize that dimension into one of \(K\) integer values… * We have effectively reinvented Clustering.

In this view, a cluster assignment is just a latent vector that has been maximally compressed into a single integer. This highlights that these “different” algorithms are often just variations of the same goal: compressing information into a coherent latent structure.