Made with ❤️ and GitHub Copilot
Overview
This article provides a structured mathematical explanation of attention mechanisms in deep learning, focusing on their application in transformer architectures. We’ll explore how sequences are processed through attention layers and understand the mathematical foundations of these powerful neural network components.
1 Key Concepts
Before diving into the mathematics, let’s establish our key concepts:
- Attention: A mechanism allowing models to focus on relevant parts of input data
- Self-Attention: A specific form where each element in a sequence attends to all others
- Multi-Head Attention: Multiple parallel attention mechanisms working together
- Positional Encoding: Method to incorporate sequential information
2 Data Preprocessing
2.1 From Text to Vectors
The transformation of text into numerical representations involves several steps:
- Tokenization: Convert text into token IDs
- Embedding: Map tokens to dense vectors
- Position Encoding: Add sequential information
Let’s examine each step in detail.
2.2 Tokenization Process
Given an input sentence:
"The cat sat on the mat"
We convert it to token IDs using subword tokenization:
= [101, 2023, 3679, 2003, 2307, 102] # Example IDs tokens
2.3 Embedding Layer
The embedding process transforms discrete tokens into continuous vectors:
\[ E \in \mathbb{R}^{V \times d} \]
where:
- \(V\) = vocabulary size
- \(d\) = embedding dimension
For each token \(t_i\), we compute:
\[ x_i = E[t_i] \in \mathbb{R}^d \]
Resulting in input matrix:
\[ X = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \in \mathbb{R}^{n \times d} \]
2.4 Positional Encoding
To preserve sequence order, we add positional encodings:
\[ \begin{aligned} P_{i,2j} &= \sin\left(\frac{i}{10000^{2j/d}}\right) \\ P_{i,2j+1} &= \cos\left(\frac{i}{10000^{2j/d}}\right) \end{aligned} \]
Final input representation:
\[ X_{\text{final}} = X + P \]
3 Self-Attention Mechanism
3.1 Query, Key, and Value Transformations
Self-attention begins by creating three matrices from the input:
\[ \begin{aligned} Q &= X W_Q \\ K &= X W_K \\ V &= X W_V \end{aligned} \]
where \(W_Q, W_K, W_V \in \mathbb{R}^{d \times d}\) are learnable parameters.
3.2 Attention Computation
Compute Attention Scores:
\[ S = \frac{Q K^T}{\sqrt{d}} \]
Apply Softmax:
\[ A = \text{softmax}(S) \]
Compute Weighted Values:
\[ Z = A V \]
4 Multi-Head Attention
4.1 Parallel Attention Heads
For \(h\) heads, we compute:
\[ \begin{aligned} Q^{(h)} &= X W_Q^{(h)} \\ K^{(h)} &= X W_K^{(h)} \\ V^{(h)} &= X W_V^{(h)} \end{aligned} \]
4.2 Head Outputs
Each head produces its output:
\[ Z^{(h)} = \text{softmax} \left( \frac{Q^{(h)} (K^{(h)})^T}{\sqrt{d_k}} \right) V^{(h)} \]
4.3 Combining Head Outputs
Concatenate:
\[ H_{\text{concat}} = [Z^{(1)} \| Z^{(2)} \| \cdots \| Z^{(H)}] \]
Project:
\[ H_{\text{output}} = H_{\text{concat}} W_O \]
5 Training Process
5.1 Gradient Descent Updates
The attention weights are updated using:
\[ \begin{aligned} W_Q &\leftarrow W_Q - \eta \frac{\partial L}{\partial W_Q} \\ W_K &\leftarrow W_K - \eta \frac{\partial L}{\partial W_K} \\ W_V &\leftarrow W_V - \eta \frac{\partial L}{\partial W_K} \end{aligned} \]
5.2 Optimization Strategy
- Use Adam optimizer for stable training
- Apply gradient clipping to prevent exploding gradients
- Implement learning rate warmup
6 Implementation Considerations
When implementing attention mechanisms, consider:
- Memory efficiency
- Numerical stability
- Parallelization opportunities
- Attention masking for padding
7 Practical Example
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.w_q = nn.Linear(d_model, d_model)
self.w_k = nn.Linear(d_model, d_model)
self.w_v = nn.Linear(d_model, d_model)
self.w_o = nn.Linear(d_model, d_model)
def forward(self, x):
= x.size(0)
batch_size
# Linear projections
= self.w_q(x).view(batch_size, -1, self.n_heads, self.d_k)
q = self.w_k(x).view(batch_size, -1, self.n_heads, self.d_k)
k = self.w_v(x).view(batch_size, -1, self.n_heads, self.d_k)
v
# Transpose for attention computation
= q.transpose(1, 2)
q = k.transpose(1, 2)
k = v.transpose(1, 2)
v
# Compute attention scores
= torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
scores = F.softmax(scores, dim=-1)
attn
# Apply attention to values
= torch.matmul(attn, v)
out
# Reshape and project
= out.transpose(1, 2).contiguous()
out = out.view(batch_size, -1, self.d_model)
out return self.w_o(out)