Introduction

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning, RL does not require labeled data; instead, it optimizes a reward function through exploration and exploitation. It’s lost some steam as a application and research area with Generative AI dominating these days.

I’ve worked on Reinforcement Learning a few times for course exercises, but never had time to really dig into. I worked on making use on Meta’s Horizon platform for RL. The idea was to modify the UI and notifications in experiments conducted with users - that is each user receive a different version of the app, different UI and notifications. We didn’t get the go ahead to follow through from early stage investigation though.

Core Components of RL

RL problems are typically modeled as Markov Decision Processes (MDPs), consisting of:

Component	Description
State (s)	The environment’s representation at a given time.
Action (a)	The choice made by the agent.
Reward (r)	A scalar feedback signal received after taking an action.
Policy (π)	A strategy that maps states to actions.
Value Function (V or Q)	Measures the expected return of being in a state or taking an action.

1. Value-Based Methods

Value-based methods estimate the expected return of states or actions and use it to make optimal decisions.

Q-Learning

Q-learning estimates the action-value function Q(s, a) and updates it using the Bellman equation:

\[ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] \]

where:

\(\alpha\) is the learning rate.
\(\gamma\) is the discount factor.
\(s'\) and \(a'\) are the next state and action.

What Does “Q” in Q-Learning Stand For?

The “Q” in Q-learning stands for “Quality”, representing the quality or value of taking a particular action ( a ) in a given state ( s ).

Definition of Q-Value

The Q-value function, ( Q(s, a) ), estimates the expected cumulative reward an agent will receive if it takes action ( a ) in state ( s ) and then follows an optimal policy thereafter. The function satisfies the Bellman equation:

\[ Q(s, a) = \mathbb{E} \left[ r + \gamma \max_{a'} Q(s', a') \right] \]

where: - \(r\) is the immediate reward, - \(\gamma\) is the discount factor (determining how much future rewards are valued), - \(s'\) is the next state, - \(a'\) is the next action.

Q-Learning Update Rule

Q-learning iteratively updates ( Q(s, a) ) estimates using the Bellman equation:

\[ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] \]

where \(\alpha\) is the learning rate.

Optimal Policy from Q-Values

The optimal policy is derived by selecting the action with the highest Q-value:

\[ \pi^*(s) = \arg\max_a Q(s, a) \]

This means the agent chooses the action that maximizes the expected future reward.

Example Code in Python

Below is a simple Q-learning update step in Python:

import numpy as np

# Initialize Q-table
Q = np.zeros((5, 2))  # 5 states, 2 actions

# Parameters
alpha = 0.1   # Learning rate
gamma = 0.9   # Discount factor

# Sample experience (state, action, reward, next_state)
s, a, r, s_next = 0, 1, 10, 2

# Q-learning update
Q[s, a] = Q[s, a] + alpha * (r + gamma * np.max(Q[s_next, :]) - Q[s, a])

print(Q)

Example: Q-Learning in Python

import numpy as np
import gym

gamma = 0.99  # Discount factor
alpha = 0.1   # Learning rate
epsilon = 0.1  # Exploration rate

env = gym.make("FrozenLake-v1", is_slippery=False)
Q = np.zeros((env.observation_space.n, env.action_space.n))

for episode in range(1000):
    state = env.reset()[0]
    done = False
    while not done:
        action = np.argmax(Q[state]) if np.random.rand() > epsilon else env.action_space.sample()
        next_state, reward, done, _, _ = env.step(action)
        Q[state, action] += alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
        state = next_state

Temporal Difference (TD) Learning

TD methods update value estimates using a one-step lookahead:

\[ V(s) \leftarrow V(s) + \alpha \left[ r + \gamma V(s') - V(s) \right] \]

This balances immediate and future rewards without requiring a full trajectory.

2. Policy-Based Methods

Instead of learning Q(s, a), policy-based methods directly optimize the policy π_θ(a | s), parameterized by θ.

Policy Gradient (PG)

Policy gradient methods use gradient ascent to maximize expected rewards:

\[ \theta \leftarrow \theta + \alpha \nabla_\theta J(\theta) \]

where J(θ) is the expected return under the policy π_θ.

Derivation of \(J(\theta)\) in Policy Gradients

1. Defining the Objective Function

In policy gradient methods, the objective function \(J(\theta)\) represents the expected return (total reward) under the parameterized policy \(\pi_{\theta}\). It is derived from the expected cumulative reward an agent receives when following \(\pi_{\theta}\).

We define \(J(\theta)\) as the expected return over all possible trajectories:

\[ J(\theta) = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[ R(\tau) \right] \]

where:

\(\tau = (s_0, a_0, s_1, a_1, \dots)\) is a trajectory (a full sequence of states and actions),
\(p_{\theta}(\tau)\) is the probability of a trajectory under policy \(\pi_{\theta}\),
\(R(\tau)\) is the total reward for a trajectory, usually defined as:

\[ R(\tau) = \sum_{t=0}^{T} \gamma^t r_t \]

where \(r_t\) is the reward at time \(t\), and \(\gamma\) is the discount factor.

Using the chain rule, the trajectory probability can be written as:

\[ p_{\theta}(\tau) = p(s_0) \prod_{t=0}^{T} \pi_{\theta}(a_t | s_t) p(s_{t+1} | s_t, a_t) \]

where:

\(p(s_0)\) is the initial state distribution,
\(\pi_{\theta}(a_t | s_t)\) is the policy,
\(p(s_{t+1} | s_t, a_t)\) is the environment transition probability.

2. Policy Gradient Theorem

We take the gradient of \(J(\theta)\) to optimize the policy:

\[ \nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[ R(\tau) \nabla_{\theta} \log p_{\theta}(\tau) \right] \]

Using the log-derivative trick:

\[ \nabla_{\theta} \log p_{\theta}(\tau) = \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) \]

we obtain the final form:

\[ \nabla_{\theta} J(\theta) = \mathbb{E}_{\tau} \left[ R(\tau) \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) \right] \]

This means that we update the policy in the direction that increases the probability of high-reward actions.

3. Practical Policy Gradient Update Rule

The policy parameters are updated using stochastic gradient ascent:

\[ \theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta) \]

where \(\alpha\) is the learning rate.

4. Example Code in Python (REINFORCE Algorithm)

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple policy network
class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc = nn.Linear(state_dim, action_dim)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, state):
        return self.softmax(self.fc(state))

# Initialize policy, optimizer
policy = PolicyNetwork(state_dim=4, action_dim=2)
optimizer = optim.Adam(policy.parameters(), lr=0.01)

# Sample data
log_probs = []  # Store log π(a|s)
rewards = []  # Store rewards

# Compute policy gradient loss
R = sum(rewards)  # Total return
policy_loss = -sum(log_prob * R for log_prob in log_probs)  # Gradient ascent

# Backpropagate
optimizer.zero_grad()
policy_loss.backward()
optimizer.step()



### Example: Policy Gradient with REINFORCE
```python
import torch
import torch.nn as nn
import torch.optim as optim
import gym

env = gym.make("CartPole-v1")
policy = nn.Sequential(nn.Linear(4, 128), nn.ReLU(), nn.Linear(128, 2), nn.Softmax(dim=-1))
optimizer = optim.Adam(policy.parameters(), lr=0.01)

def policy_gradient():
    state = env.reset()[0]
    rewards, log_probs = [], []
    done = False
    while not done:
        state = torch.tensor(state, dtype=torch.float32)
        action_probs = policy(state)
        action = torch.multinomial(action_probs, 1).item()
        log_prob = torch.log(action_probs[action])
        state, reward, done, _, _ = env.step(action)
        rewards.append(reward)
        log_probs.append(log_prob)
    return rewards, log_probs

Actor-Critic

Actor-Critic methods combine value-based and policy-based approaches: - Actor updates the policy π_θ. - Critic estimates the value function V(s) to reduce variance.

3. State-of-the-Art RL Methods

Method	Type	Description
DQN (Deep Q-Networks)	Value-Based	Uses deep learning to approximate Q-values.
PPO (Proximal Policy Optimization)	Policy-Based	A stable policy optimization algorithm.
SAC (Soft Actor-Critic)	Actor-Critic	Introduces entropy regularization for exploration.
MuZero	Model-Based	Learns a model of the environment while optimizing actions.

4. Learning Path for RL

Estimated Time to Competency

Stage	Topics	Duration
Foundations	MDPs, Q-learning, TD learning	1-2 months
Deep RL	DQN, PPO, Actor-Critic	2-3 months
Advanced Topics	Multi-Agent RL, Meta-RL	3-4 months
Research & Specialization	Real-world applications, new RL methods	Ongoing

Key Resources

Books: Sutton & Barto’s Reinforcement Learning: An Introduction
Courses: David Silver’s RL course
Libraries: Stable-Baselines3, RLlib, Gym

Criticisms of RL

Yann LeCun has been critical of reinforcement learning (RL) as a general approach to intelligence. Some key points from his statements include: 1. Inefficiency of RL • He often points out that RL is sample inefficient, requiring millions or even billions of interactions to learn effective policies. This makes it impractical for real-world applications outside of simulated environments. 2. Limited Applicability • LeCun argues that RL is useful in specific cases, such as game playing (e.g., AlphaGo, Atari games) or robotic control, but it does not scale well to more general AI tasks. 3. Need for More Efficient Learning • He believes that future AI should rely more on self-supervised learning (SSL) rather than RL. SSL allows models to learn from vast amounts of unlabeled data efficiently, whereas RL often relies on sparse rewards. 4. RL and Common Sense • LeCun has stated that RL alone cannot give AI systems “common sense” because it lacks the ability to learn structured world models. He advocates for model-based approaches that integrate predictive learning. 5. Hybrid Approaches • While critical of RL’s limitations, he acknowledges that combining RL with other paradigms, such as self-supervised learning and energy-based models, may be key to building more intelligent systems.

A famous quote from LeCun:

> “Reinforcement learning is like training a dog with a clicker: it takes forever, and the dog never learns to play chess.”

Conclusion

Reinforcement learning is a powerful framework for sequential decision-making. Mastering it requires understanding both theory and practical implementation.

Next Steps

If and when I get time, I would like to read more about Meta-RL and explore its applications. It seems like a fascinating area, although less popular these days, with generative models taking over. (Although RLHF on Large Language Models is a special application of RL).

Here are some canonical papers that have shaped the field of reinforcement learning (RL) and are often considered foundational:

Temporal Difference Learning (TD) • Title: Learning to Predict by the Methods of Temporal Differences • Authors: Richard S. Sutton • Year: 1988 • Summary: This paper introduces Temporal Difference (TD) learning, a core method that combines Monte Carlo methods and dynamic programming to estimate value functions for reinforcement learning tasks.
Q-Learning • Title: Learning from Delayed Rewards • Authors: Christopher J.C.H. Watkins • Year: 1989 • Summary: Q-learning is introduced in this work, a model-free algorithm for learning the value of action-state pairs without needing a model of the environment. It’s a foundational algorithm for value-based RL methods.
Policy Gradient Methods • Title: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning • Authors: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour • Year: 2000 • Summary: This paper introduces policy gradient methods, which optimize policies directly by updating the policy parameters using the gradient of expected return, paving the way for better solutions in continuous action spaces.
Deep Q-Networks (DQN) • Title: Human-level control through deep reinforcement learning • Authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. • Year: 2015 • Summary: DQN combines deep learning with Q-learning, using a neural network to approximate the Q-value function. This paper demonstrated human-level performance on several Atari games and marked a significant breakthrough for deep RL.
Asynchronous Advantage Actor-Critic (A3C) • Title: Asynchronous Methods for Deep Reinforcement Learning • Authors: Volodymyr Mnih, Adrià Puigdomènech Badia, David Silver, et al. • Year: 2016 • Summary: A3C introduces a parallelized framework where multiple agents interact with the environment asynchronously, using actor-critic methods to stabilize learning and improve performance on complex tasks.
Proximal Policy Optimization (PPO) • Title: Proximal Policy Optimization Algorithms • Authors: John Schulman, Filip Wolski, Prafulla Dhariwal, et al. • Year: 2017 • Summary: PPO is introduced as an efficient and simpler alternative to TRPO (Trust Region Policy Optimization). It aims to balance exploration and exploitation by constraining how much the policy changes at each step, leading to stable training.
AlphaGo • Title: Mastering the game of Go with deep neural networks and tree search • Authors: Silver, Hubert, Schrittwieser, et al. • Year: 2016 • Summary: This paper introduces AlphaGo, a system that combines deep learning, Monte Carlo tree search, and reinforcement learning to play Go at a superhuman level. It demonstrates the power of RL in combining planning and learning.
MuZero • Title: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model • Authors: Silver, Singh, Precup, et al. • Year: 2020 • Summary: MuZero is a model-based approach that learns a model of the environment (dynamics) and uses it to plan actions. Unlike prior methods, MuZero learns the model itself rather than relying on a predefined one, achieving state-of-the-art performance in multiple domains.
AlphaStar • Title: Grandmaster level in StarCraft II using multi-agent reinforcement learning • Authors: Vinyals, Babuschkin, et al. • Year: 2019 • Summary: AlphaStar is an RL agent trained to play the complex real-time strategy game StarCraft II. This paper outlines the challenges and methods used to train AlphaStar, including multi-agent learning and hierarchical reinforcement learning.
Model-Based Reinforcement Learning • Title: World Models • Authors: David Ha, Jürgen Schmidhuber • Year: 2018 • Summary: World Models are introduced here, where an RL agent learns an internal model of the environment that allows it to perform well even with a limited number of real-world interactions, showing how model-based methods can significantly improve sample efficiency.
Trust Region Policy Optimization (TRPO) • Title: Trust Region Policy Optimization • Authors: John Schulman, Sergey Levine, Philipp Moritz, et al. • Year: 2015 • Summary: TRPO addresses the challenge of large policy updates by introducing a constraint on the size of each update, thus ensuring the policy remains close to the old policy, which improves the stability of training in policy-based methods.
A Survey of Deep Reinforcement Learning • Title: Deep Reinforcement Learning: An Overview • Authors: Yuxi Li • Year: 2017 • Summary: This is a comprehensive review paper on deep reinforcement learning, covering the evolution of deep RL algorithms, key ideas, and the challenges they aim to solve. It provides a useful resource for understanding the broad scope of deep RL research.

These papers cover the evolution of RL, from value-based methods like Q-learning to policy-based methods, and the integration of deep learning into RL, all the way to model-based approaches like MuZero. Together, they form the backbone of modern RL research.

Have thoughts or questions? Contact me