Notes on Reinforcement Learning

12 minute read

Published:

Comprehensive study notes organized by lecture topic (MDPs, planning/control, policy optimization, exploration, and imitation learning). Work In Progress.

Contents

Part I: Fundamentals

  1. Introduction to RL
    • Overview and motivation
    • Key concepts and terminology
  2. MDPs and Bellman Equations
  3. Policy Evaluation

Part II: Planning and Control

  1. Value Iteration
    • Value iteration algorithm
    • Convergence properties
  2. Policy Iteration and Dynamic Programming
    • Policy iteration algorithm
    • Comparison with value iteration
  3. Continuous Control
    • Control in continuous state/action spaces
  4. Linear Quadratic Regulation (LQR)
  5. Nonlinear Control

Part III: Learning with Function Approximation

  1. Model-based RL
  2. Approximate Policy Iteration
  3. Conservative Policy Iteration
    • Safety in policy updates
    • Monotonic improvement

Part IV: Policy Optimization

  1. Optimization Background
    • Gradient descent fundamentals
    • Optimization landscape
  2. Policy Gradient Methods
  3. Advanced Policy Optimization

Part V: Exploration

  1. Multi-Armed Bandits
  2. Upper Confidence Bound (UCB)
    • UCB algorithm
    • Theoretical analysis
  3. Contextual Bandits
  4. Exploration in MDPs
    • Optimism under uncertainty
    • Count-based exploration

Part VI: Imitation Learning

  1. Behavioral Cloning
    • Supervised imitation learning
    • Distribution mismatch problem
  2. Interactive Imitation Learning
  3. Inverse Reinforcement Learning

Introduction to RL

Reinforcement learning (RL) is a framework for learning to make sequential decisions by interacting with an environment. An agent learns to maximize cumulative reward through trial and error.

Key concepts:

  • Agent: The learner/decision maker
  • Environment: What the agent interacts with
  • State: Current situation of the agent
  • Action: Choices available to the agent
  • Reward: Feedback signal from the environment
  • Policy: Strategy that maps states to actions

MDPs and Bellman Equations

MDP Basics

A Markov Decision Process (MDP) is defined by the tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\):

  • \(\mathcal{S}\): State space
  • \(\mathcal{A}\): Action space
  • \(P(s’ \mid s,a)\): Transition dynamics
  • \(R(s,a)\): Reward function
  • \(\gamma \in [0,1]\): Discount factor

Markov property: The future is independent of the past given the present: \(P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0) = P(s_{t+1} | s_t, a_t)\)

Value functions:

  • State value function \(V^\pi(s)\): Expected return starting from state \(s\) following policy \(\pi\)

    \[V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s\right]\]
  • Action value function \(Q^\pi(s,a)\): Expected return starting from state \(s\), taking action \(a\), then following policy \(\pi\)

    \[Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s, a_0 = a\right]\]

Bellman equations:

\[V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^\pi(s')]\] \[Q^\pi(s,a) = \sum_{s'} P(s'|s,a)[R(s,a) + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')]\]

Optimal Policy

The optimal policy \(\pi^*\) maximizes the value function for all states:

\[\pi^* = \arg\max_\pi V^\pi(s) \quad \forall s \in \mathcal{S}\]

Key property: There exists an optimal policy \(\pi^*\) that is better than or equal to all other policies for all states.

Bellman optimality equations:

\[V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^*(s')]\] \[Q^*(s,a) = \sum_{s'} P(s'|s,a)[R(s,a) + \gamma \max_{a'} Q^*(s',a')]\]

The optimal policy can be extracted from \(Q^*\):

\[\pi^*(s) = \arg\max_a Q^*(s,a)\]

State-Action Distributions

The state-action visitation distribution under policy \(\pi\):

\[d^\pi(s,a) = (1-\gamma)\sum_{t=0}^\infty \gamma^t P(s_t=s, a_t=a | \pi)\]

This represents the discounted frequency of visiting state-action pair \((s,a)\).


Policy Evaluation

Policy evaluation computes the value function \(V^\pi\) for a given policy \(\pi\).

Exact Policy Evaluation

For finite MDPs, we can solve the Bellman equation directly:

\[V^\pi = (I - \gamma P^\pi)^{-1} R^\pi\]

where \(P^{\pi}\) is the transition matrix under policy \(\pi\) and \(R^{\pi}\) is the reward vector.

Computational complexity: \(O( |\mathcal{S}|^3)\) due to matrix inversion.

Approximate Policy Evaluation via an Iterative Algorithm

Iterative policy evaluation (also called Bellman update): \(V_{k+1}(s) \leftarrow \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_k(s')]\)

Convergence: \(V_k \rightarrow V^\pi\) as \(k \rightarrow \infty\) (contraction mapping theorem).

Stopping criterion: Stop when \(\max_s \left|V_{k+1}(s) - V_k(s)\right| < \epsilon\)


Value Iteration

Value iteration directly computes the optimal value function by iteratively applying the Bellman optimality operator: \(V_{k+1}(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_k(s')]\)

Algorithm:

  1. Initialize \(V_0(s)\) arbitrarily for all \(s\)
  2. Repeat until convergence:
    • For each state \(s\): \(V_{k+1}(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_k(s')]\)
  3. Extract policy: \(\pi(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^*(s')]\)

Convergence: Value iteration converges to \(V^*\) at a rate of \(\gamma\) (geometric convergence).


Policy Iteration and Dynamic Programming

Policy iteration alternates between policy evaluation and policy improvement.

Algorithm:

  1. Initialize policy \(\pi_0\) arbitrarily
  2. Repeat until convergence:
    • Policy Evaluation: Compute \(V^{\pi_k}\) (solve Bellman equation)
    • Policy Improvement: Update policy \(\pi_{k+1}(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^{\pi_k}(s')]\)

Policy improvement theorem: \(V^{\pi_{k+1}}(s) \geq V^{\pi_k}(s)\) for all \(s\).

Convergence: Policy iteration converges in a finite number of iterations (typically faster than value iteration in practice).


Continuous Control

In continuous control, states and/or actions are continuous (e.g., robot manipulation, autonomous driving).

Challenges:

  • Cannot enumerate all states/actions
  • Function approximation required for value functions and policies
  • Integration over continuous spaces

Approaches:

  • Discretization (coarse approximation)
  • Function approximation (neural networks, linear models)
  • Optimal control methods (LQR, iLQR)

Linear Quadratic Regulation (LQR)

Linear Quadratic Regulator (LQRs)

LQR considers systems with:

  • Linear dynamics: \(x_{t+1} = A_t x_t + B_t u_t + w_t\)
  • Quadratic cost: \(c_t(x_t, u_t) = x_t^\top C_t x_t + u_t^\top D_t u_t\)

where \(x_t\) is state, \(u_t\) is control input (action), and \(w_t\) is noise.

Goal: Find control policy to minimize total cost: \(J = \mathbb{E}\left[\sum_{t=0}^T c_t(x_t, u_t)\right]\)

Optimal Control in LQRs

Solution: The optimal policy is linear: \(u_t^* = -K_t x_t\)

where \(K_t\) is the optimal gain matrix computed via backward recursion:

  1. Initialize: \(V_T(x) = x^\top C_T x\)
  2. For \(t = T-1, \dots, 0\):
    • Compute gain: \(K_t = (D_t + B_t^\top V_{t+1} B_t)^{-1} B_t^\top V_{t+1} A_t\)
    • Update value: \(V_t(x) = x^\top Q_t x\) where \(Q_t = C_t + A_t^\top V_{t+1} (A_t - B_t K_t)\)

Properties:

  • Closed-form solution (no iterative optimization needed)
  • Optimal even with stochastic dynamics
  • Can be solved efficiently (\(O(T)\) time)

Nonlinear Control

Control for Nonlinear Systems (Iterative LQR)

For nonlinear systems \(x_{t+1} = f(x_t, u_t)\), we can use iterative LQR (iLQR):

Algorithm:

  1. Initialize nominal trajectory \((\bar{x}_0, \bar{u}_0, \dots, \bar{x}_T)\)
  2. Repeat:
    • Linearize dynamics around nominal trajectory: \(\delta x_{t+1} \approx A_t \delta x_t + B_t \delta u_t\) where \(A_t = \frac{\partial f}{\partial x}\big|_{\bar{x}_t, \bar{u}_t}\)

      \[B_t = \frac{\partial f}{\partial u}\big|_{\bar{x}_t, \bar{u}_t}\]
    • Quadratize cost function around nominal trajectory
    • Solve LQR on linearized system to get policy correction \(\delta u_t = -K_t \delta x_t\)
    • Forward pass: Execute new policy, get new nominal trajectory
  3. Until convergence

Applications: Trajectory optimization, robotics, autonomous systems.


Model-based RL

Model-based RL w/ Generative Model

Idea: Learn a model of the environment dynamics $$\hat{P}(s’s,a)$$, then use planning algorithms.

Algorithm:

  1. Collect data: Interact with environment, store transitions \((s, a, r, s’)\)
  2. Learn model: Fit \(\hat{P}(s'|s,a)\) using supervised learning
  3. Plan: Use value iteration, policy iteration, or trajectory optimization with \(\hat{P}\)

Advantages:

  • Sample efficient (reuse data)
  • Can plan without interacting with real environment

Challenges:

  • Model errors compound over long horizons
  • High-dimensional state/action spaces difficult to model

Approximate Policy Iteration

Supervised Learning & Approximate Policy Iteration

When exact policy iteration is intractable (large/continuous spaces), use function approximation: \(V_\theta(s) \approx V^\pi(s)\)

Q-learning with function approximation: \(\theta \leftarrow \theta - \alpha \nabla_\theta \left[Q_\theta(s,a) - (r + \gamma \max_{a'} Q_\theta(s', a'))\right]^2\)

Approximate Policy Iteration & Performance Difference Lemma

Performance difference lemma: \(V^{\pi'}(s_0) - V^\pi(s_0) = \frac{1}{1-\gamma} \mathbb{E}_{s\sim d^{\pi'}, a\sim \pi'(\cdot|s)}[A^\pi(s,a)]\)

where \(A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s)\) is the advantage function.

Implication: If \(\pi’\) has non-negative advantage everywhere under \(d^{\pi’}\), then \(\pi’\) is at least as good as \(\pi\).


Conservative Policy Iteration

Conservative policy iteration ensures monotonic improvement by limiting policy changes.

Key idea: Avoid destructive policy updates that might hurt performance due to approximation errors.

Approaches:

  • Limit KL divergence between old and new policies
  • Use trust regions (TRPO)
  • Clip policy updates (PPO)

Optimization Background

Gradient descent: \(\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)\)

Stochastic gradient descent (SGD): Use mini-batch estimates of gradient.

Momentum: \(v_{t+1} = \beta v_t + \nabla_\theta L(\theta_t)\) \(\theta_{t+1} = \theta_t - \alpha v_{t+1}\)

Adam: Adaptive learning rates using first and second moment estimates.


Policy Gradient Methods

Gradient Descent & Policy Gradient

Goal: Directly optimize policy \(\pi_\theta(a|s)\) by gradient ascent on expected return: \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\)

Policy gradient theorem: \(\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\sum_{t'=t}^T \gamma^{t'-t} r_{t'}\right)\right]\)

REINFORCE algorithm:

  1. Sample trajectories \({\tau^i}\) using \(\pi_\theta\)
  2. Compute returns \(R(\tau^i)\)
  3. Update: \(\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(\tau) R(\tau)\)

Variance reduction: Use baseline \(b(s)\) (often \(V(s)\)): \(\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a|s) (Q(s,a) - b(s))\right]\)


Advanced Policy Optimization

Trust Region and Natural PG

Trust Region Policy Optimization (TRPO):

Maximize improvement subject to KL constraint:

\[\max_\theta \mathbb{E}_{s,a}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{old}}(s,a)\right]\] \[\text{s.t. } \mathbb{E}_s[KL(\pi_{\theta_{old}}(\cdot|s) \| \pi_\theta(\cdot|s))] \leq \delta\]

Natural policy gradient:

Uses Fisher information matrix \(F\) as metric:

\[\theta_{t+1} = \theta_t + \alpha F^{-1} \nabla_\theta J(\theta_t)\]

Proximal Policy Optimization (PPO):

Simpler alternative to TRPO using clipped objective:

\[L(\theta) = \mathbb{E}\left[\min\left(\frac{\pi_\theta(a|s)}{\pi_{old}(a|s)} A(s,a), \text{clip}\left(\frac{\pi_\theta(a|s)}{\pi_{old}(a|s)}, 1-\epsilon, 1+\epsilon\right) A(s,a)\right)\right]\]

Multi-Armed Bandits

Introduction to Exploration in RL

Exploration-exploitation tradeoff:

  • Exploitation: Choose actions that maximize immediate reward based on current knowledge
  • Exploration: Try new actions to gain information

Multi-armed bandit (MAB): Simplified RL setting (stateless MDP)

  • \(K\) arms (actions)
  • Each arm \(a\) has unknown reward distribution with mean \(\mu_a\)
  • Goal: Maximize cumulative reward over \(T\) rounds

Regret: Difference between optimal strategy and agent’s performance:

\[\text{Regret}(T) = T\mu^* - \sum_{t=1}^T r_t\]

where \(\mu^* = \max_a \mu_a\).

Algorithms:

  • \(\epsilon\)-greedy: Explore with probability \(\epsilon\), exploit otherwise
  • UCB: Optimism under uncertainty
  • Thompson sampling: Bayesian approach

Upper Confidence Bound (UCB)

Upper Confidence Bound (UCB1) algorithm:

Select action with highest upper confidence bound:

\[a_t = \arg\max_a \left[\hat{\mu}_a(t) + \sqrt{\frac{2\log t}{n_a(t)}}\right]\]

where:

  • \(\hat{\mu}_a(t)\): empirical mean reward of arm \(a\)
  • \(n_a(t)\): number of times arm \(a\) has been pulled

Intuition:

  • First term: exploitation (choose high-reward arms)
  • Second term: exploration bonus (choose uncertain arms)

Regret bound: \(O(\sqrt{KT\log T})\) (logarithmic in \(T\))


Contextual Bandits

Contextual bandit: Bandit with side information (context) \(x_t\) at each round.

Model: Each arm \(a\) has reward \(r_a(x)\) depending on context \(x\).

Goal: Learn policy \(\pi(x) \rightarrow a\) that maximizes reward.

Linear Contextual Bandits

LinUCB algorithm:

Assume linear reward model: \(r_a(x) = x^\top \theta_a^* + \epsilon\)

Algorithm:

  1. Maintain estimates \(\hat{\theta}_a\) and confidence sets for each arm
  2. At round \(t\), observe context \(x_t\)
  3. Select arm: \(a_t = \arg\max_a \left[x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t}\right]\) where \(A_a\) is related to covariance of arm \(a\)’s data
  4. Observe reward, update estimates

Regret: \(\tilde{O}(\sqrt{dT})\) where \(d\) is context dimension.


Exploration in MDPs

Exploration in MDPs

Challenges in MDP exploration:

  • Delayed rewards
  • Large state spaces
  • Need to explore systematically

Count-based exploration:

Add exploration bonus based on visitation counts:

\[\tilde{R}(s,a) = R(s,a) + \beta \cdot \frac{1}{\sqrt{N(s,a)}}\]

where \(N(s,a)\) is the number of times \((s,a)\) has been visited.

UCB in MDPs (UCRL2):

  • Build confidence sets around transition dynamics and rewards
  • Plan optimistically within confidence sets

Posterior sampling (Thompson sampling for RL):

  • Maintain posterior over MDP parameters
  • Sample MDP from posterior, act optimally in sampled MDP

Behavioral Cloning

Behavioral cloning (BC): Supervised learning from expert demonstrations.

Setting:

  • Expert policy \(\pi^*\)
  • Dataset of expert state-action pairs: \(\mathcal{D} = {(s_i, a_i^*)}\)

Algorithm:

\[\min_\theta \sum_{i} \ell(\pi_\theta(s_i), a_i^*)\]

where \(\ell\) is a loss function (e.g., cross-entropy for discrete actions, MSE for continuous).

Problem: Distribution mismatch

  • Training distribution: \(s \sim d^{\pi^*}\)
  • Test distribution: \(s \sim d^{\pi_\theta}\)
  • Small errors compound over time (covariate shift)

Interactive Imitation Learning

DAgger

Dataset Aggregation (DAgger): (Ross, Gordon, & Bagnell, 2011)

Algorithm:

  1. Initialize dataset \(\mathcal{D} = \emptyset\)
  2. For iteration \(i = 1, \dots, N\):
    • Train policy \(\pi_i\) on \(\mathcal{D}\)
    • Execute \(\pi_i\) in environment, collect states \({s_j}\)
    • Query expert for actions \({a_j^*}\) at visited states
    • Aggregate: \(\mathcal{D} \leftarrow \mathcal{D} \cup {(s_j, a_j^*)}\)

Key insight: Collect training data from learner’s state distribution \(d^{\pi_i}\), not just expert’s.

Guarantees: Error decreases with more iterations (addresses distribution mismatch).

Reference: Ross, S., Gordon, G., & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS.


Inverse Reinforcement Learning

Maximum Entropy Inverse RL

Inverse RL (IRL): Learn reward function from expert demonstrations.

Maximum entropy IRL:

Assume expert policy has maximum entropy distribution over optimal trajectories:

\[P(\tau | \theta) \propto \exp\left(r_\theta(\tau)\right)\]

Algorithm:

  1. Initialize reward parameters \(\theta\)
  2. Repeat:
    • Compute optimal policy \(\pi_\theta\) for current reward \(r_\theta\)
    • Compute feature expectations: \(\mathbb{E}{\pi\theta}[\phi]\) and \(\mathbb{E}_{\text{expert}}[\phi]\)
    • Update \(\theta\) to match feature expectations: \(\nabla_\theta = \mathbb{E}_{\text{expert}}[\phi] - \mathbb{E}_{\pi_\theta}[\phi]\)

Goal: Find reward such that expert policy is optimal.