Notes on Reinforcement Learning
Published:
Comprehensive study notes organized by lecture topic (MDPs, planning/control, policy optimization, exploration, and imitation learning). Work In Progress.
Contents
Part I: Fundamentals
- Introduction to RL
- Overview and motivation
- Key concepts and terminology
- MDPs and Bellman Equations
- Policy Evaluation
Part II: Planning and Control
- Value Iteration
- Value iteration algorithm
- Convergence properties
- Policy Iteration and Dynamic Programming
- Policy iteration algorithm
- Comparison with value iteration
- Continuous Control
- Control in continuous state/action spaces
- Linear Quadratic Regulation (LQR)
- Nonlinear Control
- Iterative LQR
- Trajectory optimization
Part III: Learning with Function Approximation
- Model-based RL
- Model-based RL with generative models
- Sample complexity
- Approximate Policy Iteration
- Conservative Policy Iteration
- Safety in policy updates
- Monotonic improvement
Part IV: Policy Optimization
- Optimization Background
- Gradient descent fundamentals
- Optimization landscape
- Policy Gradient Methods
- Policy gradient theorem
- REINFORCE algorithm
- Actor-critic methods
- Advanced Policy Optimization
- Trust region methods
- Natural policy gradient
- TRPO and PPO
Part V: Exploration
- Multi-Armed Bandits
- Exploration-exploitation tradeoff
- Regret bounds
- Upper Confidence Bound (UCB)
- UCB algorithm
- Theoretical analysis
- Contextual Bandits
- Problem formulation
- Linear contextual bandits
- Exploration in MDPs
- Optimism under uncertainty
- Count-based exploration
Part VI: Imitation Learning
- Behavioral Cloning
- Supervised imitation learning
- Distribution mismatch problem
- Interactive Imitation Learning
- DAgger algorithm
- Dataset aggregation
- Inverse Reinforcement Learning
- Maximum entropy IRL
- Learning reward functions
Introduction to RL
Reinforcement learning (RL) is a framework for learning to make sequential decisions by interacting with an environment. An agent learns to maximize cumulative reward through trial and error.
Key concepts:
- Agent: The learner/decision maker
- Environment: What the agent interacts with
- State: Current situation of the agent
- Action: Choices available to the agent
- Reward: Feedback signal from the environment
- Policy: Strategy that maps states to actions
MDPs and Bellman Equations
MDP Basics
A Markov Decision Process (MDP) is defined by the tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\):
- \(\mathcal{S}\): State space
- \(\mathcal{A}\): Action space
- \(P(s’ \mid s,a)\): Transition dynamics
- \(R(s,a)\): Reward function
- \(\gamma \in [0,1]\): Discount factor
Markov property: The future is independent of the past given the present: \(P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \dots, s_0, a_0) = P(s_{t+1} | s_t, a_t)\)
Value functions:
State value function \(V^\pi(s)\): Expected return starting from state \(s\) following policy \(\pi\)
\[V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s\right]\]Action value function \(Q^\pi(s,a)\): Expected return starting from state \(s\), taking action \(a\), then following policy \(\pi\)
\[Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s, a_0 = a\right]\]
Bellman equations:
\[V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^\pi(s')]\] \[Q^\pi(s,a) = \sum_{s'} P(s'|s,a)[R(s,a) + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')]\]Optimal Policy
The optimal policy \(\pi^*\) maximizes the value function for all states:
\[\pi^* = \arg\max_\pi V^\pi(s) \quad \forall s \in \mathcal{S}\]Key property: There exists an optimal policy \(\pi^*\) that is better than or equal to all other policies for all states.
Bellman optimality equations:
\[V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^*(s')]\] \[Q^*(s,a) = \sum_{s'} P(s'|s,a)[R(s,a) + \gamma \max_{a'} Q^*(s',a')]\]The optimal policy can be extracted from \(Q^*\):
\[\pi^*(s) = \arg\max_a Q^*(s,a)\]State-Action Distributions
The state-action visitation distribution under policy \(\pi\):
\[d^\pi(s,a) = (1-\gamma)\sum_{t=0}^\infty \gamma^t P(s_t=s, a_t=a | \pi)\]This represents the discounted frequency of visiting state-action pair \((s,a)\).
Policy Evaluation
Policy evaluation computes the value function \(V^\pi\) for a given policy \(\pi\).
Exact Policy Evaluation
For finite MDPs, we can solve the Bellman equation directly:
\[V^\pi = (I - \gamma P^\pi)^{-1} R^\pi\]where \(P^{\pi}\) is the transition matrix under policy \(\pi\) and \(R^{\pi}\) is the reward vector.
Computational complexity: \(O( |\mathcal{S}|^3)\) due to matrix inversion.
Approximate Policy Evaluation via an Iterative Algorithm
Iterative policy evaluation (also called Bellman update): \(V_{k+1}(s) \leftarrow \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_k(s')]\)
Convergence: \(V_k \rightarrow V^\pi\) as \(k \rightarrow \infty\) (contraction mapping theorem).
Stopping criterion: Stop when \(\max_s \left|V_{k+1}(s) - V_k(s)\right| < \epsilon\)
Value Iteration
Value iteration directly computes the optimal value function by iteratively applying the Bellman optimality operator: \(V_{k+1}(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_k(s')]\)
Algorithm:
- Initialize \(V_0(s)\) arbitrarily for all \(s\)
- Repeat until convergence:
- For each state \(s\): \(V_{k+1}(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_k(s')]\)
- Extract policy: \(\pi(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^*(s')]\)
Convergence: Value iteration converges to \(V^*\) at a rate of \(\gamma\) (geometric convergence).
Policy Iteration and Dynamic Programming
Policy iteration alternates between policy evaluation and policy improvement.
Algorithm:
- Initialize policy \(\pi_0\) arbitrarily
- Repeat until convergence:
- Policy Evaluation: Compute \(V^{\pi_k}\) (solve Bellman equation)
- Policy Improvement: Update policy \(\pi_{k+1}(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^{\pi_k}(s')]\)
Policy improvement theorem: \(V^{\pi_{k+1}}(s) \geq V^{\pi_k}(s)\) for all \(s\).
Convergence: Policy iteration converges in a finite number of iterations (typically faster than value iteration in practice).
Continuous Control
In continuous control, states and/or actions are continuous (e.g., robot manipulation, autonomous driving).
Challenges:
- Cannot enumerate all states/actions
- Function approximation required for value functions and policies
- Integration over continuous spaces
Approaches:
- Discretization (coarse approximation)
- Function approximation (neural networks, linear models)
- Optimal control methods (LQR, iLQR)
Linear Quadratic Regulation (LQR)
Linear Quadratic Regulator (LQRs)
LQR considers systems with:
- Linear dynamics: \(x_{t+1} = A_t x_t + B_t u_t + w_t\)
- Quadratic cost: \(c_t(x_t, u_t) = x_t^\top C_t x_t + u_t^\top D_t u_t\)
where \(x_t\) is state, \(u_t\) is control input (action), and \(w_t\) is noise.
Goal: Find control policy to minimize total cost: \(J = \mathbb{E}\left[\sum_{t=0}^T c_t(x_t, u_t)\right]\)
Optimal Control in LQRs
Solution: The optimal policy is linear: \(u_t^* = -K_t x_t\)
where \(K_t\) is the optimal gain matrix computed via backward recursion:
- Initialize: \(V_T(x) = x^\top C_T x\)
- For \(t = T-1, \dots, 0\):
- Compute gain: \(K_t = (D_t + B_t^\top V_{t+1} B_t)^{-1} B_t^\top V_{t+1} A_t\)
- Update value: \(V_t(x) = x^\top Q_t x\) where \(Q_t = C_t + A_t^\top V_{t+1} (A_t - B_t K_t)\)
Properties:
- Closed-form solution (no iterative optimization needed)
- Optimal even with stochastic dynamics
- Can be solved efficiently (\(O(T)\) time)
Nonlinear Control
Control for Nonlinear Systems (Iterative LQR)
For nonlinear systems \(x_{t+1} = f(x_t, u_t)\), we can use iterative LQR (iLQR):
Algorithm:
- Initialize nominal trajectory \((\bar{x}_0, \bar{u}_0, \dots, \bar{x}_T)\)
- Repeat:
Linearize dynamics around nominal trajectory: \(\delta x_{t+1} \approx A_t \delta x_t + B_t \delta u_t\) where \(A_t = \frac{\partial f}{\partial x}\big|_{\bar{x}_t, \bar{u}_t}\)
\[B_t = \frac{\partial f}{\partial u}\big|_{\bar{x}_t, \bar{u}_t}\]- Quadratize cost function around nominal trajectory
- Solve LQR on linearized system to get policy correction \(\delta u_t = -K_t \delta x_t\)
- Forward pass: Execute new policy, get new nominal trajectory
- Until convergence
Applications: Trajectory optimization, robotics, autonomous systems.
Model-based RL
Model-based RL w/ Generative Model
| Idea: Learn a model of the environment dynamics $$\hat{P}(s’ | s,a)$$, then use planning algorithms. |
Algorithm:
- Collect data: Interact with environment, store transitions \((s, a, r, s’)\)
- Learn model: Fit \(\hat{P}(s'|s,a)\) using supervised learning
- Plan: Use value iteration, policy iteration, or trajectory optimization with \(\hat{P}\)
Advantages:
- Sample efficient (reuse data)
- Can plan without interacting with real environment
Challenges:
- Model errors compound over long horizons
- High-dimensional state/action spaces difficult to model
Approximate Policy Iteration
Supervised Learning & Approximate Policy Iteration
When exact policy iteration is intractable (large/continuous spaces), use function approximation: \(V_\theta(s) \approx V^\pi(s)\)
Q-learning with function approximation: \(\theta \leftarrow \theta - \alpha \nabla_\theta \left[Q_\theta(s,a) - (r + \gamma \max_{a'} Q_\theta(s', a'))\right]^2\)
Approximate Policy Iteration & Performance Difference Lemma
Performance difference lemma: \(V^{\pi'}(s_0) - V^\pi(s_0) = \frac{1}{1-\gamma} \mathbb{E}_{s\sim d^{\pi'}, a\sim \pi'(\cdot|s)}[A^\pi(s,a)]\)
where \(A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s)\) is the advantage function.
Implication: If \(\pi’\) has non-negative advantage everywhere under \(d^{\pi’}\), then \(\pi’\) is at least as good as \(\pi\).
Conservative Policy Iteration
Conservative policy iteration ensures monotonic improvement by limiting policy changes.
Key idea: Avoid destructive policy updates that might hurt performance due to approximation errors.
Approaches:
- Limit KL divergence between old and new policies
- Use trust regions (TRPO)
- Clip policy updates (PPO)
Optimization Background
Gradient descent: \(\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)\)
Stochastic gradient descent (SGD): Use mini-batch estimates of gradient.
Momentum: \(v_{t+1} = \beta v_t + \nabla_\theta L(\theta_t)\) \(\theta_{t+1} = \theta_t - \alpha v_{t+1}\)
Adam: Adaptive learning rates using first and second moment estimates.
Policy Gradient Methods
Gradient Descent & Policy Gradient
Goal: Directly optimize policy \(\pi_\theta(a|s)\) by gradient ascent on expected return: \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\)
Policy gradient theorem: \(\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\sum_{t'=t}^T \gamma^{t'-t} r_{t'}\right)\right]\)
REINFORCE algorithm:
- Sample trajectories \({\tau^i}\) using \(\pi_\theta\)
- Compute returns \(R(\tau^i)\)
- Update: \(\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(\tau) R(\tau)\)
Variance reduction: Use baseline \(b(s)\) (often \(V(s)\)): \(\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a|s) (Q(s,a) - b(s))\right]\)
Advanced Policy Optimization
Trust Region and Natural PG
Trust Region Policy Optimization (TRPO):
Maximize improvement subject to KL constraint:
\[\max_\theta \mathbb{E}_{s,a}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{old}}(s,a)\right]\] \[\text{s.t. } \mathbb{E}_s[KL(\pi_{\theta_{old}}(\cdot|s) \| \pi_\theta(\cdot|s))] \leq \delta\]Natural policy gradient:
Uses Fisher information matrix \(F\) as metric:
\[\theta_{t+1} = \theta_t + \alpha F^{-1} \nabla_\theta J(\theta_t)\]Proximal Policy Optimization (PPO):
Simpler alternative to TRPO using clipped objective:
\[L(\theta) = \mathbb{E}\left[\min\left(\frac{\pi_\theta(a|s)}{\pi_{old}(a|s)} A(s,a), \text{clip}\left(\frac{\pi_\theta(a|s)}{\pi_{old}(a|s)}, 1-\epsilon, 1+\epsilon\right) A(s,a)\right)\right]\]Multi-Armed Bandits
Introduction to Exploration in RL
Exploration-exploitation tradeoff:
- Exploitation: Choose actions that maximize immediate reward based on current knowledge
- Exploration: Try new actions to gain information
Multi-armed bandit (MAB): Simplified RL setting (stateless MDP)
- \(K\) arms (actions)
- Each arm \(a\) has unknown reward distribution with mean \(\mu_a\)
- Goal: Maximize cumulative reward over \(T\) rounds
Regret: Difference between optimal strategy and agent’s performance:
\[\text{Regret}(T) = T\mu^* - \sum_{t=1}^T r_t\]where \(\mu^* = \max_a \mu_a\).
Algorithms:
- \(\epsilon\)-greedy: Explore with probability \(\epsilon\), exploit otherwise
- UCB: Optimism under uncertainty
- Thompson sampling: Bayesian approach
Upper Confidence Bound (UCB)
Upper Confidence Bound (UCB1) algorithm:
Select action with highest upper confidence bound:
\[a_t = \arg\max_a \left[\hat{\mu}_a(t) + \sqrt{\frac{2\log t}{n_a(t)}}\right]\]where:
- \(\hat{\mu}_a(t)\): empirical mean reward of arm \(a\)
- \(n_a(t)\): number of times arm \(a\) has been pulled
Intuition:
- First term: exploitation (choose high-reward arms)
- Second term: exploration bonus (choose uncertain arms)
Regret bound: \(O(\sqrt{KT\log T})\) (logarithmic in \(T\))
Contextual Bandits
Contextual bandit: Bandit with side information (context) \(x_t\) at each round.
Model: Each arm \(a\) has reward \(r_a(x)\) depending on context \(x\).
Goal: Learn policy \(\pi(x) \rightarrow a\) that maximizes reward.
Linear Contextual Bandits
LinUCB algorithm:
Assume linear reward model: \(r_a(x) = x^\top \theta_a^* + \epsilon\)
Algorithm:
- Maintain estimates \(\hat{\theta}_a\) and confidence sets for each arm
- At round \(t\), observe context \(x_t\)
- Select arm: \(a_t = \arg\max_a \left[x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t}\right]\) where \(A_a\) is related to covariance of arm \(a\)’s data
- Observe reward, update estimates
Regret: \(\tilde{O}(\sqrt{dT})\) where \(d\) is context dimension.
Exploration in MDPs
Exploration in MDPs
Challenges in MDP exploration:
- Delayed rewards
- Large state spaces
- Need to explore systematically
Count-based exploration:
Add exploration bonus based on visitation counts:
\[\tilde{R}(s,a) = R(s,a) + \beta \cdot \frac{1}{\sqrt{N(s,a)}}\]where \(N(s,a)\) is the number of times \((s,a)\) has been visited.
UCB in MDPs (UCRL2):
- Build confidence sets around transition dynamics and rewards
- Plan optimistically within confidence sets
Posterior sampling (Thompson sampling for RL):
- Maintain posterior over MDP parameters
- Sample MDP from posterior, act optimally in sampled MDP
Behavioral Cloning
Behavioral cloning (BC): Supervised learning from expert demonstrations.
Setting:
- Expert policy \(\pi^*\)
- Dataset of expert state-action pairs: \(\mathcal{D} = {(s_i, a_i^*)}\)
Algorithm:
\[\min_\theta \sum_{i} \ell(\pi_\theta(s_i), a_i^*)\]where \(\ell\) is a loss function (e.g., cross-entropy for discrete actions, MSE for continuous).
Problem: Distribution mismatch
- Training distribution: \(s \sim d^{\pi^*}\)
- Test distribution: \(s \sim d^{\pi_\theta}\)
- Small errors compound over time (covariate shift)
Interactive Imitation Learning
DAgger
Dataset Aggregation (DAgger): (Ross, Gordon, & Bagnell, 2011)
Algorithm:
- Initialize dataset \(\mathcal{D} = \emptyset\)
- For iteration \(i = 1, \dots, N\):
- Train policy \(\pi_i\) on \(\mathcal{D}\)
- Execute \(\pi_i\) in environment, collect states \({s_j}\)
- Query expert for actions \({a_j^*}\) at visited states
- Aggregate: \(\mathcal{D} \leftarrow \mathcal{D} \cup {(s_j, a_j^*)}\)
Key insight: Collect training data from learner’s state distribution \(d^{\pi_i}\), not just expert’s.
Guarantees: Error decreases with more iterations (addresses distribution mismatch).
Reference: Ross, S., Gordon, G., & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS.
Inverse Reinforcement Learning
Maximum Entropy Inverse RL
Inverse RL (IRL): Learn reward function from expert demonstrations.
Maximum entropy IRL:
Assume expert policy has maximum entropy distribution over optimal trajectories:
\[P(\tau | \theta) \propto \exp\left(r_\theta(\tau)\right)\]Algorithm:
- Initialize reward parameters \(\theta\)
- Repeat:
- Compute optimal policy \(\pi_\theta\) for current reward \(r_\theta\)
- Compute feature expectations: \(\mathbb{E}{\pi\theta}[\phi]\) and \(\mathbb{E}_{\text{expert}}[\phi]\)
- Update \(\theta\) to match feature expectations: \(\nabla_\theta = \mathbb{E}_{\text{expert}}[\phi] - \mathbb{E}_{\pi_\theta}[\phi]\)
Goal: Find reward such that expert policy is optimal.
