PPO and Policy Optimization
What is PPO? Proximal Policy Optimization is a stable, efficient policy gradient algorithm that restricts policy updates to prevent large, destabilizing changes.
Policy Gradient Basics
Objective: Maximize expected reward
J(θ) = E[Σ γ^t r_t]
Gradient: ∇J(θ) = E[∇log π(a|s) * A(s,a)]
Where:
- π(a|s): policy probability of action a in state s
- A(s,a): advantage (how much better than baseline)
PPO Core Idea Limit policy change per update using clipping:
# PPO clipped objective
ratio = new_policy_prob / old_policy_prob
clipped_ratio = clip(ratio, 1-epsilon, 1+epsilon)
loss = -min(ratio * advantage, clipped_ratio * advantage)
PPO Implementation
import torch
from torch.distributions import Categorical
class PPO:
def __init__(self, policy_net, value_net, epsilon=0.2):
self.policy = policy_net
self.value = value_net
self.epsilon = epsilon
def update(self, states, actions, old_probs, returns, advantages):
# Get current policy probabilities
new_probs = self.policy(states)
dist = Categorical(new_probs)
new_log_probs = dist.log_prob(actions)
# Ratio for importance sampling
ratio = torch.exp(new_log_probs - old_probs)
# Clipped surrogate objective
clip_adv = torch.clamp(ratio, 1-self.epsilon, 1+self.epsilon) * advantages
policy_loss = -torch.min(ratio * advantages, clip_adv).mean()
# Value loss
value_loss = ((self.value(states) - returns) ** 2).mean()
return policy_loss + 0.5 * value_loss
Advantage Estimation (GAE) Generalized Advantage Estimation balances bias/variance:
def compute_gae(rewards, values, gamma=0.99, lambda_=0.95):
advantages = []
gae = 0
for t in reversed(range(len(rewards))):
delta = rewards[t] + gamma * values[t+1] - values[t]
gae = delta + gamma * lambda_ * gae
advantages.insert(0, gae)
return advantages
PPO vs Other Algorithms
| Algorithm | Stability | Sample Efficiency | Complexity |
|---|---|---|---|
| Vanilla PG | Low | Low | Low |
| TRPO | High | Medium | High |
| PPO | High | Medium | Medium |
| A2C | Medium | Low | Low |
Hyperparameters
| Parameter | Typical Value |
|---|---|
| Epsilon (clip) | 0.2 |
| Learning rate | 3e-4 |
| Gamma (discount) | 0.99 |
| Lambda (GAE) | 0.95 |
| Epochs per update | 4-10 |
Use Cases
- Game playing
- Robotics control
- RLHF for LLMs
- Recommendation systems
PPO is the default choice for many RL applications due to its stability and simplicity.
ppopolicy gradientactor critic
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.