Four Identities for RL used in TRPO

In this blog post, I will present four useful identities that give some theoretical insight on advantage and value functions in reinforcement learning. These identities are useful in proving the theoretical correctness of the non-decreasing expected return policy iteration algorithm in the paper "Trust Region Policy Optimization" (TRPO) - Schulman et al, 2015, https://arxiv.org/abs/1502.05477. I would like to note that readers should be familiar with the reinforcement learning terminology and notation from the TRPO paper before reading this post. Finally, I encourage the reader to try proving the identities themselves if they feel, like me, the need to keep their math skills sharp when you no longer have problem sets and exams to do.

Identity 1

First, the expected value of the advantage function using an action sampled from it’s own policy is zero. Intuitively this makes sense, because the advantage function for a state

s

and an action

a

measures how much better that action is relative to the “average” action taken at the state

s

. The “average” action is not better or worse than itself, hence the identity:

\begin{matrix} (1) & E_{a \sim π ∣ s} [A_{π} (s, a)] = 0 \end{matrix}

Proof of Identity [1]:

\begin{aligned} E_{a \sim π ∣ s} [A_{π} (s, a)] & = E_{a \sim π ∣ s} [Q_{π} (s, a) - V_{π} (s)] \\ = E_{a \sim π ∣ s} [Q_{π} (s, a) - E_{a \sim π ∣ s} [Q_{π} (s, a)]] \\ = 0 \end{aligned}

Here we used an identity from probability:

E [X - E [X]] = 0

. You can reason that this makes sense intuitively, or you can prove it mathematically (hint: E[E[X]] = E[X] from the Appendix of this post).

Identity 2

The second identity I will present shows another way to describe the state-action value function Q:

\begin{matrix} (2) & Q_{π} (s, a) = r (s) + γ E_{s^{'} \sim P (s^{'} ∣ s, a)} [V_{π} (s^{'})] \end{matrix}

Proof of Identity [2] (omitting the

\sim π

and

\sim P

symbols for brevity):

From the definition of Q:

\begin{aligned} Q_{π} (s_{t}, a_{t}) & = E_{s_{t + 1}, a_{t + 1}} \dots [\sum_{l = 0}^{\infty} γ^{l} r (s_{t + l})] \\ = r (s_{t}) + E_{s_{t + 1}, a_{t + 1}} \dots [\sum_{l = 1}^{\infty} γ^{l} r (s_{t + l})] \\ = r (s_{t}) + E_{s_{t + 1}, a_{t + 1}} \dots [\sum_{l = 0}^{\infty} γ^{l + 1} r (s_{t + 1 + l})] \\ = r (s_{t}) + γ E_{s_{t + 1}, a_{t + 1}} \dots [\sum_{l = 0}^{\infty} γ^{l} r (s_{t + 1 + l})] \\ = r (s_{t}) + γ E_{s_{t + 1}} [E_{a_{t + 1}} \dots [\sum_{l = 0}^{\infty} γ^{l} r (s_{t + 1 + l})]] \\ = r (s_{t}) + γ E_{s_{t + 1}} [V_{π} (s_{t + 1})] \\ by definition of V_{π} ◻ \end{aligned}

In plain English, we can reason that this makes sense. The Q function is giving you the expected discounted return sampled from all possible future states when taking action

a

in the current state

s

in addition to the reward of the current state

s

Identity 3

The advantage function can be expressed without the state-action value function:

\begin{matrix} (3) & A_{π} (s, a) = E_{s^{'} \sim P (s^{'} ∣ s, a)} [r (s) + γ V_{π} (s^{'}) - V (s)] \end{matrix}

Proof of equation [3]: Use Identity [2] and the definition of the advantage function

A_{π} (s, a) = Q_{π} (s, a) - V_{π} (s)

A_{π} (s, a) = r (s) + γ E_{s^{'} \sim P (s^{'} ∣ s, a)} [V_{π} (s^{'})] - V_{π} (s)

Then move all terms that are constant w.r.t. the sampling distribution into the expectation braces.

◻

Identity 4

The final identity that I will present is an identity for the difference between the expected discounted returns of two policies

\tilde{π}

and

π

. This difference is equal to the expected accumulated discounted advantages that would be obtained by following policy

π

but on a trajectory actually followed by policy

\tilde{π}

\begin{matrix} (4) & η (\tilde{π}) - η (π) = E_{τ \sim \tilde{π}} [\sum_{t = 0}^{\infty} γ^{t} A_{π} (s_{t}, a_{t})] \end{matrix}

The intuitive way of understanding this equation is that if

\tilde{π}

is better than

π

, then on average (but favoring earlier states and actions through the discount

γ

), the value-functions that follow the imperfect policy

π

at given states like the actions that policy

\tilde{π}

chooses more.

The proof of Identity [4] is in the TRPO paper, and assumes that you accept Identity 3 and understand some probability algebra. I will rewrite it here with more detail. I will also explicity show that the trajectory

τ

not only depends on the policy

\tilde{π}

, but also the start state distribution

ρ (\cdot)

and the transition function

P (s^{'} | s, a)

. This would be a bit excessive to notate in the paper, but I’ll notate it here to minimize any chance of confusion.

\begin{aligned} E_{τ \sim ρ (\cdot), \tilde{π}, P} [\sum_{t = 0}^{\infty} γ^{t} A_{π} (s_{t}, a_{t})] \\ = E_{τ \sim ρ (\cdot), \tilde{π}, P} [E_{s_{t + 1} \sim P (s_{t + 1} ∣ s_{t}, a_{t})} [\sum_{t = 0}^{\infty} γ^{t} (r (s_{t}) + γ V_{π} (s_{t + 1}) - V_{π} (s_{t}))]] \\ see Appendix \\ = E_{τ \sim ρ (\cdot), \tilde{π}, P} [\sum_{t = 0}^{\infty} γ^{t} (r (s_{t}) + γ V_{π} (s_{t + 1}) - V_{π} (s_{t}))] \\ = E_{τ \sim ρ (\cdot), \tilde{π}, P} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}) + \sum_{t = 0}^{\infty} (γ^{t + 1} V_{π} (s_{t + 1}) - γ^{t} V_{π} (s_{t}))] \\ = E_{τ \sim ρ (\cdot), \tilde{π}, P} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t})] + E_{τ \sim ρ (\cdot), \tilde{π}, P} [\sum_{t = 0}^{\infty} (γ^{t + 1} V_{π} (s_{t + 1}) - γ^{t} V_{π} (s_{t}))] \\ = E_{τ \sim ρ (\cdot), \tilde{π}, P} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t})] + E_{τ \sim ρ (\cdot), \tilde{π}, P} [lim_{t \to \infty} γ^{t} V_{π} (s_{t}) - V_{π} (s_{0}) + \sum_{t = 1}^{\infty - 1} (γ^{t} V_{π} (s_{t}) - γ^{t} V_{π} (s_{t}))] \\ assuming our discount factor γ \in (0, 1) \\ = E_{τ \sim ρ (\cdot), \tilde{π}, P} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t})] - E_{s_{0}, a_{0}, s_{1}, \dots \sim ρ (\cdot), \tilde{π}, P} [V_{π} (s_{0})] \\ = E_{τ \sim ρ (\cdot), \tilde{π}, P} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t})] - E_{s_{0} \sim ρ (\cdot)} [V_{π} (s_{0})] \\ = E_{τ \sim ρ (\cdot), \tilde{π}, P} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t})] - E_{s_{0} \sim ρ (\cdot)} [E_{a_{0}, s_{1}, a_{1}, \dots \sim π, P} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t})]] \\ = η (\tilde{π}) - η (π) & ◻ \end{aligned}

The way in which Identity [4] is used is to theoretically guarantee an improving policy. If you can somehow update your policy from

π

\tilde{π}

while guaranteeing the RHS of Identity [4] is non-negative, then the new policy is in the worst case just as good as the old policy.

Appendix: Some probability algebra

Below is a very important result of probability algebra that the TRPO paper assumes you know. Assume

ρ_{1}

and

ρ_{2}

are continuous probability distributions. Then,

\begin{matrix} (5) & E_{x \sim ρ_{1}, y \sim ρ_{2}} [E_{x \sim ρ_{1}} [f (x, y)]] = E_{x \sim ρ_{1}, y \sim ρ_{2}} [f (x, y)] \end{matrix}

Proof by rewriting the expectation on the LHS as integrals:

\begin{aligned} \int_{y} ρ_{2} (y) \int_{x} ρ_{1} (x) [\int_{z} ρ_{1} (z) f (z, y) d z] d x d y & = \int_{y} ρ_{2} (y) \int_{x} ρ_{1} (x) d x [\int_{z} ρ_{1} (z) f (z, y) d z] d y \\ = \int_{y} ρ_{2} (y) [\int_{z} ρ_{1} (z) f (z, y) d z] d y \end{aligned}

because the integral of a probability distribution is 1:

\int_{x} ρ_{1} (x) d x = 1

. This is the continuous analogue of the discrete case, where the sum of chances of all possibilities must be equal to 100%.