In this blog post, I will present four useful identities that give some theoretical insight on advantage and value functions in reinforcement learning. These identities are useful in proving the theoretical correctness of the non-decreasing expected return policy iteration algorithm in the paper "Trust Region Policy Optimization" (TRPO) - Schulman et al, 2015, https://arxiv.org/abs/1502.05477. I would like to note that readers should be familiar with the reinforcement learning terminology and notation from the TRPO paper before reading this post. Finally, I encourage the reader to try proving the identities themselves if they feel, like me, the need to keep their math skills sharp when you no longer have problem sets and exams to do.
Identity 1
First, the expected value of the advantage function using an action sampled from it’s own policy is zero. Intuitively this makes sense, because the advantage function for a state ss and an action aa measures how much better that action is relative to the “average” action taken at the state ss. The “average” action is not better or worse than itself, hence the identity:
Here we used an identity from probability: E[X-E[X]]=0\mathbb{E}[X-\mathbb{E}[X]]=0. You can reason that this makes sense intuitively, or you can prove it mathematically (hint: E[E[X]] = E[X] from the Appendix of this post).
Identity 2
The second identity I will present shows another way to describe the state-action value function Q:
In plain English, we can reason that this makes sense. The Q function is giving you the expected discounted return sampled from all possible future states when taking action aa in the current state ss in addition to the reward of the current state ss.
Identity 3
The advantage function can be expressed without the state-action value function:
Proof of equation [3]:
Use Identity [2] and the definition of the advantage function A_(pi)(s,a)=Q_(pi)(s,a)-V_(pi)(s)A_{\pi}(s,a) = Q_{\pi}(s, a) - V_{\pi}(s):
Then move all terms that are constant w.r.t. the sampling distribution into the expectation braces. ◻\square
Identity 4
The final identity that I will present is an identity for the difference between the expected discounted returns of two policies tilde(pi)\tilde{\pi} and pi\pi. This difference is equal to the expected accumulated discounted advantages that would be obtained by following policy pi\pi but on a trajectory actually followed by policy tilde(pi)\tilde{\pi}.
The intuitive way of understanding this equation is that if tilde(pi)\tilde{\pi} is better than pi\pi, then on average (but favoring earlier states and actions through the discount gamma\gamma), the value-functions that follow the imperfect policy pi\pi at given states like the actions that policy tilde(pi)\tilde{\pi} chooses more.
The proof of Identity [4] is in the TRPO paper, and assumes that you accept Identity 3 and understand some probability algebra. I will rewrite it here with more detail. I will also explicity show that the trajectory tau\tau not only depends on the policy tilde(pi)\tilde{\pi}, but also the start state distribution rho(*)\rho(\cdot) and the transition function P(s^(')|s,a)P(s’ | s, a). This would be a bit excessive to notate in the paper, but I’ll notate it here to minimize any chance of confusion.
The way in which Identity [4] is used is to theoretically guarantee an improving policy. If you can somehow update your policy from pi\pi to tilde(pi)\tilde{\pi} while guaranteeing the RHS of Identity [4] is non-negative, then the new policy is in the worst case just as good as the old policy.
Appendix: Some probability algebra
Below is a very important result of probability algebra that the TRPO paper assumes you know. Assume rho_(1)\rho_1 and rho_(2)\rho_2 are continuous probability distributions. Then,
Proof by rewriting the expectation on the LHS as integrals:
{:[int_(y)rho_(2)(y)int_(x)rho_(1)(x)[int_(z)rho_(1)(z)f(z,y)dz]dxdy=int_(y)rho_(2)(y)int_(x)rho_(1)(x)dx[int_(z)rho_(1)(z)f(z,y)dz]dy],[=int_(y)rho_(2)(y)[int_(z)rho_(1)(z)f(z,y)dz]dy]:}\begin{aligned}
\int_{y} \rho_{2}(y) \int_{x} \rho_{1}(x)\left[\int_{z} \rho_{1}(z) f(z, y) d z\right] d x d y &= \int_{y} \rho_{2}(y) \int_{x} \rho_{1}(x) d x \left[\int_{z} \rho_{1}(z) f(z, y) d z\right] d y \\
&= \int_{y} \rho_{2}(y) \left[\int_{z} \rho_{1}(z) f(z, y) d z\right] d y
\end{aligned}
because the integral of a probability distribution is 1: int_(x)rho_(1)(x)dx=1\int_{x} \rho_{1}(x) d x = 1. This is the continuous analogue of the discrete case, where the sum of chances of all possibilities must be equal to 100%.