Optimising the policy

Let us denote $r(\tau) = \sum_{t=1}^{T} r(s_t, a_t)$ and therefore, we can rewrite $J(\theta)$ as

Our goal is to maximise the expected reward. Therefore, differentiating this function yields

A convenient identity to use here is that

• increase probability of paths with positive reward.
• decrease probability of paths with negative reward.

Let us assume that the reward is always positive i.e. r > 0 this will lead to increased probability of all paths. This isn’t exactly what we want. Therefore, we need a baseline to subtract to get an overall relative reward. In [Williams 92] it is shown that the baseline does not change the overall objective we are maximising.

Therefore, subtracting a baseline is unbiased in expectation. What possible choices of baseline can we consider?

Constant baseline

Baseline that minimises the variance

Remember that $Var[x] = E[x^2] - E[x]^2$

We would like to optimise over b which minimises the variance of $\nabla_{\theta}J(\theta)$

Let us denote $g(\tau) = \log \pi_{\theta}(\tau)$

Remember that

The derivative of the second bit is therefore zero.

This yields

Time-varying baseline

Value function baseline that depends on the state

Increase the log probability of actions proportional to how much its returns are better than the expected return under the current policy.

but how do we estimate $V^{\pi}$?

We could do roll-outs with the current policy and then collect the rewards to go and regress $V^{\pi}$ as

where one could use $V_{\phi}^{\pi}(s_t) = \phi(s_t)^T w_s$

Caveat: The same batch of trajectories should not be used for both fitting the value function baseline, as well as estimating $\nabla_{\theta}J$, since it will lead to overfitting and a biased estimate. Thus, trajectories from iteration k−1 are used to fit the value function, essentially approximating $V^{\pi}_{k-1}$, and use trajectories from iteration k for computing advantage, $A^{\pi}_{k}$ and $\nabla_{\theta} J$.

Written on October 10, 2017