Lecture on Policy Gradients
Optimising the policy
Let us denote and therefore, we can rewrite as
Our goal is to maximise the expected reward. Therefore, differentiating this function yields
A convenient identity to use here is that
Therefore, the overall gradient becomes
The gradient tries to:
- increase probability of paths with positive reward.
- decrease probability of paths with negative reward.
Let us assume that the reward is always positive i.e. r > 0 this will lead to increased probability of all paths. This isn’t exactly what we want. Therefore, we need a baseline to subtract to get an overall relative reward. In [Williams 92] it is shown that the baseline does not change the overall objective we are maximising.
Therefore, subtracting a baseline is unbiased in expectation. What possible choices of baseline can we consider?
Baseline that minimises the variance
We would like to optimise over b which minimises the variance of
Let us denote
The derivative of the second bit is therefore zero.
Value function baseline that depends on the state
Increase the log probability of actions proportional to how much its returns are better than the expected return under the current policy.
but how do we estimate ?
We could do roll-outs with the current policy and then collect the rewards to go and regress as
where one could use
Caveat: The same batch of trajectories should not be used for both fitting the value function baseline, as well as estimating , since it will lead to overfitting and a biased estimate. Thus, trajectories from iteration k−1 are used to fit the value function, essentially approximating , and use trajectories from iteration k for computing advantage, and .