# Lecture on Policy Gradients

**Optimising the policy**

Let us denote and therefore, we can rewrite as

Our goal is to maximise the expected reward. Therefore, differentiating this function yields

A convenient identity to use here is that

Therefore, the overall gradient becomes

The gradient tries to:

- increase probability of paths with positive reward.
- decrease probability of paths with negative reward.

Let us assume that the reward is always positive *i.e.* r > 0 this will lead to increased probability of all paths. This isn’t exactly what we want. Therefore, we need a baseline to subtract to get an overall relative reward. In [Williams 92] it is shown that the baseline does not change the overall objective we are maximising.

Therefore, subtracting a baseline is unbiased in expectation. What possible choices of baseline can we consider?

**Constant baseline**

**Baseline that minimises the variance**

Remember that

We would like to optimise over b which minimises the variance of

Let us denote

Remember that

The derivative of the second bit is therefore zero.

This yields

**Time-varying baseline**

**Value function baseline that depends on the state**

Increase the log probability of actions proportional to how much its returns are better than the expected return under the current policy.

but **how do we estimate ?**

We could do roll-outs with the current policy and then collect the rewards to go and regress as

where one could use

**Caveat:** The same batch of trajectories should not be used for both fitting the value function baseline, as well as estimating , since it will lead to overfitting and a biased estimate. Thus, trajectories from iteration k−1 are used to fit the value function, essentially approximating , and use trajectories from iteration k for computing advantage, and .