Different DQN variations

Over the years, different variations of the classic DQN have appeared each with their own attempt at reducing the amount of data needed to learn i.e. data efficiency and increasing the overall performance racked against humans at the ATARI benchmarks. These variations are listed below.

Classic DQN

Double DQN

Prioritised Replay

Dueling Networks

where $V$ is the value function and $A$ is the advantage function.

Multi-step Returns

Distributional RL

where $\Phi_{z}$ is the projection operator as explained in the original distributional RL paper. The cross entropy $D_{KL}$ is minimised here instead of $L_{2}^{2}$ loss function as in classic DQN.

Important to remember that $\gamma$ is usually fixed in these algorithms but it can be learnt however for each different time-step. For a fixed gamma the time-horizon can be computed as

Therefore, the effective time-horizon for $\gamma=0.99$ is 100 time-steps.

Written on September 10, 2017