Different DQN variations
Over the years, different variations of the classic DQN have appeared each with their own attempt at reducing the amount of data needed to learn i.e. data efficiency and increasing the overall performance racked against humans at the ATARI benchmarks. These variations are listed below.
where is the value function and is the advantage function.
where is the projection operator as explained in the original distributional RL paper. The cross entropy is minimised here instead of loss function as in classic DQN.
Important to remember that is usually fixed in these algorithms but it can be learnt however for each different time-step. For a fixed gamma the time-horizon can be computed as
Therefore, the effective time-horizon for is 100 time-steps.