Actor-critic methods are a family of algorithms united by the same philosophy. In terms of implementation, the most popular variant is the “Advantage Actor-Critic” algorithm. Here’s how it works.
Recall that at any given state s in the game, our agent chooses an action a by sampling from a probability distribution (output of the policy).
Now, any action choice naturally has a different value — i.e. some actions are more likely to lead to a high reward/victory than others.
Let’s say that at a given state s, the agent expects a total reward of 5.0 with action a1, 6.0 with a2, and 2.0 with a3.
In this example, on average, the agent gets a reward of 4.33 by following its current policy. Therefore, the advantage of taking action a1 versus the average is 5–4.33 = 0.67. For a2, the advantage is 1.67, and for a3, it’s -2.33.
That’s it — the Advantage function, represented as A(s,a), is the extra reward expected by choosing a certain action a in state s compared to the “average.” In advantage actor-critic, when comparing actions, we rely on this advantage value.
Why is this helpful? Intuitively, this is done to force the model to strive to do better and better than average; to only take the most optimal actions, as opposed to being satisfied with picking actions as long as they lead to a positive reward.