
So, now that we have a value backup function

that we discussed in depth, the question now becomes

what's the optimal policy?

And it turns out this value backup function defines

the optimal policy as completely opposite

of which action to pick,

which is just the action that maximizes this expression over here.

For any state S, any value function V,

we can define a policy,

and that's the one that picks the action under argmax

that maximizes the expression over here.

For the maximization, we can safely draw up gamma and R(s).

Baked in the value iteration function was already

an action choice that picks the best action.

We just made it explicit.

This is the way of backing up values,

and once values have been backed up,

this is the way to find the optimal thing to do.