These algorithms rely on the idea of temporal difference, computing an estimation \(M_t\) of a time-varying target \(X_{t}\):
\[M_{t+1} \leftarrow M_{t}+ \alpha [X_{t+1} - M_{t}]\]
This motif seems to be repeated frequently throughout Reinforcement Learning.
The notational conventions used in these algorithms also seem to be
repeated throughout Reinforcment Learning implementations, eg
functions that compute actions seem to be frequently referred to by
pi
.
The value iteration algorithm uses a bellman equation to iteratively improve on a policy \(\pi\). The algorithm can be proven to converge to an optimal value.
\[V_0(s) = \max_{a} R(a,s)\] \[V_t(s) \leftarrow \max_{a}[R(a,s) + \sum\limits_{s'\in S}\gamma Pr(s'|s,a)\cdot V_{t-1}(s')]\]
Let’s examine a worked example:
Because this is an MDP, we have two transition matrices \(T:S\rightarrow S\), one for each action.
\[T^{a_{0}}= \begin{bmatrix} .5 & .5 & 0 & 0\\ 0 & 1 & 0 & 0 \\ .5 & .5 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{bmatrix}\] \[T^{a^{1}} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ .5 & 0 & 0 & .5 \\ .5 & 0 & .5 & 0 \\ 0 & 0 & .5 & .5 \end{bmatrix}\]
We also have the following reward matrix, assigning rewards to each state depending on the action performed.
\[R = \begin{bmatrix} 0 & 0 \\ 0 & 10 \\ 5 & 0 \\ 5 & 10 \end{bmatrix}\]
For the first iteration, we have \(V_{0(s)} = \max_{a}R(a,s)\), the max entries from each row of the reward matrix.
\[V_0 = \begin{bmatrix} 0 \\ 10 \\ 5 \\ 10 \end{bmatrix}\]
For each subsequent timestep, we use \(V_t(s) \leftarrow \max_{a}[R(a,s) + \sum\limits_{s'\in S}\gamma Pr(s'|s,a)\cdot V_{t-1}(s')]\). Let’s try the first one:
\[V_{1}= \begin{bmatrix*}[l] \max\{\\ && 0 + \gamma(.5\cdot 0 + .5 \cdot 10), \\ && 0 + \gamma (1\cdot0) &&\} \\ \max\{\\ && 0 + \gamma(1 \cdot 10), \\ && 10 + \gamma (.5\cdot 0 + .5 \cdot 10) &&\} \\ \max\{\\ && 5 + \gamma(.5\cdot 0 + .5 \cdot 10),\\ && 0 + \gamma (.5\cdot 0 + .5 \cdot 5) &&\} \\ \max\{\\ && 5 + \gamma(1 \cdot 10), \\ && 10 + \gamma (.5\cdot 5 + .5 \cdot 10) &&\} \end{bmatrix*}\]
Letting \({\gamma=.9}\) we get
\[= \begin{bmatrix} 4.5 \\ 14.5 \\ 9.5 \\ 16.75\end{bmatrix}\]
Policy iteration is very similar to value iteration, but instead of just iterating on the value, we can instead alternate between policy evaluation and policy improvement. We start with an arbitrary policy \(\pi(s)\) that determines the action we will take, which we can improve using
\[\DeclareMathOperator*{\argmax}{argmax} \pi(s) \leftarrow \argmax_{a}[R(a,s) + \sum\limits_{s'\in S}\gamma Pr(s'|s,a)\cdot V^{\pi}(s')]\]
The use of \(\argmax\) shouldn’t need much explanation here, we are finding the policy that maximizes this form of the Bellman Equation. The value function \(V^{\pi}\) can be updated using
\[V^{\pi}(s) \leftarrow R(s, \pi(s)) + \sum\limits_{s'\in S}\gamma Pr(s'|s,\pi(s))\cdot V^{\pi}(s')]\]
The justification for using \(V^{\pi}\) to update itself seems to hinge on the idea that \(V^{\pi}\) will eventually converge within some tolerance \(\epsilon\).
Modified policy iteration begins by performing Value iteration \(k\) times before then performing Policy Iteration. TODO: expand this section