In a markov chain, we have a concept of *states*, and a set
of discrete probabalistic *transistions* between those states.
This can be modeled as a graph, if we have \(n\) states we can arrange transitions
into an \(n\times n\) *transition
matrix* \(T\), and this fully
determines the markov chain.

We can then use iterated multiplication to determine probabilities. Ie, entry \(i,j\) of \(T^n\) represents the probability of being in state \(s_j\) after starting in state \(s_i\) and transitioning for \(n\) steps.

Regardless, the important takeaway about markov chains is that the probability \(Pr(s' |s)\) of reaching a given next state \(s'\) is completely dependent on the current state \(s\), this is the well-known Markov property.

But for reinforcement learning, we want to extend this to include
the concept of an *agent*, and a *reward*. The agent is
said to perform an *action* \(a\in{A}\) that changes the *state*
\(s\in{S}\), and then receives a
*reward* \(r\in{R}\). More
formally, for any potential next state \(s'\), we compute probabilities \[Pr(s',r|s,a) \space \dot= \space Pr\{s_{t}=
s', r_{t}= r | s_{t-1} = s, a_{t-1} = a \}\]This is called
the *dynamics function* of the MDP. You can compute the
probability of a given state and reward using the prior state
*and* prior action, a simple extension of a Markov chain.^{1}

Comparing to the original, the extension is also simple: a set of transition matrices, one for each action, or equivalently, a tensor of dimension \(||A|| \times ||S|| \times ||S||\), and a distribution of possible rewards for each state.

Now I’ll include some definitions and properties that rely on the definition of the dynamics function above. These are mostly useful for describing proofs and constructing algorithms.

First, \[\sum_{s'\in S}\sum_{r\in R}Pr(s',r|s,a) = 1\] That is, the rows of each transition matrix represent a probability distribution, so they must sum to 1. If we limit ourselves to one reward per state, the inner-sum and reward \(r\) can be removed, since \(Pr(s',r|s,a) \neq 0\) for only one element of \(R\) for any given state.

With a distribution of rewards, the probability of a given state occurring is given by summing the reward probabilities for that state: \[\displaylines{ Pr(s'|s,a) \space \dot = \space Pr\{S_{t} = s' | S_{t-1} = s, A_{t-1} = {a} \} \\ = \sum_{r\in R}Pr(s',r|s,a) }\] Giving us a convenient notation for exactly the scenario described above; if we substitute this into the identity we will get \[\sum_{s'\in S}Pr(s'|s,a) = 1\]

We also define another convenience function to represent the expected value of rewards for a {state, action} pair. \[\displaylines{ r(s,a) \space \dot = \space \mathbb{E}\lbrack R | S_{t-1}=s, A_{t-1}= a\rbrack \\ = \sum_{r\in R} r \sum_{s' \in S} Pr(s',r|s,a) = \sum_{r\in R} \sum_{s' \in S} r \cdot Pr(s',r|s,a) }\] Take a look at wikipedia’s expected value page for a refresher. It’s just a weighted sum of rewards using transition probabilities as weights.

\[\displaylines{ r(s,a,s') \space \dot= \space \mathbb{E}\lbrack R | S_{t-1}=s, A_{t-1}= a,S_t=s'\rbrack \\ = \sum_{r\in R} r \frac{Pr(s',r|s,a)}{Pr(s'|s,a)} }\] This last one was not quite as obvious to me, but it is also an identity of expected value. It makes more intuitive sense when rearranged like so: \[\frac{ \sum_{r\in R} (r \cdot Pr(s',r|s,a))}{Pr(s'|s,a)}\] This is justified because \(Pr(s'|s,a)\) does not involve the sum index (though it does contain a different index over \(R\) inside).

Now it is clear we are taking a weighted sum of rewards for the given state \(s'\), and dividing it by the probability we actually acheive the state.

The basic strategy of reinforcement learning is providing rewards to an agent to incentivize acheiving a goal. “Goals” or “purposes” can be substituted for appropriate rewards for acheiving a desired state. Algorithms in reinforcement learning are centered around maximizing these rewards.

If \(R_t\) is the reward at
timestep \(t\), we define the
*discounted return* \(G_t\) of
a process starting at time \(t\) and
continuing indefinitely to be \[\displaylines{
G_t \space \dot= \gamma^{0}R_{t+1}+ \gamma^{1}R_{t+2}+\dots\\
= \sum_{k=0}^{\infty} \gamma^{k} R_{(t+1) +k}
}\] If \(|\gamma| < 1\) and
\(R\) is some constant, then the
above is a geometric
series and can be substituted with its closed form.

There is also a theoretically useful recursive relationship: \[\displaylines{ G_{t} \space = \space \sum_{k=0}^{\infty} \gamma^{k} R_{(t+1) + k} \\ = \gamma^{0} R_{t+1} + \gamma^{1} R_{(t+1) + 1} + \dots \\ = R_{t+1} + \gamma [ \gamma^{0}R_{(t+2) + 0} + \dots ]\\ = \space R_{t+1} + \sum_{k=0}^{\infty} \gamma^{k} R_{(t+2) + k} \\ = \space R_{t+1} + \gamma \sum_{k=0}^{\infty} \gamma^{k} R_{(t+2) + k} \\ = \space R_{t+1} + \gamma \cdot G_{t+1} }\]