Markov Decision Processes

Markov Chains

In a markov chain, we have a concept of states, and a set of discrete probabalistic transistions between those states. This can be modeled as a graph, if we have \(n\) states we can arrange transitions into an \(n\times n\) transition matrix \(T\), and this fully determines the markov chain.

We can then use iterated multiplication to determine probabilities. Ie, entry \(i,j\) of \(T^n\) represents the probability of being in state \(s_j\) after starting in state \(s_i\) and transitioning for \(n\) steps.

Regardless, the important takeaway about markov chains is that the probability \(Pr(s' |s)\) of reaching a given next state \(s'\) is completely dependent on the current state \(s\), this is the well-known Markov property.

Markov Decision Processes

But for reinforcement learning, we want to extend this to include the concept of an agent, and a reward. The agent is said to perform an action \(a\in{A}\) that changes the state \(s\in{S}\), and then receives a reward \(r\in{R}\). More formally, for any potential next state \(s'\), we compute probabilities \[Pr(s',r|s,a) \space \dot= \space Pr\{s_{t}= s', r_{t}= r | s_{t-1} = s, a_{t-1} = a \}\]This is called the dynamics function of the MDP. You can compute the probability of a given state and reward using the prior state and prior action, a simple extension of a Markov chain.1

Comparing to the original, the extension is also simple: a set of transition matrices, one for each action, or equivalently, a tensor of dimension \(||A|| \times ||S|| \times ||S||\), and a distribution of possible rewards for each state.

Now I’ll include some definitions and properties that rely on the definition of the dynamics function above. These are mostly useful for describing proofs and constructing algorithms.

First, \[\sum_{s'\in S}\sum_{r\in R}Pr(s',r|s,a) = 1\] That is, the rows of each transition matrix represent a probability distribution, so they must sum to 1. If we limit ourselves to one reward per state, the inner-sum and reward \(r\) can be removed, since \(Pr(s',r|s,a) \neq 0\) for only one element of \(R\) for any given state.

With a distribution of rewards, the probability of a given state occurring is given by summing the reward probabilities for that state: \[\displaylines{ Pr(s'|s,a) \space \dot = \space Pr\{S_{t} = s' | S_{t-1} = s, A_{t-1} = {a} \} \\ = \sum_{r\in R}Pr(s',r|s,a) }\] Giving us a convenient notation for exactly the scenario described above; if we substitute this into the identity we will get \[\sum_{s'\in S}Pr(s'|s,a) = 1\]

We also define another convenience function to represent the expected value of rewards for a {state, action} pair. \[\displaylines{ r(s,a) \space \dot = \space \mathbb{E}\lbrack R | S_{t-1}=s, A_{t-1}= a\rbrack \\ = \sum_{r\in R} r \sum_{s' \in S} Pr(s',r|s,a) = \sum_{r\in R} \sum_{s' \in S} r \cdot Pr(s',r|s,a) }\] Take a look at wikipedia’s expected value page for a refresher. It’s just a weighted sum of rewards using transition probabilities as weights.

\[\displaylines{ r(s,a,s') \space \dot= \space \mathbb{E}\lbrack R | S_{t-1}=s, A_{t-1}= a,S_t=s'\rbrack \\ = \sum_{r\in R} r \frac{Pr(s',r|s,a)}{Pr(s'|s,a)} }\] This last one was not quite as obvious to me, but it is also an identity of expected value. It makes more intuitive sense when rearranged like so: \[\frac{ \sum_{r\in R} (r \cdot Pr(s',r|s,a))}{Pr(s'|s,a)}\] This is justified because \(Pr(s'|s,a)\) does not involve the sum index (though it does contain a different index over \(R\) inside).

Now it is clear we are taking a weighted sum of rewards for the given state \(s'\), and dividing it by the probability we actually acheive the state.


The basic strategy of reinforcement learning is providing rewards to an agent to incentivize acheiving a goal. “Goals” or “purposes” can be substituted for appropriate rewards for acheiving a desired state. Algorithms in reinforcement learning are centered around maximizing these rewards.

If \(R_t\) is the reward at timestep \(t\), we define the discounted return \(G_t\) of a process starting at time \(t\) and continuing indefinitely to be \[\displaylines{ G_t \space \dot= \gamma^{0}R_{t+1}+ \gamma^{1}R_{t+2}+\dots\\ = \sum_{k=0}^{\infty} \gamma^{k} R_{(t+1) +k} }\] If \(|\gamma| < 1\) and \(R\) is some constant, then the above is a geometric series and can be substituted with its closed form.

There is also a theoretically useful recursive relationship: \[\displaylines{ G_{t} \space = \space \sum_{k=0}^{\infty} \gamma^{k} R_{(t+1) + k} \\ = \gamma^{0} R_{t+1} + \gamma^{1} R_{(t+1) + 1} + \dots \\ = R_{t+1} + \gamma [ \gamma^{0}R_{(t+2) + 0} + \dots ]\\ = \space R_{t+1} + \sum_{k=0}^{\infty} \gamma^{k} R_{(t+2) + k} \\ = \space R_{t+1} + \gamma \sum_{k=0}^{\infty} \gamma^{k} R_{(t+2) + k} \\ = \space R_{t+1} + \gamma \cdot G_{t+1} }\]