Exact Solution Methods for Decision Problems

February 2, 2021

Definitions

Markov Decision Process

A Markov Decision Process (MDP) models choosing an action at in an action space A while in a state st (from a state space S) and thence receiving a reward rt. Choosing such an action evolves the process to a state st+1 according to a probability distribution P(St+1St,At) over states and actions1. The Markov assumption assumes that St+1 only depends on St and At (rather than the entire history/trajectory of the process). Stationary MDPs are those for which P(St+1St,At) does not vary in time (i.e. only depends on states and actions and not at which t those states and actions were occupied/chosen). Therefore, the state transition model T(ss,a) represents that probability of transitioning from state s to state s after choosing action a and the reward function R(s,a) represents the expected reward when executing action a from state s.

Rewards

R(s,a) is a deterministic function of s,a because it is an expectation but it may be “generated” stochastically; for example, if the reward depends on the next state (as given by a modified R(s,a,s)) then

(1)R(s,a)=sT(ss,a)R(s,a,s)

In a finite horizon problem with n decisions the utility associated (also called the reward) with a sequence of rewards r1:n:=r1,r2,,rn is

(2)t=1nrt

In an infinite horizon the discounted return is given by

(3)t=1γt1rt

where 0γ<1 guarantees that the sum converges. Note that the “first” return r1 is “undiscounted” (γ11=1).

Policies

Given a history of states and actions ht:=(s1:t,a1:t) a policy πt(ht) sets which actions are taken at each time step. Since for MDPs future states and rewards depend only on the current state, we restrict our attention to policies that only depend on the current state, i.e. πt(ht)πt(st). A deterministic policy is one where the action taken is deterministically a function of the history. Alternatively, a stochastic policy is one where the action taken is actually drawn from a probability distribution πt(atst). For infinite horizon problems with stationary transitions and rewards stationary policies, naturally, do not depend on time i.e.

(4)πt(st)π(s)

Note that for finite horizon problems (even with stationary transitions and rewards) it maybe advantageous to consider non-stationary policies.

Value functions

The expected utility of executing policy π from state s is denoted Uπ(s). For MDPs Uπ is called the value function (because it captures the value of being in a state, given some policy for actions to take). An optimal policy π is a policy that maximizes expected utility

(5)π(s):=argmaxπUπ(s) for all states s

The value function U(s):=Uπ(s) corresponding to the optimal policy is called the optimal value function. An optimal policy can be found using dynamic programming2.

Policy evaluation

Policy evaluation involves estimating the value function Uπ forward in time. This can be done inductively (iteratively); for a single step from the state s the value function is just the reward received when performing action π(s)

(6)U1π(s):=R(s,π(s))

Then Uk+1π(s) at some time step k+1 is estimated using the lookahead equation; the lookahead equation estimates Uk+1π(s) by combining

  1. the reward received by taking action π(s)
  2. the average value of the kth (current) estimate of the policy, Ukπ(s), over all possible states s that can be transitioned to when starting in state s and taking action π(s)

That is to say

(7)Uk+1π(s):=R(s,π(s))+γsT(ss,π(s))Ukπ(s)

Note that since 0γ<1 it’s the case that UkπUk+1π is a contraction mapping3 and therefore policy evaluation converges

(8)Uπ(s):=limkUk+1π(s)

and therefore for an infinite horizon problem, at convergence, we have

(9)Uπ(s)=R(s,π(s))+γsT(ss,π(s))Uπ(s)

Uπ(s) can also be solved for by solving a system of equations

(10)Uπ=Rπ+γTπUπ

where Uπ,Rπ are |S|-length vectors and Tπ is a |S|×|S| matrix with state i state j transition probabilities Tijπ. Solving for Uπ we get

(11)Uπ=(IγTπ)1Rπ

which has runtime O(|S|3).

Q-function

An alternative way to represent an arbitrary value function U(s) is in terms of the action as well as the state, namely by the action value function (also called the Q-function)

(12)Q(s,a):=R(s,a)+γsT(ss,a)U(s)

Storing Q costs O(|S|×|A|) as opposed to just O(|S|) for storing U but it saves having to store and query R,T.

Value Function Policies

The previous section computed a value function given a policy. Similarly, given a value function U(s), we can compute the policy that maximizes that value function

(13)π(s):=argmaxa(R(s,π(s))+γsT(ss,π(s))U(s))

called the greedy policy. If U(s) is the optimal value function then π(s) is the optimal policy.

Both U(s) and π(s) can be obtained from the Q-function

(14)U(s)=maxaQ(s,a)π(s)=argmaxaQ(s,a)

Advantage function

Policies can also be represented using the advantage function, which quantifies the “advantage” of taking some action relative to the greedy action (action dictated by the greedy policy U(s))

(15)A(s,a)=Q(s,a)U(s)

Note that greedy actions have “zero advantage” and non-greedy actions have “negative advantage”.

Policy Iteration

Policy iteration is an algorithm for computing an optimal policy. It starts with any policy (e.g. π(s)=0) and alternates between policy evaluation and policy improvement (using the lookahead function).

def lookahead(S,T,R,γ,U,s,a):
    return R(s,a)+γsST(ss,a)U(s)

def greedy_action(S,T,R,A,γ,U,s):
    return argmaxaA lookahead(S,T,R,γ,U,s,a)

def policy_iteration(S,π,A,T,R,γ,k_max=100):
    # vectorize by evaluating args
    for k = 1:k_max
        # perform policy evaluation
        T = [T(ss,π(s)) for sS for sS]
        R = [R(s,π(s)) for sS]
        U = (IγT)1R

        # check convergence
        # lambdas 
        U = s -> U(s)
        π = s -> greedy_action(S,T,R,A,γ,U,s)
        if all(π(s) == π(s) for sS):
            break
        π = π
    return π

Value iteration

Value iteration is an alternative to policy iteration that updates the value function directly (instead of updating it through the policy). It starts with any bounded |U(s)|<, e.g. U(s)=0, and improves it using the Bellman equation

(16)Uk+1(s)=maxa(R(s,a)+γsST(ss,a)Uk(s))

This is called backup.

def backup(R,A,S,U):
    return maxaA(R(s,a)+γsST(ss,a)U(s))

As already mentioned this is guaranteed to converge and the optimal policy is guaranteed to satisfy

(17)U(s)=maxa(R(s,a)+γsST(ss,a)U(s))

That is to say U(s) is a fixed-point of the Bellman equation.

def value_iteration(R,T,A,S,γ,k_max=100):
    U1 = 0
    for k = 1:k_max
        Uk+1 = [backup(R,A,S,Uk) for sS]
    π = s -> greedy_action(S,T,R,A,γ,Uk_max+1,s)
    return π

Note that we could also terminate value iteration conditional on Uk+1Uk4, called the Bellman residual. A Bellman residual of δ guarantees that the current iteration is within ε=δγ/(1γ) of the optimal value function U. Similarly the policy loss UπU bounds the deviation of the total reward obtained under the policy extracted from value iteration π; policy loss is bounded by 2ε.

Bonus: Linear Program (LP) Formulation

Finding the optimal policy can be reformulated as a linear program. Begin by replacing the equality in equation (17) with a set of inequality constraints and a minimization object

(18)minimizesSU(s)subject toU(s)maxa(R(s,a)+γsST(ss,a)U(s))(for all s)

Intuitively the minimization objective induces/encourages equality. The maxa is a nonlinear constraint; it can be replaced by a set of linear constraints

(19)minimizesSU(s)subject toU(s)R(s,a)+γsST(ss,a)U(s)(for all s,a)

This is therefore a linear program because both the objective and constraints are linear in the variables U(s).

Linear Systems with Quadratic Reward

The Bellman equation5 for continuous, vector-valued states and actions is

(20)Uh+1(s)=maxa(R(s,a)+sST(ss,a)Uh(s)ds)

Note that here we’re assuming finite horizon and no discounting. In general this is difficult to solve. On the contrary, if a problem has linear dynamics and a quadratic reward then the optimal policy can be computed in closed form; a problem has linear dynamics if

(21)s=Tss+Taa+w

where

(22)Ts:=E[ss]Ta:=E[sa]

are matrices that represent the mean of the next state s given s and a respectively, and w is zero mean, finite variance noise that is independent of both s,a (e.g. multivariate Gaussian). A reward function is quadratic if it can be written

(23)R(s,a)=sRss+aRaa

where Rs,Ra are matrices that determine the contributions of state and action to reward. Further more Rs,Ra should be negative semi-definite6 and negative definite, respectively; such a reward penalizes states and actions different from 0. In control theory such a system is called a linear quadratic regulator (LQR).

Substituting eqns. (21) and (23) into the Bellman equation

(24)Uh+1(s)=maxa(sRss+aRaa+p(w)Uh(Tss+Taa+w)dw)

The optimal one-step lookahead function is

(25)U1(s)=maxa(sRss+aRaa)=sRss

for which the optimal action is a=0 (since Ra is negative semi-definite). We now prove that Uh(s)=sVhs+qh for a symmetric Vh, with V1=Rs and q1=0. Making this substitution into the right-hand side of eqn. (24)

(26)Uh+1(s)=sRss+maxa(aRaa+p(w)((Tss+Taa+w)Vh(Tss+Taa+w)+qh)dw)

Using p(w)dw=1 and wp(w)dw=0 (zero mean) to simplify terms with a qh factor

(27)Uh+1(s)=sRss+sTsVhTss+maxa(aRaa+2sTsVhTaa+aTaVhTaa)+p(w)(wVhw)dw+qh

Using

(28)x(Ax)=Ax(xAx)=(A+A)x

differentiating wrt a and setting equal to 0

(29)0=(Ra+Ra)a+2TaVhTss+(TaVhTa+(TaVhTa))a=2Raa+2TaVhTss+2TaVhTaa

Using the fact that Ra+TaVhTa is negative definite and therefore invertible we obtain the optimal action

(30)a=(Ra+TaVhTa)1TaVhTss

and substituting into Uh+1(s) produces Uh+1(s)=sVh+1s+qh+1 with

(31)Vh+1=Rs+TsVhTs(TaVhTs)(Ra+TaVhTa)1(TaVhTs)

and

(32)qh+1=i=1hp(w)(wViw)dw=i=1hEw[wViw]

If wN(0,Σ)

(33)qh+1=i=1hTr(ΣVi)

Equation (31) is called the discrete-time Ricatti equation. Note that the optimal action is independent of w but the variance of w does affect the expected utility through qh+1.

Footnotes

  1. Capital letters St,At,Rt stand for random variables, of which st,at,rt are samples. 

  2. Dynamic programming refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. The relationship between the larger problem and the sub-problems is described by the Bellman equation

  3. A contraction mapping on a metric space (M,d) is a function f from M to itself, with the property that there is some nonnegative real number ε such that 0γ<1 and for all x,y in M it’s the case that d(f(x),f(y))εd(x,y)

  4. Infinity norm is defined x:=max(|x1|,,|xn|)

  5. With continuous parameters this is known as the Hamilton–Jacobi–Bellman equation

  6. M negative-definitexTMx<0 for all xRn0 and M negative semi-definitexTMx0 for all xRn