Exact Solution Methods for Decision Problems
- Definitions
- Policy evaluation
- Value Function Policies
- Policy Iteration
- Value iteration
- Bonus: Linear Program (LP) Formulation
- Footnotes
Definitions
Markov Decision Process
A Markov Decision Process (MDP) models choosing an action
Rewards
In a finite horizon problem with
In an infinite horizon the discounted return is given by
where
Policies
Given a history of states and actions
Note that for finite horizon problems (even with stationary transitions and rewards) it maybe advantageous to consider non-stationary policies.
Value functions
The expected utility of executing policy
The value function
Policy evaluation
Policy evaluation involves estimating the value function
Then
- the reward received by taking action
- the average value of the
th (current) estimate of the policy, , over all possible states that can be transitioned to when starting in state and taking action
That is to say
Note that since
and therefore for an infinite horizon problem, at convergence, we have
where
which has runtime
-function
An alternative way to represent an arbitrary value function
Storing
Value Function Policies
The previous section computed a value function given a policy.
Similarly, given a value function
called the greedy policy.
If
Both
Advantage function
Policies can also be represented using the advantage function, which quantifies the “advantage” of taking some action relative to the greedy action (action dictated by the greedy policy
Note that greedy actions have “zero advantage” and non-greedy actions have “negative advantage”.
Policy Iteration
Policy iteration is an algorithm for computing an optimal policy.
It starts with any policy (e.g.
def lookahead :
return
def greedy_action :
return lookahead
def policy_iteration k_max=100 :
# vectorize by evaluating args
for k = 1:k_max
# perform policy evaluation
= for for
= for
=
# check convergence
# lambdas
= ->
= -> greedy_action
if all == for :
break
=
return
Value iteration
Value iteration is an alternative to policy iteration that updates the value function directly (instead of updating it through the policy).
It starts with any bounded
This is called backup.
def backup :
return
As already mentioned this is guaranteed to converge and the optimal policy is guaranteed to satisfy
That is to say
def value_iteration k_max=100 :
= 0
for k = 1:k_max
= backup for
= -> greedy_action
return
Note that we could also terminate value iteration conditional on
Bonus: Linear Program (LP) Formulation
Finding the optimal policy can be reformulated as a linear program.
Begin by replacing the equality in equation
Intuitively the minimization objective induces/encourages equality.
The
This is therefore a linear program because both the objective and constraints are linear in the variables
Linear Systems with Quadratic Reward
The Bellman equation5 for continuous, vector-valued states and actions is
Note that here we’re assuming finite horizon and no discounting. In general this is difficult to solve. On the contrary, if a problem has linear dynamics and a quadratic reward then the optimal policy can be computed in closed form; a problem has linear dynamics if
where
are matrices that represent the mean of the next state
where
Substituting eqns.
The optimal one-step lookahead function is
for which the optimal action is
Using
Using
differentiating wrt
Using the fact that
and substituting into
and
If
Equation
Footnotes
-
Capital letters
stand for random variables, of which are samples. ↩ -
Dynamic programming refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. The relationship between the larger problem and the sub-problems is described by the Bellman equation. ↩
-
A contraction mapping on a metric space
is a function from to itself, with the property that there is some nonnegative real number such that and for all in it’s the case that . ↩ -
Infinity norm is defined
. ↩ -
With continuous parameters this is known as the Hamilton–Jacobi–Bellman equation. ↩
-
and . ↩