sampledoc

Table Of Contents

Previous topic

Application: Shortest Paths and Dynamic Programming

Next topic

Namespaces

This Page

Application: Finite State Optimal Growth

This lecture:

  • Formulate a simple, finite state optimal growth problem
  • Solve by value iteration

The lecture is based on section 5.1 of EDTC

  • Please refer to that section for detailed explanations

  • You need some familiarity with dynamic programming to read this lecture

  • Previously we treated dynamic programming here
    • The current lecture contains a bit more maths

Outline of the Problem

We model the behavior of Colonel Kurtz, who can lives on a small island and fishes for catfish

Catfish bite at dawn, the colonel bags a random quantity \(W \in \{0,\ldots,B\}\) with distribution \(q\)

That is, \(\mathbb{P}\{W = b\} = q(b)\)

Catfish last only if refrigerated, and the Colonel’s freezer holds up to \(M\) fish

Let \(X_t\) be the state variable: stock of fish at noon on day t

In the evening, the Colonel consumes quantity \(C_t\), and freezes \(R_t = X_t - C_t\)

Following the next morning’s fishing trip, the state is

(1)\[X_{t+1} = R_t + W_{t+1}\]

Assume that Kurtz follows policy function \(g\): On observing state \(X\), he saves \(R = g(X)\)

The state space (i.e., the set of possible values for \(X\)) is given by \(S = \{0, \ldots, M+B\}\)

Formally, a feasible policy function \(g\) is a map from \(S\) into the action space \(\{0,\ldots,M\}\) that satisfies \(0 \leq g(x) \leq x\) for all \(x \in S\)

Denote the set of all feasible policies by \(G\). For given \(g \in G\), state evolves according to

(2)\[X_{t+1} = g(X_t) + W_{t+1}, \quad (W_t)_{t \geq 1} \sim q, \quad X_0 = x \in S\]

Letting \(U\) be the utility function, optimization problem of Colonel Kurtz is

(3)\[\max_{g \in G} \mathbb{E} \left[ \sum_{t=0}^{\infty} \rho^t U(X_t - g(X_t)) \right]\]

where the sequence \((X_t)\) is the one defined in (2) (and hence depends on \(g\))

Given that we are currently at state \(x\), the reward from following policy \(g \in G\) is

(4)\[v_g(x) := \mathbb{E} \left[ \sum_{t=0}^{\infty} \rho^t U(X_t - g(X_t)) \right]\]

The value function is defined as

(5)\[v^*(x) := \sup_{g \in G} v_g(x) \qquad (x \in S)\]

A policy \(g \in G\) is called optimal if

(6)\[v_{g}(x) = v^*(x) \text{ for all } x \in S\]

The Bellman Equation

Let \(\Gamma(x)\) denote the set of all feasible actions (number of fish that can be frozen) when the current state is \(x\):

(7)\[\Gamma(x) := \{0,1,\ldots, x \wedge M\} \qquad x \wedge M := \min\{x,M\}\]

It is known that the value function satisfied the Bellman equation, which is

(8)\[v^*(x) = \max_{a \in \Gamma(x)} \left\{ U(x - a) + \rho \sum_{z = 0}^B v^*(a + z) q(z) \right\} \qquad (x \in S)\]

The idea is as follows

Suppose we know the values of different states in terms of maximum future rewards

Then best action is found by trading off the two effects inherent in choosing an action:

  • Current reward, and

  • Future reward after transitioning to new state next period
    • New state depends on current action

Making this trade off optimally gives maximum value from the current state

Given a real-valued function \(w\) on \(S\), we call \(g \in G\) \(w\)-greedy if \(g(x)\) is a solution to

(9)\[\max_{a \in \Gamma(x)} \left\{ U(x - a) + \rho \sum_{z = 0}^B w(a + z) q(z) \right\}\]

for every \(x \in S\)

It is known that a policy is optimal if and only if it is \(v^*\)-greedy

Hence computing an optimal policy is trivial if we know the value function

How to solve for the value function?

Let \(bS\) be the real-valued functions on \(S\)

Define the Bellman operator \(T\) sending \(bS \to bS\) by

(10)\[Tv(x) = \max_{a \in \Gamma(x)} \left\{ U(x - a) + \rho \sum_{z = 0}^B v(a + z) q(z) \right\} \qquad (x \in S)\]

Facts (see the text):

  • \(T\) is a uniform contraction on \(bS\) with respect to the distance \(d_{\infty}(v, w) := \sup_{x \in S} |v(x) - w(x)|\)
  • \(v^*\) is the unique fixed point of \(T\) in \(bS\)

In consequence, iteration of \(T\) on any \(v \in bS\) converges to \(v^*\)

Value iteration algorithm

Iterate with \(T\) on some initial \(v \in bS\), producing \(T^n v\)

Continue iteration until

(11)\[d_{\infty}(T^n v, T^{n+1} v) \leq \text{ some small, positive tolerance }\]

Computer a \(w\)-greedy policy \(w\), where \(w\) is the final iterate

If the tolerance is small, then \(g\) is almost optimal (see EDTC)

Exercises

Solve for the optimal policy

  • Implement the Bellman operator \(T\) as a Python function
    • Elements of \(bS\) can be represented as lists or tuples (v[0] corresponds to \(v(0)\), etc.)
    • Takes such a sequence v as an argument and
    • returns another sequence representing \(Tv\)
  • Write a function greedy() to compute greedy policies

  • Implement the loop as a function main()
    • Use 0.001 for the tolerance

For the utility function set

(12)\[U(c) = c^{\beta}, \quad \beta, c \geq 0\]

For the parameters, use beta, rho, B, M = 0.5, 0.9, 10, 5

Let \(q\) be uniform on \(0, \ldots, B\)

Plot the sequence of functions created in the value iteration algorithm

_images/value_functions.png

Print the greedy policy (as a list) calcuated from the last iterate

The solution is

>>> g
[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5]

Solution

import pylab

beta, rho, B, M = 0.5, 0.9, 10, 5
S = range(B + M + 1)  # State space
Z = range(B + 1)      # Shock space

def d_infty(v, w):
    return max(abs(v[x] - w[x]) for x in S)

def U(c):      
    "Utility function."
    return c**beta

def q(z):    
    "Probability mass function, uniform distribution."
    return 1.0 / len(Z) if 0 <= z <= B else 0

def Gamma(x):  
    "The correspondence of feasible actions."
    return range(min(x, M) + 1)

def T(v):      
    """An implementation of the Bellman operator.
    Parameters: v is a sequence representing a function defined on S.
    Returns: Tv, a list."""
    Tv = []        
    for x in S:
        vals = []  # Stores rewards for a in Gamma(x)
        for a in Gamma(x):
            y = U(x - a) + rho * sum(v[a + z] * q(z) for z in Z)
            vals.append(y)
        Tv.append(max(vals))
    return Tv 

def greedy(v):
    g = []
    for x in S:
        runningmax = -1
        for a in Gamma(x):
            y = U(x - a) + rho * sum(v[a + z] * q(z) for z in Z)
            if y > runningmax:
                runningmax = y
                maximizer = a
        g.append(maximizer)
    return g
 
def main():
    current = [U(x) for x in S]
    tolerance = 0.001
    while 1:
        pylab.plot(current, 'bo-')
        new = T(current)
        if d_infty(new, current) < tolerance:
            break
        current = new
    g = greedy(new)
    print 'The optimal policy is:'
    print g
    pylab.show()