This lecture:
The lecture is based on section 5.1 of EDTC
Please refer to that section for detailed explanations
You need some familiarity with dynamic programming to read this lecture
We model the behavior of Colonel Kurtz, who can lives on a small island and fishes for catfish
Catfish bite at dawn, the colonel bags a random quantity \(W \in \{0,\ldots,B\}\) with distribution \(q\)
That is, \(\mathbb{P}\{W = b\} = q(b)\)
Catfish last only if refrigerated, and the Colonel’s freezer holds up to \(M\) fish
Let \(X_t\) be the state variable: stock of fish at noon on day t
In the evening, the Colonel consumes quantity \(C_t\), and freezes \(R_t = X_t - C_t\)
Following the next morning’s fishing trip, the state is
Assume that Kurtz follows policy function \(g\): On observing state \(X\), he saves \(R = g(X)\)
The state space (i.e., the set of possible values for \(X\)) is given by \(S = \{0, \ldots, M+B\}\)
Formally, a feasible policy function \(g\) is a map from \(S\) into the action space \(\{0,\ldots,M\}\) that satisfies \(0 \leq g(x) \leq x\) for all \(x \in S\)
Denote the set of all feasible policies by \(G\). For given \(g \in G\), state evolves according to
Letting \(U\) be the utility function, optimization problem of Colonel Kurtz is
where the sequence \((X_t)\) is the one defined in (2) (and hence depends on \(g\))
Given that we are currently at state \(x\), the reward from following policy \(g \in G\) is
The value function is defined as
A policy \(g \in G\) is called optimal if
Let \(\Gamma(x)\) denote the set of all feasible actions (number of fish that can be frozen) when the current state is \(x\):
It is known that the value function satisfied the Bellman equation, which is
The idea is as follows
Suppose we know the values of different states in terms of maximum future rewards
Then best action is found by trading off the two effects inherent in choosing an action:
Current reward, and
Making this trade off optimally gives maximum value from the current state
Given a real-valued function \(w\) on \(S\), we call \(g \in G\) \(w\)-greedy if \(g(x)\) is a solution to
for every \(x \in S\)
It is known that a policy is optimal if and only if it is \(v^*\)-greedy
Hence computing an optimal policy is trivial if we know the value function
How to solve for the value function?
Let \(bS\) be the real-valued functions on \(S\)
Define the Bellman operator \(T\) sending \(bS \to bS\) by
Facts (see the text):
In consequence, iteration of \(T\) on any \(v \in bS\) converges to \(v^*\)
Iterate with \(T\) on some initial \(v \in bS\), producing \(T^n v\)
Continue iteration until
Computer a \(w\)-greedy policy \(w\), where \(w\) is the final iterate
If the tolerance is small, then \(g\) is almost optimal (see EDTC)
Solve for the optimal policy
Write a function greedy() to compute greedy policies
For the utility function set
For the parameters, use beta, rho, B, M = 0.5, 0.9, 10, 5
Let \(q\) be uniform on \(0, \ldots, B\)
Plot the sequence of functions created in the value iteration algorithm
Print the greedy policy (as a list) calcuated from the last iterate
The solution is
>>> g
[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5]
import pylab
beta, rho, B, M = 0.5, 0.9, 10, 5
S = range(B + M + 1) # State space
Z = range(B + 1) # Shock space
def d_infty(v, w):
return max(abs(v[x] - w[x]) for x in S)
def U(c):
"Utility function."
return c**beta
def q(z):
"Probability mass function, uniform distribution."
return 1.0 / len(Z) if 0 <= z <= B else 0
def Gamma(x):
"The correspondence of feasible actions."
return range(min(x, M) + 1)
def T(v):
"""An implementation of the Bellman operator.
Parameters: v is a sequence representing a function defined on S.
Returns: Tv, a list."""
Tv = []
for x in S:
vals = [] # Stores rewards for a in Gamma(x)
for a in Gamma(x):
y = U(x - a) + rho * sum(v[a + z] * q(z) for z in Z)
vals.append(y)
Tv.append(max(vals))
return Tv
def greedy(v):
g = []
for x in S:
runningmax = -1
for a in Gamma(x):
y = U(x - a) + rho * sum(v[a + z] * q(z) for z in Z)
if y > runningmax:
runningmax = y
maximizer = a
g.append(maximizer)
return g
def main():
current = [U(x) for x in S]
tolerance = 0.001
while 1:
pylab.plot(current, 'bo-')
new = T(current)
if d_infty(new, current) < tolerance:
break
current = new
g = greedy(new)
print 'The optimal policy is:'
print g
pylab.show()