#
Dynamic Programming Pt.2
#
Lot Sizing / Generalized Inventory Problem
#
Scenario. Keeping inventory
Going back to the coffee shop example: A coffee shop needs to decide how much coffee beans to purchase and how frequently should they purchase. Now suppose we also own a roastery with a bunch of roasters, where we can roast our own coffee beans.
#
Variables
- t
- Time period
- d_t
- Demand at time t
- s_t
- Inventory at the start of period t
- a_t
- Production amount at period t
- b
- Unit production cost
- h
- Unit holding cost
- K
- Fixed cost for starting production
- c(a_t)
- Production cost function
c(a_t) = \begin{cases} 0 & \text{if }a_t = 0\\ K + b\cdot a_t & \text{if }a_t > 0 \end{cases}
Note that “period” is the interval between each timestamp
#
Stage, State, Value Function
- Stage
- Time t = 1, 2\dots, T
- State
- Inventory s_t
- Value function
- v_t(s_t), the minimum cost from time t to T with initial inventory s_t
- Action
- a_t, the production amount
The optimality equation is:
v_t(s_t) =\min_{a_t}\{c(a_t) + h\cdot(s_t + a_t - d_t) + v_{t+1}(s_t + a_t - d_t)\}
with base cases:
v_{T+1}(s_{T+1}) = 0, \forall s_{T+1}
#
Bounds
Suppose we only have M machines, then a_t is bounded by the number of machines
a_t\in\{0,1,\dots , M\}
The inventory capacity could be capped at D, this bounds the state:
s_t\in\{0,1,\dots, D\}
#
Properties of the Inventory Problem
Assuming unit costs b, h are independent of time t, we have:
#
Prop. Don’t produce if we have inventory
a_t > 0\implies s_t = 0
If it is optimal to produce during any time period t, then the starting inventory is 0
#
Prop. If we produce, produce enough to cover integer amount of time periods
If it is optimal to produce in stage t (so a_t > 0 for some t), then it is optimal to produce an amount that exactly covers the demand for t, t+1, \cdots, t+j for some 0\leqslant j\leqslant T-t.
#
Lemma. Equivalent optimality equation
Using the previous 2 properties, we only need to find the number of time periods j to cover when we produce. If we produce enough to cover j periods, we move to time t+j+1.
v_t = \min_{0\leqslant\, j\leqslant\, T-t}\{c_{t,t+j+1} + v_{t+j+1}\}
where the transition cost c_{t, t+j+1} is:
c_{t,\,t+j+1} = K + h\sum^j_{i=1}\underbrace{i\cdot d_{t+i}}_{\substack{\text{need to keep the demand}\\\text{ amount in inventory}}}
with base cases v_{T+1} = 0, s_1 = 0.
#
Stochastic Dynamic Programming
#
Markov Chain Review
#
Def. Discrete Time Markov Chain
A discrete, time homogeneous Markov chain on state space S with transition matrix P and initial distribution \alpha is a sequence of random states X_n\in S such that:
\Bbb P(X_0 = i) = \alpha_i
Prop. Markov Property
\Bbb P(X_{t}= j\mid X_{t-1} = i) = \Bbb P(X_{t} = j\mid X_{t-1} = i, X_{t-2 } = i_{t-1}\dots)
The elements p_{ij} in P represents:
p_{ij} = \Bbb P(X_{t+1} = j\mid X_t = i)
#
Def. Communicate
2 states i, j\in S communicates if they are accessible from each other.
i\lrarr j \coloneqq i\to j\text{ and }j \to i
#
Def. Closed Subset
A subset of state space T\sub S is closed if any of the states in T is ever entered, the chain cannot leave T. In terms of transition probability:
P_{ij} =0\hskip1em\forall i\in T, j\notin T
The entire state space is always closed.
#
Prop. n step transition probability
It’s the i,j th entry in P^n.
\begin{aligned} \Bbb P(X_n = j\mid X_0 = i) &= \Bbb P(X_{m+n}= j\mid X_m = i) \\&= P^n_{ij} \end{aligned}
#
Markov decision process with finite time
For this class we assume time t is finite, t = 0,1,2,\dots, N.
#
Variables
- s_t\in S
- State at time t
- a_t\in A
- Action at time t
- r_t(s_t, a_t)
- Reward function for the state, action pair at time t
- \Bbb P(s_{t+1}\mid s_t, a_t)
- Transition probability of actually going to state s_{t+1} given we took action a_t at state s_t.
- R(s_N) or r_N(s_N)
- Reward if we end on state s_N
#
Optimality Equation
Starting with the deterministic case, we want to see with action at time t maximizes / minimizes the reward.
v_t(s_t) = \underset{a_t\in A}{\max/\min}\,\{r_t(s_t, a_t) + v_{t+1}(s_{t+1})\}
For the stochastic case, replace v_{t+1} with expected value, since it’s possible we might not get to the state s_{t+1}. The next state became a random variable:
v_t(s_t) = \underset{a_t\in A}{\max/\min}\,\Big\{r_t(s_t, a_t) + \Bbb E[v_{t+1}(s_{t+1})\mid s_t, a_t]\Big\}
Expand with the definition of \Bbb E:
v_t(s_t) = \underset{a_t\in A}{\max/\min}\,\left\{r_t(s_t, a_t) + \sum_{s_{t+1}\in S}{\Bbb P}(s_{t+1}\mid s_t, a_t)\cdot v_{t+1}(s_{t+1})\right\}
with base cases v_N(s_N) = R(s_N) for some final reward function R and each possible end state s_N\in S.
#
Decision Rule \small d_t
At each state s_t, we want to pick out an action a_t. There are 2 types of decision rules.
Markovian. d_t(s_t) returns an action by looking at current state.
History dependent. d_t(s_1,a_1\dots s_{t-1}, a_{t-1}, s_t) returns an action by looking at all past states and actions.
#
Def. Policy \pi, sample path \omega, total reward W
The decision rules at each stage
\pi = (d_1, d_2, \dots d_{N-1})
Given a policy \pi, we automatically get a sample path \omega,
\omega= (s_1, a_1, s_2, a_2,\dots,s_{N-1}, a_{N-1}, s_N)
The probability of the path happening is:
\Bbb P(\text{path} = \omega) \\= \Bbb P(X_1 = s_1)\cdot\Bbb P(X_2 = s_2\mid s_1, a_1)\cdots\Bbb P(X_N = s_N\mid s_{N-1, a_{N-1}})
A sample path also has a total reward W,
W(\omega) = R (s_N)+\sum_{i=1}^{N-1}r_i(s_i, a_i)
Each s_i in the sum is a random variable, so the expected total reward is:
\Bbb E_{W,\pi}[W(\omega)] = \sum_{\text{all possible }\omega} \Bbb P(\text{path}=\omega) \cdot W(\omega)
#
Example. Stochastic Shortest Paths
graph LR S --1--- A A --50--- C S --0--- B A --0---D C --1---F C --0 ---G D --120 ---G D --50 ---H B--30---D B--10---E E--2---H E--2---I
Let each layer of the graph be the state at t=0, t=1, t=2, t=3.
At each stage, we can choose to go up or down. Since it’s stochastic, there’s a transition probability. Define:
\begin{aligned} \Bbb P(\text{actually go up}\mid \text{choose to go up}) &= p_u\\ \Bbb P(\text{actually go down}\mid \text{choose to go down}) &= p_d\\ \Bbb P(\text{actually go down}\mid \text{choose to go up}) &= 1-p_u\\ \Bbb P(\text{actually go up}\mid \text{choose to go down}) &= 1-p_d\\ \end{aligned}
For the path \omega = \text{SACF}, the probablity is \Bbb P(\text{path} = \omega) = p_u^3 because we need to actually go up 3 times.
There are 2^3 possible sample paths (2 choice at each stage, 3 stages in total). Each path \omega_i has a reward W(\omega_i) and a probability \Bbb P(\omega_i).
So the expected reward is:
\Bbb E_{W\!,\pi}[W(\omega)] = \sum_{i=1}^{2^3}W(w_i)\cdot \Bbb P(\text{path} = \omega_i)
#
Policy Backward Evaluation
For policy \pi, time t, define the value function:
v_{\pi, t}(s_t) = r_t(s_t, d_t(h_t)) + \sum_{j\in S}\Bbb P(j\mid s_t, d_t(h_t))\cdot v_{\pi, t+1}(h_t, d_t(h_t), j)
- r_t
- reward function
- s_t
- current state
- d_t(h_t)
- returns action depending on the history
when we are in a Markov chain, we introduce random variables X_n, Y_n for s_t, d_t(h_t).
v_{t, \pi}(h_t) = \Bbb E[\sum^{N-1}_{n=t}r_n(X_n, Y_n) + r_N(X_N)]
To find the best policy \pi^*, we max over all possible policies:
\begin{aligned} v_t(h_t) &= \max_{\pi} v_{t,\pi}(h_t) \\ &= \max_{a_t}\{r_t(s_t, a_t) + \sum_{j\in S}\Bbb P(j\mid s_t, a_t) \cdot v_{t+1} (h_{t+1})\}\\ v_N(h_N)&= r_N(s_N) \end{aligned}
The best policy is essentially the tuple of all the best actions a_t at each stage t.
d_t^*(h_t) \in \argmax_{a_t}\{r_t(s_t, a_t) + \sum_{j\in S}\Bbb P(j\mid s_t, a_t) \cdot v_{t+1} (h_{t+1})\}
Finally the optimal policy is \pi^* = (d_1^*, d_2^*,\dots, d_{N_1}^*)
#
Observation. Randomizing choices doesn't yield a better optimum
Given any value function \omega: W\to \R where W is a discrete set and some probability distribution \Bbb P over W.
Then the max value bounds the value of a random choice:
\max_{s\in W}\omega(s) \geqslant \sum_{s\in W}\Bbb P(s)\omega(s) = \Bbb E[\omega]
#
Observation. Existence of optimal Markov decision rule
If history dependent value function only looks at the current state, then there will be an optimal Markov decision rule.