#
Introduction
#
General DP Formulation
Typically a DP problem involves searching for a sequence of decisions that arrives at a maximum or minimum value. Let's define:
- t = 1...N
- A "stage" of the problem. When t=N, we have reached the end of the problem. This is usually the variable representing time / planning horizon, but not necessarily
- s_t\in S
- State at stage t; S is the space of all possible states
- a_t\in A
- Action at stage t; A is the space of all possible actions
- r_t(s_t, a_t)
- Reward function if we take action a_t at state s_t
- R(s_N)
- Reward if we end on state s_N. This is the cost/reward of our base case, a subproblem that we can immediately solve without recursion. If we reach s_N, we have no possible actions.
Now the optimality equation / value function / objective recurrence is:
v_t(s_t) = \left.\underset{a_t\in A}{\max/\min}\,\middle\{v_{t+1}(s_{t+1}) + r_t(s_t, a_t) \right\}
#
Examples
#
Rod Cutting
For a rod of length N,
\text{CutRod}(n) = \begin{cases} 0 & \text{if } n = 0\\ p_1 & \text{if }n = 1\\ \displaystyle\max_{1\leqslant i\leqslant n}\{ \text{CutRod}(n-i) + p_i \} & \text{otherwise} \end{cases}
In this case, the remaining length of the rod is the state s_t, and the number of units of rod to cut is the action a_t. We immediately encounter the first exception of the definition of stage t. The only thing that's constant is the initial length of the rod, or the maximum length we can cut in 1 go. Therefore the stage is technically time, but it ranges from 1...N since the max number of cuts we can make is just the total length N (just cut 1 unit at a time).
The rest of the variables are straightforward to find: the revenue of a cut is the reward function; final reward is either 0 or p_1, and s_{n+1} is the new length after the cut.
Let's list each component:
- t = 1...N
- Stage is the "time" elapsed or the total number of cuts we have made
- s_t\in S
- State is the length of the remaining rod
- a_t\in A
- Action the the possible cuts we can make. Ranges from 1 to s_t
- r_t(s_t, a_t)
- Reward is the revenue of the cut
- R(s_N)
- Base cases, s_N = 0 or s_N = 1 which means R(0) = 0, R(1) = p_1
optimality equation is:
\max_{1\leqslant i\leqslant n}\{\text{CutRod}(n-i) + p_i\}
#
Edit Distance
For strings A[1...n], B[1...m]
{\text{EditDist}(i, j)} = \begin{cases} i & \text{if } j = 0\\j & \text{if } i = 0\\ \min \left.\begin{cases} \text{EditDist}(i, j-1) + 1\\ \text{EditDist}(i-1, j) + 1\\ \text{EditDist}(i-1, j-1) + \begin{cases}1 & \text{if }A[i]\ne B[j]\\ 0 & \text{if }A[i]= B[j] \end{cases}\\ \end{cases}\right\} & \text{Otherwise} \end{cases}