Levenshtein Edit Distance
Question. Given 2 strings
A[1...n]andB[1...m], what is minimum number of edits to A required to make 2 strings identical? An edit could be 1 of the following:
- Insert a character
- Delete a character
- Replace a character
For example, when A = "FOOD", B = "MONEY", the edit distance is 4.
FOOD -> MOOD -> MOND -> MONED -> MONEY
when A = "HORSE", B = "ROS", the edit distance is 3.
HORSE -> RORSE -> ROSE -> ROS
1. Observing the Problem
From the example we can see that there are 3 operations at any given index i:
Insert(i)Cost = 1Delete(i)Cost = 1Replace(i, newChar)Cost = 1 ifA[i] != B[i], else Cost = 0
2. Solve the backtracking problem first
2.1 Decide the Function Signature
Since we eventually need to convert this to DP, it’s easier to use string indices to keep track of which subproblem we are solving. Let’s use i for string A[1...n] and j for string B[1...n].
\text{EditDist}(i, j)Our final result is:
\text{EditDist}(n, m)2.2 Base Cases
If any of the string is empty, then the cost would be the length of that non-empty string. This will happen when we call:
{\text{EditDist}}(\red0, j)\\
\text{ or }\\
\text{EditDist}(i, \red0)Implementation Detail
Here we used 0 to indicate that an index is going out of bounds. In actual code 0 is a valid index; use -1 with an if statement to catch it.
2.3 Recursive Cases
If both strings are not empty, we will try to insert, delete, or replace. These are the 3 possible choices. To make a choice:
Let’s consider what it really means to “insert”.
@EXA=\texttt{"FOOD"}, B=\texttt{"MONEY"}\begin{matrix} &&&&&\Downarrow\\ A:&\tt{F} & \tt{O}&\tt{O}&\tt{D}&\texttt{\red?}\\ B:&\tt{M} & \tt{O}&\tt{N}&\tt{E} &\tt{\red Y} \end{matrix}If we insert a \tt Y at the questions mark, we have an aligned column. Then we can move to the "next" by considering the column to the left.
Maybe we can do this by
i -= 1, j -= 1, now we are here:\begin{matrix} &&&&\Downarrow\\ A:&\tt{F} & \tt{O}&\tt{O}&\tt{\red D}&\tt{\green Y}\\ B:&\tt{M} & \tt{O}&\tt{N}&\tt{\red E} &\tt{\green Y} \end{matrix}But notice that the question mark technically doesn’t exist because A is shorter.
\begin{matrix} &&&&\overset{i}{\Downarrow}\\ A:&\tt{F} & \tt{O}&\tt{O}&\tt{\red D}&\\ B:&\tt{M} & \tt{O}&\tt{N}&\tt{E} &\tt{\red Y} \\ &&&&&\underset{j}{\Uparrow}\\ \end{matrix}So i is already at the position after insertion. All we need to do is move j to j-1.
\begin{matrix} &&&&\overset{i}{\Downarrow}\\ A:&\tt{F} & \tt{O}&\tt{O}&\tt{\red D}&\tt{\green Y}\\ B:&\tt{M} & \tt{O}&\tt{N}&\tt{\red E} &\tt{\green Y} \\ &&&&\underset{j}{\Uparrow}\\ \end{matrix}Therefore the recursive call for insert is:
\text{EditDist}(i, j-1) + 1+1 for the cost of insertion
Delete
Delete is more straightforward. We simply don’t consider the i–th character in A.
We can do this by moving i to i - 1.
Therefore the recursive call for delete is:
\text{EditDist}(i-1, j) + 1+1 for the cost of deletion
Replace
To replace a character, we move both i and j.
@EXA=\texttt{"FOOD"}, B=\texttt{"MONEY"}\begin{array}{c:} \text{\blue {Index}}&\blue1 & \blue2 & \blue 3 & \blue4 & \blue 5 \\ &&&&\Downarrow\\ A:&\tt{F} & \tt{O}&\tt{O}&\tt{\red D}&\tt{Y}\\ B:&\tt{M} & \tt{O}&\tt{N}&\tt{\red E} &\tt{Y} \end{array}\hspace{1em}\xrightarrow{\text{replace(cost 1) at index 4}}\hspace{1em}\begin{matrix} \blue1 & \blue2 & \blue 3 & \blue4 & \blue 5 \\ &&\Downarrow\\ \tt{F} & \tt{O}&\tt{\red O}&\tt{\green E}&\tt{Y}\\ \tt{M} & \tt{O}&\tt{\red N}&\tt{\green E} &\tt{Y} \end{matrix}\begin{array}{c:} \text{\blue {Index}}&\blue1 & \blue2 & \blue 3 & \blue4 & \blue 5 \\ &&\Downarrow\\ A:&\tt{F} & \tt{\purple O}&\tt{N}&\tt{E}&\tt{Y}\\ B:&\tt{M} & \tt{\purple O}&\tt{N}&\tt{E} &\tt{Y} \end{array}\hspace{1em}\xrightarrow{\text{replace(free) at index 2}}\hspace{1em}\begin{matrix} \blue1 & \blue2 & \blue 3 & \blue4 & \blue 5 \\\Downarrow\\ \tt{\red F} & \tt{\green O}&\tt{ N}&\tt{E}&\tt{Y}\\ \tt{\red M} & \tt{\green O}&\tt{N}&\tt{E} &\tt{Y} \end{matrix}To find the cost, we check if the characters A[i] and B[j] are the same.
- If A[i] = B[j] then the replacement is free
- If A[i] \ne B[j] then the cost is 1
Therefore the recursive call for replace is:
\text{EditDist}(i-1, j-1) + \begin{cases}1 & \text{if }A[i]\ne B[j]\\ 0 & \text{if }A[i]= B[j] \end{cases}
Since we want the shortest distance, this falls under the best solution case and we will need a min(…) call.
2.4 Forming the recurrence
Now we combine 2.2.2 and 2.2.3,
{\text{EditDist}(i, j)} = \begin{cases}
i & \text{if } j = 0\\j & \text{if } i = 0\\
\min \left.\begin{cases}
\text{Insert}\\
\text{Delete}\\
\text{Replace}\\
\end{cases}\right\} & \text{Otherwise}
\end{cases}Convert to expressions:
{\text{EditDist}(i, j)} = \begin{cases}
i & \text{if } j = 0\\j & \text{if } i = 0\\
\min \left.\begin{cases}
\text{EditDist}(i, j-1) + 1\\
\text{EditDist}(i-1, j) + 1\\
\text{EditDist}(i-1, j-1) + \begin{cases}1 & \text{if }A[i]\ne B[j]\\
0 & \text{if }A[i]= B[j]
\end{cases}\\
\end{cases}\right\} & \text{Otherwise}
\end{cases}2.5 Python: Backtracking Edit Distance
Implements the above recurrence. Github
Sanity Check
- We still make only 1 choice at each recursion level, the choices are:
- Insert
- Delete
- Replace
- All choices are valid when the 2 strings are not empty, so we try all 3 of them and see which one is the best.
- Cost is always 1 for both insert and delete, 0 for replace if the characters are the same.
3. Convert Backtracking to Dynamic Programming
3.1 Data Structure
The EditDistance(i, j) call requires 2 arguments, so we can use a 2D array.
EditDist: List<List<int>>
We can access the result of a previous function call by EditDist[i, j]
The final result will be in EditDist[n, m]
3.2 Order of Evaluation
\text{EditDist}(i, j) needs:
- \text{EditDist}(i, j - 1)
- \text{EditDist}(i-1, j)
- \text{EditDist}(i-1, j-1)
So both the loop for i and j need to go in ascending order.
Similar to LCS, if we scan the 2 strings backwards, then the order of evaluation will also be flipped. In that case i and j goes in descending order. The final call will be in EditDist[1, 1]
4.2 Python: DP Edit Distance
We can also save the recursion results by using the library @cache decorator. See LRU Cache.
This file has DP and @cache brute force.