#
Levenshtein Edit Distance
Question. Given 2 strings
A[1...n]
andB[1...m]
, what is minimum number of edits to A required to make 2 strings identical? An edit could be 1 of the following:
- Insert a character
- Delete a character
- Replace a character
For example, when A = "FOOD", B = "MONEY"
, the edit distance is 4.
FOOD -> MOOD -> MOND -> MONED -> MONEY
when A = "HORSE", B = "ROS"
, the edit distance is 3.
HORSE -> RORSE -> ROSE -> ROS
#
1. Observing the Problem
From the example we can see that there are 3 operations at any given index i
:
Insert(i)
Cost = 1Delete(i)
Cost = 1Replace(i, newChar)
Cost = 1 ifA[i] != B[i]
, else Cost = 0
#
2. Solve the backtracking problem first
#
2.1 Decide the Function Signature
Since we eventually need to convert this to DP, it’s easier to use string indices to keep track of which subproblem we are solving. Let’s use i for string A[1...n]
and j for string B[1...n]
.
\text{EditDist}(i, j)
Our final result is:
\text{EditDist}(n, m)
#
2.2 Base Cases
If any of the string is empty, then the cost would be the length of that non-empty string. This will happen when we call:
{\text{EditDist}}(\red0, j)\\ \text{ or }\\ \text{EditDist}(i, \red0)
Implementation Detail
Here we used 0 to indicate that an index is going out of bounds. In actual code 0 is a valid index; use -1 with an if statement to catch it.
#
2.3 Recursive Cases
If both strings are not empty, we will try to insert, delete, or replace. These are the 3 possible choices. To make a choice:
Let’s consider what it really means to “insert”.
@EX
A=\texttt{"FOOD"}, B=\texttt{"MONEY"}\begin{matrix} &&&&&\Downarrow\\ A:&\tt{F} & \tt{O}&\tt{O}&\tt{D}&\texttt{\red?}\\ B:&\tt{M} & \tt{O}&\tt{N}&\tt{E} &\tt{\red Y} \end{matrix}
If we insert a \tt Y at the questions mark, we have an aligned column. Then we can move to the "next" by considering the column to the left.
Maybe we can do this by
i -= 1, j -= 1
, now we are here:\begin{matrix} &&&&\Downarrow\\ A:&\tt{F} & \tt{O}&\tt{O}&\tt{\red D}&\tt{\green Y}\\ B:&\tt{M} & \tt{O}&\tt{N}&\tt{\red E} &\tt{\green Y} \end{matrix}
But notice that the question mark technically doesn’t exist because A is shorter.
\begin{matrix} &&&&\overset{i}{\Downarrow}\\ A:&\tt{F} & \tt{O}&\tt{O}&\tt{\red D}&\\ B:&\tt{M} & \tt{O}&\tt{N}&\tt{E} &\tt{\red Y} \\ &&&&&\underset{j}{\Uparrow}\\ \end{matrix}
So i is already at the position after insertion. All we need to do is move j to j-1.
\begin{matrix} &&&&\overset{i}{\Downarrow}\\ A:&\tt{F} & \tt{O}&\tt{O}&\tt{\red D}&\tt{\green Y}\\ B:&\tt{M} & \tt{O}&\tt{N}&\tt{\red E} &\tt{\green Y} \\ &&&&\underset{j}{\Uparrow}\\ \end{matrix}
Therefore the recursive call for insert is:
\text{EditDist}(i, j-1) + 1
+1 for the cost of insertion
Delete
Delete is more straightforward. We simply don’t consider the i–th character in A.
We can do this by moving i to i - 1.
Therefore the recursive call for delete is:
\text{EditDist}(i-1, j) + 1
+1 for the cost of deletion
Replace
To replace a character, we move both i and j.
@EX
A=\texttt{"FOOD"}, B=\texttt{"MONEY"}\begin{array}{c:} \text{\blue {Index}}&\blue1 & \blue2 & \blue 3 & \blue4 & \blue 5 \\ &&&&\Downarrow\\ A:&\tt{F} & \tt{O}&\tt{O}&\tt{\red D}&\tt{Y}\\ B:&\tt{M} & \tt{O}&\tt{N}&\tt{\red E} &\tt{Y} \end{array}\hspace{1em}\xrightarrow{\text{replace(cost 1) at index 4}}\hspace{1em}\begin{matrix} \blue1 & \blue2 & \blue 3 & \blue4 & \blue 5 \\ &&\Downarrow\\ \tt{F} & \tt{O}&\tt{\red O}&\tt{\green E}&\tt{Y}\\ \tt{M} & \tt{O}&\tt{\red N}&\tt{\green E} &\tt{Y} \end{matrix}
\begin{array}{c:} \text{\blue {Index}}&\blue1 & \blue2 & \blue 3 & \blue4 & \blue 5 \\ &&\Downarrow\\ A:&\tt{F} & \tt{\purple O}&\tt{N}&\tt{E}&\tt{Y}\\ B:&\tt{M} & \tt{\purple O}&\tt{N}&\tt{E} &\tt{Y} \end{array}\hspace{1em}\xrightarrow{\text{replace(free) at index 2}}\hspace{1em}\begin{matrix} \blue1 & \blue2 & \blue 3 & \blue4 & \blue 5 \\\Downarrow\\ \tt{\red F} & \tt{\green O}&\tt{ N}&\tt{E}&\tt{Y}\\ \tt{\red M} & \tt{\green O}&\tt{N}&\tt{E} &\tt{Y} \end{matrix}
To find the cost, we check if the characters A[i] and B[j] are the same.
- If A[i] = B[j] then the replacement is free
- If A[i] \ne B[j] then the cost is 1
Therefore the recursive call for replace is:
\text{EditDist}(i-1, j-1) + \begin{cases}1 & \text{if }A[i]\ne B[j]\\ 0 & \text{if }A[i]= B[j] \end{cases}
Since we want the shortest distance, this falls under the best solution case and we will need a min(…)
call.
#
2.4 Forming the recurrence
Now we combine 2.2.2 and 2.2.3,
{\text{EditDist}(i, j)} = \begin{cases} i & \text{if } j = 0\\j & \text{if } i = 0\\ \min \left.\begin{cases} \text{Insert}\\ \text{Delete}\\ \text{Replace}\\ \end{cases}\right\} & \text{Otherwise} \end{cases}
Convert to expressions:
{\text{EditDist}(i, j)} = \begin{cases} i & \text{if } j = 0\\j & \text{if } i = 0\\ \min \left.\begin{cases} \text{EditDist}(i, j-1) + 1\\ \text{EditDist}(i-1, j) + 1\\ \text{EditDist}(i-1, j-1) + \begin{cases}1 & \text{if }A[i]\ne B[j]\\ 0 & \text{if }A[i]= B[j] \end{cases}\\ \end{cases}\right\} & \text{Otherwise} \end{cases}
#
2.5 Python: Backtracking Edit Distance
Implements the above recurrence. Github
Sanity Check
- We still make only 1 choice at each recursion level, the choices are:
- Insert
- Delete
- Replace
- All choices are valid when the 2 strings are not empty, so we try all 3 of them and see which one is the best.
- Cost is always 1 for both insert and delete, 0 for replace if the characters are the same.
#
3. Convert Backtracking to Dynamic Programming
#
3.1 Data Structure
The EditDistance(i, j)
call requires 2 arguments, so we can use a 2D array.
EditDist: List<List<int>>
We can access the result of a previous function call by EditDist[i, j]
The final result will be in EditDist[n, m]
#
3.2 Order of Evaluation
\text{EditDist}(i, j) needs:
- \text{EditDist}(i, j - 1)
- \text{EditDist}(i-1, j)
- \text{EditDist}(i-1, j-1)
So both the loop for i and j need to go in ascending order.
Similar to LCS, if we scan the 2 strings backwards, then the order of evaluation will also be flipped. In that case i and j goes in descending order. The final call will be in EditDist[1, 1]
#
4.2 Python: DP Edit Distance
We can also save the recursion results by using the library @cache
decorator. See LRU Cache.
This file has DP and @cache
brute force.