Introduction
If you are new to string search, I would recommend to first read the brute force approach here. Brute force as discussed in the mentioned post has time complexity of O(mn) in worst case. Rabin-Karp also has the worst case time complexity of O(mn), but it has a much better time complexity of O(m + n) in best case.
The Rabin–Karp algorithm not as good as Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms for single pattern searching because of its slow worst case time complexity. However, it does better for multiple pattern search. An application of this algorithm is in detection of plagiarism.
Problem Statement
We will assume text (where we have to search) is an array of n characters. Let’s denote the text by T[1..n]. String to be searched is called pattern and is represented by an array P[1..m] where m ≤ n.
We need to find indexes in given text T where the given pattern P occurs. We need to improve from the brute force algorithm having time complexity of O(m*n).
Refer here for detailed problem statement.
Explanation of Rabin-Karp Algorithm
Let’s assume text as abcabac and pattern as abac.
Step 1:
Generate hash function of the pattern. We will define hash function as follow for a character array of pattern P[] and any prime number, Prime:
Hash = ascii of (P[0]) * Prime^0 + ascii of (P[1]) * Prime^1 + ascii of (P[2]) * Prime^2 + ............
We will use prime as 7.
Hash of Pattern = 97 * 7^0 + 98 * 7^1 + 97 * 7^2 + 99 * 7^3 = 39493
Ascii of ‘a’ = 97. Ascii of ‘b’ = 98. Ascii of ‘c’ = 99.
Why do we have prime number to generate hash value?
Step 2:
We will do rolling hash of the text. We will use the same hash function.
Let’s illustrate this with an example:
Our text is abcabac. We will calculate hash of abca which is the first four character of the text.
Hash of abca = 97 * 7^0 + 98 * 7^1 + 99 * 7^2 + 97 * 7^3 = 38905
We will now see if the hash of abca is equal to abac. Since they aren’t we now need to calculate hash of bcab. bcab is the substring of text starting from second character. We will not use the hash formulae, instead we will use the hash of abca and do some arithmetic operation to get result.
Loosely, we can come up with the following formulae:
Hash of bcab = (Hash of abca – T[0]) / prime+ T[4] * Prime^3
Where T[] is the character array of text.
Hash of bcab = (38905 – 97 * 7^0 ) / 7 + 98 * 7^3 = 5544 + 33614 = 39158
We will again match hash of bcab with the hash of pattern. And we will take the next substring of text and repeat the process.
Step 3:
When we reach to the last substring abac of the substring, we will find the hash value of the pattern as well as substring equal. One important rule is equal hash value does not necessarily mean matched strings. After the hash value is matched, we will compare the substring with the pattern. Only when that is matched, we say that the substring is found.
Implementation of Rabin Karp
public class RabinKarp { | |
public static final int NOT_FOUND_INDEX = –1; | |
public static final int PRIME = 7; | |
public static void main(String args[]){ | |
System.out.println("**************** Rabin Karp *****************"); | |
String text = "abcabac"; | |
String pattern = "abac"; | |
long initialTime = System.nanoTime(); | |
int index = findPatternInText(pattern, text); | |
long finalTime = System.nanoTime(); | |
if(index == NOT_FOUND_INDEX) | |
System.out.println("Pattern is not find in the text"); | |
else | |
System.out.println("Successful! Pattern found at index "+index); | |
System.out.println("Time taken for matching "+(finalTime – initialTime)); | |
} | |
private static int findPatternInText(String pattern, String text) { | |
Integer patternHash = getHash(pattern); | |
char[] textArray = text.toCharArray(); | |
int textLength = textArray.length; | |
int patternLength = pattern.length(); | |
char[] patternArray = pattern.toCharArray(); | |
int matchedIndex = NOT_FOUND_INDEX; | |
Integer textHash = 0; | |
for(int i = 0; i < textLength; i++){ | |
if(i < patternLength){ | |
textHash = textHash + textArray[i] * Double.valueOf(Math.pow(PRIME, i)).intValue(); | |
} else { | |
textHash = textHash – textArray[i – patternLength]; | |
textHash = textHash / 11; | |
textHash = textHash + textArray[i] * Double.valueOf(Math.pow(PRIME,patternLength – 1)).intValue(); | |
} | |
boolean isMatchingVerified = false; | |
if(i == (patternLength – 1)){ | |
if(textHash == patternHash) { | |
isMatchingVerified = verifyString(textArray, patternArray, 0, patternLength); | |
if(isMatchingVerified) | |
matchedIndex = 0; | |
} | |
} | |
if(i > (patternLength – 1)){ | |
isMatchingVerified = verifyString(textArray, patternArray, i – patternLength + 1, i); | |
if(isMatchingVerified) | |
matchedIndex = i – patternLength + 1; | |
} | |
} | |
return matchedIndex; | |
} | |
private static boolean verifyString(char[] textArray, char[] patternArray, int start, int end) { | |
boolean isMatchingVerified = true; | |
for(int i = start, j = 0; i <= end; i++, j++ ){ | |
if(textArray[i] != patternArray[j]) | |
isMatchingVerified = false; | |
} | |
return isMatchingVerified; | |
} | |
private static Integer getHash(String string) { | |
char[] stringArray = string.toCharArray(); | |
Integer patternHash = 0; | |
int patternLength = stringArray.length; | |
for(int i = 0; i < patternLength; i++){ | |
patternHash = patternHash + stringArray[i] * Double.valueOf(Math.pow(PRIME, i)).intValue(); | |
} | |
return patternHash; | |
} | |
} |
Time Complexity
Its average and best case running time is O(n+m) but its worst-case time is O(nm) for text of length n and pattern of length m.
Reference and further reading:
Why would you use prime number for hash?
Simple Explanation And Implementation Of Knuth-Morris-Pratt (KMP) Algorithm For String Search
A beginner guide to Brute Force Algorithm for substring search
If you liked this article and would like one such blog to land in your inbox every week, consider subscribing to our newsletter: https://skillcaptain.substack.com