If you are new to string search, I would recommend to first read the brute force approach here. Brute force as discussed in the mentioned post has time complexity of O(mn) in worst case. Rabin-Karp also has the worst case time complexity of O(mn), but it has a much better time complexity of O(m + n) in best case.
The Rabin–Karp algorithm not as good as Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms for single pattern searching because of its slow worst case time complexity. However, it does better for multiple pattern search. An application of this algorithm is in detection of plagiarism.
We will assume text (where we have to search) is an array of n characters. Let’s denote the text by T[1..n]. String to be searched is called pattern and is represented by an array P[1..m] where m ≤ n.
We need to find indexes in given text T where the given pattern P occurs. We need to improve from the brute force algorithm having time complexity of O(m*n).
Explanation of Rabin-Karp Algorithm
Let’s assume text as abcabac and pattern as abac.
Generate hash function of the pattern. We will define hash function as follow for a character array of pattern P and any prime number, Prime:
Hash = ascii of (P) * Prime^0 + ascii of (P) * Prime^1 + ascii of (P) * Prime^2 + ............
We will use prime as 7.
Hash of Pattern = 97 * 7^0 + 98 * 7^1 + 97 * 7^2 + 99 * 7^3 = 39493
Ascii of ‘a’ = 97. Ascii of ‘b’ = 98. Ascii of ‘c’ = 99.
We will do rolling hash of the text. We will use the same hash function.
Let’s illustrate this with an example:
Our text is abcabac. We will calculate hash of abca which is the first four character of the text.
Hash of abca = 97 * 7^0 + 98 * 7^1 + 99 * 7^2 + 97 * 7^3 = 38905
We will now see if the hash of abca is equal to abac. Since they aren’t we now need to calculate hash of bcab. bcab is the substring of text starting from second character. We will not use the hash formulae, instead we will use the hash of abca and do some arithmetic operation to get result.
Loosely, we can come up with the following formulae:
Hash of bcab = (Hash of abca – T) / prime+ T * Prime^3
Where T is the character array of text.
Hash of bcab = (38905 – 97 * 7^0 ) / 7 + 98 * 7^3 = 5544 + 33614 = 39158
We will again match hash of bcab with the hash of pattern. And we will take the next substring of text and repeat the process.
When we reach to the last substring abac of the substring, we will find the hash value of the pattern as well as substring equal. One important rule is equal hash value does not necessarily mean matched strings. After the hash value is matched, we will compare the substring with the pattern. Only when that is matched, we say that the substring is found.
Implementation of Rabin Karp
Its average and best case running time is O(n+m) but its worst-case time is O(nm) for text of length n and pattern of length m.
Reference and further reading: