A Simple Explanation of Rabin-Karp Algorithm For String Search

Introduction

If you are new to string search, I would recommend to first read the brute force approach here. Brute force as discussed in the mentioned post has time complexity of O(mn) in worst case. Rabin-Karp also has the worst case time complexity of O(mn), but it has a much better time complexity of O(m + n) in best case.

The Rabin–Karp algorithm not as good as Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms for single pattern searching because of its slow worst case time complexity. However, it does better for multiple pattern search. An application of this algorithm is in detection of plagiarism.

Problem Statement

We will assume text (where we have to search) is an array of  n characters. Let’s denote the text by T[1..n]. String to be searched is called pattern and is represented by an array P[1..m] where m ≤ n.

We need to find indexes in given text T where the given pattern P occurs. We need to improve from the brute force algorithm having time complexity of O(m*n).

Refer here for detailed problem statement.

Explanation of Rabin-Karp Algorithm

Let’s assume text as abcabac and pattern as abac.

Step 1:

Generate hash function of the pattern. We will define hash function as follow for a character array of pattern P[] and any prime number, Prime:

Hash = ascii of (P[0]) * Prime^0 + ascii of (P[1]) * Prime^1 + ascii of (P[2]) * Prime^2 + ............

We will use prime as 7.

Hash of Pattern =  97 * 7^0 + 98 * 7^1 + 97 * 7^2 + 99 * 7^3 = 39493

Ascii of ‘a’ = 97. Ascii of ‘b’ = 98. Ascii of ‘c’ = 99.


Why do we have prime number to generate hash value? 


Step 2:

We will do rolling hash of the text.  We will use the same hash function.

Let’s illustrate this with an example:

Our text is abcabac. We will calculate hash of abca which is the first four character of the text.

Hash of abca = 97 * 7^0 + 98 * 7^1 + 99 * 7^2 + 97 * 7^3 = 38905

We will now see if the hash of abca is equal to abac. Since they aren’t we now need to calculate hash of bcab. bcab is the substring of text starting from second character. We will not use the hash formulae, instead we will use the hash of abca and do some arithmetic operation to get result.

Loosely, we can come up with the following formulae:

Hash of bcab =  (Hash of abca – T[0]) / prime+ T[4] * Prime^3

Where T[] is the character array of text.

Hash of bcab = (38905 – 97 * 7^0 ) / 7 + 98 * 7^3 = 5544 + 33614 = 39158

We will again match hash of bcab with the hash of pattern. And we will take the next substring of text and repeat the process.

Step 3:

When we reach to the last substring abac of the substring, we will find the hash value of the pattern as well as substring equal. One important rule is equal hash value does not necessarily mean matched strings. After the hash value is matched, we will compare the substring with the pattern. Only when that is matched, we say that the substring is found.

Implementation of Rabin Karp

Time Complexity

Its average and best case running time is O(n+m) but its worst-case time is O(nm) for text of length n and  pattern of length m.


Reference and further reading:

1

Wiki

Why would you use prime number for hash?

Simple Explanation And Implementation Of Knuth-Morris-Pratt (KMP) Algorithm For String Search

A beginner guide to Brute Force Algorithm for substring search

 

Advertisements

One thought on “A Simple Explanation of Rabin-Karp Algorithm For String Search

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: