Boyer Moore Algorithm for Pattern Searching using Python

The Boyer Moore Algorithm is the most efficient in pattern matching algorithms. Pattern searching methods display search results when looking at a string in a notepad/word file, web browser, or database. A common pattern searching technique is the Boyer Moore String Search technique, with practical applications.

Strings search methods quicker than O(n) usually need some preprocessing of the text, which can be computationally expensive if done frequently or memory-intensive if the text is extensive. Boyer Moore requires that the pattern being sought be pre-processed.

This algorithm starts with preprocessing the string we want to search. The information we get from the preprocessing is used for further steps in the search algorithm. The main property of this algorithm is to match the tail portion of the sequence instead of the head and skip down the text in multiple character jumps rather than checking every letter in the text.

Problem Statement

A string is given as str[0….s - 1] and a pattern string as ptn[0….p-1], where s is the string length, and p is the length of the pattern; we need to make a function that will print all the occurrence of the pattern string in the text string. We can assume that the length of the text string is more than the length of the pattern string.

Let's understand the problem statement with a few examples.

Example 1:

Input: 
str[ ]: "HELLO TESTING THE TEST TEXT"
ptn[ ]: "TEST"

Output:
The Pattern "TEST" is found at index 6.
The Pattern "TEST" is found at index 18.

Example 2:

Input: 
str[ ]: "ABCDABBACACABBAD"
ptn[ ]: "ABBA"

Output:
The Pattern "ABBA" is found at index 4.
The Pattern "ABBA" is found at index 11.

Boyer Moore Algorithm for Pattern Searching using Python

Explanation:

An immutable data type used to store the characters and multiple sequences is known as a string.

If we look at the examples, we have started from the beginning index to search for the pattern. In this first example, we started to search the pattern "TEST" from the index 0 and got the pattern at index 6 and then 18. Similarly, we tried to find the pattern "ABBA" in the second example and started searching it from the first position. We got the pattern first at the 4^th index and then at the 11^th index.

We will use the Boyer Moore algorithm to search patterns in strings, where two pointers are assigned, which searches the pattern in the string. The pointer is assigned at the 0^th index of the string and the character string. This algorithm searches the pattern from the last character. Then, the pattern is compared with the text string from its current position, starting from the rightmost character.

When a character fails to match, the Boyer-Moore algorithm moves the characters simultaneously in two ways:

Bad Character Heuristics
Good Suffix Heuristics

Bad Character Heuristics

When the algorithm fails to find the pattern in the string, it gives two different cases:

Some Mismatched Characters of the input is there in the pattern

In these situations, the mismatched character is called a bad character. When a mismatched character is discovered, we move the pattern until it matches the characters of the string.

For instance, we are searching the pattern PCERST in the string ABPRPXCPREDFT. When we compared the pattern string with the text string, we found a mismatch between P, the character of the string (the bad character), and E, the character of the pattern. Then, we will shift the pattern string until the character P of the pattern string matches the character P of the text string.

Explanation:

We started to search for the pattern in the text string from the index 0 and found that the first character is not matched with the pattern character. Then, we shifted the pattern to match the string text.

When the text string does not have any mismatched characters in the pattern string

To understand this situation, let's take an example: We need to search the pattern string PCERST in the string ABPRPXCPREDFT. We will compare the pattern from right to left.

Implementation for the bad character heuristics:

Code:

total_chars = 256

def badCharacterHeuristic(char, size):

    badChar = [-1]*total_chars

# Filling the actual value of the last occurrence of a character in the bad character list (array).
    for i in range(size):
        badChar[ord(char[i])] = i

    return badChar

def boyerMooreAlgorithm(txt, ptn):
    # Getting the size of the pattern and the string.
    m = len(ptn)
    s = len(txt)

    badCharacter = badCharacterHeuristic(ptn, m)

    # Initially, there is no shift
    shift = 0

    # Defining the boundaries of the searching
    while(shift <= s-m):
        j = m-1

        # If the pattern and text match, keep reducing the j variable.
        while j >= 0 and ptn[j] == txt[shift+j]:
            j -= 1

# Checking for the current location of the shift; if it is at the current shift, then j will become -1
        if j < 0:
            print(" The Pattern found at index: ", shift)

            shift += (m-badCharacter[ord(txt[shift+m])]
                       if shift+m < s else 1)
        else:
            shift += max(1, j-badCharacter[ord(txt[shift+j])])

txt = "SWERTAPABCAAPPLOC"
ptn = "ABCAAPPLO"
boyerMooreAlgorithm(txt, ptn)

Output:

The Pattern found at index:  7

Complexity Analysis of Bad Character Heuristics

In this approach, we have traversed the pattern for each part of the text string.

Time Complexity

The time complexity of this approach is O (n x m), where n is the length of the text and m is the length of the pattern.

Space Complexity

No extra space is being used for the preprocessing of the array. Thus, the space complexity is (O (1)).

Good Suffix Heuristics

A different approach to searching the pattern with the Boyer-Moore algorithm is first to discover the good suffix and then analyze it. The fundamental idea is to align the overlapping regions of the pattern and the text string together to shift more effectively when a mismatch occurs.

Let's understand this situation with the help of an example. Take a pattern and start searching from the right side. Assume that a part of the pattern string is found. We will search till a mismatch is found. The matched pattern or the part of the pattern that matches the text string is called a Good String. Any mismatch in the text means that the current location is not the starting location of the pattern string. Now, we will shift the pattern to the next location for searching. The good suffix can be used for shifting the pattern.

There are two situations in which we use a good suffix:

1. The position of the good suffix is different in the pattern:

For example, the text string is AABBCAABBGCAABBAA, and the pattern string is AABBAA; we can see that the good string (a portion of the pattern string is matched to the text string) is present at different positions. We will shift the good suffix such that it is aligned to the text and the pattern text matches the text string.

2. Some portion of the good suffix is a prefix of the pattern:

While searching, we found a mismatch in the pattern, but not the exact match of the good string. If the good suffix is found as the prefix of the pattern, we can shift the suffix to align the pattern with the text string.

Implementation for the good character heuristics:

Code:

# suffix preprocessing
def suffix_processing(shift, borderPos, ptn, n):
    i = n
    j = n + 1
    borderPos[i] = j
    
    while i > 0:
        while j <= n and ptn[i - 1] != ptn[j - 1]:
            if shift[j] == 0:
                shift[j] = j - i

            j = borderPos[j]

        i -= 1
        j -= 1
        borderPos[i] = j

# prefix preprocessing
def prefix_processing(shift, borderPos, ptn, n):
    j = borderPos[0]
    for i in range(n + 1):
        if shift[i] == 0:
            shift[i] = j

        if i == j:
            j = borderPos[j]

# searching string 

def boyerMooreAlgorithm(txt, ptn):
    s = 0
    n = len(ptn)
    m = len(txt)

    shift = [0] * (n + 1)
    borderPos = [0] * (n + 1)

    # First, performing the preprocessing
    suffix_processing(shift, borderPos, ptn, n)
    prefix_processing(shift, borderPos, ptn, n)

    while s <= m - n:
        j = n - 1

        while j >= 0 and ptn[j] == txt[s + j]:
            j -= 1

        if j < 0:
            print("The Pattern is found at index:", s)
            s += shift[0]
        else:
            s += shift[j + 1]

txt = "ABCDAABBEEPYPAABBEE"
ptn = "AABBEE"
boyerMooreAlgorithm(txt, ptn)

Output:

The Pattern is found at index: 4
The Pattern is found at index: 13

Complexity Analysis of Good Character Heuristics

In this method, we have traversed the pattern string and preprocessed the string for the suffix and prefix.

Time Complexity

The time complexity of this approach is O (m x n), where m is the length of the text and n is the length of the pattern.

Space Complexity

The space complexity of this approach is [O (n) + O (n)], which is equal to O (m), where m = n + n.

Next TopicCLEAN Tips to IMPROVE Python Functions

← prev next →

Latest Courses

We provides tutorials and interview questions of all technology like java tutorial, android, java frameworks

Contact info

G-13, 2nd Floor, Sec-3, Noida, UP, 201301, India

[email protected].

Python

AI, ML and Data Science

Java

B.Tech and MCA

Web Technology

Software Testing

Technical Interview

Java Interview

Web Interview

Database Interview

Company Interviews

Python Tutorial

Python OOPs

Python MySQL

Python MongoDB

Python SQLite

Python Questions

Plotly

Python Tkinter (GUI)

Python Web Blocker

Python MCQ

Related Tutorials

Python Programs

Python String functions

Python Functions

Boyer Moore Algorithm for Pattern Searching using Python

Problem Statement

Bad Character Heuristics

Implementation for the bad character heuristics:

Good Suffix Heuristics

Implementation for the good character heuristics:

Latest Courses

Contact info

Follow us

Tutorials

Interview Questions

Online Compiler