Monday, 3 January 2011

String Searching – The Naive Approach

We all should all know what string searching is. We have a pattern p and some string s, and we wish to see if p exists in s. There are a number of ways to do this, and this is one of those many ways; the naive approach.

 

The Problem

So what’s the problem then? Well, it was quickly identified within the welcome message above, but we need a more formal description.

Let’s say we have some string or text, s, and some pattern, p, to find within s. Such that,

  • s = s1s2s3 … sn
  • p = p1p2p3 … pm

As we have stated before, and as will be stated many more times, does p occur in s? If it turns out that it does, where does it occur, at what position in s?

Quick Example

Say we have the following s and p,

  • s = abracadabra
  • p = abra

If we were to think of a string as a character array, like in C, then the first ‘a’ in s would have position 0, and the following ‘b’ would have position 1. The same rules apply to p as well. [This would be a zero-based index such that s[0] = ‘a’ ].

    Without stating how it is actually done, we can clearly see at a glance that the pattern p will first occur in s at position 0. This is the case for finding the first occurrence of p in s. If we were searching for all occurrences then we can also point out that p occurs in s at position 7 as well.

     

    The Basic Idea

    Now we’ve seen a quick example, but on the whole how does the naive algorithm work? Well here’s the basic idea behind it – pseudo code so to speak.

    First of all we have a pattern p and some string s. Starting at the beginning of s we check, character by character, if p is at this position in s until one of the following happens:

  • We find 2 characters that differ in p and s. That is, we have the character ‘a’ in p and ‘b’ in s; clearly ‘a’ does not equal ‘b’.  If this happens, we don’t have a match at the current start point in s.
  • If we reach the end of p, for example we’ve run out of characters to compare against s, then we have found a match and p exists in s.
  • Of course we can have the vice versa, that is, we reach the end of s, which means that p does not occur in s.

A general rule of thumb is, that if we have stumbled across the first bullet-point, then we slide the pattern across one character in s, and start the character by character comparisons once again.

 

Better Example

The quick example above is great to get the basic idea of how this works – kind of. But how do it properly work? Here we’ll work through the steps given above.

Let’s say we have the same string and pattern from above, “abracadabra” and “abra” respectively. Starting at the initial position we have:

  • abracadabra
  • abra

Now this would be a best case scenario for this naive algorithm, because we have found a match straight of the top! That is ‘a’=’a’, ‘b’=’b’ etc.

But what if the string to search through was instead “abrncadabra”? Well, starting at the initial position again we have,

  • abrncadabra
  • abra

Everything is perfectly fine. We compare character by character until we reach position 3 in both p and s! Here p[3]=a and s[3]=n, so clearly p[3] != s[3].

!= means ‘is not equal to’. So p[3] != s[3] says ‘p[3] is not equal to s[3]’.

Seeing as we have found a mismatch, then what we have stumbled across is the first bullet-point from the pseudo code above. What we do here is simply slide the pattern across to the next start position in s to start comparing:

  • abrncadabra
  • abra

So here, we’re comparing the pattern “abra” to the substring “brnc” of s. This clearly won’t match, so let’s skip a few movements ahead shall we, till we get:

  • abrncadabra
  •        abra

Now we’re comparing “abra” to the “abra” substring of s. This is an obvious match. Therefore we reach the end of p, meaning the second bullet-point in the pseudo-code. We have now found our match at position 7 in s.

Now let’s say we instead had the string “abrncadabrn”. I won’t work through this, I’ll leave it for you; but it should be obvious that you’ll reach the end of string s,  which means we haven’t found a match (bullet-point 3).

 

Complexity

It should be clear now as to why this is a naive approach. It’s because all we ever do if we don’t find a match is to just slide the pattern along to the next starting position in s. This is very inefficient.

Let’s say that the string s has a length of m, and that the pattern p has a length of n.

In the worst case, what is the most number of comparisons that we would have to make? Well this is simply calculated by the formula below,

  • Number of Comparisons = Length of Pattern * Number of possible start positions in the String
  • Number of Comparisons = n * (m – n + 1)

For example, let’s use the case again where s = “abrncadabra” and p = “abra”. The length of s is 11, and p is 4. Therefore m=11 and n=4.

  • Number of Comparisons = n * (m – n +1)
  • Number of Comparisons = 4 * (11 – 4 + 1)
  • Number of Comparisons = 4 * (8)
  • Number of Comparisons = 32

Therefore the total number of comparisons made when trying to locate p in s is 32.

It should be obvious that the length of the string s will commonly be much larger than the pattern p. Since this is the case, then the time complexity for the naive algorithm is O(nm)

 

Resources

[1] University of Manchester, Advanced Algorithms 1 lecture notes by David Rydeheard

No comments:

Post a Comment