Saturday, November 18, 2017

Simple parallelizable algorithm for approximate pattern matching

In this article, I'd like to show a simple algorithm for approximate pattern matching. This term generally means that you want to find all places in a big string (text) where a short string (pattern) resides in it. In this form, it defines exact pattern matching. But in our case, we will allow some "errors" (places where the symbol in the text is not equal to the symbol in the pattern). We just want to limit the number of such errors. This is why it is called approximate pattern matching.

For example, pattern "CAR" resides in the text "ABRACADABRA" in the 4th position (zero-based) with one error ("R" is not equal to "D").

In this article, I'll describe an algorithm which solves approximate pattern matching problem using a suffix tree. I assume that you know what the suffix tree is, how to construct it and how to use it to solve exact pattern matching problem. Here I'll touch these topics only briefly. If you want to learn more about these things, I want to recommend you a great course Algorithms on Strings on Coursera.

Exact pattern match algorithm


Let's start with a look at the algorithm solving exact pattern match problem:

public static class SuffixTreeSearch<TSymbol>
{
    public static IEnumerable<StringSearchMatch> Search(
        IReadOnlyList<TSymbol> text,
        TSymbol stopSymbol,
        IEnumerable<IEnumerable<TSymbol>> patterns,
        IComparer<TSymbol> comparer = null)
    {
        if (text == null) throw new ArgumentNullException(nameof(text));

        if (patterns == null)
            yield break;

        var suffixTree = SuffixTreeCreator<TSymbol>.CreateSuffixTree(text, stopSymbol, comparer);

        foreach (var pattern in patterns)
        {
            foreach (var match in GetMatches(pattern.ToArray(), suffixTree))
            {
                yield return match;
            }
        }
    }

    private static IEnumerable<StringSearchMatch> GetMatches(TSymbol[] pattern, SuffixTree<TSymbol> suffixTree)
    {
        SuffixTreeNode<TSymbol> node = GetPatternNode(pattern, suffixTree);
        if(node == null)
            yield break; 

        var starts = GetSuffixStarts(node);

        foreach (var start in starts.Where(s => s < suffixTree.Text.Count - 1))
        {
            yield return new StringSearchMatch(start, pattern.Length);
        }
    }

    private static SuffixTreeNode<TSymbol> GetPatternNode(TSymbol[] pattern, SuffixTree<TSymbol> suffixTree)
    {
        if (pattern.Length == 0)
            return suffixTree.Root;

        var lastEdge = GetLastEdge(pattern, suffixTree);

        return lastEdge?.To;
    }

    private static SuffixTreeEdge<TSymbol> GetLastEdge(TSymbol[] pattern, SuffixTree<TSymbol> suffixTree)
    {
        var node = suffixTree.Root;
        var patternIndex = 0;

        while (true)
        {
            if (!node.Edges.ContainsKey(pattern[patternIndex]))
                return null;

            var edge = node.Edges[pattern[patternIndex]];

            var matchLength = GetMatchLength(edge, suffixTree.Text, pattern, patternIndex);

            patternIndex += matchLength;

            if (patternIndex >= pattern.Length)
                return edge;

            if (matchLength < edge.Length)
                return null;

            node = edge.To;
        }

    }

    private static int GetMatchLength(SuffixTreeEdge<TSymbol> edge, IReadOnlyList<TSymbol> edgeText, TSymbol[] pattern, int patternIndex)
    {
        var edgeIndex = (int)edge.Start;

        var matchLength = 0;

        while (true)
        {
            if (!edgeText[edgeIndex].Equals(pattern[patternIndex]))
                break;
            matchLength++;
            edgeIndex++;
            patternIndex++;
            if (edgeIndex >= edge.Start + edge.Length)
                break;
            if (patternIndex >= pattern.Length)
                break;
        }

        return matchLength;
    }

    private static IEnumerable<int> GetSuffixStarts(SuffixTreeNode<TSymbol> node)
    {
        var queue = new Queue<SuffixTreeNode<TSymbol>>();
        queue.Enqueue(node);

        while (queue.Count > 0)
        {
            node = queue.Dequeue();
            if (node.SuffixStart.HasValue)
                yield return (int) node.SuffixStart.Value;

            foreach (var edge in node.Edges.Values)
            {
                queue.Enqueue(edge.To);
            }
        }
    }
}


The main method is Search. It accepts text, stop symbol (it is used to make sure that any suffix is not a prefix of another suffix), a sequence of patterns, and comparer of symbols. This method creates a suffix tree and enumerates all patterns to find matches for each of them.

Method GetMatches find all matches for one pattern. First of all, it finds a node in the suffix tree, to which the pattern leads from the root of the tree (call to GetPatternNode). Then for this node, it finds starting positions of all suffixes, to which this node belongs (call to GetSuffixStarts). These positions are essentially what we were looking for.

Method GetPatternNode uses GetLastEdge to find an edge where the pattern stops (in a suffix tree each edge contains part of the initial text). We just go from edge to edge starting from the root node, while the corresponding symbol on an edge is equal to the corresponding symbol on the pattern.

Method GetSuffixStarts is simple. Leaf nodes in a suffix tree contain positions where corresponding suffixes start. We just must collect all these positions from the subtree of the found node.

You can find full code of the algorithm here. Also in the same repository, you can find classes for creation of the suffix tree, if you want it.

Now, when we refreshed in our memory how the algorithm for exact pattern matching works, let's continue with approximate pattern matching.

Approximate pattern matching


The idea for the approximate pattern matching is simple. When we looked for the exact match, there was no more than one path from each node, no more then one edge to consider. When we are looking for the approximate match, we must consider all paths from a node, all outgoing edges. And only if on some path number of errors exceeds our limit, we should abandon this path.

Take a look at the code:


public static class SuffixTreeApproximateSearch<TSymbol>
{
    private class TaskDescription
    {
        public TaskDescription(
            SuffixTreeNode<TSymbol> startNode, 
            IReadOnlyList<TSymbol> pattern, 
            int patternStartIndex, 
            uint numberOfErrors)
        {
            StartNode = startNode;
            Pattern = pattern;
            PatternStartIndex = patternStartIndex;
            NumberOfErrors = numberOfErrors;
        }

        public SuffixTreeNode<TSymbol> StartNode { get; }
        public IReadOnlyList<TSymbol> Pattern { get; }
        public int PatternStartIndex { get; }
        public uint NumberOfErrors { get; }
    }

    private class EdgeMatch
    {
        public EdgeMatch(int matchLength, uint numberOfErrors)
        {
            MatchLength = matchLength;
            NumberOfErrors = numberOfErrors;
        }

        public int MatchLength { get; }
        public uint NumberOfErrors { get; }
    }

    public static IEnumerable<StringSearchApproximateMatch<TSymbol>> Search(
        IReadOnlyList<TSymbol> text,
        TSymbol stopSymbol,
        IEnumerable<IEnumerable<TSymbol>> patterns,
        uint numberOfErrors,
        IComparer<TSymbol> comparer = null)
    {
        if (text == null) throw new ArgumentNullException(nameof(text));

        if (patterns == null)
            return new StringSearchApproximateMatch<TSymbol>[0];

        comparer = new StopSymbolFirstComparer<TSymbol>(comparer ?? Comparer<TSymbol>.Default, stopSymbol);

        var suffixTree = SuffixTreeCreator<TSymbol>.CreateSuffixTree(text, stopSymbol, comparer);

        return patterns
            .SelectMany(pattern => GetMatches(pattern.ToArray(), numberOfErrors, suffixTree, comparer));
    }

    private static IEnumerable<StringSearchApproximateMatch<TSymbol>> GetMatches(
        IReadOnlyList<TSymbol> pattern, 
        uint numberOfErrors, 
        SuffixTree<TSymbol> suffixTree,
        IComparer<TSymbol> comparer)
    {
        Queue<TaskDescription> queue = new Queue<TaskDescription>();
        queue.Enqueue(new TaskDescription(
            suffixTree.Root,
            pattern,
            0,
            numberOfErrors));

        LinkedList<SuffixTreeNode<TSymbol>> foundNodes = new LinkedList<SuffixTreeNode<TSymbol>>();

        while (queue.Count > 0)
        {
            TaskDescription task = queue.Dequeue();

            ProcessSearchTask(queue, foundNodes, task, suffixTree.Text, comparer);
        }

        return foundNodes
            .SelectMany(GetSuffixStarts)
            .Where(s => PatternDoesNotOverlapStopSymbol(s, pattern.Count, suffixTree.Text.Count))
            .Select(s => new StringSearchApproximateMatch<TSymbol>(s, pattern));
    }

    private static bool PatternDoesNotOverlapStopSymbol(int startPosition, int patternLength, int textLength)
    {
        if (startPosition >= textLength - 1)
            return false;
        if (patternLength > 0 && startPosition + patternLength >= textLength)
            return false;
        return true;
    }

    private static void ProcessSearchTask(
        Queue<TaskDescription> queue, 
        LinkedList<SuffixTreeNode<TSymbol>> foundNodes, 
        TaskDescription task, 
        IReadOnlyList<TSymbol> text,
        IComparer<TSymbol> comparer)
    {
        if (task.PatternStartIndex >= task.Pattern.Count)
        {
            foundNodes.AddLast(task.StartNode);
            return;
        }

        foreach (var edge in task.StartNode.Edges.Values)
        {
            EdgeMatch edgeMatch = GetEdgeMatch(edge, text, task.Pattern, task.PatternStartIndex, comparer);

            if(edgeMatch.NumberOfErrors > task.NumberOfErrors)
                continue;

            var newPatternStartIndex = task.PatternStartIndex + edgeMatch.MatchLength;

            if (newPatternStartIndex >= task.Pattern.Count)
            {
                foundNodes.AddLast(edge.To);
                continue;
            }
                
            queue.Enqueue(new TaskDescription(
                edge.To,
                task.Pattern,
                newPatternStartIndex,
                task.NumberOfErrors - edgeMatch.NumberOfErrors
                ));
        }
    }

    private static EdgeMatch GetEdgeMatch(
        SuffixTreeEdge<TSymbol> edge, 
        IReadOnlyList<TSymbol> text, 
        IReadOnlyList<TSymbol> pattern, 
        int patternStartIndex,
        IComparer<TSymbol> comparer)
    {
        int matchLength = 0;
        uint numberOfErrors = 0;

        while (true)
        {
            var textIndex = edge.Start + matchLength;
            var pattentIndex = patternStartIndex + matchLength;
            if(matchLength >= edge.Length)
                break;
            if(pattentIndex >= pattern.Count)
                break;

            if (comparer.Compare(text[(int) textIndex], pattern[pattentIndex]) != 0)
                numberOfErrors++;

            matchLength++;
        }

        return new EdgeMatch(matchLength, numberOfErrors);
    }

    private static IEnumerable<int> GetSuffixStarts(SuffixTreeNode<TSymbol> node)
    {
        var queue = new Queue<SuffixTreeNode<TSymbol>>();
        queue.Enqueue(node);

        while (queue.Count > 0)
        {
            node = queue.Dequeue();
            if (node.SuffixStart.HasValue)
                yield return (int) node.SuffixStart.Value;

            foreach (var edge in node.Edges.Values)
            {
                queue.Enqueue(edge.To);
            }
        }
    }
}

Method Search is the same as for the previous algorithm. But in the GetMatches implementation, there are serious changes. Here we maintain a queue of tasks to solve and a list of found nodes for the current pattern. One task is represented by the TaskDescription class. The task says, that we should find all possible matches of a pattern (Pattern property) starting from specific position (PatternStartIndex property) inside subtree of a node of the suffix tree (StartNode property) with a limited number of errors (NumberOfErrors property). In the 'while' loop we'll solve our tasks and then we'll return results from found nodes using the same GetSuffixStarts method. The only difference in this last part is that now our patterns can overlap with stop symbol because we allowed some number of errors. Filtering using PatternDoesNotOverlapStopSymbol method prevents this situation.

Solving of one task is implemented by ProcessSearchTask method. First of all, we check if our pattern is already processed to the end. In this case, we just add the current node to the list of found nodes. Otherwise, we consider all edges going out of the current node. For each edge, we calculate two numbers using the GetEdgeMatch method: number of compared symbols along this edge (MatchLength property of EdgeMatch class) and number of errors during this comparison (NumberOfErrors property of EdgeMatch class). If this number of error exceeds our limit, we just do nothing. Otherwise, if the end of the pattern is reached, we add the end of the edge to the list of found nodes. Otherwise, we create a new task to process. This task contains another node, another starting position in the pattern and another allowed number of errors.

I want to stress your attention to the fact, that the ProcessSearchTask method may not solve it's task completely. Instead, it can generate several smaller (simpler) tasks to solve.

This is how the algorithm of the approximate pattern matching works. Simple, isn't it? :-) You can take a look at the complete code here.

Although this algorithm works, there is another simple way to make it even better. I mean multithreading.

Multithreaded approximate pattern matching


Take a look at the last algorithm. All we do is generate and solve different tasks. Also notice, that all tasks are independent. Our suffix tree and patterns are read-only. And we never change a bit in our tasks description. This immutability is a good sign, that we can apply multithreaded processing here.

Let's examine a new algorithm:


public static class MultithreadedSuffixTreeApproximateSearch<TSymbol>
{
    private class TaskDescription
    {
        public TaskDescription(SuffixTreeNode<TSymbol> startNode, IReadOnlyList<TSymbol> pattern, int patternStartIndex, uint numberOfErrors)
        {
            StartNode = startNode;
            Pattern = pattern;
            PatternStartIndex = patternStartIndex;
            NumberOfErrors = numberOfErrors;
        }

        public SuffixTreeNode<TSymbol> StartNode { get; }
        public IReadOnlyList<TSymbol> Pattern { get; }
        public int PatternStartIndex { get; }
        public uint NumberOfErrors { get; }
    }

    private class EdgeMatch
    {
        public EdgeMatch(int matchLength, uint numberOfErrors)
        {
            MatchLength = matchLength;
            NumberOfErrors = numberOfErrors;
        }

        public int MatchLength { get; }
        public uint NumberOfErrors { get; }
    }

    private class Waiter
    {
        private readonly object _lock = new object();
        private int _counter;

        public void Increment()
        {
            lock (_lock) { _counter++; }
        }

        public void Decrement()
        {
            lock (_lock) { _counter--; }
        }

        public bool IsFinished()
        {
            lock (_lock) { return _counter == 0; }
        }
    }

    public static IEnumerable<StringSearchApproximateMatch<TSymbol>> Search(
        IReadOnlyList<TSymbol> text,
        TSymbol stopSymbol,
        IEnumerable<IEnumerable<TSymbol>> patterns,
        uint numberOfErrors,
        IComparer<TSymbol> comparer = null)
    {
        if (text == null) throw new ArgumentNullException(nameof(text));

        if (patterns == null)
            return new StringSearchApproximateMatch<TSymbol>[0];

        comparer = new StopSymbolFirstComparer<TSymbol>(comparer ?? Comparer<TSymbol>.Default, stopSymbol);

        var suffixTree = SuffixTreeCreator<TSymbol>.CreateSuffixTree(text, stopSymbol, comparer);

        var waiter = new Waiter();

        var results = new ConcurrentBag<StringSearchApproximateMatch<TSymbol>>();

        foreach (var pattern in patterns)
        {
            waiter.Increment();

            var task = new TaskDescription(
                suffixTree.Root,
                pattern.ToArray(),
                0,
                numberOfErrors);

            ThreadPool.QueueUserWorkItem(state =>
            {
                ProcessSearchTask(results, waiter, suffixTree.Text, comparer, task);
            });
        }

        while (!waiter.IsFinished())
        {
            Thread.Sleep(0);
        }

        return results;
    }

    private static bool PatternDoesNotOverlapStopSymbol(int startPosition, int patternLength, int textLength)
    {
        if (startPosition >= textLength - 1)
            return false;
        if (patternLength > 0 && startPosition + patternLength >= textLength)
            return false;
        return true;
    }

    private static void ProcessSearchTask(ConcurrentBag<StringSearchApproximateMatch<TSymbol>> results, Waiter waiter, IReadOnlyList<TSymbol> text, IComparer<TSymbol> comparer, TaskDescription task)
    {
        if (task.PatternStartIndex >= task.Pattern.Count)
        {
            AddNodeMatches(results, task.StartNode, task.Pattern, text.Count);
            waiter.Decrement();
            return;
        }

        foreach (var edge in task.StartNode.Edges.Values)
        {
            EdgeMatch edgeMatch = GetEdgeMatch(edge, text, task.Pattern, task.PatternStartIndex, comparer);

            if(edgeMatch.NumberOfErrors > task.NumberOfErrors)
                continue;

            var newPatternStartIndex = task.PatternStartIndex + edgeMatch.MatchLength;

            if (newPatternStartIndex >= task.Pattern.Count)
            {
                AddNodeMatches(results, edge.To, task.Pattern, text.Count);
                continue;
            }

            var newTask = new TaskDescription(
                edge.To,
                task.Pattern,
                newPatternStartIndex,
                task.NumberOfErrors - edgeMatch.NumberOfErrors
                );

            waiter.Increment();

            ThreadPool.QueueUserWorkItem(state =>
            {
                ProcessSearchTask(results, waiter, text, comparer, newTask);
            });
        }

        waiter.Decrement();
    }

    private static void AddNodeMatches(ConcurrentBag<StringSearchApproximateMatch<TSymbol>> results, SuffixTreeNode<TSymbol> node, IReadOnlyList<TSymbol> pattern, int textLength)
    {
        foreach (var start in GetSuffixStarts(node).Where(s => PatternDoesNotOverlapStopSymbol(s, pattern.Count, textLength)))
        {
            results.Add(new StringSearchApproximateMatch<TSymbol>(start, pattern));
        }
    }

    private static EdgeMatch GetEdgeMatch(
        SuffixTreeEdge<TSymbol> edge, 
        IReadOnlyList<TSymbol> text, 
        IReadOnlyList<TSymbol> pattern, 
        int patternStartIndex,
        IComparer<TSymbol> comparer)
    {
        int matchLength = 0;
        uint numberOfErrors = 0;

        while (true)
        {
            var textIndex = edge.Start + matchLength;
            var pattentIndex = patternStartIndex + matchLength;
            if(matchLength >= edge.Length)
                break;
            if(pattentIndex >= pattern.Count)
                break;

            if (comparer.Compare(text[(int) textIndex], pattern[pattentIndex]) != 0)
                numberOfErrors++;

            matchLength++;
        }

        return new EdgeMatch(matchLength, numberOfErrors);
    }

    private static IEnumerable<int> GetSuffixStarts(SuffixTreeNode<TSymbol> node)
    {
        var queue = new Queue<SuffixTreeNode<TSymbol>>();
        queue.Enqueue(node);

        while (queue.Count > 0)
        {
            node = queue.Dequeue();
            if (node.SuffixStart.HasValue)
                yield return (int) node.SuffixStart.Value;

            foreach (var edge in node.Edges.Values)
            {
                queue.Enqueue(edge.To);
            }
        }
    }
}

The idea behind this code is simple. When we create a new task, we don't store it, but immediately run it using  ThreadPool.QueueUserWorkItem. It gives us the ability to use a pool of threads straight away. As we process many tasks simultaneously, we must use a concurrent collection for gathering results of their execution. We use ConcurrentBag for this purpose.

The only thing to solve is the question when we should stop. We must somehow know that all tasks are executed and there are no more tasks. To achieve this goal I have implemented a simple Waiter class. It contains a counter and allows to increment and decrement it and to compare with zero in a thread-safe way. Before placing a new task in the pool of threads, we increment the counter. When any task is finished, we decrement the counter. Now, all we need to do is to wait until the counter is zero.

I admit, that I'm not a specialist in a multithreaded code. I'm sure one can come up with a better solution, how to wait for all tasks.

The full code of the class can be found here.

Efficiency


Here I'll try to somehow estimate the efficiency of the algorithm. Be aware, that here we'll talk about the non-multithreaded version of the algorithm. Also, you should know that I'm not a specialist in such questions. So you should take everything in this part of the article with a great pinch of salt.

First of all, you should not forget about time, required to build a suffix tree. It can be constructed in O(length of text) time.

Now let's look at the work of the SuffixTreeApproximateSearch class. The main work is done in the ProcessSearchTask method. Let's say that our alphabet (all letters that can be in text and patterns) contains L letters. It means, that in the worst case each node of a suffix tree will contain about L children (L + 1 if we count stop symbol). In the worst case, each edge of the tree will correspond to only one symbol of the text. It means, that each task (an instance of the TaskDescription class) will compare only one symbol of a pattern with one symbol of the text.

While we still have room for errors, a task for each node will generate L subtasks. If we allow N errors in total, then at least L^N (L to the Nth power) tasks will be generated until ability for errors is exhausted. So the efficiency of the algorithm can't be better than O(L^N).

After that, we can make no more errors. So we are back to the exact pattern match problem. If the length of the pattern is much bigger than N, this part of the algorithm will take no more than O(length of the pattern) for each node, reached in the previous part.

So total time will be approximately O(<length of text> + L^N * <length of pattern>). Of course, this estimation is only for one pattern. For many patterns, the time will be appropriately bigger.

Conclusion


In this article, I've presented one algorithm for approximate pattern matching. I must inform you, that there are more efficient algorithms solving the same problem. To the advantages of the described algorithm, I can add its simplicity and ability to solve its task using multiple threads, thus leveraging the power of several processor cores.

I hope it will be helpful for you in your applications.

No comments:

Post a Comment