Approximate string matching metrics with amatch


Most often in sequence analysis we want to compare how  similar two sequences are. How can we quantify similarity by using a metric? That was my question yesterday and I went hunting for a ruby implementation for such metrics. Luckily I got a library called amatch which is an approximate string matching extension for ruby! amatch implements the following metrics:

Hamming distance, Levenshtein edit distance,longest subsequence common to two strings,longest substring common to two strings,sellers distance and pair distance which is based on the number of adjacent character pairs, that are contained in two  strings.

Hamming distance

This is the number of characters that are different between two strings. This is not recommended for the majority of string based information retrieval. Very similar strings can sometimes be given high hamming distances.

Leveshtein edit distance

Is defined as the minimal costs involved in transforming one string into another by using  deletion, insertion and substitution of a character to one of the strings. The algorithm can associate a cost for performing each of the operations and for this metric it is usually 1.

Longest common substring

This is define as the contiguous chain of characters that exists in both strings. The longer the substring the better the match between the two strings. The problem with this approach is that if a difference was introduced in the middle of one string, the distance will be longer that if the same difference was introduced at the beginning of one of the strings.

Longest common Subsequence

The longer the common sub sequence is, the more similar the two strings will be. In this case a sub sequence does not have to be contiguous.

Look at the documentation for more explanations of the metrics and algorithms.

To use the library you need to first install the gem. I installed it on my Linux box running Ubuntu and ruby 1.8.6.

sudo gem install amatch

Then in script,

require 'rubygems'

require 'amatch'
include Amatch
require  'bio'
#with bioruby it would be easy to compare two sequence entries  for example
seq_obj1 = Bio::Sequence.auto("actagatatttgat")
seq_obj2 = Bio::Sequence.auto("gccagatagttaat")

#calculate the hamming distance
 m = Hamming.new(seq_obj1.to_seq)
 m.match(seq_obj2.to_seq)
#=> 

#calculate pair-distances between the two sequences
pair_distance_obj = PairDistance.new(seq_obj1.seq)
pair_distance_obj.match(seq_obj2.seq)
 #=>
# note that you can just substitute the strings directly to the metric object creation method
without creating the sequence objects!

Note that amatch  failed to install on windows XP with the following error

Building native extensions.  This could take a while…
ERROR:  Error installing amatch:
ERROR: Failed to build gem native extension.

C:/ruby-1.8.6/ruby/bin/ruby.exe extconf.rb install amatch
creating Makefile

nmake
‘nmake’ is not recognized as an internal or external command,
operable program or batch file.

Although i have nmake installed on my windows machine. I will look at that later.

Happy string matching!


6 Responses to “Approximate string matching metrics with amatch”

  1. Rob Syme Says:

    Just what I needed, thanks for the tip!

    I don’t think that Bio::Sequence objects have a ‘to_seq’ method, but they do have a ’seq’ method :)
    -r

  2. John McLeod Says:

    Hello,
    Just a question, how could I add ‘amatch’ to a standard find method?
    Thanks for any help.
    John

    • biorelated Says:

      Hi,
      Do you mean adding it to an activerecord finder method in rails? or which find method?

      • John Says:

        Thank you for the reply.
        I’m sorry for being unclear as I’m a Rails beginner.
        Yes, I’m using the ActiveRecord finder method. I have a search field that searches by project name. Well, the stored project name can be very long. It’s almost like a keyword search. (whoa)
        I think I just had an idea. I’m going to pursue that avenue.
        Thanks again.

  3. Marwa Says:

    It’s required from me to implement a simple spell checker
    using both algorithms (longest common subsequence & edit distance).
    So, what i asked for is how i can use both of them to get
    an optimized solution.

  4. biorelated Says:

    Also you might want to look at searchlogic gem (http://github.com/binarylogic/searchlogic) It might be more appropriate in your case.


Leave a Reply