Most often in sequence analysis we want to compare how similar two sequences are. How can we quantify similarity by using a metric? That was my question yesterday and I went hunting for a ruby implementation for such metrics. Luckily I got a library called amatch which is an approximate string matching extension for ruby! amatch implements the following metrics:
Hamming distance, Levenshtein edit distance,longest subsequence common to two strings,longest substring common to two strings,sellers distance and pair distance which is based on the number of adjacent character pairs, that are contained in two strings.
Hamming distance
This is the number of characters that are different between two strings. This is not recommended for the majority of string based information retrieval. Very similar strings can sometimes be given high hamming distances.
Leveshtein edit distance
Is defined as the minimal costs involved in transforming one string into another by using deletion, insertion and substitution of a character to one of the strings. The algorithm can associate a cost for performing each of the operations and for this metric it is usually 1.
Longest common substring
This is define as the contiguous chain of characters that exists in both strings. The longer the substring the better the match between the two strings. The problem with this approach is that if a difference was introduced in the middle of one string, the distance will be longer that if the same difference was introduced at the beginning of one of the strings.
Longest common Subsequence
The longer the common sub sequence is, the more similar the two strings will be. In this case a sub sequence does not have to be contiguous.
Look at the documentation for more explanations of the metrics and algorithms.
To use the library you need to first install the gem. I installed it on my Linux box running Ubuntu and ruby 1.8.6.
sudo gem install amatch
Then in script,
require 'rubygems' require 'amatch'
include Amatch require 'bio'
#with bioruby it would be easy to compare two sequence entries for example
seq_obj1 = Bio::Sequence.auto("actagatatttgat") seq_obj2 = Bio::Sequence.auto("gccagatagttaat") #calculate the hamming distance m = Hamming.new(seq_obj1.to_seq) m.match(seq_obj2.to_seq) #=> #calculate pair-distances between the two sequences pair_distance_obj = PairDistance.new(seq_obj1.seq) pair_distance_obj.match(seq_obj2.seq) #=> # note that you can just substitute the strings directly to the metric object creation method without creating the sequence objects!
Note that amatch failed to install on windows XP with the following error
Building native extensions. This could take a while…
ERROR: Error installing amatch:
ERROR: Failed to build gem native extension.
C:/ruby-1.8.6/ruby/bin/ruby.exe extconf.rb install amatch
creating Makefile
nmake
‘nmake’ is not recognized as an internal or external command,
operable program or batch file.
Although i have nmake installed on my windows machine. I will look at that later.
Happy string matching!

January 8, 2009 at 9:27 am
Just what I needed, thanks for the tip!
I don’t think that Bio::Sequence objects have a ‘to_seq’ method, but they do have a ’seq’ method :)
-r
December 8, 2009 at 7:09 pm
Hello,
Just a question, how could I add ‘amatch’ to a standard find method?
Thanks for any help.
John
December 9, 2009 at 1:03 pm
Hi,
Do you mean adding it to an activerecord finder method in rails? or which find method?
December 9, 2009 at 1:46 pm
Thank you for the reply.
I’m sorry for being unclear as I’m a Rails beginner.
Yes, I’m using the ActiveRecord finder method. I have a search field that searches by project name. Well, the stored project name can be very long. It’s almost like a keyword search. (whoa)
I think I just had an idea. I’m going to pursue that avenue.
Thanks again.
December 25, 2009 at 12:24 am
It’s required from me to implement a simple spell checker
using both algorithms (longest common subsequence & edit distance).
So, what i asked for is how i can use both of them to get
an optimized solution.
December 9, 2009 at 1:53 pm
Also you might want to look at searchlogic gem (http://github.com/binarylogic/searchlogic) It might be more appropriate in your case.