Most often in sequence analysis we want to compare how similar two sequences are. How can we quantify similarity by using a metric? That was my question yesterday and I went hunting for a ruby implementation for such metrics. Luckily I got a library called amatch which is an approximate string matching extension for ruby! amatch implements the following metrics:
Hamming distance, Levenshtein edit distance,longest subsequence common to two strings,longest substring common to two strings,sellers distance and pair distance which is based on the number of adjacent character pairs, that are contained in two strings.
Hamming distance
This is the number of characters that are different between two strings. This is not recommended for the majority of string based information retrieval. Very similar strings can sometimes be given high hamming distances.
Leveshtein edit distance
Is defined as the minimal costs involved in transforming one string into another by using deletion, insertion and substitution of a character to one of the strings. The algorithm can associate a cost for performing each of the operations and for this metric it is usually 1.
Longest common substring
This is define as the contiguous chain of characters that exists in both strings. The longer the substring the better the match between the two strings. The problem with this approach is that if a difference was introduced in the middle of one string, the distance will be longer that if the same difference was introduced at the beginning of one of the strings.
Longest common Subsequence
The longer the common sub sequence is, the more similar the two strings will be. In this case a sub sequence does not have to be contiguous.
Look at the documentation for more explanations of the metrics and algorithms.
To use the library you need to first install the gem. I installed it on my Linux box running Ubuntu and ruby 1.8.6.
sudo gem install amatch
Then in script,
require 'rubygems' require 'amatch'
include Amatch require 'bio'
#with bioruby it would be easy to compare two sequence entries for example
seq_obj1 = Bio::Sequence.auto("actagatatttgat") seq_obj2 = Bio::Sequence.auto("gccagatagttaat") #calculate the hamming distance m = Hamming.new(seq_obj1.to_seq) m.match(seq_obj2.to_seq) #=> #calculate pair-distances between the two sequences pair_distance_obj = PairDistance.new(seq_obj1.seq) pair_distance_obj.match(seq_obj2.seq) #=> # note that you can just substitute the strings directly to the metric object creation method without creating the sequence objects!
Note that amatch failed to install on windows XP with the following error
Building native extensions. This could take a while…
ERROR: Error installing amatch:
ERROR: Failed to build gem native extension.
C:/ruby-1.8.6/ruby/bin/ruby.exe extconf.rb install amatch
creating Makefile
nmake
‘nmake’ is not recognized as an internal or external command,
operable program or batch file.
Although i have nmake installed on my windows machine. I will look at that later.
Happy string matching!
Just what I needed, thanks for the tip!
I don’t think that Bio::Sequence objects have a ‘to_seq’ method, but they do have a ‘seq’ method :)
-r
Hello,
Just a question, how could I add ‘amatch’ to a standard find method?
Thanks for any help.
John
Hi,
Do you mean adding it to an activerecord finder method in rails? or which find method?
Thank you for the reply.
I’m sorry for being unclear as I’m a Rails beginner.
Yes, I’m using the ActiveRecord finder method. I have a search field that searches by project name. Well, the stored project name can be very long. It’s almost like a keyword search. (whoa)
I think I just had an idea. I’m going to pursue that avenue.
Thanks again.
Also you might want to look at searchlogic gem (http://github.com/binarylogic/searchlogic) It might be more appropriate in your case.
It’s required from me to implement a simple spell checker
using both algorithms (longest common subsequence & edit distance).
So, what i asked for is how i can use both of them to get
an optimized solution.
I have a windows XP, and I just installed gnuwin32 make app but I was not able to install the gem. this is what I got:
F:\Program Files\Ruby191\bin>gem install amatch
Building native extensions. This could take a while…
ERROR: Error installing amatch:
ERROR: Failed to build gem native extension.
“F:/Program Files/Ruby191/bin/ruby.exe” extconf.rb
creating Makefile
make
makefile:154: warning: overriding commands for target `F:/Program’
makefile:148: warning: ignoring old commands for target `F:/Program’
make: *** No rule to make target `”/F/Program’, needed by `amatch.o’. Stop.
Gem files will remain installed in F:/Program Files/Ruby191/lib/ruby/gems/1.9.1/
r inspection.
Results logged to F:/Program Files/Ruby191/lib/ruby/gems/1.9.1/gems/amatch-0.2.5
F:\Program Files\Ruby191\bin>
any idea?
As i had noted ealier i was not able to install the gem on a windows OS using ruby 1.8.6 at the time. Sorry i never explored the cause of the issue but i bet it has to do with the build and make programs needed to compile the source for windows natively. For now i would urge you to try it on a Unix/Linux system. Windows and DOS are not good systems for this kind of stuff.