Bio-graphics, BioSQL and Rails part 1
Posted: January 7, 2009 Filed under: biographics, bioinformatics, bioruby, databases, ruby on rails, tutorials | Tags: bioruby 9 Comments »In these series I will show you how to quickly add graphics support to a bioinformatics database rails application. We are going to use the biographics library by Jan Aerts, the BioSQL database schema, and rails 2.2.2 (also works with 2.3.2) In this simple example we want to represent a sequence as a graphic, such that we can view it in a web browser more or less the way Gbrowse works. Each main feature has different subfeatures at different locations along it.
——————————————————————- main feature
——- ——– ———— ——– —— subfeatures
We need to have the following installed, rails 2.1.1, bio 2.1, biographics 1.4 all available as gems and a database based on BioSQL schema.
We need to download the BioSQL schema located here. The latest version as of this writing is BioSQL v1.0 (code-named Tokyo) release, v1.0.1. Create a database called biosql_development. I am on Ubuntu Linux with Mysql 5.0.
george:>mysql -u george -p :enter password Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 27 Server version: 5.0.51a-3ubuntu5.4 (Ubuntu) Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> create database biosql_development; Query OK, 1 row affected (0.02 sec) mysql>
I have created a database called biosql_development. Why am i not using migrations? The reason is that BioSQL has some agreed standards on table names and schema convection which are not compatible with rails database creation and table naming conventions. However Rails allows us to override these default convections, when working with legacy databases, as will be our case.
After creating the database, load the BioSQL schema to the empty database. First we need to tell mysql which database to use.
mysql> use biosql_development;
then load the schema
mysql> source /home/george/Desktop/downloadsfolder/biosql-1.0.1/sql/biosqldb-mysql.sql; Query OK, 0 rows affected, 1 warning (0.48 sec) Query OK, 0 rows affected (0.15 sec) Records: 0 Duplicates: 0 Warnings: 0 Query OK, 0 rows affected, 1 warning (0.01 sec) Query OK, 0 rows affected (0.03 sec) Records: 0 Duplicates: 0 Warnings: 0 Query OK, 0 rows affected, 1 warning (0.01 sec) ........ trucated mysql>
Now we need to create a rails application and connect to this database.
I use the Netbeans IDE development environment for creating ruby and rails applications. Go ahead and create a rails application and specify to use mysql as the database adapter.
To connect to our legacy database, we need to override some convections. First disable table plurulization, and tell rails that the table primary name is named as tablename_id as opposed to just the id column expected by rails. To do that
Create a new file in your application configurations/initializers directory called override_rails.rb (you can call it whatever).
class ActiveRecord::Base self.pluralize_table_names = false self.primary_key_prefix_type = :table_name_with_underscore end
The two lines above tells ActiveRecord not to expect the table names to be plural and that the primary key for each table is named as tablename_id format.
Also create another one called external_libraries.rb in the initializers directory, as you can tell this is where I want to put my require statements for loading external libraries.
require 'rubygems'
#load the bioinformatics library
require 'bio'
#load the biographics library
require 'bio-graphics'
#load the sql views extension library
gem 'rails_sql_views'
require 'rails_sql_views'
This file loads our gems. The rails_sql_views gem allows us to create views and access them by creating models corresponding to the views.
At this point if you run rake db:schema:dump, we will have a rails based BioSQL schema and which we can conveniently use to create a BioSQL database on any Relational database that rails supports and this includes Microsoft SQL server, DB2, Oracle, SQLlite and a host of others. All that would be required is to change the database.yml file to suit the adapter of choice and then execute rake db:schema:load to load the BioSQL schema.
Please note that if your are using rails 2.2.2, you may want to comment the lines
unless Kernel.respond_to?(:gem)
Kernel.send :alias_method, :gem, :require_gem
end
in rails_sql_views(0.6.1), otherwise running db:schema:dump will cause rake to abort.
In the next part I will describe how to create the necessary resources for our RESTful(Representational State Transfer) bioinformatics web application and rendering of the graphics.
Approximate string matching metrics with amatch
Posted: January 6, 2009 Filed under: algorithms, bioinformatics, bioruby, tutorials 8 Comments »Most often in sequence analysis we want to compare how similar two sequences are. How can we quantify similarity by using a metric? That was my question yesterday and I went hunting for a ruby implementation for such metrics. Luckily I got a library called amatch which is an approximate string matching extension for ruby! amatch implements the following metrics:
Hamming distance, Levenshtein edit distance,longest subsequence common to two strings,longest substring common to two strings,sellers distance and pair distance which is based on the number of adjacent character pairs, that are contained in two strings.
Hamming distance
This is the number of characters that are different between two strings. This is not recommended for the majority of string based information retrieval. Very similar strings can sometimes be given high hamming distances.
Leveshtein edit distance
Is defined as the minimal costs involved in transforming one string into another by using deletion, insertion and substitution of a character to one of the strings. The algorithm can associate a cost for performing each of the operations and for this metric it is usually 1.
Longest common substring
This is define as the contiguous chain of characters that exists in both strings. The longer the substring the better the match between the two strings. The problem with this approach is that if a difference was introduced in the middle of one string, the distance will be longer that if the same difference was introduced at the beginning of one of the strings.
Longest common Subsequence
The longer the common sub sequence is, the more similar the two strings will be. In this case a sub sequence does not have to be contiguous.
Look at the documentation for more explanations of the metrics and algorithms.
To use the library you need to first install the gem. I installed it on my Linux box running Ubuntu and ruby 1.8.6.
sudo gem install amatch
Then in script,
require 'rubygems' require 'amatch'
include Amatch require 'bio'
#with bioruby it would be easy to compare two sequence entries for example
seq_obj1 = Bio::Sequence.auto("actagatatttgat") seq_obj2 = Bio::Sequence.auto("gccagatagttaat") #calculate the hamming distance m = Hamming.new(seq_obj1.to_seq) m.match(seq_obj2.to_seq) #=> #calculate pair-distances between the two sequences pair_distance_obj = PairDistance.new(seq_obj1.seq) pair_distance_obj.match(seq_obj2.seq) #=> # note that you can just substitute the strings directly to the metric object creation method without creating the sequence objects!
Note that amatch failed to install on windows XP with the following error
Building native extensions. This could take a while…
ERROR: Error installing amatch:
ERROR: Failed to build gem native extension.
C:/ruby-1.8.6/ruby/bin/ruby.exe extconf.rb install amatch
creating Makefile
nmake
‘nmake’ is not recognized as an internal or external command,
operable program or batch file.
Although i have nmake installed on my windows machine. I will look at that later.
Happy string matching!
What’s new in Bioruby edge
Posted: January 4, 2009 Filed under: bioinformatics, bioruby, blast, databases, tutorials | Tags: bioruby, blast, code, ruby Leave a comment »Changes to Bio::Blast
Naohisa Goto has announced changes to the Bio::Blast.reports to support default -m 0 and tabular -m 8
formats in addition to XML (-m 7) form. I think this is really nice and convenient!
Previously it meant that for bioruby to parse a Blast file, you had to have your blast results in XML output which Bio::Blast::Reports would understand. However, by default Blast gives an -m0 output and without that prior knowledge you may spend hours wondering what is wrong when parsing default blast output files. With Bioruby 1.2.1 this will not work
require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::Blast.reports(report_file) do |report|
puts report.class
end
Unless blast_output2.txt is in XML format
In the upcoming Bioruby 1.3 the default Blast file can now be parsed, for example,
require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::FlatFile.open(Bio::Blast::Default::Report,report_file) do |ff| ff.each do |rep| puts rep.statistics rep.iterations.each do |itr| puts itr.hits.size itr.hits.each_with_index do |hit,i| puts hit.hit_id puts hit.len end end end end
Bio::Blast.remote now supports DDBJ in addition to Genomenet. It would be a nice idea to support NCBI as well.
Changes to Bio:sequence
It is possible to create sequence objects from Bio::GenBank, Bio::EMBL, and Bio::FastaFormat by using the to_biosequence method
gb = Bio::GenBank.new(genbank_file.gb) gb.to_biosequence
Bio::SQL Support
Thanks to Raoul and Naohisa, support for BioSQL has been rewritten by using ActiveRecord.
#Make a connection connection = Bio::SQL.establish_connection(path_to_database.yaml,'development') #list databases databases =Bio::SQL.list_databases #retrieve a sequence sample_seq = Bio::SQL.fetch_accession('some_accession_number') #get number of seqeunces in the database puts Bio::SQL.list_entries #get references associated with an entry puts sample_seq.references #create an embl format puts sample_seq.to_biosequence.output(:embl)
Changes to Bio::GFF2 and Bio::GFF3
GFF2/GFF3 formatted texts are now supported but there will be backward portability issues with bio 1.2.1 since some incompatible changes have been incorporated. Bio::GFF::Record.comments has been renamed to comment and comments= is now comment=
Both Bio::GFF::GFF2::Record.new and Bio::GFF::GFF3::Records.new, can now take 9 arguments that correspond to GFF columns making it easy to create a Record object directly without need for formatted text.
Both Bio::GFF::GFF2::Record#attributes and Bio::GFF::GFF3::Record#attributes have been changed to return a nested array containing tag, value pairs, to obtain a hash, use the to_hash method
To support data output for GFF2/GFF3, new methods have been added: Bio::GFF::GFF2#to_s, Bio::GFF::GFF3#to_s, Bio::GFF::GFF2::Record#to_s,and Bio::GFF::GFF3::Record#to_s
Lots of other changes have been incorporated for the GFF classes and you can view the change log at github
CodeML parser
A wrapper for PAML codeml program that is used for estimating evolutinary rate has been added. The class provides methods for generating the necessary configuration file. The new Bio::PAML::Codeml::Report and PAML::Codeml::Rates provides simpel classes for accessing the codeml report and rates file. This example is from the example given in the source code
require 'bio'
# Reads multi-fasta formatted file and gets a Bio::Alignment object.
alignment = Bio::FlatFile.open(Bio::Alignment::MultiFastaFormat, 'example.fst').alignment
# Reads newick tree from a file
tree = Bio::FlatFile.open(Bio::Newick, 'example.tree').tree
# Creates a Codeml object
codeml = Bio::PAML::Codeml.new
# Sets parameters
codeml.parameters[:runmode] = 0
codeml.parameters[:RateAncestor] = 1
# You can also set many parameters at a time.
codeml.parameters.update({ :alpha => 0.5, :fix_alpha => 0 })
# Executes codeml with the alignment and the tree
report = codeml.query(alignment, tree)
Lots of Bugs have been fixed and also support for Ruby 1.9 has been added. Its great thanks to the bioruby developers for their time and the excellent new changes!
Bioruby workshop
Posted: April 17, 2008 Filed under: tutorials Leave a comment »
I would like to announce the first Bioruby workshop here in East Africa.
Please note the following important points:
1. The course will be held on Thursday 22nd May at the International Livestock Research Institute (ILRI) Nairobi-Kenya
2. Send your application form to bioinfoafrica@gmail.com. Applications not sent to this EMAIL ADDRESS WILL NOT be processed.
3. Deadline for application is May 1st 2008, successful applicants will be notified by 8th of May 2008.
4. Currently we are not able to offer travel fellowships for members outside Nairobi or Kenya. However, this is just the beginning of RSG East Africa getting organized to have training for its members and we hope in due course that such initiative will attract funding for travel.
Bioruby mini-series: The Bio::Sequence::Common class
Posted: February 12, 2008 Filed under: bioinformatics, bioruby, tutorials | Tags: bioruby, code, ruby 6 Comments »
Sequence Transformation
Lets have a look at the Bio::Sequence::Common class module which provides us with most of the sequence transformation methods for biological sequences.
Bio::Sequence::Common
Implements methods which are common to both Bio::Sequence::AA and Bio::Sequence::NA, for example
A Bio::Sequence object is easily created like this;
require ‘bio’
my_dna = Bio::Sequence.auto("actagatatttgat") #=> actagatatttgat
my_dna is now a Bio::sequence object and you can use the various methods available for this class, which we are going to explore shortly.
Bio::Sequence::Common Non Modifying methods
-
to_s
This method returns a sequence as a string. It does not modify the original sequence.
puts my_dna.to_s #=> actagatatttgat
puts my_dna.to_s.class #=> String
An alias for this method is the to_str method.
my_dna.to_str
#=> actagatatttgat
-
seq
This method will return a new Bio::Sequence::NA or Bio::Sequence::AA object. The original sequence remains unchanged. For example if you wished to assign a new instance of my_dna object that we created above ,such that you have a my_dna2 object, you would create that as follows,
my_dna2 = my_dna.seq
puts my_dna2 #=> actagatatttgat
puts my_dna2.class #=>
Bio::Sequence::NA
Bio::Sequence::Common modifying methods
-
Normalize!
This method removes all the white space and transforms all positions to uppercase if the sequence is an amino acid (AA) or transforms all positions to lowercase if the sequence is a nucleic acid (NA) sequence, leaving the original sequence modified
For example
test_seq = Bio::Sequence::NA.new(“ACTG”)
puts test_seq.normalize! #=>
actg
-
Concatenating
Many times we want to append a new sequence or a set of bases/residues eg a poly A sequence to the end of a new sequence and modify the original sequence. This is achieved by the concat method.
It is also referred to as << method.test_seq = Bio::Sequence::NA.new(“actg”)
test_seq << “acagat”
test_seq concat “acagat”
puts test_seq #=>
actgacagat
Note that to create a new sequence that adds to an existing sequence without altering the original sequence you would use the + operator. It accepts a variable number of arguments. For example
test_seq = Bio::Sequence::NA.new(“actg”)
test_seq2 = test_seq + (“cttcccttttt” “tatatata”)
puts test_seq2 #=>
actgcttcccttttttatatata
puts test_seq #=>actg
Working with subsequences
Please note that biological sequence numbering convections are one based as opposed to ruby’s zero based. Biological coordinate’s convection for BioSQL and Chado is zero based.
-
Subseq
This method returns a new sequence containing the subsequence identified by the start and end values given as parameters. This method works in a similar way to the slice string method. For example
my_seq = Bio::Sequence::NA.new(“agggatttc”)
puts my_seq.subseq(2,5) #=>
ggga
The first argument denotes the start and the second argument denotes the end of the subsequence. Both arguments must be positive integers
When this method is used without arguments, the start defaults to 1 and the end defaults to the last element of the string. Therefore when subseq is called without any arguments, it returns a new sequence similar to the original sequence.
puts my_seq.subseq #=> agggatttc
-
window_search
This method is typically used with a block. The method is called if you wanted to step through a sequence given a length of a subsequence. Therefore the method accepts two arguments. Step_size which defines the size of your ‘steps’ and the window_size which defines the length of the stepping subsequence. Any remaining sequence at the terminal end will be returned. The default step size is one since its an optional argument.
For example
To print the average GC% on each 100bp you can write,
s.window_search(100) do |subseq|
puts subseq.gc
end
Bioruby mini-series: The Sequence class
Posted: November 23, 2007 Filed under: bioruby, tutorials 3 Comments »Bioruby is a bioinformatics ruby package for analysis of biological sequences. In my quest to become a bioruby guru i have decided to poke the bioruby API and all available tutorials to better understand this fantastic library written by the bioruby team of developers. My journey will be logged here as the bioruby mini series. We start with an introductory overview of the sequence class.
To use the library you need to have a ruby interpreter installed , preferably ruby 1.8.5 and above . To install bioruby as a gem, do:
sudo gem install bio
This will install Bioruby version 1.1.0 and it comes with its own shell as well.
Type bioruby on the command prompt and you will see this:
Loading config (/.bioruby/shell/session/config) … done
Loading object (/.bioruby/shell/session/object) … done
Loading history (/.bioruby/shell/session/history) … done
. . . B i o R u b y i n t h e s h e l l . . .
Version : BioRuby 1.1.0 / Ruby 1.8.6
bioruby>
Now we ready to rock and roll! I dug in to the API and extracted some useful information for us.
The Bio::Sequence class
This is the primary sequence class and deals with sequence translation and transformations. It inherits from ruby’s string class which means that you can use ruby’s string methods with the Bio::Sequence class just like you would with a string.
The Bio::Sequence class object is a wrapper around the actual sequence and it is represented as either a Bio::Sequence::NA or a Bio::Sequence::AA. and responds to all the methods that are defined for both NA and AA classes. This class has the following methods:
- auto – This will guess the type of sequence provided and return the appropriate Bio:Sequence class for the given string, either a Bio::Sequence::AA or a Bio::Sequence::NA
- new – Creates a new Bio::Sequence object. It does not initialize the object in to any of the bioruby objects. It returns a string.
- aa – Will transform your current Bio::Sequence object to a Bio::Sequence::AA object. It will change your current object i.e it will transform a Bio::Sequence::NA to a Bio::Sequence::AA which is undesirable. So it needs to be used only when you are sure of the type of sequence you are working with.
- na – works the same as the aa method above but the returned object is a Bio::Sequence::NA
- output – It returns a string with the current Bio::Sequence object formatted with the given style. The supported styles are fasta, genbank and embl. The style argument is passed as a ruby symbol eg :fasta
- to_s – it returns the sequence as a string leaving the original sequence unaltered. The to_str is an alias for this method
Bio::Sequence::NA class
This class wraps a nucleic acid sequence. It provides a number of methods to work with a DNA sequence as demonstrated in the example below.
Dr Optimist has finally finished his long awaited sequencing project code named Sikwensi. The nucleic acid sequence for a chromosome for which he won’t reveal any further details is shown below.
“gacagatggacatggactagagctgct”
He calls his trusted ruby programmer to help analyze the sequence and tear it base by base. The guy gets to work.
require ‘bio’
bio_seq = Bio::Sequence.auto( ‘gacagatggacatggactagagctgct’) #=> bio_seq is now a Bio::Sequence::NA object
#get the number of codons in the sequence
bio_seq.window_search(3,3) {|codon| puts codon}
# complemental sequence
bio_seq.complement (Bio::Sequence::NA object)
# gets subsequence of positions 4 to 14
bio_seq.subseq(4,14) # he thinks the subsequence is interesting and worth extracting!
bio_seq.gc_percent #what is the gc content?
bio_seq.composition # nucleic acid compositions (returns a Hash)
bio_seq.translate # translation ( returns a Bio::Sequence::AA object)
bio_seq.translate(2) # translation from frame 2 (The default is frame 1)
bio_seq.translate(1,11) # using codon table No.11 (bacteria)
bio_seq.translate.codes # shows three-letter codes ( returns an Array)
bio_seq.translate.names # shows amino acid names (returns an Array)
bio_seq.translate.composition # amino acid compositions (returns a Hash)
bio_seq.translate.molecular_weight # calculating molecular weight (returns Float)
bio_seq.complement.translate # translation of complemental strand
A tutorial written by Katayama Toshiaki can be found here and translated to English by Naohisa Goto. (Thank you guys!)