Standalone BLAST with Ruby revisited

Earlier  I showed a very simple way to perform a BLAST  using Ruby. Today I would like to revisit that topic for two reasons.

  1. The “using ruby with blast” search term seems to be very common and actually one of the ways that people reach my blog.
  2. The original post was not very through.

BLAST aka Basic Local Alignment Tool is used to search a sequence (either DNA or protein) against a database of other sequences (either all nucleotide or all protein) in order to identify similar sequences. BLAST has many different flavors and can  search DNA against DNA or protein against protein and also can translate a nucleotide query and search it against a protein database  and vice versa. It can also compute a “profile” for the query sequence and use that for further searches as well as search the query against a database of profiles.

The BLAST tool is fundamental to molecular biologists and bioinformaticians. There are excellent books and tutorials on how to and when to use BLAST, so i will assume all you need is to automated your work and parse the results. The actual algorithm is implemented in C and freely  available from the NCBI website.The first thing  to do is to download the appropriate binaries for your platform. Instructions for setting up and installing BLAST

Once installed on your system  the primary method of interaction is using the command line. Use formatdb to create blast databases and blastall to search for sequence homology for a given sequence against a given blast database.

In Ruby, there are two ways you can call the BLAST program. First using the Bioruby library and second by writing your own ruby wrapper for the BLAST command line parameters and execution. Most often, one executes BLAST from the command line and then process the results file which is in either one of the many BLAST output formats. Bioruby is excellent  at parsing the results file. Using Bioruby with BLAST is  very straightforward:

#blasting the bioruby way
  #query_file: a list of query sequences in fasta format
  #database_path: a path to the actual BLAST formatted database
  #program: The BLAST program to call, either of blastp,blastn,tblastn e.t.c.
    def bio_blast(program, database_path,query_file)
        factory = Bio::Blast.local(program,database_path)

        ff = Bio::FlatFile.open(Bio::FastaFormat, query_file)
        ff.each do |entry|
           report = factory.query(entry) # report will be a Blast::Report object
          # iterate trough the hits
          report.each do |hit|
puts hit.bit_score        # bit score (*)
puts hit.query_seq        # query sequence (TRANSLATOR'S NOTE: sequence of homologous region of query sequence)
puts hit.midline          # middle line string of alignment of homologous region (*)
puts hit.target_seq       # hit sequence (TRANSLATOR'S NOTE: sequence of homologous region of query sequence)
puts hit.evalue           # E-value
puts hit.identity         # % identity
puts hit.overlap          # length of overlapping region
puts hit.query_id         # identifier of query sequence
puts hit.query_def        # definition(comment line) of query sequence
puts hit.query_len        # length of query sequence
puts hit.target_id        # identifier of hit sequence
puts hit.target_def       # definition(comment line) of hit sequence
puts hit.target_len       # length of hit sequence
puts hit.query_start      # start position of homologous region in query sequence
puts hit.query_end        # end position of homologous region in query sequence
puts hit.target_start     # start position of homologous region in hit(target) sequence
puts hit.target_end       # end position of homologous region in hit(target) sequence
puts hit.lap_at           # array of above four numbers
hit.each do |hsp| puts hsp.query_from end end end end

The method will execute BLAST and also print the hits and the high scoring potions start coordinates for each hit. How ever you may want to just run BLAST without the bioruby overhead. The line below will work as well:

  input = query_path
    #execute blast and store the results in the blast_results  variable
    #-p blast program to run
    #-d blast database to query against
    #-T gives a html output
    #-i query file path

  #execution
blast_result = %x(blastall -p #{program} -d #{database} -e #{expectation} -M #{matrix}
                 -i #{input} -T  T)
#blast_result will be the output from the system execution of the above command. You can choose to write it 
to a file or process it using the Bio::Blast::Report object.

You can use a similar style command like the one above to create BLAST databases using the formatdb command.

I would recommend the use of the bio-ruby blast report parsing classes to automate the process. Please look at the Bio-ruby API documentation for more details.

A ruby class for screen-scraping plasmodb database

Plasmodb is the primary resource for retrieving Plasmodium falciparum genomic data and information. Unfortunately this database has no API or XML service to request or query its  information from a programmer’s point of view or for easy automation of sequence information retrieval.  Recently I needed to download a long list of Plasmodium falciparum genomic, Protein and other information for a set of genes. Been lazy to click and open the webpage for each gene in my list. I wrote this in ruby.

It would be great if Plasmodb  would provide an easy way  of automated sequence retrieval. A webservice or an XML output format would do. Screen scraping is not a very efficient approach.  Here we use Scrapi which  is an HTML scraping toolkit for Ruby. It uses CSS selectors to write easy, maintainable scraping rules to select, extract and store data from HTML content.

#A class to fetch information from plasmodb using the scrapi API
##TODO handle  Scraper::Reader::HTTPUnspecifiedError
class Plasmodb
   #retrives a information  using the gene_id
   #returns a structure obj
  def fetch_by_gene_id(var_name)
    begin
      scraper = Scraper.define do
        process "div#genomicSequence pre",    :genomic_sequence  => :text
        process "div#transcriptSequence pre", :mrna_sequence =>:text
        process "div#proteinSequence pre",    :protein_sequence  =>:text
        process "div#Aliases td>table",       :aliases =>:text
        result :protein_sequence,:aliases,:mrna_sequence,:genomic_sequence
        end

     search_link="http://plasmodb.org/plasmo/showRecord.do?
               name=GeneRecordClasses.GeneRecordClass&source_id="+var_name+"&project_id=PlasmoDB"
     uri = URI.parse(search_link)
     @query = scraper.scrape(uri)

    rescue Scraper::Reader::HTTPUnspecifiedError
      "None"
    end
  end
  #returns the predicted protein sequence
  def protein_sequence
    @query.protein_sequence.chomp
  end
#  Returns the genomic sequence
  def genomic_sequence
    @query.genomic_sequence.chomp
  end
  #returns Aliases
  def aliases
    @query.aliases
  end
  #returns the mrna sequence
  def mrna_sequence
    @query.mrna_sequence.chomp
  end
end

#Use the class to fetch information.
require 'rubygems'
require 'bio'
require 'scrapi'

file = "/home/george/genes_list.txt" #a file containing a list of accession numbers.
#one accession number per line

plasmo = Plasmodb.new #initialize a plasmodb class instance

#Read the file and process each accession number.
File.readlines(file).each do |line|
  line.chomp!
  plasmo.fetch_by_gene_id(line)  #fetches the information from Plasmodb.
  #print a fasta entry for the protein sequence
  puts Bio::Sequence.new(plasmo.protein_sequence).output(:fasta,:header=>line)
  puts Bio::Sequence.new(plasmo.genomic_sequence).output(:fasta,:header=>line)
end

#another example
#p = Plasmodb.new
#p.fetch_by_gene_id("PFD0020c")
#puts p.genomic_sequence

PC vs Apple Mac (Not the war!)

My good old PC running Linux OS is coming of age and recently started failing. The Optical drive is not functional and occasionally it will freeze. The Top cover does not hold anymore and the graphical TFT screen needs to be supported carefully.

While this particular computer has served me well, I am at that point where i need a new machine but am torn between an Apple Mac and a PC running Linux. First my work involves the following aspects;

  1. Compiling and running bioinformatics software developed using open source standards and technologies
  2. Programming
  3. Word processing and document editing
  4. Occasional mathematical modeling
  5. Administering  Unix based servers

I have tried to come up with a computer-model agnostic specifications for my needs.

Hardware

* High Processor speed (2.60GHZ or above)

* High Memory (4GB or above)

* Medium Hard-disk space (160GB and above)

* Long Battery life ( 5hours and above)

* Durable external cover

* Ergonomic keys and mouse

* Support for multiple external devices(printers,Cameras,Microphones,Storage devices,monitors)

* Excellent support for wireless technologies

* Support for running multiple operating systems on the same machine

Size and weight

* Lightweight

* convenience while travelling while traveling

Operating System

* A Unix or Linux derived operating system

* Easy to upgrade at zero or minimal cost

* Free patches against known security holes and problems.

Software

* Support for Open software standards

* Support for Microsoft, Adobe and other proprietary software vendor’s products

Security

* Excellent inbuilt support against Malware, Trojans and viruses at minimal or no cost

* Support for locking the machine while away or against unauthorized login

* Ability to easily ‘tag’ the machine in case of theft

Price : Affordable and reasonable

Based on the above specifications I have evaluated two computers models that can satisfy the above needs.

1. A PC laptop computer running a Linux based operating system

2. An Apple Macintosh laptop computer

I have ruled out a Windows/DOS based Operating software because  Microsoft Windows based operating system cannot offer  support for open source standards and technologies. OS upgrade for windows is very expensive and the OS is highly prone to malware, Trojans and viruses. Most bioinformatics software and tools are developed on Unix or  Linux environment.

PC can support Linux installations even though one looses on hardware optimization. Linux has a relatively poor graphical user interface and functionality when compared to Mac OS or Windows. There is limited support for document processing, graphics and rich multimedia applications support. Linux does not support any of the Microsoft software applications natively. There are open source equivalents but most lack good support.

Apple Macintosh computers are based on Unix and open source technologies, they support both closed source and open source standards. The hardware is optimized and accelerated for the Apple mac OS. They offer excellent graphical user interface system, a powerful terminal for interaction with the OS, they are not prone to virus attacks, and they support long battery life as well as portability, ergonomics and a relatively within a  price range equivalent to a PC of the same specifications.

Given my budget constrains, I am thinking that a 2.53GHz Apple Macintosh 13 inch model with 4GB of memory is best for my needs. There is little price differences between the PC and Macintosh models based on my specifications. PC models do not favor Linux installations and Linux hardware support is not guaranteed. They however seem to have a more flexible price ranges depending on the manufacturers, vendors, quality and specifications.

I will keep Linux to run my server applications.

Bio-graphics, BioSQL and Rails part 2

In  part 1 of this series we created a rails application and connected it to a BioSQL database. We also overwrote the rails convections to accommodate our legacy schema.

To understand the BioSQL schema, please review the documentation here. A brief overview of is as follows. Every record we enter into our database is a ‘bioentry’ and goes to the bioenty table. A bioentry can be composed of the following entities: the record’s public name, public accession and version, its description and an identifier field.

The actual sequence data is stored in the biosequence table which contains raw sequence information associated with a bioentry, and alphabet information (‘protein’, ‘dna’, ‘rna’). This is because not all records in our database need to be associated with a raw sequence. Additional sequence information is stored in the seqfeature table together with other qualifiers.

The location of each seqfeature (or sub-seqfeature) is defined by a location entity, describing the stop and start coordinates and strand. This information is stored in the location table.

In our rails application we are going to create some models and a few controllers. In RESTful language, we are actually creating resources. In this example we will be very simplistic and just create a biodatabase, taxon, bioentry, biosequence, seqfeature, location resources. We will also create associations between them in their model classes. But before that delete the index.html file from your rails application public folder and add the following line to your configurations/routes.rb file

 map.root :controller => "biosequences"

To quickly create the models, controllers, associated views and a test suite for each of our resources, just run the rails generate scaffold command, passing the name of the model as an argument. For example,

generate scaffold Bioentry

will create a bioentry model, a bioentries_controller, associated views (index,show,edit and new), a migration file, though in our case we do not need it. When you finish scaffolding, the routes.rb file should have the following resources declared.

  map.resources :seqfeatures
  map.resources :locations
  map.resources :bioentries
  map.resources :biosequences
  map.resources :taxon
  map.resources :biodatabases

Let us create some mandatory associations for the models.

Edit the /models/biodatabase.rb file by adding the following

 has_many :bioentries #a biodatabase is associated with many bioentries
 validates_uniqueness_of :name  #The name foe each biodatabase is unique!

Edit the /models/bioentry.rb file by adding the following

    belongs_to :biodatabase
    belongs_to :taxon
    has_one :biosequence

Edit the /models/taxon.rb and add

   has_one :bioentry

Edit the /models/biosequence.rb file by adding:

  set_primary_key :bioentry_id #biosequence uses bioentry_id as a primary key!
  belongs_to :bioentry

edit the /models/location.rb file by adding:

 belongs_to :seqfeature

Edit the /models/seqfeature.rb file by adding:

  belongs_to :bioentry
  has_many :locations

Note that most likely you will be adding huge files to the database. BioSQL comes with a set of  perl scripts to enable you do that. Until bioruby 1.3 is released you will have to use the perl scripts to add huge datasets. All the documentation to do that is available from the BioSQL website. I used a perl script load_ncbi_taxonomy.pl to load taxon data to my database. This script comes with the BioSQL. (It did not seem to work on my system, I will sort that later)

To make this post shorter and get to the meat of it, i will assume that you have some existing data in your biosql database. If not, create some dummy data to populate, the biodatabase, bioentry,biosequence, seqfeature and location tables. In Part 3, I will show you how to create the necessary views to populate the database. After all biologists don’t want to interact with raw SQL queries and sometimes have no idea of running scripts, however they are very web savy!

Edit the /biosequences/show.html.erb to look as follows:

<h2><%= @biosequence.bioentry.name%>(<%= @biosequence.alphabet %>)</h2>
<p>Sequence</p>
<%= @biosequence.seq %><br/>


<%= link_to 'Edit', edit_biosequence_path(@biosequence) %> 

Now navigate to http://localhost:3000/biosequences/1

and then navigate to http://locahost:3000/biosequences/1.xml The XML version of your sequence is also available!

Lets add some ability to render graphics for the sequences.

Add the following lines at the top of the biosequence.rb model file

 require 'stringio'
 require 'base64' 

In the biosequence.rb model class, create a new method called draw_graphic.

def self.draw_graphic(value)
      #get the name and length of the main feature to be drawn
     main_feature = Bioentry.find(value)
     len = main_feature.biosequence.length.to_i
     name = main_feature.name

    #create a Biographics panel and add a track
      @my_panel = Bio::Graphics::Panel.new(len,:width=> 900)
      @track = @my_panel.add_track("#{name}",:glyph=>'directed_generic')

     #specify the range for the main feature
     main_feature_range = "1..#{len}"
      @track.add_feature(Bio::Feature.new("#{name}",main_feature_range), :label=>" ")

    #write the output to memory
        output = StringIO.new
        @my_panel.draw(output)
        return output.string
  end

This method will be called by an action method in biosequence_controller.rb file.

  def to_image
    begin
      image = Biosequence.draw_graphic(Biosequence.find(params[:id]))
      send_data(image, :filename => "graphic.svg", :disposition => "inline")
    rescue  ActiveRecord::RecordNotFound
      add_error("Error:Attempt to call image without specifying a biosequence  ID")
      redirect_to :action=>'index'
    end
  end

We add a rescue block to capture record not found errors. In RESTful applications a controller is limited to seven actions. So we need to add a collection to our biosequence resource in routes.rb. This is how we do it.

  map.resources :biosequences,:collection=>{:to_image=>:get}

Now we need to modify our /biosequences/show.html.erb file, to enable rendering of the graphic. For that we will create a helper method so that our show.html.erb view is ‘clean’. In helpers/biosequences_helper.rb file, add the following code

  def render_image(feature_obj)
     image_tag(url_for({:action=>'to_image',:id=>feature_obj}))
  end

And in the /views/biosequences/show.html.erb file add the following line of code

<%= render_image(@biosequence) %><br/>

Now assuming  that you have a biosql database with valid data, navigate to

http://localhost:3000/biosequences/show/1

screenshort

screenshort

The above is a screen shot from my example application while I was writing this tutorial.

The source code for this example  application is available from github

For a full review of the methods available for biographics please check the project’s git repository and the rdoc.

Bio-graphics, BioSQL and Rails part 1

In these series I will show you how to quickly add graphics support to a bioinformatics database rails application. We are going to use the biographics library by Jan Aerts, the BioSQL database schema, and rails 2.2.2 (also works with 2.3.2)  In this simple example we want to represent a sequence as a graphic, such that we can view it in a web browser more or less the way Gbrowse works. Each main feature has different subfeatures at different locations along it.

——————————————————————- main feature

——- ——– ———— ——– ——    subfeatures

We need to have the following installed, rails 2.1.1, bio 2.1, biographics 1.4 all available as gems and a database based on BioSQL schema.

We need to download the BioSQL schema located here. The latest version as of this writing is BioSQL v1.0 (code-named Tokyo) release, v1.0.1. Create a database called biosql_development. I am on Ubuntu Linux with Mysql 5.0.

george:>mysql -u george -p
:enter password
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 27
Server version: 5.0.51a-3ubuntu5.4 (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create database biosql_development;
Query OK, 1 row affected (0.02 sec)

mysql> 

I have created a database called biosql_development. Why am i not using migrations? The reason is that BioSQL has some agreed standards on table names and schema convection which are not compatible with rails database creation and table naming conventions. However Rails allows us to override these default convections, when working with legacy databases, as will be our case.

After creating the database, load the BioSQL schema to the empty database. First we need to tell mysql which database to use.

mysql> use biosql_development;

then load the schema

mysql> source /home/george/Desktop/downloadsfolder/biosql-1.0.1/sql/biosqldb-mysql.sql;
Query OK, 0 rows affected, 1 warning (0.48 sec)

Query OK, 0 rows affected (0.15 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected, 1 warning (0.01 sec)

Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected, 1 warning (0.01 sec)

 ........ trucated
mysql>

Now we need to create a rails application and connect to this database.

I use the Netbeans IDE development environment for creating ruby and rails applications. Go ahead and create a rails application and specify to use mysql as the database adapter.

To connect to our legacy database, we need to override some convections. First disable table plurulization, and tell rails that the table primary name is named as tablename_id as opposed to just the id column expected by rails. To do that

Create a new file in your application configurations/initializers directory called override_rails.rb (you can call it whatever).

 class ActiveRecord::Base
  self.pluralize_table_names = false

  self.primary_key_prefix_type = :table_name_with_underscore
 end

The two lines above tells ActiveRecord not to expect the table names to be plural and that the primary key for each table is named as tablename_id format.

Also create another one called external_libraries.rb in the initializers directory, as you can tell this is where I want to put my require statements for loading external libraries.

require 'rubygems'

#load the bioinformatics library
require 'bio'

#load the biographics library
require 'bio-graphics'

#load the sql views extension library
gem 'rails_sql_views'
require 'rails_sql_views'

This file loads our gems. The rails_sql_views gem allows us to create views and access them by creating models corresponding to the views.

At this point if you run rake db:schema:dump, we will have a rails based BioSQL schema and which we can conveniently use to create a BioSQL database on any Relational database that rails supports and this includes Microsoft SQL server, DB2, Oracle, SQLlite and a host of others. All that would be required is to change the database.yml file to suit the adapter of choice and then execute rake db:schema:load to load the BioSQL schema.

Please note that if your are using rails 2.2.2,  you may want to comment the lines

unless Kernel.respond_to?(:gem)
  Kernel.send :alias_method, :gem, :require_gem
end

in rails_sql_views(0.6.1), otherwise running db:schema:dump will cause rake to abort.

In the next part I will describe how to create the necessary resources for our RESTful(Representational State Transfer) bioinformatics web application and rendering of the graphics.


Approximate string matching metrics with amatch

Most often in sequence analysis we want to compare how  similar two sequences are. How can we quantify similarity by using a metric? That was my question yesterday and I went hunting for a ruby implementation for such metrics. Luckily I got a library called amatch which is an approximate string matching extension for ruby! amatch implements the following metrics:

Hamming distance, Levenshtein edit distance,longest subsequence common to two strings,longest substring common to two strings,sellers distance and pair distance which is based on the number of adjacent character pairs, that are contained in two  strings.

Hamming distance

This is the number of characters that are different between two strings. This is not recommended for the majority of string based information retrieval. Very similar strings can sometimes be given high hamming distances.

Leveshtein edit distance

Is defined as the minimal costs involved in transforming one string into another by using  deletion, insertion and substitution of a character to one of the strings. The algorithm can associate a cost for performing each of the operations and for this metric it is usually 1.

Longest common substring

This is define as the contiguous chain of characters that exists in both strings. The longer the substring the better the match between the two strings. The problem with this approach is that if a difference was introduced in the middle of one string, the distance will be longer that if the same difference was introduced at the beginning of one of the strings.

Longest common Subsequence

The longer the common sub sequence is, the more similar the two strings will be. In this case a sub sequence does not have to be contiguous.

Look at the documentation for more explanations of the metrics and algorithms.

To use the library you need to first install the gem. I installed it on my Linux box running Ubuntu and ruby 1.8.6.

sudo gem install amatch

Then in script,

require 'rubygems'

require 'amatch'
include Amatch
require  'bio'
#with bioruby it would be easy to compare two sequence entries  for example
seq_obj1 = Bio::Sequence.auto("actagatatttgat")
seq_obj2 = Bio::Sequence.auto("gccagatagttaat")

#calculate the hamming distance
 m = Hamming.new(seq_obj1.to_seq)
 m.match(seq_obj2.to_seq)
#=> 

#calculate pair-distances between the two sequences
pair_distance_obj = PairDistance.new(seq_obj1.seq)
pair_distance_obj.match(seq_obj2.seq)
 #=>
# note that you can just substitute the strings directly to the metric object creation method
without creating the sequence objects!

Note that amatch  failed to install on windows XP with the following error

Building native extensions.  This could take a while…
ERROR:  Error installing amatch:
ERROR: Failed to build gem native extension.

C:/ruby-1.8.6/ruby/bin/ruby.exe extconf.rb install amatch
creating Makefile

nmake
‘nmake’ is not recognized as an internal or external command,
operable program or batch file.

Although i have nmake installed on my windows machine. I will look at that later.

Happy string matching!


What’s new in Bioruby edge

Changes to Bio::Blast

Naohisa Goto has announced changes to the Bio::Blast.reports to support default -m 0 and tabular -m 8
formats in addition to XML (-m 7) form. I think this is really nice and convenient!

Previously it meant that for bioruby to parse a Blast file, you had to have your blast results in XML output which Bio::Blast::Reports would understand. However, by default Blast gives an  -m0 output and without that prior knowledge you may spend hours wondering what is wrong when parsing default blast output files. With Bioruby 1.2.1 this will not work

require 'rubygems'
require 'bio'

report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::Blast.reports(report_file) do |report|
  puts report.class
end

Unless blast_output2.txt is in XML format

In the upcoming Bioruby 1.3 the default Blast file can now be parsed, for example,

require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::FlatFile.open(Bio::Blast::Default::Report,report_file) do |ff|
ff.each do |rep|
   puts rep.statistics
   rep.iterations.each do |itr|
      puts itr.hits.size

    itr.hits.each_with_index do |hit,i|
      puts hit.hit_id
      puts hit.len
     end
   end
 end
end

Bio::Blast.remote now supports DDBJ in addition to Genomenet. It would be a nice idea to support NCBI as well.

Changes to Bio:sequence

It is possible to create  sequence objects from Bio::GenBank, Bio::EMBL, and Bio::FastaFormat by using the to_biosequence method

gb = Bio::GenBank.new(genbank_file.gb)
gb.to_biosequence

Bio::SQL Support

Thanks to Raoul and Naohisa, support for BioSQL has been rewritten by using  ActiveRecord.

#Make a connection

connection = Bio::SQL.establish_connection(path_to_database.yaml,'development')
#list databases

databases =Bio::SQL.list_databases

#retrieve a sequence
sample_seq = Bio::SQL.fetch_accession('some_accession_number')

#get number of seqeunces in the database
puts Bio::SQL.list_entries

#get references associated with an entry
puts sample_seq.references

#create an embl format
puts sample_seq.to_biosequence.output(:embl)

Changes to Bio::GFF2 and Bio::GFF3

GFF2/GFF3 formatted texts are now supported but there will be backward portability issues with bio 1.2.1 since some incompatible changes have been incorporated.  Bio::GFF::Record.comments has been renamed to comment and comments= is now comment=

Both Bio::GFF::GFF2::Record.new and Bio::GFF::GFF3::Records.new, can now take 9 arguments that correspond to GFF columns making it easy to create a Record object directly without need for  formatted text.

Both Bio::GFF::GFF2::Record#attributes and Bio::GFF::GFF3::Record#attributes have been changed to return a nested array containing tag, value pairs, to obtain a  hash, use the to_hash method

To support data output for GFF2/GFF3, new methods have been added:  Bio::GFF::GFF2#to_s, Bio::GFF::GFF3#to_s, Bio::GFF::GFF2::Record#to_s,and Bio::GFF::GFF3::Record#to_s

Lots of other changes have been incorporated for the GFF classes and you can view the change log at github

CodeML parser

A wrapper for PAML codeml program that is used for estimating evolutinary rate has been added.  The class provides methods  for generating the necessary configuration file. The new Bio::PAML::Codeml::Report and PAML::Codeml::Rates  provides simpel classes  for accessing the codeml report and rates file.  This example is from the example given in the source code

require 'bio'
# Reads multi-fasta formatted file and gets a Bio::Alignment object.
     alignment = Bio::FlatFile.open(Bio::Alignment::MultiFastaFormat, 'example.fst').alignment
     # Reads newick tree from a file
     tree = Bio::FlatFile.open(Bio::Newick, 'example.tree').tree
  # Creates a Codeml object
  codeml = Bio::PAML::Codeml.new
     # Sets parameters
     codeml.parameters[:runmode] = 0
     codeml.parameters[:RateAncestor] = 1
     # You can also set many parameters at a time.
   codeml.parameters.update({ :alpha => 0.5, :fix_alpha => 0 })
     # Executes codeml with the alignment and the tree
     report = codeml.query(alignment, tree)

Lots of Bugs  have been fixed and also support for Ruby 1.9 has been added. Its great thanks to the bioruby developers for their time and the excellent new changes!



Bioruby mini-series: The Bio::Sequence::Common class

Sequence Transformation

Lets have a look at the Bio::Sequence::Common class module which provides us with most of the sequence transformation methods for biological sequences.

Bio::Sequence::Common

Implements methods which are common to both Bio::Sequence::AA and Bio::Sequence::NA, for example

A Bio::Sequence object is easily created like this;

require ‘bio’

my_dna = Bio::Sequence.auto("actagatatttgat") #=> actagatatttgat

my_dna is now a Bio::sequence object and you can use the various methods available for this class, which we are going to explore shortly.

Bio::Sequence::Common Non Modifying methods

  • to_s

    This method returns a sequence as a string. It does not modify the original sequence.

    puts my_dna.to_s #=> actagatatttgat

    puts my_dna.to_s.class #=> String

    An alias for this method is the to_str method.

    my_dna.to_str
    #=> actagatatttgat

  • seq

    This method will return a new Bio::Sequence::NA or Bio::Sequence::AA object. The original sequence remains unchanged. For example if you wished to assign a new instance of my_dna object that we created above ,such that you have a my_dna2 object, you would create that as follows,

    my_dna2 = my_dna.seq

    puts my_dna2 #=> actagatatttgat

    puts my_dna2.class #=>
    Bio::Sequence::NA

Bio::Sequence::Common modifying methods

  • Normalize!

    This method removes all the white space and transforms all positions to uppercase if the sequence is an amino acid (AA) or transforms all positions to lowercase if the sequence is a nucleic acid (NA) sequence, leaving the original sequence modified

    For example

    test_seq = Bio::Sequence::NA.new(“ACTG”)

    puts test_seq.normalize! #=>
    actg

  • Concatenating

    Many times we want to append a new sequence or a set of bases/residues eg a poly A sequence to the end of a new sequence and modify the original sequence. This is achieved by the concat method.
    It is also referred to as << method.

    test_seq = Bio::Sequence::NA.new(“actg”)

    test_seq << “acagat”

    test_seq concat “acagat”

    puts test_seq #=>
    actgacagat

Note that to create a new sequence that adds to an existing sequence without altering the original sequence you would use the + operator. It accepts a variable number of arguments. For example

test_seq = Bio::Sequence::NA.new(“actg”)

test_seq2 = test_seq + (“cttcccttttt” “tatatata”)

puts test_seq2 #=>
actgcttcccttttttatatata

puts test_seq #=>actg

Working with subsequences

Please note that biological sequence numbering convections are one based as opposed to ruby’s zero based. Biological coordinate’s convection for BioSQL and Chado is zero based.

  • Subseq

    This method returns a new sequence containing the subsequence identified by the start and end values given as parameters. This method works in a similar way to the slice string method. For example

    my_seq = Bio::Sequence::NA.new(“agggatttc”)

    puts my_seq.subseq(2,5) #=>
    ggga

    The first argument denotes the start and the second argument denotes the end of the subsequence. Both arguments must be positive integers

    When this method is used without arguments, the start defaults to 1 and the end defaults to the last element of the string. Therefore when subseq is called without any arguments, it returns a new sequence similar to the original sequence.

    puts my_seq.subseq #=> agggatttc

  • window_search

    This method is typically used with a block. The method is called if you wanted to step through a sequence given a length of a subsequence. Therefore the method accepts two arguments. Step_size which defines the size of your ‘steps’ and the window_size which defines the length of the stepping subsequence. Any remaining sequence at the terminal end will be returned. The default step size is one since its an optional argument.

    For example

    To print the average GC% on each 100bp you can write,

    s.window_search(100) do |subseq|

    puts subseq.gc

    end

Plasmodium falciparum re-annotation workshop opens

The Plasmodium genome re-annotation workshop opened on 21st October at the Sanger center. The Workshop runs till the 26th and aims to re-annotate the P. falciparum genome. In a welcome message Prof David Roos pointed out that a major goal of the workshop is to ascribe new or updated functions to gene models, reflecting the current state of knowledge in the wider malaria community.

The Plasmodium falciparum sequencing project was completed in 2002 and since then the Plasmodb database which is currently at version 5 has been the primary source of P. falciparum data and genomic information. With 60% of the P. falciparum genes annotated as hypothetical, it is time to reduce the number of hypothetical genes by providing annotations where known and possible.

Issues that will be addressed and visited include:

- standards for the use of structured gene ontologies in gene/genome annotation

- naming conventions for “hypothetical proteins”, “conserved hypotheticals”, “putative kinases”, etc

- naming conventions for large gene families

- standards for inferring function from orthology, motif/domain conservation, or ‘guilt by association’ based on functional genomics data

- standards for transfering annotations to orthologs in other Plasmodium species

- plans and proposals for further Plasmodium sequencing and other genomics resources

- pipelines for ensuring currency and consistency of data in GenBank/EMBL, GeneDB, PlasmoDB, etc

- future requirements and needs for Plasmodium informatics resources

- annotation projects not completed during the workshop … and strategies for ensuring completion

The workshop is sponsored by Sanger institute and plasmodb

Standalone BLAST with Ruby (windows)

Updated article

BLAST is one of the most widely used search algorithms in molecular biology. So lets see how you can run and retrieve blast results via a simple Ruby script .I will assume you already have ruby 1.8.5 and above installed in your windows box and a standalone blast.exe which you can download from the NCBI’s ftp site here . The latest windows binaries as of this writing is 2.2.17. Create a new folder in C and call it NCBI_Blast. Paste the downloaded blast program in this folder. Double click the blast program and it will create a bin, doc and data folders inside your your NCBI_Blast folder. If this is your first time to install blast in your machine. You will need to do a little configuration. Follow these instructions for setting up blast .
#create a query sequence
myseq="pcaatcacatyyawwqqffgghhhkllkl"
#create a temporary file
require 'tempfile'
temp=Tempfile.new("seqfile")
#get the name of the temporary file
name=temp.path
#append the contents your sequence to this temporary file
temp.puts "#{myseq}"
temp.close
#since we have a protein query sequence, we will run a blastp. Please note that you will need to have a valid #database to query against. use the formatdb command to create your database before executing the lines #below.
@program = 'blastp'
#path to blast
@database = 'c:/path_to_databasefile'
#name of your query file
@input= name
#your blast output file
@output='c:/path_output_file'
#assume your blast is in a folder called NCBI_Blast, execute
system( "c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} -o #{@output}")
#To capture the output in a variable execute this command instead.
#note that we have omitted the blast -o parameter
result=%x(c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} )
#remember to delete the temporary file!
temp.close(true)