Standalone BLAST with Ruby revisited

Earlier  I showed a very simple way to perform a BLAST  using Ruby. Today I would like to revisit that topic for two reasons.

  1. The “using ruby with blast” search term seems to be very common and actually one of the ways that people reach my blog.
  2. The original post was not very through.

BLAST aka Basic Local Alignment Tool is used to search a sequence (either DNA or protein) against a database of other sequences (either all nucleotide or all protein) in order to identify similar sequences. BLAST has many different flavors and can  search DNA against DNA or protein against protein and also can translate a nucleotide query and search it against a protein database  and vice versa. It can also compute a “profile” for the query sequence and use that for further searches as well as search the query against a database of profiles.

The BLAST tool is fundamental to molecular biologists and bioinformaticians. There are excellent books and tutorials on how to and when to use BLAST, so i will assume all you need is to automated your work and parse the results. The actual algorithm is implemented in C and freely  available from the NCBI website.The first thing  to do is to download the appropriate binaries for your platform. Instructions for setting up and installing BLAST

Once installed on your system  the primary method of interaction is using the command line. Use formatdb to create blast databases and blastall to search for sequence homology for a given sequence against a given blast database.

In Ruby, there are two ways you can call the BLAST program. First using the Bioruby library and second by writing your own ruby wrapper for the BLAST command line parameters and execution. Most often, one executes BLAST from the command line and then process the results file which is in either one of the many BLAST output formats. Bioruby is excellent  at parsing the results file. Using Bioruby with BLAST is  very straightforward:

#blasting the bioruby way
  #query_file: a list of query sequences in fasta format
  #database_path: a path to the actual BLAST formatted database
  #program: The BLAST program to call, either of blastp,blastn,tblastn e.t.c.
    def bio_blast(program, database_path,query_file)
        factory = Bio::Blast.local(program,database_path)

        ff = Bio::FlatFile.open(Bio::FastaFormat, query_file)
        ff.each do |entry|
           report = factory.query(entry) # report will be a Blast::Report object
          # iterate trough the hits
          report.each do |hit|
puts hit.bit_score        # bit score (*)
puts hit.query_seq        # query sequence (TRANSLATOR'S NOTE: sequence of homologous region of query sequence)
puts hit.midline          # middle line string of alignment of homologous region (*)
puts hit.target_seq       # hit sequence (TRANSLATOR'S NOTE: sequence of homologous region of query sequence)
puts hit.evalue           # E-value
puts hit.identity         # % identity
puts hit.overlap          # length of overlapping region
puts hit.query_id         # identifier of query sequence
puts hit.query_def        # definition(comment line) of query sequence
puts hit.query_len        # length of query sequence
puts hit.target_id        # identifier of hit sequence
puts hit.target_def       # definition(comment line) of hit sequence
puts hit.target_len       # length of hit sequence
puts hit.query_start      # start position of homologous region in query sequence
puts hit.query_end        # end position of homologous region in query sequence
puts hit.target_start     # start position of homologous region in hit(target) sequence
puts hit.target_end       # end position of homologous region in hit(target) sequence
puts hit.lap_at           # array of above four numbers
hit.each do |hsp| puts hsp.query_from end end end end

The method will execute BLAST and also print the hits and the high scoring potions start coordinates for each hit. How ever you may want to just run BLAST without the bioruby overhead. The line below will work as well:

  input = query_path
    #execute blast and store the results in the blast_results  variable
    #-p blast program to run
    #-d blast database to query against
    #-T gives a html output
    #-i query file path

  #execution
blast_result = %x(blastall -p #{program} -d #{database} -e #{expectation} -M #{matrix}
                 -i #{input} -T  T)
#blast_result will be the output from the system execution of the above command. You can choose to write it 
to a file or process it using the Bio::Blast::Report object.

You can use a similar style command like the one above to create BLAST databases using the formatdb command.

I would recommend the use of the bio-ruby blast report parsing classes to automate the process. Please look at the Bio-ruby API documentation for more details.

What’s new in Bioruby edge

Changes to Bio::Blast

Naohisa Goto has announced changes to the Bio::Blast.reports to support default -m 0 and tabular -m 8
formats in addition to XML (-m 7) form. I think this is really nice and convenient!

Previously it meant that for bioruby to parse a Blast file, you had to have your blast results in XML output which Bio::Blast::Reports would understand. However, by default Blast gives an  -m0 output and without that prior knowledge you may spend hours wondering what is wrong when parsing default blast output files. With Bioruby 1.2.1 this will not work

require 'rubygems'
require 'bio'

report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::Blast.reports(report_file) do |report|
  puts report.class
end

Unless blast_output2.txt is in XML format

In the upcoming Bioruby 1.3 the default Blast file can now be parsed, for example,

require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::FlatFile.open(Bio::Blast::Default::Report,report_file) do |ff|
ff.each do |rep|
   puts rep.statistics
   rep.iterations.each do |itr|
      puts itr.hits.size

    itr.hits.each_with_index do |hit,i|
      puts hit.hit_id
      puts hit.len
     end
   end
 end
end

Bio::Blast.remote now supports DDBJ in addition to Genomenet. It would be a nice idea to support NCBI as well.

Changes to Bio:sequence

It is possible to create  sequence objects from Bio::GenBank, Bio::EMBL, and Bio::FastaFormat by using the to_biosequence method

gb = Bio::GenBank.new(genbank_file.gb)
gb.to_biosequence

Bio::SQL Support

Thanks to Raoul and Naohisa, support for BioSQL has been rewritten by using  ActiveRecord.

#Make a connection

connection = Bio::SQL.establish_connection(path_to_database.yaml,'development')
#list databases

databases =Bio::SQL.list_databases

#retrieve a sequence
sample_seq = Bio::SQL.fetch_accession('some_accession_number')

#get number of seqeunces in the database
puts Bio::SQL.list_entries

#get references associated with an entry
puts sample_seq.references

#create an embl format
puts sample_seq.to_biosequence.output(:embl)

Changes to Bio::GFF2 and Bio::GFF3

GFF2/GFF3 formatted texts are now supported but there will be backward portability issues with bio 1.2.1 since some incompatible changes have been incorporated.  Bio::GFF::Record.comments has been renamed to comment and comments= is now comment=

Both Bio::GFF::GFF2::Record.new and Bio::GFF::GFF3::Records.new, can now take 9 arguments that correspond to GFF columns making it easy to create a Record object directly without need for  formatted text.

Both Bio::GFF::GFF2::Record#attributes and Bio::GFF::GFF3::Record#attributes have been changed to return a nested array containing tag, value pairs, to obtain a  hash, use the to_hash method

To support data output for GFF2/GFF3, new methods have been added:  Bio::GFF::GFF2#to_s, Bio::GFF::GFF3#to_s, Bio::GFF::GFF2::Record#to_s,and Bio::GFF::GFF3::Record#to_s

Lots of other changes have been incorporated for the GFF classes and you can view the change log at github

CodeML parser

A wrapper for PAML codeml program that is used for estimating evolutinary rate has been added.  The class provides methods  for generating the necessary configuration file. The new Bio::PAML::Codeml::Report and PAML::Codeml::Rates  provides simpel classes  for accessing the codeml report and rates file.  This example is from the example given in the source code

require 'bio'
# Reads multi-fasta formatted file and gets a Bio::Alignment object.
     alignment = Bio::FlatFile.open(Bio::Alignment::MultiFastaFormat, 'example.fst').alignment
     # Reads newick tree from a file
     tree = Bio::FlatFile.open(Bio::Newick, 'example.tree').tree
  # Creates a Codeml object
  codeml = Bio::PAML::Codeml.new
     # Sets parameters
     codeml.parameters[:runmode] = 0
     codeml.parameters[:RateAncestor] = 1
     # You can also set many parameters at a time.
   codeml.parameters.update({ :alpha => 0.5, :fix_alpha => 0 })
     # Executes codeml with the alignment and the tree
     report = codeml.query(alignment, tree)

Lots of Bugs  have been fixed and also support for Ruby 1.9 has been added. Its great thanks to the bioruby developers for their time and the excellent new changes!



Standalone BLAST with Ruby (windows)

Updated article

BLAST is one of the most widely used search algorithms in molecular biology. So lets see how you can run and retrieve blast results via a simple Ruby script .I will assume you already have ruby 1.8.5 and above installed in your windows box and a standalone blast.exe which you can download from the NCBI’s ftp site here . The latest windows binaries as of this writing is 2.2.17. Create a new folder in C and call it NCBI_Blast. Paste the downloaded blast program in this folder. Double click the blast program and it will create a bin, doc and data folders inside your your NCBI_Blast folder. If this is your first time to install blast in your machine. You will need to do a little configuration. Follow these instructions for setting up blast .
#create a query sequence
myseq="pcaatcacatyyawwqqffgghhhkllkl"
#create a temporary file
require 'tempfile'
temp=Tempfile.new("seqfile")
#get the name of the temporary file
name=temp.path
#append the contents your sequence to this temporary file
temp.puts "#{myseq}"
temp.close
#since we have a protein query sequence, we will run a blastp. Please note that you will need to have a valid #database to query against. use the formatdb command to create your database before executing the lines #below.
@program = 'blastp'
#path to blast
@database = 'c:/path_to_databasefile'
#name of your query file
@input= name
#your blast output file
@output='c:/path_output_file'
#assume your blast is in a folder called NCBI_Blast, execute
system( "c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} -o #{@output}")
#To capture the output in a variable execute this command instead.
#note that we have omitted the blast -o parameter
result=%x(c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} )
#remember to delete the temporary file!
temp.close(true)