Standalone BLAST with Ruby revisited

Earlier  I showed a very simple way to perform a BLAST  using Ruby. Today I would like to revisit that topic for two reasons.

  1. The “using ruby with blast” search term seems to be very common and actually one of the ways that people reach my blog.
  2. The original post was not very through.

BLAST aka Basic Local Alignment Tool is used to search a sequence (either DNA or protein) against a database of other sequences (either all nucleotide or all protein) in order to identify similar sequences. BLAST has many different flavors and can  search DNA against DNA or protein against protein and also can translate a nucleotide query and search it against a protein database  and vice versa. It can also compute a “profile” for the query sequence and use that for further searches as well as search the query against a database of profiles.

The BLAST tool is fundamental to molecular biologists and bioinformaticians. There are excellent books and tutorials on how to and when to use BLAST, so i will assume all you need is to automated your work and parse the results. The actual algorithm is implemented in C and freely  available from the NCBI website.The first thing  to do is to download the appropriate binaries for your platform. Instructions for setting up and installing BLAST

Once installed on your system  the primary method of interaction is using the command line. Use formatdb to create blast databases and blastall to search for sequence homology for a given sequence against a given blast database.

In Ruby, there are two ways you can call the BLAST program. First using the Bioruby library and second by writing your own ruby wrapper for the BLAST command line parameters and execution. Most often, one executes BLAST from the command line and then process the results file which is in either one of the many BLAST output formats. Bioruby is excellent  at parsing the results file. Using Bioruby with BLAST is  very straightforward:

#blasting the bioruby way
  #query_file: a list of query sequences in fasta format
  #database_path: a path to the actual BLAST formatted database
  #program: The BLAST program to call, either of blastp,blastn,tblastn e.t.c.
    def bio_blast(program, database_path,query_file)
        factory = Bio::Blast.local(program,database_path)

        ff = Bio::FlatFile.open(Bio::FastaFormat, query_file)
        ff.each do |entry|
           report = factory.query(entry) # report will be a Blast::Report object
          # iterate trough the hits
          report.each do |hit|
puts hit.bit_score        # bit score (*)
puts hit.query_seq        # query sequence (TRANSLATOR'S NOTE: sequence of homologous region of query sequence)
puts hit.midline          # middle line string of alignment of homologous region (*)
puts hit.target_seq       # hit sequence (TRANSLATOR'S NOTE: sequence of homologous region of query sequence)
puts hit.evalue           # E-value
puts hit.identity         # % identity
puts hit.overlap          # length of overlapping region
puts hit.query_id         # identifier of query sequence
puts hit.query_def        # definition(comment line) of query sequence
puts hit.query_len        # length of query sequence
puts hit.target_id        # identifier of hit sequence
puts hit.target_def       # definition(comment line) of hit sequence
puts hit.target_len       # length of hit sequence
puts hit.query_start      # start position of homologous region in query sequence
puts hit.query_end        # end position of homologous region in query sequence
puts hit.target_start     # start position of homologous region in hit(target) sequence
puts hit.target_end       # end position of homologous region in hit(target) sequence
puts hit.lap_at           # array of above four numbers
hit.each do |hsp| puts hsp.query_from end end end end

The method will execute BLAST and also print the hits and the high scoring potions start coordinates for each hit. How ever you may want to just run BLAST without the bioruby overhead. The line below will work as well:

  input = query_path
    #execute blast and store the results in the blast_results  variable
    #-p blast program to run
    #-d blast database to query against
    #-T gives a html output
    #-i query file path

  #execution
blast_result = %x(blastall -p #{program} -d #{database} -e #{expectation} -M #{matrix}
                 -i #{input} -T  T)
#blast_result will be the output from the system execution of the above command. You can choose to write it 
to a file or process it using the Bio::Blast::Report object.

You can use a similar style command like the one above to create BLAST databases using the formatdb command.

I would recommend the use of the bio-ruby blast report parsing classes to automate the process. Please look at the Bio-ruby API documentation for more details.

A ruby class for screen-scraping plasmodb database

Plasmodb is the primary resource for retrieving Plasmodium falciparum genomic data and information. Unfortunately this database has no API or XML service to request or query its  information from a programmer’s point of view or for easy automation of sequence information retrieval.  Recently I needed to download a long list of Plasmodium falciparum genomic, Protein and other information for a set of genes. Been lazy to click and open the webpage for each gene in my list. I wrote this in ruby.

It would be great if Plasmodb  would provide an easy way  of automated sequence retrieval. A webservice or an XML output format would do. Screen scraping is not a very efficient approach.  Here we use Scrapi which  is an HTML scraping toolkit for Ruby. It uses CSS selectors to write easy, maintainable scraping rules to select, extract and store data from HTML content.

#A class to fetch information from plasmodb using the scrapi API
##TODO handle  Scraper::Reader::HTTPUnspecifiedError
class Plasmodb
   #retrives a information  using the gene_id
   #returns a structure obj
  def fetch_by_gene_id(var_name)
    begin
      scraper = Scraper.define do
        process "div#genomicSequence pre",    :genomic_sequence  => :text
        process "div#transcriptSequence pre", :mrna_sequence =>:text
        process "div#proteinSequence pre",    :protein_sequence  =>:text
        process "div#Aliases td>table",       :aliases =>:text
        result :protein_sequence,:aliases,:mrna_sequence,:genomic_sequence
        end

     search_link="http://plasmodb.org/plasmo/showRecord.do?
               name=GeneRecordClasses.GeneRecordClass&source_id="+var_name+"&project_id=PlasmoDB"
     uri = URI.parse(search_link)
     @query = scraper.scrape(uri)

    rescue Scraper::Reader::HTTPUnspecifiedError
      "None"
    end
  end
  #returns the predicted protein sequence
  def protein_sequence
    @query.protein_sequence.chomp
  end
#  Returns the genomic sequence
  def genomic_sequence
    @query.genomic_sequence.chomp
  end
  #returns Aliases
  def aliases
    @query.aliases
  end
  #returns the mrna sequence
  def mrna_sequence
    @query.mrna_sequence.chomp
  end
end

#Use the class to fetch information.
require 'rubygems'
require 'bio'
require 'scrapi'

file = "/home/george/genes_list.txt" #a file containing a list of accession numbers.
#one accession number per line

plasmo = Plasmodb.new #initialize a plasmodb class instance

#Read the file and process each accession number.
File.readlines(file).each do |line|
  line.chomp!
  plasmo.fetch_by_gene_id(line)  #fetches the information from Plasmodb.
  #print a fasta entry for the protein sequence
  puts Bio::Sequence.new(plasmo.protein_sequence).output(:fasta,:header=>line)
  puts Bio::Sequence.new(plasmo.genomic_sequence).output(:fasta,:header=>line)
end

#another example
#p = Plasmodb.new
#p.fetch_by_gene_id("PFD0020c")
#puts p.genomic_sequence

A Ruby algorithms resource

Ruby quiz resourceAn algorithm is a procedure to accomplish a specific task. They solve general well specified problems and are the ideas behind computer programs.

Rubyquiz is an interesting repository for ruby programs that implement queit interesting algorithms. Even though the programs may not have anything to do with biology, some of the algorithms definitely do. It is a good place to a find a start point for your ruby algorithm implementation. Browsing the different sets of problems can give a lot of insight on how to approach some common programming problems eg. writing an inference engine, a hidden markov chain , dictionary matcher etc

The site has about 147 quizzes as of this writing. Take a look!

A ruby micro review

ruby logoRuby is a reflective, dynamic, object-oriented programming language, created by Yukihiro Matsumoto and released to the public in 1995. It is an extremely pragmatic language, less concerned with formalities and more concerned with ease of development and valid results. You will see Agile principles running through ruby and particularly with rails. Most of all TDD and BDD concepts / philosophies have been implemented for ruby developers. Ruby differs from most programming languages by syntax, culture, grammar and customs. It has more in common with LISP and Smalltalk than with most languages such as C++ and PHP.

If you can program in languages such as Perl, PHP, C or Pascal, using and learning ruby is quite easy, but the problem solving pespectives that ruby uses may throw you out at first.

The so popular and hyped ruby on rails DSL (domain specific language) is a framework for developing web applications and currently powers hundreds of large websites around the world.

Bioruby is an excellent bioinformatics library for ruby. Though not highly documented like its sister, bioperl, efforts are been made to improve its level of documentation. The bioruby community is also really nice and friendly. Not a single question that i have posted on the mailing list goes unanswered.

Hundreds of libraries for performing different tasks have been written for ruby , packaged as gems and hosted at rubyforge

So far my favorite ruby editor is the netbeans IDE, whose currently release is in beta 2. The final release is slated for 3rd of Dec 2007. (Am waiting!). It features auto completion, syntax highlighting among other cool things that makes programming a joy. It also comes bundled with the jruby release, a java implementation of ruby that is starting to rock the world, so you can choose to use either native ruby or jruby, the choice is all yours!

Ruby can be downloaded here

Standalone BLAST with Ruby (windows)

Updated article

BLAST is one of the most widely used search algorithms in molecular biology. So lets see how you can run and retrieve blast results via a simple Ruby script .I will assume you already have ruby 1.8.5 and above installed in your windows box and a standalone blast.exe which you can download from the NCBI’s ftp site here . The latest windows binaries as of this writing is 2.2.17. Create a new folder in C and call it NCBI_Blast. Paste the downloaded blast program in this folder. Double click the blast program and it will create a bin, doc and data folders inside your your NCBI_Blast folder. If this is your first time to install blast in your machine. You will need to do a little configuration. Follow these instructions for setting up blast .
#create a query sequence
myseq="pcaatcacatyyawwqqffgghhhkllkl"
#create a temporary file
require 'tempfile'
temp=Tempfile.new("seqfile")
#get the name of the temporary file
name=temp.path
#append the contents your sequence to this temporary file
temp.puts "#{myseq}"
temp.close
#since we have a protein query sequence, we will run a blastp. Please note that you will need to have a valid #database to query against. use the formatdb command to create your database before executing the lines #below.
@program = 'blastp'
#path to blast
@database = 'c:/path_to_databasefile'
#name of your query file
@input= name
#your blast output file
@output='c:/path_output_file'
#assume your blast is in a folder called NCBI_Blast, execute
system( "c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} -o #{@output}")
#To capture the output in a variable execute this command instead.
#note that we have omitted the blast -o parameter
result=%x(c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} )
#remember to delete the temporary file!
temp.close(true)