Standalone BLAST with Ruby revisited
Posted: December 15, 2009 Filed under: bioinformatics, blast, databases, ruby, tutorials 4 Comments »Earlier I showed a very simple way to perform a BLAST using Ruby. Today I would like to revisit that topic for two reasons.
- The “using ruby with blast” search term seems to be very common and actually one of the ways that people reach my blog.
- The original post was not very through.
BLAST aka Basic Local Alignment Tool is used to search a sequence (either DNA or protein) against a database of other sequences (either all nucleotide or all protein) in order to identify similar sequences. BLAST has many different flavors and can search DNA against DNA or protein against protein and also can translate a nucleotide query and search it against a protein database and vice versa. It can also compute a “profile” for the query sequence and use that for further searches as well as search the query against a database of profiles.
The BLAST tool is fundamental to molecular biologists and bioinformaticians. There are excellent books and tutorials on how to and when to use BLAST, so i will assume all you need is to automated your work and parse the results. The actual algorithm is implemented in C and freely available from the NCBI website.The first thing to do is to download the appropriate binaries for your platform. Instructions for setting up and installing BLAST
Once installed on your system the primary method of interaction is using the command line. Use formatdb to create blast databases and blastall to search for sequence homology for a given sequence against a given blast database.
In Ruby, there are two ways you can call the BLAST program. First using the Bioruby library and second by writing your own ruby wrapper for the BLAST command line parameters and execution. Most often, one executes BLAST from the command line and then process the results file which is in either one of the many BLAST output formats. Bioruby is excellent at parsing the results file. Using Bioruby with BLAST is very straightforward:
#blasting the bioruby way #query_file: a list of query sequences in fasta format #database_path: a path to the actual BLAST formatted database #program: The BLAST program to call, either of blastp,blastn,tblastn e.t.c.
def bio_blast(program, database_path,query_file)
factory = Bio::Blast.local(program,database_path)
ff = Bio::FlatFile.open(Bio::FastaFormat, query_file)
ff.each do |entry|
report = factory.query(entry) # report will be a Blast::Report object
# iterate trough the hits
report.each do|hit|
puts hit.bit_score # bit score (*)
puts hit.query_seq # query sequence
puts hit.midline # middle line string of alignment of homologous region (*)
puts hit.target_seq # hit sequence
puts hit.evalue # E-value
puts hit.identity # % identity
puts hit.overlap # length of overlapping region
puts hit.query_id # identifier of query sequence
puts hit.query_def # definition(comment line) of query sequence
puts hit.query_len # length of query sequence
puts hit.target_id # identifier of hit sequence
puts hit.target_def # definition(comment line) of hit sequence
puts hit.target_len # length of hit sequence
puts hit.query_start # start position of homologous region in query sequence
puts hit.query_end # end position of homologous region in query sequence
puts hit.target_start # start position of homologous region in hit(target) sequence
puts hit.target_end # end position of homologous region in hit(target) sequence
puts hit.lap_at # array of above four numbers
hit.each do |hsp|
puts hsp.query_from
end
end
end
end
What’s new in Bioruby edge
Posted: January 4, 2009 Filed under: bioinformatics, bioruby, blast, databases, tutorials | Tags: bioruby, blast, code, ruby Leave a comment »Changes to Bio::Blast
Naohisa Goto has announced changes to the Bio::Blast.reports to support default -m 0 and tabular -m 8
formats in addition to XML (-m 7) form. I think this is really nice and convenient!
Previously it meant that for bioruby to parse a Blast file, you had to have your blast results in XML output which Bio::Blast::Reports would understand. However, by default Blast gives an -m0 output and without that prior knowledge you may spend hours wondering what is wrong when parsing default blast output files. With Bioruby 1.2.1 this will not work
require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::Blast.reports(report_file) do |report|
puts report.class
end
Unless blast_output2.txt is in XML format
In the upcoming Bioruby 1.3 the default Blast file can now be parsed, for example,
require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::FlatFile.open(Bio::Blast::Default::Report,report_file) do |ff| ff.each do |rep| puts rep.statistics rep.iterations.each do |itr| puts itr.hits.size itr.hits.each_with_index do |hit,i| puts hit.hit_id puts hit.len end end end end
Bio::Blast.remote now supports DDBJ in addition to Genomenet. It would be a nice idea to support NCBI as well.
Changes to Bio:sequence
It is possible to create sequence objects from Bio::GenBank, Bio::EMBL, and Bio::FastaFormat by using the to_biosequence method
gb = Bio::GenBank.new(genbank_file.gb) gb.to_biosequence
Bio::SQL Support
Thanks to Raoul and Naohisa, support for BioSQL has been rewritten by using ActiveRecord.
#Make a connection connection = Bio::SQL.establish_connection(path_to_database.yaml,'development') #list databases databases =Bio::SQL.list_databases #retrieve a sequence sample_seq = Bio::SQL.fetch_accession('some_accession_number') #get number of seqeunces in the database puts Bio::SQL.list_entries #get references associated with an entry puts sample_seq.references #create an embl format puts sample_seq.to_biosequence.output(:embl)
Changes to Bio::GFF2 and Bio::GFF3
GFF2/GFF3 formatted texts are now supported but there will be backward portability issues with bio 1.2.1 since some incompatible changes have been incorporated. Bio::GFF::Record.comments has been renamed to comment and comments= is now comment=
Both Bio::GFF::GFF2::Record.new and Bio::GFF::GFF3::Records.new, can now take 9 arguments that correspond to GFF columns making it easy to create a Record object directly without need for formatted text.
Both Bio::GFF::GFF2::Record#attributes and Bio::GFF::GFF3::Record#attributes have been changed to return a nested array containing tag, value pairs, to obtain a hash, use the to_hash method
To support data output for GFF2/GFF3, new methods have been added: Bio::GFF::GFF2#to_s, Bio::GFF::GFF3#to_s, Bio::GFF::GFF2::Record#to_s,and Bio::GFF::GFF3::Record#to_s
Lots of other changes have been incorporated for the GFF classes and you can view the change log at github
CodeML parser
A wrapper for PAML codeml program that is used for estimating evolutinary rate has been added. The class provides methods for generating the necessary configuration file. The new Bio::PAML::Codeml::Report and PAML::Codeml::Rates provides simpel classes for accessing the codeml report and rates file. This example is from the example given in the source code
require 'bio'
# Reads multi-fasta formatted file and gets a Bio::Alignment object.
alignment = Bio::FlatFile.open(Bio::Alignment::MultiFastaFormat, 'example.fst').alignment
# Reads newick tree from a file
tree = Bio::FlatFile.open(Bio::Newick, 'example.tree').tree
# Creates a Codeml object
codeml = Bio::PAML::Codeml.new
# Sets parameters
codeml.parameters[:runmode] = 0
codeml.parameters[:RateAncestor] = 1
# You can also set many parameters at a time.
codeml.parameters.update({ :alpha => 0.5, :fix_alpha => 0 })
# Executes codeml with the alignment and the tree
report = codeml.query(alignment, tree)
Lots of Bugs have been fixed and also support for Ruby 1.9 has been added. Its great thanks to the bioruby developers for their time and the excellent new changes!
Standalone BLAST with Ruby (windows)
Posted: October 3, 2007 Filed under: bioinformatics, blast, ruby, technology | Tags: bioinformatics, blast, ruby, technology 2 Comments »#create a query sequence
myseq="pcaatcacatyyawwqqffgghhhkllkl"
#create a temporary file
require 'tempfile'
temp=Tempfile.new("seqfile")
#get the name of the temporary file
name=temp.path
#append the contents your sequence to this temporary file
temp.puts "#{myseq}"
temp.close
#since we have a protein query sequence, we will run a blastp. Please note that you will need to have a valid #database to query against. use the formatdb command to create your database before executing the lines #below.
@program = 'blastp'
#path to blast
@database = 'c:/path_to_databasefile'
#name of your query file
@input= name
#your blast output file
@output='c:/path_output_file'
#assume your blast is in a folder called NCBI_Blast, execute
system( "c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} -o #{@output}")
#To capture the output in a variable execute this command instead.
#note that we have omitted the blast -o parameter
result=%x(c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} )
#remember to delete the temporary file!
temp.close(true)