What’s new in Bioruby edge

Changes to Bio::Blast

Naohisa Goto has announced changes to the Bio::Blast.reports to support default -m 0 and tabular -m 8
formats in addition to XML (-m 7) form. I think this is really nice and convenient!

Previously it meant that for bioruby to parse a Blast file, you had to have your blast results in XML output which Bio::Blast::Reports would understand. However, by default Blast gives an  -m0 output and without that prior knowledge you may spend hours wondering what is wrong when parsing default blast output files. With Bioruby 1.2.1 this will not work

require 'rubygems'
require 'bio'

report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::Blast.reports(report_file) do |report|
  puts report.class
end

Unless blast_output2.txt is in XML format

In the upcoming Bioruby 1.3 the default Blast file can now be parsed, for example,

require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::FlatFile.open(Bio::Blast::Default::Report,report_file) do |ff|
ff.each do |rep|
   puts rep.statistics
   rep.iterations.each do |itr|
      puts itr.hits.size

    itr.hits.each_with_index do |hit,i|
      puts hit.hit_id
      puts hit.len
     end
   end
 end
end

Bio::Blast.remote now supports DDBJ in addition to Genomenet. It would be a nice idea to support NCBI as well.

Changes to Bio:sequence

It is possible to create  sequence objects from Bio::GenBank, Bio::EMBL, and Bio::FastaFormat by using the to_biosequence method

gb = Bio::GenBank.new(genbank_file.gb)
gb.to_biosequence

Bio::SQL Support

Thanks to Raoul and Naohisa, support for BioSQL has been rewritten by using  ActiveRecord.

#Make a connection

connection = Bio::SQL.establish_connection(path_to_database.yaml,'development')
#list databases

databases =Bio::SQL.list_databases

#retrieve a sequence
sample_seq = Bio::SQL.fetch_accession('some_accession_number')

#get number of seqeunces in the database
puts Bio::SQL.list_entries

#get references associated with an entry
puts sample_seq.references

#create an embl format
puts sample_seq.to_biosequence.output(:embl)

Changes to Bio::GFF2 and Bio::GFF3

GFF2/GFF3 formatted texts are now supported but there will be backward portability issues with bio 1.2.1 since some incompatible changes have been incorporated.  Bio::GFF::Record.comments has been renamed to comment and comments= is now comment=

Both Bio::GFF::GFF2::Record.new and Bio::GFF::GFF3::Records.new, can now take 9 arguments that correspond to GFF columns making it easy to create a Record object directly without need for  formatted text.

Both Bio::GFF::GFF2::Record#attributes and Bio::GFF::GFF3::Record#attributes have been changed to return a nested array containing tag, value pairs, to obtain a  hash, use the to_hash method

To support data output for GFF2/GFF3, new methods have been added:  Bio::GFF::GFF2#to_s, Bio::GFF::GFF3#to_s, Bio::GFF::GFF2::Record#to_s,and Bio::GFF::GFF3::Record#to_s

Lots of other changes have been incorporated for the GFF classes and you can view the change log at github

CodeML parser

A wrapper for PAML codeml program that is used for estimating evolutinary rate has been added.  The class provides methods  for generating the necessary configuration file. The new Bio::PAML::Codeml::Report and PAML::Codeml::Rates  provides simpel classes  for accessing the codeml report and rates file.  This example is from the example given in the source code

require 'bio'
# Reads multi-fasta formatted file and gets a Bio::Alignment object.
     alignment = Bio::FlatFile.open(Bio::Alignment::MultiFastaFormat, 'example.fst').alignment
     # Reads newick tree from a file
     tree = Bio::FlatFile.open(Bio::Newick, 'example.tree').tree
  # Creates a Codeml object
  codeml = Bio::PAML::Codeml.new
     # Sets parameters
     codeml.parameters[:runmode] = 0
     codeml.parameters[:RateAncestor] = 1
     # You can also set many parameters at a time.
   codeml.parameters.update({ :alpha => 0.5, :fix_alpha => 0 })
     # Executes codeml with the alignment and the tree
     report = codeml.query(alignment, tree)

Lots of Bugs  have been fixed and also support for Ruby 1.9 has been added. Its great thanks to the bioruby developers for their time and the excellent new changes!




Bioruby mini-series: The Bio::Sequence::Common class

Sequence Transformation

Lets have a look at the Bio::Sequence::Common class module which provides us with most of the sequence transformation methods for biological sequences.

Bio::Sequence::Common

Implements methods which are common to both Bio::Sequence::AA and Bio::Sequence::NA, for example

A Bio::Sequence object is easily created like this;

require ‘bio’

my_dna = Bio::Sequence.auto("actagatatttgat") #=> actagatatttgat

my_dna is now a Bio::sequence object and you can use the various methods available for this class, which we are going to explore shortly.

Bio::Sequence::Common Non Modifying methods

  • to_s

    This method returns a sequence as a string. It does not modify the original sequence.

    puts my_dna.to_s #=> actagatatttgat

    puts my_dna.to_s.class #=> String

    An alias for this method is the to_str method.

    my_dna.to_str
    #=> actagatatttgat

  • seq

    This method will return a new Bio::Sequence::NA or Bio::Sequence::AA object. The original sequence remains unchanged. For example if you wished to assign a new instance of my_dna object that we created above ,such that you have a my_dna2 object, you would create that as follows,

    my_dna2 = my_dna.seq

    puts my_dna2 #=> actagatatttgat

    puts my_dna2.class #=>
    Bio::Sequence::NA

Bio::Sequence::Common modifying methods

  • Normalize!

    This method removes all the white space and transforms all positions to uppercase if the sequence is an amino acid (AA) or transforms all positions to lowercase if the sequence is a nucleic acid (NA) sequence, leaving the original sequence modified

    For example

    test_seq = Bio::Sequence::NA.new(“ACTG”)

    puts test_seq.normalize! #=>
    actg

  • Concatenating

    Many times we want to append a new sequence or a set of bases/residues eg a poly A sequence to the end of a new sequence and modify the original sequence. This is achieved by the concat method.
    It is also referred to as << method.

    test_seq = Bio::Sequence::NA.new(“actg”)

    test_seq << “acagat”

    test_seq concat “acagat”

    puts test_seq #=>
    actgacagat

Note that to create a new sequence that adds to an existing sequence without altering the original sequence you would use the + operator. It accepts a variable number of arguments. For example

test_seq = Bio::Sequence::NA.new(“actg”)

test_seq2 = test_seq + (“cttcccttttt” “tatatata”)

puts test_seq2 #=>
actgcttcccttttttatatata

puts test_seq #=>actg

Working with subsequences

Please note that biological sequence numbering convections are one based as opposed to ruby’s zero based. Biological coordinate’s convection for BioSQL and Chado is zero based.

  • Subseq

    This method returns a new sequence containing the subsequence identified by the start and end values given as parameters. This method works in a similar way to the slice string method. For example

    my_seq = Bio::Sequence::NA.new(“agggatttc”)

    puts my_seq.subseq(2,5) #=>
    ggga

    The first argument denotes the start and the second argument denotes the end of the subsequence. Both arguments must be positive integers

    When this method is used without arguments, the start defaults to 1 and the end defaults to the last element of the string. Therefore when subseq is called without any arguments, it returns a new sequence similar to the original sequence.

    puts my_seq.subseq #=> agggatttc

  • window_search

    This method is typically used with a block. The method is called if you wanted to step through a sequence given a length of a subsequence. Therefore the method accepts two arguments. Step_size which defines the size of your ‘steps’ and the window_size which defines the length of the stepping subsequence. Any remaining sequence at the terminal end will be returned. The default step size is one since its an optional argument.

    For example

    To print the average GC% on each 100bp you can write,

    s.window_search(100) do |subseq|

    puts subseq.gc

    end


Plasmodium falciparum re-annotation workshop opens

The Plasmodium genome re-annotation workshop opened on 21st October at the Sanger center. The Workshop runs till the 26th and aims to re-annotate the P. falciparum genome. In a welcome message Prof David Roos pointed out that a major goal of the workshop is to ascribe new or updated functions to gene models, reflecting the current state of knowledge in the wider malaria community.

The Plasmodium falciparum sequencing project was completed in 2002 and since then the Plasmodb database which is currently at version 5 has been the primary source of P. falciparum data and genomic information. With 60% of the P. falciparum genes annotated as hypothetical, it is time to reduce the number of hypothetical genes by providing annotations where known and possible.

Issues that will be addressed and visited include:

- standards for the use of structured gene ontologies in gene/genome annotation

- naming conventions for “hypothetical proteins”, “conserved hypotheticals”, “putative kinases”, etc

- naming conventions for large gene families

- standards for inferring function from orthology, motif/domain conservation, or ‘guilt by association’ based on functional genomics data

- standards for transfering annotations to orthologs in other Plasmodium species

- plans and proposals for further Plasmodium sequencing and other genomics resources

- pipelines for ensuring currency and consistency of data in GenBank/EMBL, GeneDB, PlasmoDB, etc

- future requirements and needs for Plasmodium informatics resources

- annotation projects not completed during the workshop … and strategies for ensuring completion

The workshop is sponsored by Sanger institute and plasmodb


Standalone BLAST with Ruby (windows)

Updated article

BLAST is one of the most widely used search algorithms in molecular biology. So lets see how you can run and retrieve blast results via a simple Ruby script .I will assume you already have ruby 1.8.5 and above installed in your windows box and a standalone blast.exe which you can download from the NCBI’s ftp site here . The latest windows binaries as of this writing is 2.2.17. Create a new folder in C and call it NCBI_Blast. Paste the downloaded blast program in this folder. Double click the blast program and it will create a bin, doc and data folders inside your your NCBI_Blast folder. If this is your first time to install blast in your machine. You will need to do a little configuration. Follow these instructions for setting up blast .
#create a query sequence
myseq="pcaatcacatyyawwqqffgghhhkllkl"
#create a temporary file
require 'tempfile'
temp=Tempfile.new("seqfile")
#get the name of the temporary file
name=temp.path
#append the contents your sequence to this temporary file
temp.puts "#{myseq}"
temp.close
#since we have a protein query sequence, we will run a blastp. Please note that you will need to have a valid #database to query against. use the formatdb command to create your database before executing the lines #below.
@program = 'blastp'
#path to blast
@database = 'c:/path_to_databasefile'
#name of your query file
@input= name
#your blast output file
@output='c:/path_output_file'
#assume your blast is in a folder called NCBI_Blast, execute
system( "c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} -o #{@output}")
#To capture the output in a variable execute this command instead.
#note that we have omitted the blast -o parameter
result=%x(c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} )
#remember to delete the temporary file!
temp.close(true)

Follow

Get every new post delivered to your Inbox.

Join 134 other followers