What’s new in Bioruby edge
Posted: January 4, 2009 Filed under: bioinformatics, bioruby, blast, databases, tutorials | Tags: bioruby, blast, code, ruby Leave a comment »Changes to Bio::Blast
Naohisa Goto has announced changes to the Bio::Blast.reports to support default -m 0 and tabular -m 8
formats in addition to XML (-m 7) form. I think this is really nice and convenient!
Previously it meant that for bioruby to parse a Blast file, you had to have your blast results in XML output which Bio::Blast::Reports would understand. However, by default Blast gives an -m0 output and without that prior knowledge you may spend hours wondering what is wrong when parsing default blast output files. With Bioruby 1.2.1 this will not work
require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::Blast.reports(report_file) do |report|
puts report.class
end
Unless blast_output2.txt is in XML format
In the upcoming Bioruby 1.3 the default Blast file can now be parsed, for example,
require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::FlatFile.open(Bio::Blast::Default::Report,report_file) do |ff| ff.each do |rep| puts rep.statistics rep.iterations.each do |itr| puts itr.hits.size itr.hits.each_with_index do |hit,i| puts hit.hit_id puts hit.len end end end end
Bio::Blast.remote now supports DDBJ in addition to Genomenet. It would be a nice idea to support NCBI as well.
Changes to Bio:sequence
It is possible to create sequence objects from Bio::GenBank, Bio::EMBL, and Bio::FastaFormat by using the to_biosequence method
gb = Bio::GenBank.new(genbank_file.gb) gb.to_biosequence
Bio::SQL Support
Thanks to Raoul and Naohisa, support for BioSQL has been rewritten by using ActiveRecord.
#Make a connection connection = Bio::SQL.establish_connection(path_to_database.yaml,'development') #list databases databases =Bio::SQL.list_databases #retrieve a sequence sample_seq = Bio::SQL.fetch_accession('some_accession_number') #get number of seqeunces in the database puts Bio::SQL.list_entries #get references associated with an entry puts sample_seq.references #create an embl format puts sample_seq.to_biosequence.output(:embl)
Changes to Bio::GFF2 and Bio::GFF3
GFF2/GFF3 formatted texts are now supported but there will be backward portability issues with bio 1.2.1 since some incompatible changes have been incorporated. Bio::GFF::Record.comments has been renamed to comment and comments= is now comment=
Both Bio::GFF::GFF2::Record.new and Bio::GFF::GFF3::Records.new, can now take 9 arguments that correspond to GFF columns making it easy to create a Record object directly without need for formatted text.
Both Bio::GFF::GFF2::Record#attributes and Bio::GFF::GFF3::Record#attributes have been changed to return a nested array containing tag, value pairs, to obtain a hash, use the to_hash method
To support data output for GFF2/GFF3, new methods have been added: Bio::GFF::GFF2#to_s, Bio::GFF::GFF3#to_s, Bio::GFF::GFF2::Record#to_s,and Bio::GFF::GFF3::Record#to_s
Lots of other changes have been incorporated for the GFF classes and you can view the change log at github
CodeML parser
A wrapper for PAML codeml program that is used for estimating evolutinary rate has been added. The class provides methods for generating the necessary configuration file. The new Bio::PAML::Codeml::Report and PAML::Codeml::Rates provides simpel classes for accessing the codeml report and rates file. This example is from the example given in the source code
require 'bio'
# Reads multi-fasta formatted file and gets a Bio::Alignment object.
alignment = Bio::FlatFile.open(Bio::Alignment::MultiFastaFormat, 'example.fst').alignment
# Reads newick tree from a file
tree = Bio::FlatFile.open(Bio::Newick, 'example.tree').tree
# Creates a Codeml object
codeml = Bio::PAML::Codeml.new
# Sets parameters
codeml.parameters[:runmode] = 0
codeml.parameters[:RateAncestor] = 1
# You can also set many parameters at a time.
codeml.parameters.update({ :alpha => 0.5, :fix_alpha => 0 })
# Executes codeml with the alignment and the tree
report = codeml.query(alignment, tree)
Lots of Bugs have been fixed and also support for Ruby 1.9 has been added. Its great thanks to the bioruby developers for their time and the excellent new changes!
Bioruby mini-series: The Bio::Sequence::Common class
Posted: February 12, 2008 Filed under: bioinformatics, bioruby, tutorials | Tags: bioruby, code, ruby 6 Comments »
Sequence Transformation
Lets have a look at the Bio::Sequence::Common class module which provides us with most of the sequence transformation methods for biological sequences.
Bio::Sequence::Common
Implements methods which are common to both Bio::Sequence::AA and Bio::Sequence::NA, for example
A Bio::Sequence object is easily created like this;
require ‘bio’
my_dna = Bio::Sequence.auto("actagatatttgat") #=> actagatatttgat
my_dna is now a Bio::sequence object and you can use the various methods available for this class, which we are going to explore shortly.
Bio::Sequence::Common Non Modifying methods
-
to_s
This method returns a sequence as a string. It does not modify the original sequence.
puts my_dna.to_s #=> actagatatttgat
puts my_dna.to_s.class #=> String
An alias for this method is the to_str method.
my_dna.to_str
#=> actagatatttgat
-
seq
This method will return a new Bio::Sequence::NA or Bio::Sequence::AA object. The original sequence remains unchanged. For example if you wished to assign a new instance of my_dna object that we created above ,such that you have a my_dna2 object, you would create that as follows,
my_dna2 = my_dna.seq
puts my_dna2 #=> actagatatttgat
puts my_dna2.class #=>
Bio::Sequence::NA
Bio::Sequence::Common modifying methods
-
Normalize!
This method removes all the white space and transforms all positions to uppercase if the sequence is an amino acid (AA) or transforms all positions to lowercase if the sequence is a nucleic acid (NA) sequence, leaving the original sequence modified
For example
test_seq = Bio::Sequence::NA.new(“ACTG”)
puts test_seq.normalize! #=>
actg
-
Concatenating
Many times we want to append a new sequence or a set of bases/residues eg a poly A sequence to the end of a new sequence and modify the original sequence. This is achieved by the concat method.
It is also referred to as << method.test_seq = Bio::Sequence::NA.new(“actg”)
test_seq << “acagat”
test_seq concat “acagat”
puts test_seq #=>
actgacagat
Note that to create a new sequence that adds to an existing sequence without altering the original sequence you would use the + operator. It accepts a variable number of arguments. For example
test_seq = Bio::Sequence::NA.new(“actg”)
test_seq2 = test_seq + (“cttcccttttt” “tatatata”)
puts test_seq2 #=>
actgcttcccttttttatatata
puts test_seq #=>actg
Working with subsequences
Please note that biological sequence numbering convections are one based as opposed to ruby’s zero based. Biological coordinate’s convection for BioSQL and Chado is zero based.
-
Subseq
This method returns a new sequence containing the subsequence identified by the start and end values given as parameters. This method works in a similar way to the slice string method. For example
my_seq = Bio::Sequence::NA.new(“agggatttc”)
puts my_seq.subseq(2,5) #=>
ggga
The first argument denotes the start and the second argument denotes the end of the subsequence. Both arguments must be positive integers
When this method is used without arguments, the start defaults to 1 and the end defaults to the last element of the string. Therefore when subseq is called without any arguments, it returns a new sequence similar to the original sequence.
puts my_seq.subseq #=> agggatttc
-
window_search
This method is typically used with a block. The method is called if you wanted to step through a sequence given a length of a subsequence. Therefore the method accepts two arguments. Step_size which defines the size of your ‘steps’ and the window_size which defines the length of the stepping subsequence. Any remaining sequence at the terminal end will be returned. The default step size is one since its an optional argument.
For example
To print the average GC% on each 100bp you can write,
s.window_search(100) do |subseq|
puts subseq.gc
end
Plasmodium falciparum re-annotation workshop opens
Posted: October 22, 2007 Filed under: bioinformatics, databases, malaria | Tags: annotation, bioinformatics, malaria, plasmodium Leave a comment »The Plasmodium genome re-annotation workshop opened on 21st October at the Sanger center. The Workshop runs till the 26th and aims to re-annotate the P. falciparum genome. In a welcome message Prof David Roos pointed out that a major goal of the workshop is to ascribe new or updated functions to gene models, reflecting the current state of knowledge in the wider malaria community.
The Plasmodium falciparum sequencing project was completed in 2002 and since then the Plasmodb database which is currently at version 5 has been the primary source of P. falciparum data and genomic information. With 60% of the P. falciparum genes annotated as hypothetical, it is time to reduce the number of hypothetical genes by providing annotations where known and possible.
Issues that will be addressed and visited include:
- standards for the use of structured gene ontologies in gene/genome annotation
- naming conventions for “hypothetical proteins”, “conserved hypotheticals”, “putative kinases”, etc
- naming conventions for large gene families
- standards for inferring function from orthology, motif/domain conservation, or ‘guilt by association’ based on functional genomics data
- standards for transfering annotations to orthologs in other Plasmodium species
- plans and proposals for further Plasmodium sequencing and other genomics resources
- pipelines for ensuring currency and consistency of data in GenBank/EMBL, GeneDB, PlasmoDB, etc
- future requirements and needs for Plasmodium informatics resources
- annotation projects not completed during the workshop … and strategies for ensuring completion
The workshop is sponsored by Sanger institute and plasmodb
Standalone BLAST with Ruby (windows)
Posted: October 3, 2007 Filed under: bioinformatics, blast, ruby, technology | Tags: bioinformatics, blast, ruby, technology 2 Comments »#create a query sequence
myseq="pcaatcacatyyawwqqffgghhhkllkl"
#create a temporary file
require 'tempfile'
temp=Tempfile.new("seqfile")
#get the name of the temporary file
name=temp.path
#append the contents your sequence to this temporary file
temp.puts "#{myseq}"
temp.close
#since we have a protein query sequence, we will run a blastp. Please note that you will need to have a valid #database to query against. use the formatdb command to create your database before executing the lines #below.
@program = 'blastp'
#path to blast
@database = 'c:/path_to_databasefile'
#name of your query file
@input= name
#your blast output file
@output='c:/path_output_file'
#assume your blast is in a folder called NCBI_Blast, execute
system( "c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} -o #{@output}")
#To capture the output in a variable execute this command instead.
#note that we have omitted the blast -o parameter
result=%x(c:/NCBI_Blast/bin/blastall.exe -p #{@program} -d #{@database} -i #{@input} )
#remember to delete the temporary file!
temp.close(true)