Biorelated

on Informatics & Life Sciences

Introducing Bio-cd-hit-report

CD-HIT is a fast algorithm for clustering and comparing biological sequences. CD-HIT and CD-HIT-EST tools group similar protein and DNA sequences into clusters that meet a user-defined similarity threshold. CD-HIT-454 is specilized to cluster metagenomic sequences. Both CD-HIT and CD-HIT-EST produce two files as an output. The first file is a fasta file which contains the representative sequences and the second file (.clstr) is a cluster file that contains a list of all clusters and the names of the sequence members for each cluster and the relative percetage identities..

If all you are interested in, is to reduce sequence redundacy at a certain percentage threshold,then the list of representative sequences is all that you might need. However if you want to interlogate the clusters, then you might need to look at the cluster file a little more closely. An example of a cluster file is shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>Cluster 0
0       468nt, >55Contig1... *
1       468nt, >55Contig2... at +/100.00%
>Cluster 1
0       306nt, >34Contig1... at +/100.00%
1       441nt, >34Contig2... *
2       306nt, >34Contig3... at +/100.00%
3       441nt, >34Contig5... at +/100.00%
>Cluster 2
0       309nt, >b01_Contig1... at +/100.00%
1       441nt, >b02_Contig1... *
2       309nt, >b03_Contig1... at +/100.00%
3       441nt, >b04_Contig1... at +/100.00%
>Cluster 3
0       438nt, >A_d01_Contig1... *
1       438nt, >A_d01_Contig1... at +/100.00%

bio-cd-hit-report is a simple biogem library that parses this output. It still needs a better interface and some improvements but for now I can easily use it as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
require 'bio-cd-hit-report'

   cluster_file = "cluster95.clstr"
    report = Bio::CdHitReport.new(cluster_file)

      #print the max number of sequences in a cluster for the entire dataset
      puts report.max_members

      #print the minimum number of sequences in a cluster for the entire dataset
      puts report.min_members

      #print total number of clusters in the report
      puts report.total_clusters

      #print the cluster members for cluster with id 1
      puts report.get_cluster(1)

      #information for each cluster
      report.each_cluster do |c|
        puts "#{c.name} - #{c.members}" #print cluster name/id with respective sequences in the cluster
        puts c.size #print the total number of entries in the cluster
      end

For more information and issues see the repo at github

Bioruby Plugins

Bio projects (bioperl, biopython, biojava) are monolithic empires with a few generals who control what is contributed to the projects and ensures that community coding standards are adhered to. A year ago the bioruby generals or core team as they like to be called, decided to be a little democratic, they split the empire into chiefdoms as long as a few rules and standards were followed. This development method is a shift from the traditional way used to develop the bio projects. The R project has known this trick for a while and has successfully expanded its empire considerably without a lot of development headaches for its generals. With 75 plugins published, as of this writing, it is evident the bio-plugins approach has increased code contribution and interest to a lot of bioruby developers. How does it work?

A bioruby-plugin is a gem that subclasses or extends the Bio module in Bioruby. To add new functionality to the sequence class for example, we can easily create a bio-awesome gem to add the missing functionality with

1
2
3
4
5
6
module Bio
  class Awesome < Bio::Sequence
    def new_method
    end
  end
end

To standardize the approach and to generate the necessary recommended directory layout, a plugin authoring tool (Bioruby-gem) was developed and described as an application note in a Bioinformatics paper. Bio-gem is maintained by Raoul Bonnal and Pjotr Prins with contributions from the rest of the community. Creating a new plugin directory structure is as simple as installing the bio-gem and running the command from a terminal.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
gem install bundler
gem install bio-gem

biogem awesome
create    .gitignore
  remove jeweler rcov lines
  create  Rakefile
  create  Gemfile
  create  LICENSE.txt
  create  README.rdoc
  create  .document
  create  lib
  create  lib/bio-awesome.rb
  create  test
 create   test/helper.rb
  create  test/test_bio-awesome.rb
  create  lib/bio-awesome.rb
  create  lib/bio-awesome
  create  lib/bio-awesome/awesome.rb
  create  README.rdoc
  create  README.md
  append  .gitignore
  append  .travis.yml
Jeweler has prepared your gem in bioruby-awesome
....

Biogem command can take extra options that specify the testing framework,create a data directory for testing or create a bin directory to add command line functinality to a bio-gem. Biogem command will also create a git repository and push your code to a github account. It is recommended that a plugin’s code should be pushed to github. A comprehensive tutorial for creating biogems is available at github.

The existing plugins cover a lot of bioinformatics requirements (NGS,assembly,multiple alignments,interfaces with other tools,report parsing and other utilities) but there is need for more. What was an experimental approach may well be the way to develop open source software for bioinformatics.

The Bioruby 1.4.3 Release

On 21st August, Naohisa Goto announced the release of bioruby 1.4.3 via the bioruby mailing list. This version of bioruby contains bug fixes and a lot of improvements some of which are not backward compatible. Notable were improvements and bug fixes for working with Jruby and Rubinius

New Features

A few new features that were announced include

Bio::KEGG::KGML

New class Bio::KEGG::KGML::Graphics for storing a graphics element.

New class Bio::KEGG::KGML::Substrate for storing a substrate element.

New class Bio::KEGG::KGML::Product for storing a product element.

New method Bio::KEGG::KGML::Reaction#id.

The regular expression for FASTQ was truncated to only check the first id line starting with the “@” symbol. which allows for a broader FASTQ file recognition.

Incompatible changes

Bio::FlatFile.auto and Bio::FlatFile.open now uses the binary mode to open a file by default unless you explicitly specify it to use text mode and therefore files using using CR+LF line separator may not be read correctly.

In bioruby 1.4.2, a chromatogram file would open in ascii mode which would cause errors when reading the file. A simple workaround was to explicitly tell Ruby to open the file in binary mode.

1
2
3
4
5
6
7
f = File.open(filename, 'rb') #open in binary mode

chromatogram_ff = Bio::Abif.open(f)

chromatogram_ff.each do |ch|
  ch.to_seq
end

The file is easily opened with,

1
2
3
4
5
6
7
file = "filename"

chromatogram_ff = Bio::Abif.open(filename)

chromatogram_ff.each do |ch|
  ch.to_seq
end

Bug fixes

In BioRuby 1.4.3, almost all require and autoload lines were revised to avoid circular require calls. This fixed crashes on JRuby due to JRuby’s autoload bug. Other fixes as specified in the release notes were:

  • Fixed: Genomenet remote BLAST does not work.

  • Fixed: Bio::KEGG::KGML ignores “coords” field.

  • Fixed: Bio::NucleicAcid.to_re("s") typo

  • To suppress rare failure of chi-square equiprobability tests for Bio::Sequence::Common#randomize, test code changed to retry up to 10 times if the chi-square test fails. The assertion fails if the chi-square test fails 10 consecutive times, and this strongly suggests bugs in codes or in the random number generator.

  • Fixed: Bio::EMBL#os raises RuntimeError. The fix includes incompatible change.

  • Fixed: bin/bioruby: Failed to save object with error message “can’t convert Symbol into String” on Ruby 1.9

A complete list of all the changes is available from the Bioruby github repository and from the Bioruby Blog

Announcing Scribl-rails

Sometimes back I had mentioned about scribl javascript framework for drawing bioinformatics glyphs on HTML5 canvas. If you are a Rails developer you will be happy to know that I have written  scribl-rails, an asset helper for including scribl in your application asset pipeline.

Usage

Add the following to your gemfile:

1
gem 'scribl-rails'

ran bundle install from the application directory

Add the following directive to your Javascript manifest file (application.js):

1
//= require scribl

Enjoy using scribl-rails and creating cute bio-graphics! Many thanks to Chase Miller for the awesome library!

For more information and development check out the scribl-rails at github

Use Scribl to Draw Genomic Glyphs on HTML5 Canvas

The Scribl library by Chase Miller is an awesome and a promising Javascript library for visualizing biological sequence information and rendering it on the web. Scribl  generates biological charts of genomic regions, alignments, and assembly data. The library is under continuous development and I have been able to use it for some internal projects!

A very nice list of examples and introduction is available at the home page and the wiki provides an elaborate user guide!

Happy biology!

Keep Track of Bioruby Plugins

Biogems.info is a new site for keeping track of new and existing Bioruby plugins. Plugins are separate code libraries that split functionality out of the Bioruby main tree. The idea is to have a core Bioruby release and to allow Ruby developers to contribute to Bioruby through plugins. According to Bonnal, the maintainer of biogem (the bio-plugin crafting tool),  plugins are separately maintained and may represent experimental or work in progress.

To read more about Bioruby plugin system please refer to the wiki page on plugins.

Happy biology!

Bioruby 1.4.2 Released!

The Bioruby development team has continued to work tirelessly to bring us the latest release of the Ruby bioinformatics library commonly referred to as bioruby. A list of all the new changes is available here

One of the most pleasant news for beginners is that the Bioruby tutorial has been updated thanks to Michael O’Keefe and Pjotr Prins. The Release is largely a bug fix release with updates on web services from SOAP to REST interfaces.

Upgrading to the latest release is easy…

1
gem update bio

or

1
gem install bio

Happy biology!

Processing netMHCII-pan Prediction Output

Like most informatics throughput methods, epitope prediction generates a lot of output and in a not so friendly format suitable for subsequent analysis. I considered writing a parser for the output using Ruby, but would that not take long? A simple vim function that I added to my .vimrc file to format the output and use a single keystroke worked the magic and saved time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
" formating output from netMHCII-pan program
function! FormatNetmhcOutput()
   g/^\#/norm dd 
   g/^--/norm dd
   g/^Protein/norm dd
   %le
   g/^pos/norm dd
   %s/<=\sWB//g
   %s/<=\sSB//g
   %s/\s\+$//
   %s/\s\+/,/g
   g/^$/d
endfunction
nmap   ;h  :call FormatNetmhcOutput()

This function can be called by pressing the ; and h key when in normal mode. It removes comments and provides a csv output that can be read with a simple R directive.

1
data <– read.csv("file.csv")

sample output

Convert a fastA File to a Hash

Sometimes you might want to convert a file of fasta sequences to a hash. Here is a one line method that might come handy for that.

1
2
3
4
5
6
7
8
9
require 'bio'

file_path = "example.fasta"

def fasta_to_hash
  Bio::FlatFile.auto(file_path){ |f| f.map {|entry| Hash.[](entry.definition.to_sym,[entry.seq.to_s])}}
end

 #=>[{:"seq1"=>["gatataggagatatcgttagag"]}]

The result is an array of hashes. Each hash key corresponds to the sequence name

My General Purpose Bioinformatics Toolbox

I spend most of my time writing code and using an range of bioinformatics analysis packages. Unlike in many other professions, sometimes  there are no best tools for accomplishing a bioinformatics task. The tools are continuously improved and the choice of tools is dependent on the research question and the biology of whatever you are investigating. However  I have come to rely on some general purpose resources that make me more productive. Let me introduce you to my general purpose spanner box.

Code Editing

I have finally found nirvana in MacVim, which is the preferred version of  Vim for Mac OS. It allows screen splitting, window resizing and integrates with the console, such that you can run system commands right in the editor. You have to install the necessary scripts or plugins to support what you want to do.

It has increased my productivity, although it has a slightly steep learning curve.  This is a tool I recommend to any bioinformatician, if you are not already using it!

Cost: Free

Source Control

Git

I use git for source control. It is awesome and fits very well with my workflow. Git has powerful features and easy to use and work with. I like the idea of distributed source control and it makes it easier to work on different versions of the same project!

Cost: Free

Bibliography manager

*Papers**

I use Papers, which is a commercial tool but I would recommend it to anyone. It helps me sort,annotate and read research articles. I once used Mendeley, which is an awesome tool as well.

Cost: $42 (has academic discounts)

Terminal

terminal

One of the best tools which we may forget is a tool is the terminal! Since I use Mac OS, I enjoy the best of both worlds, a powerful Unix command line support, excellent graphics and support for proprietary software if need arises.

Pen

Pilot V Ball RT Pen

I don’t keep an electronic notebook since I prefer jotting down my notes and having a Notebook. I use a liquid ink Pilot V ball RT pen. The pen has a retractable cone-tip liquid ink rollerball, rubber grip and metal pocket clip. It is airplane-safe and writes a 0.4mm line. 

Cost: $5

NoteBook.

NoteBook

My preference for a notebook is an  A4  D66174 Notebook.  Each book has about 180 pages.  It comes with a protective handcover. This is an archive for my written thoughts, discussions and workflows.

Cost $90 for a pack of five books.

What is your general purpose  bioinformatics toolbox?