What’s new in bioruby

I scouted the bioruby git repository the other day to see what might be new in the current snapshot.   These are some of the notable changes:

Bug fixes;

Lots of bug fixes. For example the Bio::Fasta.remote bug has been fixed, workaround for Zlib error, fixed method names

Increased ruby 1.9 support

Renaming of files and modules

Some files and modules have been renamed for example Bio::Fastq:QualityScore has been renamed to Bio::Sequence::QualityScore

Better documentation

There is a samples folder that include sample usage of some classes and methods

PhyloXML support

Support for the phyloxml parser and writer has been included. A new version(1.10) of the PhyloXML schema has been added.

This contribution was provided by the awesome Latvian girl through a Google Summer of Code project and working for NESCent organization.

Meme and Mast support

Contributed by Adam Kraut. Minimal and basic support for the motif finding application Meme and Mast has been added.

Efficiecy

Speed up of Bio::Tree.children

“For speed up of Bio::Tree#children and parent, internal cache of
the parent for each node is added. The cache is automatically
cleared when the tree is modified. Note that the cache can only
be accessed from inside Bio::Tree.
* Bio::Tree#parent is changed to directly raise IndexError when
both of the root specified in the argument and preset in the
tree are nil (previously, the same error is raised in the path
method which is internally called from the parent method).
* Bio::Tree#path is changed not to call bfs_shortest_path if the
node1 and node2 are adjacent.”
To build a gem based on the current snapshot,  make sure the following lines have been included in the bioruby.gemspecs file. The current(today’s) snapshot may have already fixed this by now. :)

    "lib/bio/db/fasta/fasta_to_biosequence.rb",
    "lib/bio/db/fastq/fastq_to_biosequence.rb",
    "lib/bio/db/fastq/format_fastq.rb",
    "lib/bio/db/fastq.rb",
    "lib/bio/db/sanger_chromatogram/abif.rb",
    "lib/bio/db/sanger_chromatogram/chromatogram.rb",
    "lib/bio/db/sanger_chromatogram/chromatogram_to_biosequence.rb",
    "lib/bio/db/sanger_chromatogram/scf.rb",
    "lib/bio/db/phyloxml/phyloxml.xsd",
    "lib/bio/db/phyloxml/phyloxml_elements.rb",
    "lib/bio/db/phyloxml/phyloxml_parser.rb",
    "lib/bio/db/phyloxml/phyloxml_writer.rb",
    "lib/bio/sequence/quality_score.rb"
Note that this is the breeding edge version and things are bound to break.
Thank you for the awesome work!

Standalone BLAST with Ruby revisited

Earlier  I showed a very simple way to perform a BLAST  using Ruby. Today I would like to revisit that topic for two reasons.

  1. The “using ruby with blast” search term seems to be very common and actually one of the ways that people reach my blog.
  2. The original post was not very through.

BLAST aka Basic Local Alignment Tool is used to search a sequence (either DNA or protein) against a database of other sequences (either all nucleotide or all protein) in order to identify similar sequences. BLAST has many different flavors and can  search DNA against DNA or protein against protein and also can translate a nucleotide query and search it against a protein database  and vice versa. It can also compute a “profile” for the query sequence and use that for further searches as well as search the query against a database of profiles.

The BLAST tool is fundamental to molecular biologists and bioinformaticians. There are excellent books and tutorials on how to and when to use BLAST, so i will assume all you need is to automated your work and parse the results. The actual algorithm is implemented in C and freely  available from the NCBI website.The first thing  to do is to download the appropriate binaries for your platform. Instructions for setting up and installing BLAST

Once installed on your system  the primary method of interaction is using the command line. Use formatdb to create blast databases and blastall to search for sequence homology for a given sequence against a given blast database.

In Ruby, there are two ways you can call the BLAST program. First using the Bioruby library and second by writing your own ruby wrapper for the BLAST command line parameters and execution. Most often, one executes BLAST from the command line and then process the results file which is in either one of the many BLAST output formats. Bioruby is excellent  at parsing the results file. Using Bioruby with BLAST is  very straightforward:

#blasting the bioruby way
  #query_file: a list of query sequences in fasta format
  #database_path: a path to the actual BLAST formatted database
  #program: The BLAST program to call, either of blastp,blastn,tblastn e.t.c.
    def bio_blast(program, database_path,query_file)
        factory = Bio::Blast.local(program,database_path)

        ff = Bio::FlatFile.open(Bio::FastaFormat, query_file)
        ff.each do |entry|
           report = factory.query(entry) # report will be a Blast::Report object
          # iterate trough the hits
          report.each do |hit|
puts hit.bit_score        # bit score (*)
puts hit.query_seq        # query sequence (TRANSLATOR'S NOTE: sequence of homologous region of query sequence)
puts hit.midline          # middle line string of alignment of homologous region (*)
puts hit.target_seq       # hit sequence (TRANSLATOR'S NOTE: sequence of homologous region of query sequence)
puts hit.evalue           # E-value
puts hit.identity         # % identity
puts hit.overlap          # length of overlapping region
puts hit.query_id         # identifier of query sequence
puts hit.query_def        # definition(comment line) of query sequence
puts hit.query_len        # length of query sequence
puts hit.target_id        # identifier of hit sequence
puts hit.target_def       # definition(comment line) of hit sequence
puts hit.target_len       # length of hit sequence
puts hit.query_start      # start position of homologous region in query sequence
puts hit.query_end        # end position of homologous region in query sequence
puts hit.target_start     # start position of homologous region in hit(target) sequence
puts hit.target_end       # end position of homologous region in hit(target) sequence
puts hit.lap_at           # array of above four numbers
hit.each do |hsp| puts hsp.query_from end end end end

The method will execute BLAST and also print the hits and the high scoring potions start coordinates for each hit. How ever you may want to just run BLAST without the bioruby overhead. The line below will work as well:

  input = query_path
    #execute blast and store the results in the blast_results  variable
    #-p blast program to run
    #-d blast database to query against
    #-T gives a html output
    #-i query file path

  #execution
blast_result = %x(blastall -p #{program} -d #{database} -e #{expectation} -M #{matrix}
                 -i #{input} -T  T)
#blast_result will be the output from the system execution of the above command. You can choose to write it 
to a file or process it using the Bio::Blast::Report object.

You can use a similar style command like the one above to create BLAST databases using the formatdb command.

I would recommend the use of the bio-ruby blast report parsing classes to automate the process. Please look at the Bio-ruby API documentation for more details.

A ruby class for screen-scraping plasmodb database

Plasmodb is the primary resource for retrieving Plasmodium falciparum genomic data and information. Unfortunately this database has no API or XML service to request or query its  information from a programmer’s point of view or for easy automation of sequence information retrieval.  Recently I needed to download a long list of Plasmodium falciparum genomic, Protein and other information for a set of genes. Been lazy to click and open the webpage for each gene in my list. I wrote this in ruby.

It would be great if Plasmodb  would provide an easy way  of automated sequence retrieval. A webservice or an XML output format would do. Screen scraping is not a very efficient approach.  Here we use Scrapi which  is an HTML scraping toolkit for Ruby. It uses CSS selectors to write easy, maintainable scraping rules to select, extract and store data from HTML content.

#A class to fetch information from plasmodb using the scrapi API
##TODO handle  Scraper::Reader::HTTPUnspecifiedError
class Plasmodb
   #retrives a information  using the gene_id
   #returns a structure obj
  def fetch_by_gene_id(var_name)
    begin
      scraper = Scraper.define do
        process "div#genomicSequence pre",    :genomic_sequence  => :text
        process "div#transcriptSequence pre", :mrna_sequence =>:text
        process "div#proteinSequence pre",    :protein_sequence  =>:text
        process "div#Aliases td>table",       :aliases =>:text
        result :protein_sequence,:aliases,:mrna_sequence,:genomic_sequence
        end

     search_link="http://plasmodb.org/plasmo/showRecord.do?
               name=GeneRecordClasses.GeneRecordClass&source_id="+var_name+"&project_id=PlasmoDB"
     uri = URI.parse(search_link)
     @query = scraper.scrape(uri)

    rescue Scraper::Reader::HTTPUnspecifiedError
      "None"
    end
  end
  #returns the predicted protein sequence
  def protein_sequence
    @query.protein_sequence.chomp
  end
#  Returns the genomic sequence
  def genomic_sequence
    @query.genomic_sequence.chomp
  end
  #returns Aliases
  def aliases
    @query.aliases
  end
  #returns the mrna sequence
  def mrna_sequence
    @query.mrna_sequence.chomp
  end
end

#Use the class to fetch information.
require 'rubygems'
require 'bio'
require 'scrapi'

file = "/home/george/genes_list.txt" #a file containing a list of accession numbers.
#one accession number per line

plasmo = Plasmodb.new #initialize a plasmodb class instance

#Read the file and process each accession number.
File.readlines(file).each do |line|
  line.chomp!
  plasmo.fetch_by_gene_id(line)  #fetches the information from Plasmodb.
  #print a fasta entry for the protein sequence
  puts Bio::Sequence.new(plasmo.protein_sequence).output(:fasta,:header=>line)
  puts Bio::Sequence.new(plasmo.genomic_sequence).output(:fasta,:header=>line)
end

#another example
#p = Plasmodb.new
#p.fetch_by_gene_id("PFD0020c")
#puts p.genomic_sequence

PC vs Apple Mac (Not the war!)

My good old PC running Linux OS is coming of age and recently started failing. The Optical drive is not functional and occasionally it will freeze. The Top cover does not hold anymore and the graphical TFT screen needs to be supported carefully.

While this particular computer has served me well, I am at that point where i need a new machine but am torn between an Apple Mac and a PC running Linux. First my work involves the following aspects;

  1. Compiling and running bioinformatics software developed using open source standards and technologies
  2. Programming
  3. Word processing and document editing
  4. Occasional mathematical modeling
  5. Administering  Unix based servers

I have tried to come up with a computer-model agnostic specifications for my needs.

Hardware

* High Processor speed (2.60GHZ or above)

* High Memory (4GB or above)

* Medium Hard-disk space (160GB and above)

* Long Battery life ( 5hours and above)

* Durable external cover

* Ergonomic keys and mouse

* Support for multiple external devices(printers,Cameras,Microphones,Storage devices,monitors)

* Excellent support for wireless technologies

* Support for running multiple operating systems on the same machine

Size and weight

* Lightweight

* convenience while travelling while traveling

Operating System

* A Unix or Linux derived operating system

* Easy to upgrade at zero or minimal cost

* Free patches against known security holes and problems.

Software

* Support for Open software standards

* Support for Microsoft, Adobe and other proprietary software vendor’s products

Security

* Excellent inbuilt support against Malware, Trojans and viruses at minimal or no cost

* Support for locking the machine while away or against unauthorized login

* Ability to easily ‘tag’ the machine in case of theft

Price : Affordable and reasonable

Based on the above specifications I have evaluated two computers models that can satisfy the above needs.

1. A PC laptop computer running a Linux based operating system

2. An Apple Macintosh laptop computer

I have ruled out a Windows/DOS based Operating software because  Microsoft Windows based operating system cannot offer  support for open source standards and technologies. OS upgrade for windows is very expensive and the OS is highly prone to malware, Trojans and viruses. Most bioinformatics software and tools are developed on Unix or  Linux environment.

PC can support Linux installations even though one looses on hardware optimization. Linux has a relatively poor graphical user interface and functionality when compared to Mac OS or Windows. There is limited support for document processing, graphics and rich multimedia applications support. Linux does not support any of the Microsoft software applications natively. There are open source equivalents but most lack good support.

Apple Macintosh computers are based on Unix and open source technologies, they support both closed source and open source standards. The hardware is optimized and accelerated for the Apple mac OS. They offer excellent graphical user interface system, a powerful terminal for interaction with the OS, they are not prone to virus attacks, and they support long battery life as well as portability, ergonomics and a relatively within a  price range equivalent to a PC of the same specifications.

Given my budget constrains, I am thinking that a 2.53GHz Apple Macintosh 13 inch model with 4GB of memory is best for my needs. There is little price differences between the PC and Macintosh models based on my specifications. PC models do not favor Linux installations and Linux hardware support is not guaranteed. They however seem to have a more flexible price ranges depending on the manufacturers, vendors, quality and specifications.

I will keep Linux to run my server applications.