Use Scribl to draw genomic glyphs on HTML5 canvas
Posted: September 20, 2011 Filed under: biographics, bioinformatics, databases, ruby on rails, Tools Leave a comment »
The Scribl library by Chase Miller is an awesome and a promising Javascript library for visualizing biological sequence information and rendering it on the web. Scribl generates biological charts of genomic regions, alignments, and assembly data. The library is under continuous development and I have been able to use it for some internal projects!
A very nice list of examples and introduction is available at the home page and the wiki provides an elaborate user guide!
Happy biology!
Standalone BLAST with Ruby revisited
Posted: December 15, 2009 Filed under: bioinformatics, blast, databases, ruby, tutorials 4 Comments »Earlier I showed a very simple way to perform a BLAST using Ruby. Today I would like to revisit that topic for two reasons.
- The “using ruby with blast” search term seems to be very common and actually one of the ways that people reach my blog.
- The original post was not very through.
BLAST aka Basic Local Alignment Tool is used to search a sequence (either DNA or protein) against a database of other sequences (either all nucleotide or all protein) in order to identify similar sequences. BLAST has many different flavors and can search DNA against DNA or protein against protein and also can translate a nucleotide query and search it against a protein database and vice versa. It can also compute a “profile” for the query sequence and use that for further searches as well as search the query against a database of profiles.
The BLAST tool is fundamental to molecular biologists and bioinformaticians. There are excellent books and tutorials on how to and when to use BLAST, so i will assume all you need is to automated your work and parse the results. The actual algorithm is implemented in C and freely available from the NCBI website.The first thing to do is to download the appropriate binaries for your platform. Instructions for setting up and installing BLAST
Once installed on your system the primary method of interaction is using the command line. Use formatdb to create blast databases and blastall to search for sequence homology for a given sequence against a given blast database.
In Ruby, there are two ways you can call the BLAST program. First using the Bioruby library and second by writing your own ruby wrapper for the BLAST command line parameters and execution. Most often, one executes BLAST from the command line and then process the results file which is in either one of the many BLAST output formats. Bioruby is excellent at parsing the results file. Using Bioruby with BLAST is very straightforward:
#blasting the bioruby way #query_file: a list of query sequences in fasta format #database_path: a path to the actual BLAST formatted database #program: The BLAST program to call, either of blastp,blastn,tblastn e.t.c.
def bio_blast(program, database_path,query_file)
factory = Bio::Blast.local(program,database_path)
ff = Bio::FlatFile.open(Bio::FastaFormat, query_file)
ff.each do |entry|
report = factory.query(entry) # report will be a Blast::Report object
# iterate trough the hits
report.each do|hit|
puts hit.bit_score # bit score (*)
puts hit.query_seq # query sequence
puts hit.midline # middle line string of alignment of homologous region (*)
puts hit.target_seq # hit sequence
puts hit.evalue # E-value
puts hit.identity # % identity
puts hit.overlap # length of overlapping region
puts hit.query_id # identifier of query sequence
puts hit.query_def # definition(comment line) of query sequence
puts hit.query_len # length of query sequence
puts hit.target_id # identifier of hit sequence
puts hit.target_def # definition(comment line) of hit sequence
puts hit.target_len # length of hit sequence
puts hit.query_start # start position of homologous region in query sequence
puts hit.query_end # end position of homologous region in query sequence
puts hit.target_start # start position of homologous region in hit(target) sequence
puts hit.target_end # end position of homologous region in hit(target) sequence
puts hit.lap_at # array of above four numbers
hit.each do |hsp|
puts hsp.query_from
end
end
end
end
A ruby class for screen-scraping plasmodb database
Posted: December 9, 2009 Filed under: bioinformatics, bioruby, databases, malaria, ruby, tutorials 3 Comments »Plasmodb is the primary resource for retrieving Plasmodium falciparum genomic data and information. Unfortunately this database has no API or XML service to request or query its information from a programmer’s point of view or for easy automation of sequence information retrieval. Recently I needed to download a long list of Plasmodium falciparum genomic, Protein and other information for a set of genes. Been lazy to click and open the webpage for each gene in my list. I wrote this in ruby.
It would be great if Plasmodb would provide an easy way of automated sequence retrieval. A webservice or an XML output format would do. Screen scraping is not a very efficient approach. Here we use Scrapi which is an HTML scraping toolkit for Ruby. It uses CSS selectors to write easy, maintainable scraping rules to select, extract and store data from HTML content.
![]()
#A class to fetch information from plasmodb using the scrapi API ##TODO handle Scraper::Reader::HTTPUnspecifiedError class Plasmodb #retrives a information using the gene_id #returns a structure obj def fetch_by_gene_id(var_name) begin scraper = Scraper.define do process "div#genomicSequence pre", :genomic_sequence => :text process "div#transcriptSequence pre", :mrna_sequence =>:text process "div#proteinSequence pre", :protein_sequence =>:text process "div#Aliases td>table", :aliases =>:text result :protein_sequence,:aliases,:mrna_sequence,:genomic_sequence end search_link="http://plasmodb.org/plasmo/showRecord.do? name=GeneRecordClasses.GeneRecordClass&source_id="+var_name+"&project_id=PlasmoDB" uri = URI.parse(search_link) @query = scraper.scrape(uri) rescue Scraper::Reader::HTTPUnspecifiedError "None" end end #returns the predicted protein sequence def protein_sequence @query.protein_sequence.chomp end # Returns the genomic sequence def genomic_sequence @query.genomic_sequence.chomp end #returns Aliases def aliases @query.aliases end #returns the mrna sequence def mrna_sequence @query.mrna_sequence.chomp end end #Use the class to fetch information. require 'rubygems' require 'bio' require 'scrapi' file = "/home/george/genes_list.txt" #a file containing a list of accession numbers. #one accession number per line plasmo = Plasmodb.new #initialize a plasmodb class instance #Read the file and process each accession number. File.readlines(file).each do |line| line.chomp! plasmo.fetch_by_gene_id(line) #fetches the information from Plasmodb. #print a fasta entry for the protein sequence puts Bio::Sequence.new(plasmo.protein_sequence).output(:fasta,:header=>line) puts Bio::Sequence.new(plasmo.genomic_sequence).output(:fasta,:header=>line) end #another example#p = Plasmodb.new#p.fetch_by_gene_id("PFD0020c")#puts p.genomic_sequence
Bio-graphics, BioSQL and Rails part 2
Posted: January 8, 2009 Filed under: biographics, bioinformatics, bioruby, databases, ruby on rails, technology 10 Comments »In part 1 of this series we created a rails application and connected it to a BioSQL database. We also overwrote the rails convections to accommodate our legacy schema.
To understand the BioSQL schema, please review the documentation here. A brief overview of is as follows. Every record we enter into our database is a ‘bioentry’ and goes to the bioenty table. A bioentry can be composed of the following entities: the record’s public name, public accession and version, its description and an identifier field.
The actual sequence data is stored in the biosequence table which contains raw sequence information associated with a bioentry, and alphabet information (‘protein’, ‘dna’, ‘rna’). This is because not all records in our database need to be associated with a raw sequence. Additional sequence information is stored in the seqfeature table together with other qualifiers.
The location of each seqfeature (or sub-seqfeature) is defined by a location entity, describing the stop and start coordinates and strand. This information is stored in the location table.
In our rails application we are going to create some models and a few controllers. In RESTful language, we are actually creating resources. In this example we will be very simplistic and just create a biodatabase, taxon, bioentry, biosequence, seqfeature, location resources. We will also create associations between them in their model classes. But before that delete the index.html file from your rails application public folder and add the following line to your configurations/routes.rb file
map.root :controller => "biosequences"
To quickly create the models, controllers, associated views and a test suite for each of our resources, just run the rails generate scaffold command, passing the name of the model as an argument. For example,
generate scaffold Bioentry
will create a bioentry model, a bioentries_controller, associated views (index,show,edit and new), a migration file, though in our case we do not need it. When you finish scaffolding, the routes.rb file should have the following resources declared.
map.resources :seqfeatures
map.resources :locations
map.resources :bioentries
map.resources :biosequences
map.resources :taxon
map.resources :biodatabases
Let us create some mandatory associations for the models.
Edit the /models/biodatabase.rb file by adding the following
has_many :bioentries #a biodatabase is associated with many bioentries
validates_uniqueness_of :name #The name foe each biodatabase is unique!
Edit the /models/bioentry.rb file by adding the following
belongs_to :biodatabase
belongs_to :taxon
has_one :biosequence
Edit the /models/taxon.rb and add
has_one :bioentry
Edit the /models/biosequence.rb file by adding:
set_primary_key :bioentry_id #biosequence uses bioentry_id as a primary key! belongs_to :bioentry
edit the /models/location.rb file by adding:
belongs_to :seqfeature
Edit the /models/seqfeature.rb file by adding:
belongs_to :bioentry
has_many :locations
Note that most likely you will be adding huge files to the database. BioSQL comes with a set of perl scripts to enable you do that. Until bioruby 1.3 is released you will have to use the perl scripts to add huge datasets. All the documentation to do that is available from the BioSQL website. I used a perl script load_ncbi_taxonomy.pl to load taxon data to my database. This script comes with the BioSQL. (It did not seem to work on my system, I will sort that later)
To make this post shorter and get to the meat of it, i will assume that you have some existing data in your biosql database. If not, create some dummy data to populate, the biodatabase, bioentry,biosequence, seqfeature and location tables. In Part 3, I will show you how to create the necessary views to populate the database. After all biologists don’t want to interact with raw SQL queries and sometimes have no idea of running scripts, however they are very web savy!
Edit the /biosequences/show.html.erb to look as follows:
<h2><%= @biosequence.bioentry.name%>(<%= @biosequence.alphabet %>)</h2> <p>Sequence</p> <%= @biosequence.seq %><br/> <%= link_to 'Edit', edit_biosequence_path(@biosequence) %>
Now navigate to http://localhost:3000/biosequences/1
and then navigate to http://locahost:3000/biosequences/1.xml The XML version of your sequence is also available!
Lets add some ability to render graphics for the sequences.
Add the following lines at the top of the biosequence.rb model file
require 'stringio'
require 'base64'
In the biosequence.rb model class, create a new method called draw_graphic.
def self.draw_graphic(value) #get the name and length of the main feature to be drawn main_feature = Bioentry.find(value) len = main_feature.biosequence.length.to_i name = main_feature.name #create a Biographics panel and add a track @my_panel = Bio::Graphics::Panel.new(len,:width=> 900) @track = @my_panel.add_track("#{name}",:glyph=>'directed_generic') #specify the range for the main feature main_feature_range = "1..#{len}" @track.add_feature(Bio::Feature.new("#{name}",main_feature_range), :label=>" ") #write the output to memory output = StringIO.new @my_panel.draw(output) return output.string end
This method will be called by an action method in biosequence_controller.rb file.
def to_image
begin
image = Biosequence.draw_graphic(Biosequence.find(params[:id]))
send_data(image, :filename => "graphic.svg", :disposition => "inline")
rescue ActiveRecord::RecordNotFound
add_error("Error:Attempt to call image without specifying a biosequence ID")
redirect_to :action=>'index'
end
end
We add a rescue block to capture record not found errors. In RESTful applications a controller is limited to seven actions. So we need to add a collection to our biosequence resource in routes.rb. This is how we do it.
map.resources :biosequences,:collection=>{:to_image=>:get}
Now we need to modify our /biosequences/show.html.erb file, to enable rendering of the graphic. For that we will create a helper method so that our show.html.erb view is ‘clean’. In helpers/biosequences_helper.rb file, add the following code
def render_image(feature_obj)
image_tag(url_for({:action=>'to_image',:id=>feature_obj}))
end
And in the /views/biosequences/show.html.erb file add the following line of code
<%= render_image(@biosequence) %><br/>
Now assuming that you have a biosql database with valid data, navigate to
http://localhost:3000/biosequences/show/1

screenshort
The above is a screen shot from my example application while I was writing this tutorial.
The source code for this example application is available from github
For a full review of the methods available for biographics please check the project’s git repository and the rdoc.
Bio-graphics, BioSQL and Rails part 1
Posted: January 7, 2009 Filed under: biographics, bioinformatics, bioruby, databases, ruby on rails, tutorials | Tags: bioruby 9 Comments »In these series I will show you how to quickly add graphics support to a bioinformatics database rails application. We are going to use the biographics library by Jan Aerts, the BioSQL database schema, and rails 2.2.2 (also works with 2.3.2) In this simple example we want to represent a sequence as a graphic, such that we can view it in a web browser more or less the way Gbrowse works. Each main feature has different subfeatures at different locations along it.
——————————————————————- main feature
——- ——– ———— ——– —— subfeatures
We need to have the following installed, rails 2.1.1, bio 2.1, biographics 1.4 all available as gems and a database based on BioSQL schema.
We need to download the BioSQL schema located here. The latest version as of this writing is BioSQL v1.0 (code-named Tokyo) release, v1.0.1. Create a database called biosql_development. I am on Ubuntu Linux with Mysql 5.0.
george:>mysql -u george -p :enter password Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 27 Server version: 5.0.51a-3ubuntu5.4 (Ubuntu) Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> create database biosql_development; Query OK, 1 row affected (0.02 sec) mysql>
I have created a database called biosql_development. Why am i not using migrations? The reason is that BioSQL has some agreed standards on table names and schema convection which are not compatible with rails database creation and table naming conventions. However Rails allows us to override these default convections, when working with legacy databases, as will be our case.
After creating the database, load the BioSQL schema to the empty database. First we need to tell mysql which database to use.
mysql> use biosql_development;
then load the schema
mysql> source /home/george/Desktop/downloadsfolder/biosql-1.0.1/sql/biosqldb-mysql.sql; Query OK, 0 rows affected, 1 warning (0.48 sec) Query OK, 0 rows affected (0.15 sec) Records: 0 Duplicates: 0 Warnings: 0 Query OK, 0 rows affected, 1 warning (0.01 sec) Query OK, 0 rows affected (0.03 sec) Records: 0 Duplicates: 0 Warnings: 0 Query OK, 0 rows affected, 1 warning (0.01 sec) ........ trucated mysql>
Now we need to create a rails application and connect to this database.
I use the Netbeans IDE development environment for creating ruby and rails applications. Go ahead and create a rails application and specify to use mysql as the database adapter.
To connect to our legacy database, we need to override some convections. First disable table plurulization, and tell rails that the table primary name is named as tablename_id as opposed to just the id column expected by rails. To do that
Create a new file in your application configurations/initializers directory called override_rails.rb (you can call it whatever).
class ActiveRecord::Base self.pluralize_table_names = false self.primary_key_prefix_type = :table_name_with_underscore end
The two lines above tells ActiveRecord not to expect the table names to be plural and that the primary key for each table is named as tablename_id format.
Also create another one called external_libraries.rb in the initializers directory, as you can tell this is where I want to put my require statements for loading external libraries.
require 'rubygems'
#load the bioinformatics library
require 'bio'
#load the biographics library
require 'bio-graphics'
#load the sql views extension library
gem 'rails_sql_views'
require 'rails_sql_views'
This file loads our gems. The rails_sql_views gem allows us to create views and access them by creating models corresponding to the views.
At this point if you run rake db:schema:dump, we will have a rails based BioSQL schema and which we can conveniently use to create a BioSQL database on any Relational database that rails supports and this includes Microsoft SQL server, DB2, Oracle, SQLlite and a host of others. All that would be required is to change the database.yml file to suit the adapter of choice and then execute rake db:schema:load to load the BioSQL schema.
Please note that if your are using rails 2.2.2, you may want to comment the lines
unless Kernel.respond_to?(:gem)
Kernel.send :alias_method, :gem, :require_gem
end
in rails_sql_views(0.6.1), otherwise running db:schema:dump will cause rake to abort.
In the next part I will describe how to create the necessary resources for our RESTful(Representational State Transfer) bioinformatics web application and rendering of the graphics.
What’s new in Bioruby edge
Posted: January 4, 2009 Filed under: bioinformatics, bioruby, blast, databases, tutorials | Tags: bioruby, blast, code, ruby Leave a comment »Changes to Bio::Blast
Naohisa Goto has announced changes to the Bio::Blast.reports to support default -m 0 and tabular -m 8
formats in addition to XML (-m 7) form. I think this is really nice and convenient!
Previously it meant that for bioruby to parse a Blast file, you had to have your blast results in XML output which Bio::Blast::Reports would understand. However, by default Blast gives an -m0 output and without that prior knowledge you may spend hours wondering what is wrong when parsing default blast output files. With Bioruby 1.2.1 this will not work
require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::Blast.reports(report_file) do |report|
puts report.class
end
Unless blast_output2.txt is in XML format
In the upcoming Bioruby 1.3 the default Blast file can now be parsed, for example,
require 'rubygems'
require 'bio'
report_file = "/home/george/esther_blast_files/blast_output2.txt"
Bio::FlatFile.open(Bio::Blast::Default::Report,report_file) do |ff| ff.each do |rep| puts rep.statistics rep.iterations.each do |itr| puts itr.hits.size itr.hits.each_with_index do |hit,i| puts hit.hit_id puts hit.len end end end end
Bio::Blast.remote now supports DDBJ in addition to Genomenet. It would be a nice idea to support NCBI as well.
Changes to Bio:sequence
It is possible to create sequence objects from Bio::GenBank, Bio::EMBL, and Bio::FastaFormat by using the to_biosequence method
gb = Bio::GenBank.new(genbank_file.gb) gb.to_biosequence
Bio::SQL Support
Thanks to Raoul and Naohisa, support for BioSQL has been rewritten by using ActiveRecord.
#Make a connection connection = Bio::SQL.establish_connection(path_to_database.yaml,'development') #list databases databases =Bio::SQL.list_databases #retrieve a sequence sample_seq = Bio::SQL.fetch_accession('some_accession_number') #get number of seqeunces in the database puts Bio::SQL.list_entries #get references associated with an entry puts sample_seq.references #create an embl format puts sample_seq.to_biosequence.output(:embl)
Changes to Bio::GFF2 and Bio::GFF3
GFF2/GFF3 formatted texts are now supported but there will be backward portability issues with bio 1.2.1 since some incompatible changes have been incorporated. Bio::GFF::Record.comments has been renamed to comment and comments= is now comment=
Both Bio::GFF::GFF2::Record.new and Bio::GFF::GFF3::Records.new, can now take 9 arguments that correspond to GFF columns making it easy to create a Record object directly without need for formatted text.
Both Bio::GFF::GFF2::Record#attributes and Bio::GFF::GFF3::Record#attributes have been changed to return a nested array containing tag, value pairs, to obtain a hash, use the to_hash method
To support data output for GFF2/GFF3, new methods have been added: Bio::GFF::GFF2#to_s, Bio::GFF::GFF3#to_s, Bio::GFF::GFF2::Record#to_s,and Bio::GFF::GFF3::Record#to_s
Lots of other changes have been incorporated for the GFF classes and you can view the change log at github
CodeML parser
A wrapper for PAML codeml program that is used for estimating evolutinary rate has been added. The class provides methods for generating the necessary configuration file. The new Bio::PAML::Codeml::Report and PAML::Codeml::Rates provides simpel classes for accessing the codeml report and rates file. This example is from the example given in the source code
require 'bio'
# Reads multi-fasta formatted file and gets a Bio::Alignment object.
alignment = Bio::FlatFile.open(Bio::Alignment::MultiFastaFormat, 'example.fst').alignment
# Reads newick tree from a file
tree = Bio::FlatFile.open(Bio::Newick, 'example.tree').tree
# Creates a Codeml object
codeml = Bio::PAML::Codeml.new
# Sets parameters
codeml.parameters[:runmode] = 0
codeml.parameters[:RateAncestor] = 1
# You can also set many parameters at a time.
codeml.parameters.update({ :alpha => 0.5, :fix_alpha => 0 })
# Executes codeml with the alignment and the tree
report = codeml.query(alignment, tree)
Lots of Bugs have been fixed and also support for Ruby 1.9 has been added. Its great thanks to the bioruby developers for their time and the excellent new changes!
PlasmoDB 5.4 released
Posted: November 11, 2007 Filed under: databases, malaria | Tags: malaria Leave a comment »
The ApiDB team have announced a new release of the Plasmodb database version 5.4. The database hosts genomic and proteomic data for different species of the parasitic eukaryote Plasmodium, which is the causative agent for malaria. It brings together data provided by numerous laboratories worldwide. From an email sent to registered database users,
New data in the new release include:
- A slightly modified reference genome for P. falciparum
- P. berghei gametocyte proteomics data
- Many additional P. falciparum SNPs
- Additional ESTs
- Expression profiling data for antigenic and adherent variants of P. falciparum 3D7
- User comments submitted prior to June 2007 have now been incorporated into the official annotation.
A brief list of new features include:
- faster loading of Gene and Genome Browser pages.
- Improved synteny views in the Genome Browser.
- Browser views of rodent malaria genomes colored to indicate chromosomes.
- Gene page links to various external data sources (including PlasmoMAP, TDRtargets, UCSC P. falciparum genome browser, Ontology-based Pattern Identification and literature databases).
- More convenient access to help … please click the “Ask us a Question” link on the left of every page, or the “Contact Us” at bottom to report problems or suggest improvements to the database.
Many thanks to the Plasmodb team and the entire ApiDB team for the the recent improvements and for the new datasets.
Plasmodium falciparum re-annotation workshop opens
Posted: October 22, 2007 Filed under: bioinformatics, databases, malaria | Tags: annotation, bioinformatics, malaria, plasmodium Leave a comment »The Plasmodium genome re-annotation workshop opened on 21st October at the Sanger center. The Workshop runs till the 26th and aims to re-annotate the P. falciparum genome. In a welcome message Prof David Roos pointed out that a major goal of the workshop is to ascribe new or updated functions to gene models, reflecting the current state of knowledge in the wider malaria community.
The Plasmodium falciparum sequencing project was completed in 2002 and since then the Plasmodb database which is currently at version 5 has been the primary source of P. falciparum data and genomic information. With 60% of the P. falciparum genes annotated as hypothetical, it is time to reduce the number of hypothetical genes by providing annotations where known and possible.
Issues that will be addressed and visited include:
- standards for the use of structured gene ontologies in gene/genome annotation
- naming conventions for “hypothetical proteins”, “conserved hypotheticals”, “putative kinases”, etc
- naming conventions for large gene families
- standards for inferring function from orthology, motif/domain conservation, or ‘guilt by association’ based on functional genomics data
- standards for transfering annotations to orthologs in other Plasmodium species
- plans and proposals for further Plasmodium sequencing and other genomics resources
- pipelines for ensuring currency and consistency of data in GenBank/EMBL, GeneDB, PlasmoDB, etc
- future requirements and needs for Plasmodium informatics resources
- annotation projects not completed during the workshop … and strategies for ensuring completion
The workshop is sponsored by Sanger institute and plasmodb