A ruby class for screen-scraping plasmodb database

Plasmodb is the primary resource for retrieving Plasmodium falciparum genomic data and information. Unfortunately this database has no API or XML service to request or query its  information from a programmer’s point of view or for easy automation of sequence information retrieval.  Recently I needed to download a long list of Plasmodium falciparum genomic, Protein and other information for a set of genes. Been lazy to click and open the webpage for each gene in my list. I wrote this in ruby.

It would be great if Plasmodb  would provide an easy way  of automated sequence retrieval. A webservice or an XML output format would do. Screen scraping is not a very efficient approach.  Here we use Scrapi which  is an HTML scraping toolkit for Ruby. It uses CSS selectors to write easy, maintainable scraping rules to select, extract and store data from HTML content.

#A class to fetch information from plasmodb using the scrapi API
##TODO handle  Scraper::Reader::HTTPUnspecifiedError
class Plasmodb
   #retrives a information  using the gene_id
   #returns a structure obj
  def fetch_by_gene_id(var_name)
    begin
      scraper = Scraper.define do
        process "div#genomicSequence pre",    :genomic_sequence  => :text
        process "div#transcriptSequence pre", :mrna_sequence =>:text
        process "div#proteinSequence pre",    :protein_sequence  =>:text
        process "div#Aliases td>table",       :aliases =>:text
        result :protein_sequence,:aliases,:mrna_sequence,:genomic_sequence
        end

     search_link="http://plasmodb.org/plasmo/showRecord.do?
               name=GeneRecordClasses.GeneRecordClass&source_id="+var_name+"&project_id=PlasmoDB"
     uri = URI.parse(search_link)
     @query = scraper.scrape(uri)

    rescue Scraper::Reader::HTTPUnspecifiedError
      "None"
    end
  end
  #returns the predicted protein sequence
  def protein_sequence
    @query.protein_sequence.chomp
  end
#  Returns the genomic sequence
  def genomic_sequence
    @query.genomic_sequence.chomp
  end
  #returns Aliases
  def aliases
    @query.aliases
  end
  #returns the mrna sequence
  def mrna_sequence
    @query.mrna_sequence.chomp
  end
end

#Use the class to fetch information.
require 'rubygems'
require 'bio'
require 'scrapi'

file = "/home/george/genes_list.txt" #a file containing a list of accession numbers.
#one accession number per line

plasmo = Plasmodb.new #initialize a plasmodb class instance

#Read the file and process each accession number.
File.readlines(file).each do |line|
  line.chomp!
  plasmo.fetch_by_gene_id(line)  #fetches the information from Plasmodb.
  #print a fasta entry for the protein sequence
  puts Bio::Sequence.new(plasmo.protein_sequence).output(:fasta,:header=>line)
  puts Bio::Sequence.new(plasmo.genomic_sequence).output(:fasta,:header=>line)
end

#another example
#p = Plasmodb.new
#p.fetch_by_gene_id("PFD0020c")
#puts p.genomic_sequence

PlasmoDB 5.4 released

PlasmodbThe ApiDB team have announced a new release of the Plasmodb database version 5.4. The database hosts genomic and proteomic data for different species of the parasitic eukaryote Plasmodium, which is the causative agent for malaria. It brings together data provided by numerous laboratories worldwide. From an email sent to registered database users,

New data in the new release include:

  • A slightly modified reference genome for P. falciparum
  • P. berghei gametocyte proteomics data
  • Many additional P. falciparum SNPs
  • Additional ESTs
  • Expression profiling data for antigenic and adherent variants of P. falciparum 3D7
  • User comments submitted prior to June 2007 have now been incorporated into the official annotation.

A brief list of new features include:

  • faster loading of Gene and Genome Browser pages.
  • Improved synteny views in the Genome Browser.
  • Browser views of rodent malaria genomes colored to indicate chromosomes.
  • Gene page links to various external data sources (including PlasmoMAP, TDRtargets, UCSC P. falciparum genome browser, Ontology-based Pattern Identification and literature databases).
  • More convenient access to help … please click the “Ask us a Question” link on the left of every page, or the “Contact Us” at bottom to report problems or suggest improvements to the database.

Many thanks to the Plasmodb team and the entire ApiDB team for the the recent improvements and for the new datasets.

Plasmodium falciparum re-annotation workshop opens

The Plasmodium genome re-annotation workshop opened on 21st October at the Sanger center. The Workshop runs till the 26th and aims to re-annotate the P. falciparum genome. In a welcome message Prof David Roos pointed out that a major goal of the workshop is to ascribe new or updated functions to gene models, reflecting the current state of knowledge in the wider malaria community.

The Plasmodium falciparum sequencing project was completed in 2002 and since then the Plasmodb database which is currently at version 5 has been the primary source of P. falciparum data and genomic information. With 60% of the P. falciparum genes annotated as hypothetical, it is time to reduce the number of hypothetical genes by providing annotations where known and possible.

Issues that will be addressed and visited include:

- standards for the use of structured gene ontologies in gene/genome annotation

- naming conventions for “hypothetical proteins”, “conserved hypotheticals”, “putative kinases”, etc

- naming conventions for large gene families

- standards for inferring function from orthology, motif/domain conservation, or ‘guilt by association’ based on functional genomics data

- standards for transfering annotations to orthologs in other Plasmodium species

- plans and proposals for further Plasmodium sequencing and other genomics resources

- pipelines for ensuring currency and consistency of data in GenBank/EMBL, GeneDB, PlasmoDB, etc

- future requirements and needs for Plasmodium informatics resources

- annotation projects not completed during the workshop … and strategies for ensuring completion

The workshop is sponsored by Sanger institute and plasmodb