A ruby class for screen-scraping plasmodb database


Plasmodb is the primary resource for retrieving Plasmodium falciparum genomic data and information. Unfortunately this database has no API or XML service to request or query its  information from a programmer’s point of view or for easy automation of sequence information retrieval.  Recently I needed to download a long list of Plasmodium falciparum genomic, Protein and other information for a set of genes. Been lazy to click and open the webpage for each gene in my list. I wrote this in ruby.

It would be great if Plasmodb  would provide an easy way  of automated sequence retrieval. A webservice or an XML output format would do. Screen scraping is not a very efficient approach.  Here we use Scrapi which  is an HTML scraping toolkit for Ruby. It uses CSS selectors to write easy, maintainable scraping rules to select, extract and store data from HTML content.

#A class to fetch information from plasmodb using the scrapi API
##TODO handle  Scraper::Reader::HTTPUnspecifiedError
class Plasmodb
   #retrives a information  using the gene_id
   #returns a structure obj
  def fetch_by_gene_id(var_name)
    begin
      scraper = Scraper.define do
        process "div#genomicSequence pre",    :genomic_sequence  => :text
        process "div#transcriptSequence pre", :mrna_sequence =>:text
        process "div#proteinSequence pre",    :protein_sequence  =>:text
        process "div#Aliases td>table",       :aliases =>:text
        result :protein_sequence,:aliases,:mrna_sequence,:genomic_sequence
        end

     search_link="http://plasmodb.org/plasmo/showRecord.do?
               name=GeneRecordClasses.GeneRecordClass&source_id="+var_name+"&project_id=PlasmoDB"
     uri = URI.parse(search_link)
     @query = scraper.scrape(uri)

    rescue Scraper::Reader::HTTPUnspecifiedError
      "None"
    end
  end
  #returns the predicted protein sequence
  def protein_sequence
    @query.protein_sequence.chomp
  end
#  Returns the genomic sequence
  def genomic_sequence
    @query.genomic_sequence.chomp
  end
  #returns Aliases
  def aliases
    @query.aliases
  end
  #returns the mrna sequence
  def mrna_sequence
    @query.mrna_sequence.chomp
  end
end

#Use the class to fetch information.
require 'rubygems'
require 'bio'
require 'scrapi'

file = "/home/george/genes_list.txt" #a file containing a list of accession numbers.
#one accession number per line

plasmo = Plasmodb.new #initialize a plasmodb class instance

#Read the file and process each accession number.
File.readlines(file).each do |line|
  line.chomp!
  plasmo.fetch_by_gene_id(line)  #fetches the information from Plasmodb.
  #print a fasta entry for the protein sequence
  puts Bio::Sequence.new(plasmo.protein_sequence).output(:fasta,:header=>line)
  puts Bio::Sequence.new(plasmo.genomic_sequence).output(:fasta,:header=>line)
end

#another example
#p = Plasmodb.new
#p.fetch_by_gene_id("PFD0020c")
#puts p.genomic_sequence

One Response to “A ruby class for screen-scraping plasmodb database”

  1. Geoffrey Siwo Says:

    Hey George,

    This is cool! Clicking not always efficient.


Leave a Reply