Plasmodb is the primary resource for retrieving Plasmodium falciparum genomic data and information. Unfortunately this database has no API or XML service to request or query its information from a programmer’s point of view or for easy automation of sequence information retrieval. Recently I needed to download a long list of Plasmodium falciparum genomic, Protein and other information for a set of genes. Been lazy to click and open the webpage for each gene in my list. I wrote this in ruby.
It would be great if Plasmodb would provide an easy way of automated sequence retrieval. A webservice or an XML output format would do. Screen scraping is not a very efficient approach. Here we use Scrapi which is an HTML scraping toolkit for Ruby. It uses CSS selectors to write easy, maintainable scraping rules to select, extract and store data from HTML content.
![]()
#A class to fetch information from plasmodb using the scrapi API ##TODO handle Scraper::Reader::HTTPUnspecifiedError class Plasmodb #retrives a information using the gene_id #returns a structure obj def fetch_by_gene_id(var_name) begin scraper = Scraper.define do process "div#genomicSequence pre", :genomic_sequence => :text process "div#transcriptSequence pre", :mrna_sequence =>:text process "div#proteinSequence pre", :protein_sequence =>:text process "div#Aliases td>table", :aliases =>:text result :protein_sequence,:aliases,:mrna_sequence,:genomic_sequence end search_link="http://plasmodb.org/plasmo/showRecord.do? name=GeneRecordClasses.GeneRecordClass&source_id="+var_name+"&project_id=PlasmoDB" uri = URI.parse(search_link) @query = scraper.scrape(uri) rescue Scraper::Reader::HTTPUnspecifiedError "None" end end #returns the predicted protein sequence def protein_sequence @query.protein_sequence.chomp end # Returns the genomic sequence def genomic_sequence @query.genomic_sequence.chomp end #returns Aliases def aliases @query.aliases end #returns the mrna sequence def mrna_sequence @query.mrna_sequence.chomp end end #Use the class to fetch information. require 'rubygems' require 'bio' require 'scrapi' file = "/home/george/genes_list.txt" #a file containing a list of accession numbers. #one accession number per line plasmo = Plasmodb.new #initialize a plasmodb class instance #Read the file and process each accession number. File.readlines(file).each do |line| line.chomp! plasmo.fetch_by_gene_id(line) #fetches the information from Plasmodb. #print a fasta entry for the protein sequence puts Bio::Sequence.new(plasmo.protein_sequence).output(:fasta,:header=>line) puts Bio::Sequence.new(plasmo.genomic_sequence).output(:fasta,:header=>line) end #another example#p = Plasmodb.new#p.fetch_by_gene_id("PFD0020c")#puts p.genomic_sequence
