A ruby class for screen-scraping plasmodb database
Posted: December 9, 2009 Filed under: bioinformatics, bioruby, databases, malaria, ruby, tutorials 3 Comments »Plasmodb is the primary resource for retrieving Plasmodium falciparum genomic data and information. Unfortunately this database has no API or XML service to request or query its information from a programmer’s point of view or for easy automation of sequence information retrieval. Recently I needed to download a long list of Plasmodium falciparum genomic, Protein and other information for a set of genes. Been lazy to click and open the webpage for each gene in my list. I wrote this in ruby.
It would be great if Plasmodb would provide an easy way of automated sequence retrieval. A webservice or an XML output format would do. Screen scraping is not a very efficient approach. Here we use Scrapi which is an HTML scraping toolkit for Ruby. It uses CSS selectors to write easy, maintainable scraping rules to select, extract and store data from HTML content.
![]()
#A class to fetch information from plasmodb using the scrapi API ##TODO handle Scraper::Reader::HTTPUnspecifiedError class Plasmodb #retrives a information using the gene_id #returns a structure obj def fetch_by_gene_id(var_name) begin scraper = Scraper.define do process "div#genomicSequence pre", :genomic_sequence => :text process "div#transcriptSequence pre", :mrna_sequence =>:text process "div#proteinSequence pre", :protein_sequence =>:text process "div#Aliases td>table", :aliases =>:text result :protein_sequence,:aliases,:mrna_sequence,:genomic_sequence end search_link="http://plasmodb.org/plasmo/showRecord.do? name=GeneRecordClasses.GeneRecordClass&source_id="+var_name+"&project_id=PlasmoDB" uri = URI.parse(search_link) @query = scraper.scrape(uri) rescue Scraper::Reader::HTTPUnspecifiedError "None" end end #returns the predicted protein sequence def protein_sequence @query.protein_sequence.chomp end # Returns the genomic sequence def genomic_sequence @query.genomic_sequence.chomp end #returns Aliases def aliases @query.aliases end #returns the mrna sequence def mrna_sequence @query.mrna_sequence.chomp end end #Use the class to fetch information. require 'rubygems' require 'bio' require 'scrapi' file = "/home/george/genes_list.txt" #a file containing a list of accession numbers. #one accession number per line plasmo = Plasmodb.new #initialize a plasmodb class instance #Read the file and process each accession number. File.readlines(file).each do |line| line.chomp! plasmo.fetch_by_gene_id(line) #fetches the information from Plasmodb. #print a fasta entry for the protein sequence puts Bio::Sequence.new(plasmo.protein_sequence).output(:fasta,:header=>line) puts Bio::Sequence.new(plasmo.genomic_sequence).output(:fasta,:header=>line) end #another example#p = Plasmodb.new#p.fetch_by_gene_id("PFD0020c")#puts p.genomic_sequence
PlasmoDB 5.4 released
Posted: November 11, 2007 Filed under: databases, malaria | Tags: malaria Leave a comment »
The ApiDB team have announced a new release of the Plasmodb database version 5.4. The database hosts genomic and proteomic data for different species of the parasitic eukaryote Plasmodium, which is the causative agent for malaria. It brings together data provided by numerous laboratories worldwide. From an email sent to registered database users,
New data in the new release include:
- A slightly modified reference genome for P. falciparum
- P. berghei gametocyte proteomics data
- Many additional P. falciparum SNPs
- Additional ESTs
- Expression profiling data for antigenic and adherent variants of P. falciparum 3D7
- User comments submitted prior to June 2007 have now been incorporated into the official annotation.
A brief list of new features include:
- faster loading of Gene and Genome Browser pages.
- Improved synteny views in the Genome Browser.
- Browser views of rodent malaria genomes colored to indicate chromosomes.
- Gene page links to various external data sources (including PlasmoMAP, TDRtargets, UCSC P. falciparum genome browser, Ontology-based Pattern Identification and literature databases).
- More convenient access to help … please click the “Ask us a Question” link on the left of every page, or the “Contact Us” at bottom to report problems or suggest improvements to the database.
Many thanks to the Plasmodb team and the entire ApiDB team for the the recent improvements and for the new datasets.
Plasmodium falciparum re-annotation workshop opens
Posted: October 22, 2007 Filed under: bioinformatics, databases, malaria | Tags: annotation, bioinformatics, malaria, plasmodium Leave a comment »The Plasmodium genome re-annotation workshop opened on 21st October at the Sanger center. The Workshop runs till the 26th and aims to re-annotate the P. falciparum genome. In a welcome message Prof David Roos pointed out that a major goal of the workshop is to ascribe new or updated functions to gene models, reflecting the current state of knowledge in the wider malaria community.
The Plasmodium falciparum sequencing project was completed in 2002 and since then the Plasmodb database which is currently at version 5 has been the primary source of P. falciparum data and genomic information. With 60% of the P. falciparum genes annotated as hypothetical, it is time to reduce the number of hypothetical genes by providing annotations where known and possible.
Issues that will be addressed and visited include:
- standards for the use of structured gene ontologies in gene/genome annotation
- naming conventions for “hypothetical proteins”, “conserved hypotheticals”, “putative kinases”, etc
- naming conventions for large gene families
- standards for inferring function from orthology, motif/domain conservation, or ‘guilt by association’ based on functional genomics data
- standards for transfering annotations to orthologs in other Plasmodium species
- plans and proposals for further Plasmodium sequencing and other genomics resources
- pipelines for ensuring currency and consistency of data in GenBank/EMBL, GeneDB, PlasmoDB, etc
- future requirements and needs for Plasmodium informatics resources
- annotation projects not completed during the workshop … and strategies for ensuring completion
The workshop is sponsored by Sanger institute and plasmodb