Many times I find someone storing sequence data in excel Workbooks.(insert scream here) This is usually followed by a request which goes like this,
Someone: ” I will send you some sequences and then we can perform xyz analysis please?”
Me: “Are they in fasta format?”
Someone: “No, they are in Excel “
Me: (supressing a laugh) “Ok, do you mind to convert them to Fasta and then we can do xyz?”
Someone:(with a wiggle on the face) “How do I do that?, Is there a windows program to do that?”
Me: (feeling superman-nish) “eeh we can create a quick script in perl or Ruby, I prefer Ruby … but you should lean some basic perl or Ruby…. and run away from windows. :)”
Me: “Save your data as CSV(File ->Save As-> csv), then send me that file”
So here is a very simple script that reads a csv file and creates a fasta file using Ruby.
You need to specify the path to the input csv file and the output fasta file, the column number that contains the name of the sequence and the column number that contains the sequence data in the csv file.
require 'csv'
# read a csv file and create a fasta file
def csv_to_fasta(csv_file,output_file,name_col,seq_col)
File.open(output_file,'w') do |file|
count = 0
CSV.foreach(csv_file) do |row|
sequence_id = row[name_col]
seq = row[seq_col]
count = count+1
puts sequence_id
file.puts ">#{sequence_id} \n#{seq}"
end
puts "#{count} sequences processed"
end
csv_file = "#{ENV['HOME']}/path_to_csv_file.csv"
fasta_file = "#{ENV['HOME']}/path_to_fasta_file.fasta"
seq_name_col = 0 #assumes the first column contains the names
seq_data_col = 1 #second column contains the seq data
csv_to_fasta(csv_file,fasta_file,seq_name_col,seq_data_col)
Happy biology!