CD-HIT is a fast algorithm for clustering and comparing biological sequences. CD-HIT and CD-HIT-EST tools group similar protein and DNA sequences into clusters that meet a user-defined similarity threshold. CD-HIT-454 is specilized to cluster metagenomic sequences. Both CD-HIT and CD-HIT-EST produce two files as an output. The first file is a fasta file which contains the representative sequences and the second file (.clstr) is a cluster file that contains a list of all clusters and the names of the sequence members for each cluster and the relative percetage identities..
If all you are interested in, is to reduce sequence redundacy at a certain percentage threshold,then the list of representative sequences is all that you might need. However if you want to interlogate the clusters, then you might need to look at the cluster file a little more closely. An example of a cluster file is shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
bio-cd-hit-report is a simple biogem library that parses this output. It still needs a better interface and some improvements but for now I can easily use it as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | |
For more information and issues see the repo at github
Bio projects (







