CD-HIT clusters proteins that meet a similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format. It generates a fasta file of representative sequences and a text file of list of clusters.
Multiple CD-HIT runs. Proteins are first clustered at a high identity (like 90%), the non-redundant sequences are further clustered at a low identity (like 60%). A third cluster can be performed at lower identity. Multi-step run is more efficient and more accurate than a single run.
CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in fasta format and the output are two files: a fasta file of proteins in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.
CD-HIT-EST clusters a nucleotide sequences that meet a similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format It generates a fasta file of representative sequences and a text file of list of clusters. It can not be used for very long sequences, like full genomes.
Multiple CD-HIT-EST runs.
Like CD-HIT-2D, CD-HIT-EST-2D compares 2 nucleotide datasets. For same reason as CD-HIT-EST, CD-HIT-EST-2D is not good for very long sequences.
1. Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 2010(26): 680-682. full text
2. Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006(22): 1658-1659. full text
3. Weizhong Li, Lukasz Jaroszewski and Adam Godzik. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 2002(18): 77-82. full text
4. Weizhong Li, Lukasz Jaroszewski and Adam Godzik. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 2001(17): 282-283. full text