Server home

CD-HIT Suite


CD-HIT package can perform various jobs like clustering a protein database, clustering a DNA/RNA database, comparing two databases (protein or DNA/RNA), and generating protein families. More infomation is available at home page.

Learn more

cd-hit

CD-HIT clusters proteins that meet a similarity threshold, usually a sequence identity. Each cluster has one representative sequence. The input is a protein dataset in fasta format. It generates a fasta file of representative sequences and a text file of list of clusters.

h-cd-hit

Multiple CD-HIT runs. Proteins are first clustered at a high identity (like 90%), the non-redundant sequences are further clustered at a low identity (like 60%). A third cluster can be performed at lower identity. Multi-step run is more efficient and more accurate than a single run.

cd-hit-2d

CD-HIT-2D compares 2 protein datasets (db1, db2). It identifies the sequences in db2 that are similar to db1 at a certain threshold. The input are two protein datasets (db1, db2) in fasta format and the output are two files: a fasta file of proteins in db2 that are not similar to db1 and a text file that lists similar sequences between db1 & db2.

cd-hit-est

CD-HIT-EST clusters a nucleotide sequences that meet a similarity threshold, usually a sequence identity. The input is a DNA/RNA dataset in fasta format It generates a fasta file of representative sequences and a text file of list of clusters. It can not be used for very long sequences, like full genomes.

h-cd-hit-est

Multiple CD-HIT-EST runs.

cd-hit-est-2d

Like CD-HIT-2D, CD-HIT-EST-2D compares 2 nucleotide datasets. For same reason as CD-HIT-EST, CD-HIT-EST-2D is not good for very long sequences.

Reference:

1. Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 2010(26): 680-682. full text

2. Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006(22): 1658-1659. full text

3. Weizhong Li, Lukasz Jaroszewski and Adam Godzik. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 2002(18): 77-82. full text

4. Weizhong Li, Lukasz Jaroszewski and Adam Godzik. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 2001(17): 282-283. full text

Developed by @Zhipeng He