psi-cd-hit(1) [debian man page]

PSI-CD-HIT.PL(1)						   User Commands						  PSI-CD-HIT.PL(1)

NAME

       psi-cd-hit.pl - runs similar algorithm like CD-HIT but using BLAST to calculate similarities

DESCRIPTION

       Usage psi-cd-hit [Options]

       Options

       -i     in_dbname, required

       -o     out_dbname, required

       -c     clustering threshold (sequence identity), default 0.3

       -ce clustering threshold (blast expect), default -1,

	      it  means  by default it doesn't use expect threshold, but with positive value, the program cluster seqs if similarities meet either
	      identity threshold or expect threshold

       -L     coverage of shorter sequence ( aligned / full), default 0.0

       -M     coverage of longer sequence ( aligned / full), default 0.0

       -R     (1/0) use psi-blast profile? default 0 perform psi-blast / pdb-blast type search

       -G     (1/0) use global identity? default 1 sequence identity calculated as

	      total identical residues of local alignments / length of shorter seq

	      if you prefer to use -G 0, it is suggested that you also use -L, such as -L 0.8, to prevent very short matches.

       -d     length of description line in the .clstr file, default 30 if set to 0, it takes the fasta defline and stops at first space

       -l     length_of_throw_away_sequences, default 10

       -p     profile search para, default

	      "-a 2 -d nr80 -j 3 -F F -e 0.001 -b 500 -v 500"

       -bfdb profile database, default nr80

       -s     blast search para, default

	      "-F F -e 0.000001 -b 100000 -v 100000"

       -be blast expect cutoff, default 0.000001

       -b     filename of list of hosts to run this program in parallel with ssh calls, you need provide a list of hosts

       -pbs No of jobs to send each time by PBS querying system

	      you can not use both ssh and pbs at same time

       -k (1/0) keep blast raw output file, default 1

       -rs steps of save restart file and clustering output, default 5000

	      everytime after process 5000 sequences, program write a restart file and current clustering information

       -restart restart file, readin a restart file

	      if program crash, stoped, termitated, you can restart it by add a option "-restart sth.restart"

       -rf steps of re format blast database, default 200,000

	      if program clustered 200,000 seqs, it remove them from seq pool, and re format blast db to save time

       -local dir of local blast db,

	      when run in parallel with ssh (not pbs), I can copy blast dbs to local drives on each node to save blast db reading time BUT, IT MAY
	      NOT FASTER

       -J     job, job_file, exe specific jobs like parse blast outonly DON'T use it, it is only used by this program itself

       -single files of ids those you known that they are singletons

	      so I won't run them as queries

	      ============================== by Weizhong Li, liwz@sdsc.edu ==============================

	      If you find cd-hit useful, please kindly cite:

	      "Clustering  of  highly  homologous  sequences  to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam
	      GodzikBioinformatics, (2001) 17:282-283 "Cd-hit: a fast program for clustering and comparing large sets  of  protein  or	nucleotide
	      sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-1659

psi-cd-hit.pl 4.6-2012-04-25					    April 2012							  PSI-CD-HIT.PL(1)

Check Out this Related Man Page

CD-HIT-2D(1)							   User Commands						      CD-HIT-2D(1)

NAME

       cdhit-2d - quickly group sequences in db1 or db2 format

SYNOPSIS

       cdhit-2d [Options]

DESCRIPTION

	      ====== CD-HIT version 4.6 (built on Apr 26 2012) ======

       Options

       -i     input filename for db1 in fasta format, required

       -i2    input filename for db2 in fasta format, required

       -o     output filename, required

       -c     sequence	identity threshold, default 0.9 this is the default cd-hit's "global sequence identity" calculated as: number of identical
	      amino acids in alignment divided by the full length of the shorter sequence

       -G     use global sequence identity, default 1 if set to 0, then use local sequence identity, calculated as :  number  of  identical  amino
	      acids  in  alignment  divided  by  the length of the alignment NOTE!!! don't use -G 0 unless you use alignment coverage controls see
	      options -aL, -AL, -aS, -AS

       -b     band_width of alignment, default 20

       -M     memory limit (in MB) for the program, default 800; 0 for unlimitted;

       -T     number of threads, default 1; with 0, all CPUs will be used

       -n     word_length, default 5, see user's guide for choosing it

       -l     length of throw_away_sequences, default 10

       -t     tolerance for redundance, default 2

       -d     length of description in .clstr file, default 20 if set to 0, it takes the fasta defline and stops at first space

       -s     length difference cutoff, default 0.0 if set to 0.9, the shorter sequences need to be at least 90% length of the	representative	of
	      the cluster

       -S     length  difference  cutoff  in  amino acid, default 999999 if set to 60, the length difference between the shorter sequences and the
	      representative of the cluster can not be bigger than 60

       -s2    length difference cutoff for db1, default 1.0 by default, seqs in db1 >= seqs in db2 in a same cluster if set to 0.9,  seqs  in  db1
	      may just >= 90% seqs in db2

       -S2    length  difference  cutoff,  default  0  by default, seqs in db1 >= seqs in db2 in a same cluster if set to 60, seqs in db2 may 60aa
	      longer than seqs in db1

       -aL    alignment coverage for the longer sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence

       -AL    alignment coverage control for the longer sequence, default 99999999 if set to 60, and the length of the sequence is 400,  then  the
	      alignment must be >= 340 (400-60) residues

       -aS    alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence

       -AS    alignment  coverage control for the shorter sequence, default 99999999 if set to 60, and the length of the sequence is 400, then the
	      alignment must be >= 340 (400-60) residues

       -A     minimal alignment coverage control for the both sequences, default 0 alignment must cover >= this value for both sequences

       -uL    maximum unmatched percentage for the longer sequence, default 1.0 if set to 0.1, the unmatched region (excluding leading and tailing
	      gaps) must not be more than 10% of the sequence

       -uS    maximum  unmatched percentage for the shorter sequence, default 1.0 if set to 0.1, the unmatched region (excluding leading and tail-
	      ing gaps) must not be more than 10% of the sequence

       -U     maximum unmatched length, default 99999999 if set to 10, the unmatched region (excluding leading and tailing gaps) must not be  more
	      than 10 bases

       -B     1  or  0, default 0, by default, sequences are stored in RAM if set to 1, sequence are stored on hard drive it is recommended to use
	      -B 1 for huge databases

       -p     1 or 0, default 0 if set to 1, print alignment overlap in .clstr file

       -g     1 or 0, default 0 by cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the  threshold  (fast  clus-
	      ter).  If  set  to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode) but
	      either 1 or 0 won't change the representatives of final clusters

       -bak write backup cluster file (1 or 0, default 0)

       -h     print this help

	      Questions, bugs, contact Weizhong Li at liwz@sdsc.edu

	      If you find cd-hit useful, please kindly cite:

	      "Clustering of highly homologous sequences to reduce thesize of large protein database", Weizhong  Li,  Lukasz  Jaroszewski  &  Adam
	      Godzik.  Bioinformatics,	(2001) 17:282-283 "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide
	      sequences", Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659

cd-hit-2d 4.6-2012-04-25					    April 2012							      CD-HIT-2D(1)

Linux and UNIX Man Pages

psi-cd-hit(1) [debian man page]

Check Out this Related Man Page