debian cmbuild man page on unix.com

cmbuild(1) Infernal Manual cmbuild(1)

NAME
cmbuild - construct a CM from an RNA multiple sequence alignment

SYNOPSIS
cmbuild [options] cmfile alifile

DESCRIPTION
cmbuild read an RNA multiple sequence alignment from alifile, constructs a covariance model (CM), and saves the CM to cmfile.

The alignment file must be in Stockholm format, and must contain consensus secondary structure annotation. cmbuild uses the consensus
structure to determine the architecture of the CM.

The alignment file may be a database containing more than one alignment. If so, the resulting cmfile will be a database of CMs, one per
alignment.

The expert options --ctarget, --cmindiff, and --call result in multiple CMs being built from each alignment in alifile as described below.

OUTPUT
The default output from cmbuild is tabular, with a single line printed for each model . Each line has the following fields: aln: the index
of the alignment used to build the CM, cm idx: the index of the CM in the cmfile; name: the name of the CM, nseq: the number of sequences
in the alignment used to build the CM, eff_nseq: the effective number of sequences used to build the model (see the User Guide); alen: the
length of the alignment used to build the CM; clen: the number of columns from the alignment defined as consensus columns; rel entropy, CM:
the total relative entropy of the model divided by the number of consensus columns; rel entropy, HMM: the total relative entropy of the
model ignoring secondary structure divided by the number of consensus columns.

OPTIONS
-h Print brief help; includes version number and summary of all options, including expert options.

-n <s> Name the covariance model <s>. (Does not work if alifile contains more than one alignment). The default is to use the name of the
alignment (given by the #=GF ID tag, in Stockholm format), or if that is not present, to use the name of the alignment file minus
any file type extension plus a "-" and a positive integer indicating the position of that alignment in the file (that is, the first
alignment in a file "myrnas.sto" would give a CM named "myrnas-1", the second alignment would give a CM named "myrnas-2").

-A Append the CM to cmfile, if cmfile already exists.

-F Allow cmfile to be overwritten. Normally, if cmfile already exists, cmbuild exits with an error unless the -A or -F option is set.

-v Run in verbose output mode instead of using the default single line tabular format. This output format is similar to that used by
older versions of Infernal.

--iins Allow informative insert emissions for the CM. By default, all CM insert emission scores are set to 0.0 bits. The motivation for
zero bit scores is to avoid high-scoring hits to low complexity sequence favored by high insert state emission scores.

--Wbeta<x>
Set the beta tail loss probability for query-dependent banding (QDB) to <x> The QDB algorithm is used to determine the maximium
length of a hit to the model. For more information on QDB see (Nawrocki and Eddy, PLoS Computational Biology 3(3): e56). The beta
paramater is the amount of probability mass considered negligible during band calculation, lower values of beta will result in
shorter maximum hit lengths, which will yield faster searches. The default beta is 1E-7: determined empirically as a good tradeoff
between sensitivity, specificity and speed.

--devhelp
Print help, as with -h , but also include undocumented developer options. These options are not listed below. They are under devel-
opment or experimental, and are not guaranteed to even work correctly. Use developer options at your own risk. The only resources
for understanding what they actually do are the brief one-line description printed when --devhelp is enabled, and the source code.

EXPERT OPTIONS
--rsearch <f>
Parameterize emission scores a la RSEARCH, using the RIBOSUM matrix in file <f>. (Actually, the emission scores will not be identi-
cal to RIBOSUM scores due of differences in the modelling strategy between Infernal and RSEARCH, but they will be as similar as pos-
sible.) RIBOSUM matrix files are included with Infernal in the "matrices/" subdirectory of the top-level Infernal directory. RIBO-
SUM matrices are substitution score matrices trained specifically for structural RNAs with separate single stranded residue and base
pair substitution scores. For more information see the RSEARCH publication (Klein and Eddy, BMC Bioinformatics 4:44, 2003). Actu-
ally, the emission scores will not exactly

With --rsearch enabled, all alignments in alifile must contain exactly one sequence or the --call option must also be enabled.

--binary
Save the model in a compact binary format. The default is a more readable ASCII text format.

--rf Use reference coordinate annotation (#=GC RF line, in Stockholm) to determine which columns are consensus, and which are inserts.
Any non-gap character indicates a consensus column. (For example, mark consensus columns with "x", and insert columns with ".".)
The default is to determine this automatically; if the frequency of gap characters in a column is greater than a threshold,
gapthresh (default 0.5), the column is called an insertion.

--gapthresh <x>
Set the gap threshold (used for determining which columns are insertions versus consensus; see --rf above) to <x>. The default is
0.5.

--ignorant
Strip all base pair secondary structure information from all input alignments in alifile before building the CM(s). All resulting
CM(s) will have zero MATP (base pair) nodes, with zero bifurcations.

--wgsc Use the Gerstein/Sonnhammer/Chothia (GSC) weighting algorithm. This is the default unless the number of sequences in the alignment
exceeds a cutoff (see --pbswitch), in which case the default becomes the faster Henikoff position-based weighting scheme.

--wblosum
Use the BLOSUM filtering algorithm to weight the sequences, instead of the default GSC weighting. Cluster the sequences at a given
percentage identity (see --wid); assign each cluster a total weight of 1.0, distributed equally amongst the members of that cluster.

--wpb Use the Henikoff position-based weighting scheme. This weighting scheme is automatically used (overriding --wgsc and --wblosum) if
the number of sequences in the alignment exceeds a cutoff (see --pbswitch).

--wnone
Turn sequence weighting off; e.g. explicitly set all sequence weights to 1.0.

--wgiven
Use sequence weights as given in annotation in the input alignment file. If no weights were given, assume they are all 1.0. The
default is to determine new sequence weights by the Gerstein/Sonnhammer/Chothia algorithm, ignoring any annotated weights.

--pbswitch <n>
Set the cutoff for automatically switching the weighting method to the Henikoff position-based weighting scheme to <n>. If the num-
ber of sequences in the alignment exceeds <n> Henikoff weighting is used. By default <n> is 5000.

--wid <x>
Controls the behavior of the --wblosum weighting option by setting the percent identity for clustering the alignment to <x>.

--eent Use the entropy weighting strategy to determine the effective sequence number that gives a target mean match state relative entropy.
This option is the default, and can be turned off with --enone. The default target mean match state relative entropy is 0.59 bits
but can be changed with --ere. The default of 0.59 bits is automatically changed if the total relative entropy of the model (summed
match state relative entropy) is less than a cutoff, which is is 6.0 bits by default, but can be changed with the expert, undocu-
mented --eX option. If you really want to play with that option, consult the source code.

--enone
Turn off the entropy weighting strategy. The effective sequence number is just the number of sequences in the alignment.

--ere <x>
Set the target mean match state relative entropy as <x>. By default the target relative entropy per match position is 0.59 bits.

--null <f>
Read a null model from <f>. The null model defines the probability of each RNA nucleotide in background sequence, the default is to
use 0.25 for each nucleotide. The format of null files is documented in the User's Guide.

--prior <f>
Read a Dirichlet prior from <f>, replacing the default mixture Dirichlet. The format of prior files is documented in the User's
Guide.

--ctarget <n>
Cluster each alignment in alifile by percent identity. Find a cutoff percent id threshold that gives exactly <n> clusters and build
a separate CM from each cluster. If <n> is greater than the number of sequences in the alignment the program will not complain, and
each sequence in the alignment will be its own cluster. Each CM will have a positive integer appended to its name indicating the
order in which it was built. For example, if cmbuild --ctarget 3 is called with alifile "myrnas.sto", and "myrnas.sto" has exactly
one Stockholm alignment in it with no #=GF ID tag annotation, three CMs will be built, the first will be named "myrnas-1.1", the
second, "myrnas-1.2", and the third "myrnas-1.3". (As explained above for the -n option, the first number "1" after "myrnas" indi-
cates the CM was built from the first alignment in "myrnas.sto".)

--cmaxid <x>
Cluster each sequence alignment in alifile by percent identity. Define clusters at the cutoff fractional id similarity of <x> and
build a separate CM from each cluster. No two sequences will be be more than <x> fractionally identical ( <x> * 100 percent identi-
cal) if those two sequences are in different clusters. The CMs are named as described above for --ctarget.

--call Build a separate CM from each sequence in each alignment in alifile. Naming of CMs takes place as described above for --ctarget.
Using this option in combination with --rsearch causes a separate CM to be built and parameterized using a RIBOSUM matrix for each
sequence in alifile.

--corig
After building multiple CMs using --ctarget, --cmindiff or --call as described above, build a final CM using the complete original
alignment from alifile. The CMs are named as described above for --ctarget with the exception of the final CM built from the origi-
nal alignment which is named in the default manner, without an appended integer.

--cdump<f>
Dump the multiple alignments of each cluster to <f> in Stockholm format. This option only works in combination with --ctarget,
--cmindiff or --call.

--refine <f>
Attempt to refine the alignment before building the CM using expectation-maximization (EM). A CM is first built from the initial
alignment as usual. Then, the sequences in the alignment are realigned optimally (with the HMM banded CYK algorithm, optimal means
optimal given the bands) to the CM, and a new CM is built from the resulting alignment. The sequences are then realigned to the new
CM, and a new CM is built from that alignment. This is continued until convergence, specifically when the alignments for two succes-
sive iterations are not significantly different (the summed bit scores of all the sequences in the alignment changes less than 1%
between two successive iterations). The final alignment (the alignment used to build the CM that gets written to cmfile) is written
to <f>.

--gibbs
Modifies the behavior of --refine so Gibbs sampling is used instead of EM. The difference is that during the alignment stage the
alignment is not necessarily optimal, instead an alignment (parsetree) for each sequences is sampled from the posterior distribution
of alignments as determined by the Inside algorithm. Due to this sampling step --gibbs is non-deterministic, so different runs with
the same alignment may yield different results. This is not true when --refine is used without the --gibbs option, in which case the
final alignment and CM will always be the same. When --gibbs is enabled, the -s <n> option can be used to seed the random number
generator predictably, making the results reproducible. The goal of the --gibbs option is to help expert RNA alignment curators
refine structural alignments by allowing them to observe alternative high scoring alignments.

-s <n> Set the random seed to <n>, where <n> is a positive integer. This option can only be used in combination with --gibbs. The default
is to use time() to generate a different seed for each run, which means that two different runs of cmbuild --refine <f> --gibbs on
the same alignment will give slightly different results. You can use this option to generate reproducible results.

-l With --refine, turn on the local alignment algorithm, which allows the alignment to span two or more subsequences if necessary (e.g.
if the structures of the query model and target sequence are only partially shared), allowing certain large insertions and deletions
in the structure to be penalized differently than normal indels. The default is to globally align the query model to the target
sequences.

-a With --refine, print the scores of each individual sequence alignment.

--cyk With --refine, align with the CYK algorithm. By default the optimal accuracy algorithm is used. There is more information on this in
the cmalign manual page.

--sub With --refine, turn on the sub model construction and alignment procedure. For each sequence to be realigned an HMM is first used to
predict the model start and end consensus columns, and a new sub CM is constructed that only models consensus columns from start to
end. The sequence is then aligned to this sub CM. This option is useful for building CMs for alignments with sequences that are
known to truncated, non-full length sequences. This option is experimental and not rigorously tested, use at your own risk. This
"sub CM" procedure is not the same as the "sub CMs" described by Weinberg and Ruzzo.

--nonbanded
With --refine, do not use HMM bands to accelerate alignment. Use the full CYK algorithm which is guaranteed to give the optimal
alignment. This will slow down the run significantly, especially for large models.

--tau <x>
With --refine, set the tail loss probability used during HMM band calculation to <f>. This is the amount of probability mass within
the HMM posterior probabilities that is considered negligible. The default value is 1E-7. In general, higher values will result in
greater acceleration, but increase the chance of missing the optimal alignment due to the HMM bands.

--fins With --refine, change the behavior of how insert emissions are placed in the alignment. By default, all contiguous blocks of
inserts are split in half, and half the residues are flushed left against the nearest consensus column to the left, and half are
flushed right against the nearest consensus column on the right. With --fins inserts are not split in half, instead all inserted
residues from IL states are flushed left, instead all inserted residues from IR states are flushed right. This was the default
behavior of previous versions of Infernal.

--mxsize <x>
With --refine, set the maximum allowable matrix size for alignment to <x> megabytes. By default this size is 2 Gb. This should be
large enough for the vast majority of alignments, however it is possible that when run with --refine, cmbuild will exit prematurely,
reporting an error message that the matrix exceeded it's maximum allowable size. In this case, the --mxsize can be used to raise the
limit.

--rdump<x>
With --refine, output the intermediate alignments at each iteration of the refinement procedure (as described above for --refine )
to file <f>.

SEE ALSO
For complete documentation, see the User's Guide (Userguide.pdf) that came with the distribution; or see the Infernal web page,
http://infernal.janelia.org/.

COPYRIGHT
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on redistribution conditions.

AUTHOR
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/

Infernal 1.0.2 October 2009 cmbuild(1)

debian man page for cmbuild