mcl(1) [debian man page]

mcl(1)								  USER COMMANDS 							    mcl(1)

  NAME
      mcl - The Markov Cluster Algorithm, aka the MCL algorithm.

      This  program  implements  mcl, a cluster algorithm for graphs. A single parameter controls the granularity of the output clustering, namely
      the -I inflation option described further below.	In standard usage of the program this parameter is the only one that may require changing.
      By  default it is set to 2.0 and this is a good way to start. If you want to explore cluster structure in graphs with MCL, vary this parame-
      ter to obtain clusterings at different levels of granularity. A good set of starting values is 1.4, 2, 4, and 6.

      The program has a rather large set of options. Except for -I none affects the clustering method itself. The other options are for a  variety
      of  aspects,  such  as study of the underlying MCL process (i.e.	dumping of iterands), network preprocessing (incorporated for efficiency),
      resource allocation options (for large-scale analyses), output naming and placement, output formatting, setting of verbosity levels, and	so
      on.

      Network construction and reduction techniques should not be considered as part of a clustering algorithm. Nevertheless particular techniques
      may benefit particular methods or applications. In mcl many transformations are accessible through the -tf option. It can be used  for  edge
      weight transformations and selection, as well as transformations that act on a graph as a whole.	It is for example possible to remove edges
      with weight below 0.7 by issuing -tf 'gq(0.7)', where the quotes are necessary to prevent the shell from interpreting the  parentheses.  The
      option  accepts  more  complicated  sequences,  such as -tf 'gq(0.7),add(-0.7)'. This causes all remaining edge weights to be shifted to the
      range [0-0.3], assuming that the input contains correlations. Many more transformations are supported, as documented in  mcxio(5).  Examples
      of  graph-wide  transformations are '#knn(<num>)' and '#ceilnb(<num>)'. The first only keeps those edges that occur in the list of top-<num>
      edges of highest weight in both of its incident nodes. The second removes edges from nodes of highest degree  first,  proceeding	until  all
      node  degrees satisfy the given threshold.  The -pi (pre-inflation) option can be used to increase the contrast in edge weights. This may be
      useful when clusterings are coarse and fine-grained clusterings are difficult to obtain.

  GETTING STARTED
      There are two main modes of invocation. The most accessible is label mode which assumes a format alternatively called label  input  or  ABC-
      format.	The input is then a file or stream in which each line encodes an edge in terms of two labels (the 'A' and the 'B') and a numerical
      value (the 'C'), all separated by white space. The most basic example of usage is this:

	 mcl <-|fname> --abc -o fname-out

      The output is then a file where each line is a cluster of tab-separated labels.  If clustering is part of a  larger  workflow  where  it	is
      desirable  to analyse and compare multiple clusterings, then it is a good idea to use native mode rather than ABC mode.  The reason for this
      is that native mode is understood by all programs in the mcl suite. It is a more stringent and unambiguous format, and hence  more  suitable
      for data exchange.  The reader is refered to clmprotocols(5) for more information.

  SYNOPSIS
      The  example  invocation	below  assumes	matrix	input,	as  explained above and described in the mcxio(5) section. Switching to label mode
      requires the input file to be in ABC-format and the addition of the --abc option.

      mcl <-|fname> [-I <num> (inflation)] [-o <str> (fname)]

      These options are sufficient in 95 percent of the cases or more. The first argument must be the name of a file containing a graph/matrix	in
      the mcl input format, or a hyphen to read from STDIN. With respect to clustering, the -I option is foremost relevant.

      The  full  listing  of  mcl  options  is shown further below, separated into parts corresponding with functional aspects such as clustering,
      threading, verbosity, network preprocessing, pruning and resource management, automatic output naming, and dumping.

      Baseline clustering options
      [-I <num> (inflation)] [-o <fname> (fname)]

      Output options
      [-odir <dname> (directory)] [--d (use input directory for output)]

      Input options
      [--abc (expect/write labels)] [--sif (expect/write labels)] [--etc (expect/write labels)] [--expect-values (sif or etc stream contains  val-
      ues)] [-use-tab <fname> (use mapping to write)]

      Transform options
      [-tf  <tf-spec>  (transform input matrix values)] [-abc-tf <tf-spec> (transform input stream values)] [--abc-neg-log10 (take log10 of stream
      values, negate sign)] [--abc-neg-log (take log of stream values, negate sign)]

      Cache options
      [-write-graph <fname> (write graph)] [-write-graphx <fname> (write transformed graph)]  [-write-expanded	<fname>  (write  expanded  graph)]
      [--write-limit (write mcl process limit)]

      Input manipulation options
      [-pi  <num>  (pre-inflation)] [-ph <num> (pre-inflation, max-bound)] [-if <num> (start-inflation)] [--discard-loops=<y/n> (discard y/n loops
      in input)] [--sum-loops (set loops to sum of other arcs weights)] [-c <num> (reweight loops)]

      Clustering processing options
      [-sort <str> (sort mode)] [-overlap <str> (overlap mode)] [--force-connected=<y/n> (analyze components)]	[--check-connected=<y/n>  (analyze
      components)] [--analyze=<y/n> (performance criteria)] [--show-log=<y/n> (show log)]

      Verbosity options
      [-q <spec> (log levels)] [-v <str> (verbosity type on)] [-V <str> (verbosity type off)] [--show (print (small) matrices to screen)]

      Thread options
      [-te <int> (#expansion threads)]

      Output file name and annotation options
      [-o  <str> (fname)] [-ap <str> (use str as file name prefix)] [-aa <str> (append str to suffix)] [-az (show output file name and exit)] [-ax
      (show output suffix and exit)] [-annot <str> (dummy annotation option)]

      Dump options
      [-dump-interval <i:j> (dump interval)] [-dump-modulo <int> (dump modulo)] [-dump-stem <stem> (dump file stem)] [-dump <str> (type)] [-digits
      <int> (printing precision)] [--write-binary (write matrices in binary format)]

      Info options
      [--jury-charter  (explains jury)] [--version (show version)] [-how-much-ram k (RAM upper bound)] [-h (most important options)] [--help (one-
      line description for all options)] [-z (show current settings)] [-az (show output file name and exit)] [-ax (show output suffix  and  exit)]
      [--show-schemes (show resource schemes)]

      Implementation options
      [-sparse <int> (sparse matrix multiplication threshold)]

      Pruning options
      The  following  options all pertain to the various pruning strategies that can be employed by mcl. They are described in the PRUNING OPTIONS
      section, accompanied by a description of the mcl pruning strategy.  If your graphs are huge and you have an appetite for tuning, have a look
      at the following:

      [-scheme	<int> (resource scheme)] [-resource <int> (per-node resource maximum)] [-p <num> (cutoff)] [-P <int> (1/cutoff)] [-S <int> (selec-
      tion number)] [-R <int> (recovery number)] [-pct <int> (recover percentage)] [-warn-pct <int> (prune warn percentage)]  [-warn-factor  <int>
      (prune warn factor)]

      The  first  argument of mcl must be a file name, but some options are allowed to appear as the first argument instead. These are the options
      that cause mcl to print out information of some kind, after which it will gracefully exit. The full list of these options is

      -z, -h, --help, --version, --show-settings, --show-schemes, --jury-charter.

  DESCRIPTION
      mcl implements the MCL algorithm, short for the Markov cluster algorithm, a cluster algorithm for graphs developed by Stijn  van	Dongen	at
      the Centre for Mathematics and Computer Science in Amsterdam, the Netherlands. The algorithm simulates flow using two simple algebraic oper-
      ations on matrices.  The inception of this flow process and the theory behind it are described elsewhere (see REFERENCES). Frequently  asked
      questions  are  answered	in the mclfaq(7) section.  The program described here is a fast threaded implementation written by the algorithm's
      creator with contributions by several others. Anton Enright co-implemented  threading;  see  the	HISTORY/CREDITS  section  for  a  complete
      account.	 See  the  APPLICABILITY  section  for	a description of the type of graph mcl likes best, and for a qualitative assessment of its
      speed.  mcl is accompanied by several other utilities for analyzing clusterings and performing matrix and graph operations; see the SEE ALSO
      section.

      The  first  argument  is	the  input file name, or a single hyphen to read from stdin. The rationale for making the name of the input file a
      fixed parameter is that you typically do several runs with different parameters. In command line mode it is pleasant if you do not  have	to
      skip over an immutable parameter all the time.

      The  -I f  option is the main control, affecting cluster granularity.  In finding good mcl parameter settings for a particular domain, or in
      finding cluster structure at different levels of granularity, one typically runs mcl multiple times for varying values of f  (refer  to  the
      -I inflation option for further information).

      NOTE  MCL  interprets  the  matrix  entries  or graph edge weights as similarities, and it likes undirected input graphs best. It can handle
      directed graphs, but any node pair (i,j) for which w(i,j) is much smaller than w(j,i) or vice versa will presumably have a slightly negative
      effect on the clusterings output by mcl. Many such node pairs will have a distinctly negative effect, so try to make your input graphs undi-
      rected. How your edge weights are computed may affect mcl's performance. In protein clustering, one way to go is to choose the negated loga-
      rithm of the BLAST probabilities (see REFERENCES).

      mcl's  default parameters should make it quite fast under almost all circumstances. Taking default parameters, mcl has been used to generate
      good protein clusters on 133k proteins, taking 10 minutes running time on a Compaq ES40 system with four alpha EV6.7 processors. It has been
      applied  (with  good results) to graphs with two million nodes, and if you have the memory (and preferably CPUs as well) nothing should stop
      you from going further.

      For large graphs, there are several groups of parameters available for tuning the mcl computing process, should it be necessary. The easiest
      thing  to  do is just vary the -scheme option. This triggers different settings for the group of pruning parameters -p/-P, -R, -S, and -pct.
      The default setting corresponds with -scheme 6.  When doing multiple mcl runs for the same graphs with different -I settings (for  obtaining
      clusterings  at different levels of granularity), it can be useful to factor out the first bit of computation that is common to all runs, by
      using the -write-expanded option one time and then using -if inflation for each run in the set.  Whether mcl considers a graph large depends
      mainly  on  the  graph  connectivity;  a	highly	connected graph on 50,000 nodes is large to mcl (so that you might want to tune resources)
      whereas a sparsely connected graph on 500,000 nodes may be business as usual.

      mcl is a memory munger. Its precise appetite depends on the resource settings. You can get a rough (and usually much too pessimistic)  upper
      bound  for the amount of RAM that is needed by using the -how-much-ram option. The corresponding entry in this manual page contains the sim-
      ple formula via which the upper bound is computed.

      Other options of interest are the option to specify threads -te, and the verbosity-related options -v and -V.  The actual settings are shown
      with  -z,  and for graphs with at most 12 nodes or so you can view the MCL matrix iterands on screen by supplying --show (this may give some
      more feeling).

      MCL iterands allow a generic interpretation as clusterings as well. The clusterings associated with early iterands may contain a fair amount
      of  overlap.  Refer  to  the -dump option, the mclfaq(7) manual, and the clm imac utility (Interpret Matrices As Clusterings).  Use clm imac
      only if you have a special reason; the normal usage of mcl is to do multiple runs for varying -I parameters and use the  clusterings  output
      by mcl itself.

      Under very rare circumstances, mcl might get stuck in a seemingly infinite loop. If the number of iterations exceeds a hundred and the chaos
      indicator remains nearly constant (presumably around value 0.37), you can force mcl to stop by sending it the ALRM signal (usually  done	by
      kill -s ALRM pid). It will finish the current iteration, and interpret the last iterand as a clustering. Alternatively, you can wait and mcl
      might converge by itself or it will certainly stop after 10,000 iterations. The most probable explanation for such an infinite loop is  that
      the input graph contains the flip-flop graph of node size three as a subgraph.

      The  creator  of this page feels that manual pages are a valuable resource, that online html documentation is also a good thing to have, and
      that info pages are way way ahead of their time. The NOTES section explains how this page was created.

      In the OPTIONS section options are listed in order of importance, with related options grouped together.

  OPTIONS
      -I <num> (inflation)
	Sets the main inflation value to <num>. This value is the main handle for affecting cluster granularity. It is usually chosen somewhere in
	the  range  [1.2-5.0]. -I 5.0 will tend to result in fine-grained clusterings, and -I 1.2 will tend to result in very coarse grained clus-
	terings. Your mileage will vary depending on the characteristics of your data. That is why it is a good  idea  to  test  the  quality  and
	coherency of your clusterings using clm dist and clm info. This will most likely reveal that certain values of -I are simply not right for
	your data. The clm dist section contains a discussion of how to use the cluster validation tools shipped with mcl (see the SEE	ALSO  sec-
	tion).

	With  low  values  for	-I,  like  -I 1.2,  you should be prepared to use more resources in order to maintain quality of clusterings, i.e.
	increase the argument to the -scheme option.

      -o <fname> (output file name)
      -odir <dname> (output directory name)
      --d (use input directory for output)
	The default mode of output creation for mcl is to create a file name that uses the input file name stripped of	any  leading  path  compo-
	nents, augmented with a prefix 'out.' and a suffix encoding pivotal mcl parameters.  This will usually be the inflation value which is the
	argument to the -I option. By default the output file is  written  in  the  current  directory.   For  example,  if  the  input  is  named
	data/small.mci for example and inflation is set to three, the output file will be named out.small.mci.I30.

	This  behaviour  can be overridden in various ways. The -o option simply specifies the output file name, which may include path components
	that should exist. It is possible to send the clustering to STDOUT by supplying -o -. With the -odir <dname>  option  mcl  constructs  the
	output	file  name  as before, but writes the file in the directory <dname>.  Finally, the option --d is similar but more specific in that
	mcl will write the output in the directory specified by the path component of the input file, that is, the directory in  which	the  input
	file resides.

	If  either one of --abc, --sif, --etc or -use-tab tab-file is used the output will be in label format.	Otherwise the clustering is output
	in the mcl matrix format; see the mcxio(5) section for more information on this. Refer also to the group of options discussed at --abc.

	Look at the -ap prefix option and its siblings for the automatic naming constructions employed by mcl if the -o option is not used.

      -c <num> (reweight loops)
      --sum-loops (set loops to sum of other arcs weights)
	With the -c <num> option, as the final step of loop computation (i.e. after initialization and shadowing) all loop weights are	multiplied
	by <num>, if supplied.

      --discard-loops=<y/n> (discard loops in input)
	By  default  mcl will remove any loops that are present in the input. Use --discard-loops=n to turn this off. Bear in mind that loops will
	still be modified in all cases where the loop weight is not maximal among the list of edge weights for a given node.

      --abc (expect/write labels)
      --sif (expect/write labels)
      --etc (expect/write labels)
      --expect-values (expect label:weight format)
      -use-tab <fname> (use mapping to write)
	These items all relate to label input and/or label output.  --abc tells mcl to expect label input and output clusters in  terms  of  those
	labels. This simple format expects two or three fields separated by white space on each line.  The first and second fields are interpreted
	as labels specifying source and destination node respectively. The third field, if present, specifies the weight of the arc connecting the
	two nodes.

	The  option  --sif  tells  mcl to expect SIF (Simple Interaction File) format. This format is line based. The first two fields specify the
	source node (as a label) and the relationship type. An arbitrary number of fields may follow, each containing a label identifying a desti-
	nation node.  The second field is simply ignored by mcl.  As an extension to the SIF format weights may optionally follow the labels, sep-
	arated from them with a colon character.  It is in this case necessary to use the --expect-values option.  The --etc option expects a for-
	mat identical in all respects except that the relationship type is not present, so that all fields after the first are interpreted as des-
	tination labels.

	-use-tab is only useful when matrix input is used.  It will use the tab file to convert the output to labels; it does not fail on  indices
	missing from the tab file, but will bind these to generated dummy labels.

      -tf <tf-spec> (transform input matrix values)
      -abc-tf <tf-spec> (transform input stream values)
      --abc-neg-log10 (take log10 of stream values, negate sign)
      --abc-neg-log (take log of stream values, negate sign)
	-tf  transforms the values of the input matrix according to <tf-spec>. -abc-tf transforms the stream values (when --abc is used) according
	to <tf-spec>.  --abc-neg-log and --abc-neg-log10 imply that the stream input values are replaced by the negation of  their  log  or  log10
	values,  respectively.	 The  reason  for  their  existence  is  documented  in  mcxio(5).   For  a  description of the transform language
	excpected/accepted in <tf-spec> refer to the same.

      -write-graph <fname> (write graph)
      -write-graphx <fname> (write transformed graph)
      -write-expanded <fname> (write expanded graph)
      --write-limit (write mcl process limit)
	The first two options are somewhat outdated, in that the prefered way of loading networks is  by  using  mcxload(1).  The  option  -write-
	expanded can be useful for exploring more complicated input transformations that incorporate an expansion step, but is not really relevant
	for production use. The last option is mainly educational and for analyzing the mcl process itself.

      -scheme <num> (use a preset resource scheme)
      -resource <num> (allow n neighbours throughout)
	There are currently seven different resource schemes, indexed 1..7.  High schemes result in more expensive computations that may  possibly
	be  more accurate. The default scheme is 4. When mcl is done, it will give a grade (the so called jury synopsis) to the appropriateness of
	the scheme used. A low grade does not necessarily imply that the resulting clustering is bad - but anyway, a low grade should be reason to
	try for a higher scheme.

	Use the -resource <num> option to cap for each nodes the number of neighbours tracked during computation at <num> nodes.

	The PRUNING OPTIONS section contains an elaborate description of the way mcl manages resources, should you be interested.  In case you are
	worried about the validation of the resulting clusterings, the mclfaq(7) section has several entries discussing  this  issue.  The  bottom
	line  is  that	you  have to compare the clusterings resulting from different schemes (and otherwise identical parameters) using utilities
	such as clm dist, clm info on the one hand, and your own sound judgment on the other hand.

	If your input graph is extremely dense, with an average node degree (i.e. the number of neighbours per node) that is somewhere above  500,
	you may need to filter the input graph by removing edges, for example by using one of -tf '#ceilnb()' or -tf '#knn()'.

      --show-schemes (show preset resource schemes)
	Shows the explicit settings to which the different preset schemes correspond.

	The characteristics are written in the same format (more or less) as the output triggered by -v pruning.

      -V <str> (verbosity type off)
	See the -v option below.

      -v <str> (verbosity type on)
	These are the different verbosity modes:

	pruning
	explain
	cls
	all

      -q <spec> (log levels)

	To make mcl as quiet as can be, add -q x -V all to the command line.

	The -q option governs a general logging mechanism.  The format accepted is described in the tingea.log(7) manual page.

	The  other  options  govern  verbosity levels specific to mcl. -v all turns them all on, -V all turns them all off. -v str and -V str turn
	on/off the single mode str (for str equal to one of pruning, cls, or explain). Each verbosity mode is given its own entry below.

      -v explain
	This mode causes the output of explanatory headers illuminating the output generated with the pruning verbosity mode.

      -v pruning
	This mode causes output of resource-related quantities. It has a separate entry in the PRUNING OPTIONS section.

      -v cls
	This mode (on by default) prints a terse list of characteristics of the clusterings associated with intermediate iterands. The	character-
	istics	are  E/V, cls, olap, and dd. They respectively stand for the number of outgoing arcs per node (as an average), the number of clus-
	ters in the overlapping clustering associated with the iterand, the number of nodes in overlap, and the dag depth associated with the  DAG
	(directed  acyclic graph) associated with the iterand. For more information on this DAG refer to the -dump option description in this man-
	ual and also mclfaq(7).

	Standard log information

	expa This gives the ratio of (1) the number of edges after initial expansion, before pruning, to (2) the number of  edges  before  initial
	     expansion.
	expb This gives the ratio of (1) the number of edges after expansion and pruning, to (2) the number of edges before initial expansion.
	expc This gives the ratio of (1) the number of edges after expansion and pruning, to (2) the number of edges of the original input graph.
	fmv  This  gives  the percentage of nodes (matrix columns) for which full matrix/vector computation was used (as opposed to using a sparse
	     technique).

      -aa <str> (append <str> to suffix)
	See the -ap option below.

      -ap <str> (use <str> as file name prefix)
	If the -o fname option is not used, mcl will create a file name (for writing output to) that should uniquely  characterize  the  important
	parameters  used  in the current invocation of mcl. The default format is out.fname.suf, where out is simply the literal string out, fname
	is the first argument containing the name of the file (with the graph) to be clustered, and where suf is the  suffix  encoding	a  set	of
	parameters (described further below).

	The  -ap str  option specifies a prefix to use rather than out.fname as sketched above.  However, mcl will interpret the character '=', if
	present in str, as a placeholder for the input file name.

	If the -aa str option is used, mcl will append str to the suffix suf created by itself.  You can use this if you need to encode some extra
	information in the file name suffix.

	The  suffix  is constructed as follows. The -I f and -scheme parameter are always encoded.  Other options, such as -pi f and -knn are only
	encoded if they are used. Any real argument f is encoded using exactly one trailing digit behind the decimal separator	(which	itself	is
	not  written). The setting -I 3.14 is thus encoded as I31. The -scheme option is encoded using the letter 's', all other options mentioned
	here are encoded as themselves (stripped of the hyphen). For example

	mcl small.mci -I 3 -c 2.5 -pi 0.8 -scheme 5

	results in the file name out.small.mci.I30s5c25pi08.  If you want to know beforehand what file name will be produced, use the -az option.

      -az (show output file name and exit)
      -ax (show output suffix and exit)
	If mcl automatically constructs a file name, it can be helpful to known beforehand what that file name will be. Use -az and mcl will write
	the  file  name  to  STDOUT and exit. This can be used if mcl is integrated into other software for which the automatic creation of unique
	file names is convenient.

	By default mcl incorporates the input file name into the output file name and appends a short suffix describing the most important  option
	settings. Use -ax to find out what that suffix is.  This can be useful in wrapper pipeline scripts such as clxcoarse.

      -annot <str> (dummy annotation option)
	mcl  writes  the  command line with which it was invoked to the output clustering file. Use this option to include any additional informa-
	tion. MCL does nothing with this option except copying it as just described.

      -te <int> (#expansion threads)
	Threading is useful if you have a multi-processor system. mcl will spawn k threads of computation. If these are computed in parallel (this
	depends on the number of CPUs available to the mcl process) it will speed up the process accordingly.

	When  threading, it is best not to turn on pruning verbosity mode if you are letting mcl run unattended, unless you want to scrutinize its
	output later. This is because it makes mcl run somewhat slower, although the difference is not dramatic.

      -pi <num> (pre-inflation)
      -ph <num> (pre-inflation, max-bound)
	If used, mcl will apply inflation one time to the input graph before entering the main process. This can be useful  for  making  the  edge
	weights in a graph either more homogeneous (which may result in less granular clusterings) or more heterogeneous (which may result in more
	granular clusterings).	Homogeneity is achieved for values <num> less than one, heterogeneity for values larger than one.  Values  to  try
	are normally in the range [2.0,10.0].

	The  -ph  option  is  special in that it does not rescale columns to be stochastic. Instead, it rescales columns so that the maximum value
	found in the column stays the same after inflation was applied. There is little significance to this, and what little there is is  undocu-
	mented.

      -if <num> (start-inflation)
	If  used,  mcl will apply inflation one time to the input graph before entering the main process. The difference with -pi is that with the
	latter option mcl may apply certain transformations after reading in the matrix such as adding or modifying loops. The purpose of the  -if
	(mnemonic  for	inflation-first) option is to use it on graphs saved with the --write-expanded option and convey to mcl that it should not
	apply those transformations.

      -dump-interval <i:j> (dump interval)
      -dump-interval all
	Dump during iterations i..j-1. Use all to dump in all iterations. See the -dump str option below.

      -dump-modulo <int> (dump i+0..i+<int>..)
	Sampling rate: select only these iterations in the dump interval.  See the -dump str option below.

      -dump-stem <stem> (file stem)
	Set the the stem for file names of dumped objects (default mcl). See the -dump str option below.

      -dump <str> (type)
	str is checked for substring occurrences of the following entries.  Repeated use of -dump is also allowed.

	ite
	dag
	cls
	chr
	lines
	cat

	lines and cat change the mode of dumping. The first changes the dump format to a line based pairwise format rather than  the  default  mcl
	matrix	format. The second causes all dumped items to be dumped to the default stream used for the output clustering, which is appended at
	the end.

	The ite option writes mcl iterands to file. The cls option writes clusterings associated with mcl iterands to file.   These  clusters  are
	obtained  from	a  particular  directed acyclic graph (abbreviated as DAG) associated with each iterand. The dag option writes that DAG to
	file. The DAG can optionally be further pruned and then again be interpreted as a clustering using clm imac, and clm imac  can	also  work
	with  the  matrices  written  using the ite option.  It should be noted that clusterings associated with intermediate iterands may contain
	overlap, which is interesting in many applications. For more information refer to mclfaq(7) and the REFERENCES section below.

	The result option dumps the usual MCL clustering.

	The chr option says, for each iterand I, to output a matrix C with characteristics of I. C has the same number of columns as I.  For  each
	column	k in C, row entry 0 is the diagonal or 'loop' value of column k in I after expansion and pruning, and before inflation and rescal-
	ing. Entry 1 is the loop value after inflation and rescaling.  Entry 2 is the center of column k (the sum of its entries squared) computed
	after expansion and before pruning, entry 3 is the maximum value found in that column at the same time. Entry 4 is the amount of mass kept
	for that column after pruning.

	The -ds option sets the stem for file names of dumped objects (default mcl). The -di and -dm options allow a selection of iterands  to	be
	made.

      -digits <int> (printing precision)
	This  has two completely different uses. It sets the number of decimals used for pretty-printing mcl iterands when using the --show option
	(see below), and it sets the number of decimals used for writing the expanded matrix when using the -write-expanded option.

      --show (print matrices to screen)
	Print matrices to screen. The number of significant digits to be printed can be tuned with -digits n. An 80-column  screen  allows  graphs
	(matrices)  of size up to 12(x12) to be printed with three digits precision (behind the comma), and of size up to 14(x14) with two digits.
	This can give you an idea of how mcl operates, and what the effect of pruning is.  Use e.g. -S 6 for such a small graph and view  the  MCL
	matrix iterands with --show.

      --write-binary (output format)
	Write  matrix  dump output in binary mcl format rather than interchange mcl format (the default). Note that mcxconvert can be used to con-
	vert each one into the other.  See mcxio(5) and mcx(1) for more information.

      -sort <str> (sort mode)
	str can be one of lex, size, revsize, or none. The default is 'revsize', in which the largest clusters come first. If the mode is  'size',
	smallest  clusters  come  first, if the mode is 'lex', clusters are ordered lexicographically, and if the mode is 'none', the order is the
	same as produced by the procedure used by mcl to map matrices onto clusterings.

      -overlap <str> (overlap mode)
	Mode keep causes mcl to retain overlap should this improbable event occur. In theory, mcl may generate a clustering that contains overlap,
	although this almost never happens in practice, as it requires some particular type of symmetry to be present in the input graph (not just
	any symmetry will do). Mathematically speaking, this is a conjecture and not a theorem, but the present author wil  eat  his  shoe  if	it
	fails  to  be  true  (for marzipan values of shoe). It is easy though to construct an input graph for which certain mcl settings result in
	overlap - for example a line graph on an odd number of nodes. The default is to excise overlapping parts and introduce them as clusters in
	their  own  right.  It is possible to allocate nodes in overlap to the first cluster in which they occur (i.e. rather arbitrarily), corre-
	sponding with mode cut.

	In mode split mcl will put all nodes in overlap into separate clusters. These clusters are chosen such that two nodes are put in the  same
	new cluster if and only if they always occur paired in the clusters of the overlapping clustering.

	This  option has no effect on the clusterings that are output when using -dump cls - the default for those is that overlap is not touched,
	and this default can not yet be overridden.

      --force-connected=<y/n> (analyze components)
      --check-connected=<y/n> (analyze components)
	If the input graph has strong bipartite characteristics, mcl may yield clusters that do not correspond	to  connected  components  in  the
	input graph. Turn one of these modes on to analyze the resultant clustering.

	If  loose clusters are found they will be split into subclusters corresponding to connected components.  With --force-connected=y mcl will
	write the corrected clustering to the normal output file, and the old clustering to the same file with	suffix	orig.	With  --check-con-
	nected=y mcl will write the loose clustering to the normal output file, and the corrected clustering to the same file with suffix coco.

	These  options	are  not on by default, as the analysis is currently (overly) time-consuming and mcl's behaviour actually makes some sense
	(when taking bipartite characteristics into account).

      --analyze=<y/n> (performance criteria)
	With this mode turned on, mcl will reread the input matrix and compute a few performance criteria and attach them to the output file.  Off
	by default.

      --show-log=<y/n> (show log)
	Shows the log with process characteristics on STDOUT.  By default, this mode is off.

      --jury-charter (explains jury)
	Explains how the jury synopsis is computed from the jury marks.

      --version (show version)
	Show version.

      -how-much-ram <int> (RAM upper bound)
	<int>  is  interpreted	as the number of nodes of an input graph.  mcl will print the maximum amount of RAM it needs for its computations.
	The formula for this number in bytes is:

	   2 * c * k * <int>

	   2  :  two matrices are concurrently held in memory.
	   c  :  mcl cell size (as shown by -z).
	 <int>:  graph cardinality (number of nodes).
	   k  :  MAX(s, r).
	   s  :  select number (-S, -scheme options).
	   r  :  recover number (-R, -scheme options).

	This estimate will usually be too pessimistic. It does assume though that the average node degree of the input graph does  not	exceed	k.
	The  -how-much-ram  option  takes  other  command-line	arguments  into account (such as -S and -R), and it expresses the amount of RAM in
	megabyte units.

      -h (show help)
	Shows a selection of the most important mcl options.

      --help (show help)
	Gives a one-line description for all options.

      -z (show settings)
	Show current settings for tunable parameters.  --show-settings is a synonym.

  PRUNING OPTIONS
      -p <num> (cutoff)
      -P <int> (1/cutoff)
      -S <int> (selection number)
      -R <int> (recover number)
      -pct <pct> (recover percentage)
	After computing a new (column stochastic) matrix vector during expansion (which is matrix multiplication c.q.  squaring),  the	vector	is
	successively exposed to different pruning strategies. The intent of pruning is that many small entries are removed while retaining much of
	the stochastic mass of the original vector. After pruning, vectors are rescaled to be stochastic again.  MCL  iterands	are  theoretically
	known  to  be sparse in a weighted sense, and this manoever effectively perturbs the MCL process a little in order to obtain matrices that
	are genuinely sparse, thus keeping the computation tractable. An example of monitoring pruning can be found in the discussion of  -v prun-
	ing at the end of this section.

	mcl proceeds as follows. First, entries that are smaller than cutoff are removed, resulting in a vector with at most 1/cutoff entries. The
	cutoff can be supplied either by -p, or as the inverse value by -P. The latter is more intuitive, if your intuition is like mine (P stands
	for  precision or pruning).  The cutoff just described is rigid; it is the same for all vectors. The --adapt option causes the computation
	of a cutoff that depends on a vector's homogeneity properties, and this option may or may not speed up mcl.

	Second, if the remaining stochastic mass (i.e. the sum of all remaining entries) is less  than	<pct>/100  and	the  number  of  remaining
	entries  is  less  than  <r> (as specified by the -R flag), mcl will try to regain ground by recovering the largest discarded entries. The
	total number of entries is not allowed to grow larger than <r>.  If recovery was not necessary, mcl tries to prune the vector further down
	to  at most s entries (if applicable), as specified by the -S flag. If this results in a vector that satisfies the recovery condition then
	recovery is attempted, exactly as described above. The latter will not occur of course if <r> <= <s>.

	The default setting is something like -P 4000 -S 500 -R 600. Check the -z flag to be sure. There is a set of precomposed  settings,  which
	can  be  triggered with the -scheme k option. k=4 is the default scheme; higher values for k result in costlier and more accurate computa-
	tions (vice versa for lower, cheaper, and less accurate).  The schemes are listed using the --show-schemes option. It is advisable to  use
	the  -scheme option only in interactive mode, and to use the explicit expressions when doing batch processing. The reason is that there is
	no guarantee whatsoever that the schemes will not change between different releases. This is because the  scheme  options  should  reflect
	good general purpose settings, and it may become appararent that other schemes are better.

	Note  that  'less accurate' or 'more accurate' computations may still generate the same output clusterings. Use clm dist to compare output
	clusterings for different resource parameters. Refer to clm dist for a discussion of this issue.

      -warn-pct <int> (prune warn percentage)
      -warn-factor <int> (prune warn factor)
	The two options -warn-pct and -warn-factor relate to warnings that may be triggered once the initial pruning of a vector is completed. The
	idea  is to issue warnings if initial pruning almost completely destroys a computed vector, as this may be a sign that the pruning parame-
	ters should be changed. It depends on the mass remaining after initial pruning whether a warning will be issued. If that mass is less than
	warn-pct  or if the number of remaining entries is smaller by a factor warn-factor than both the number of entries originally computed and
	the recovery number, in that case, mcl will issue a warning.

	-warn-pct takes an integer between 0 and 100 as parameter, -warn-factor takes a real positive number. They default to  something  like	30
	and 50.0. If you want to see less warnings, decrease warn-pct and increase warn-factor. Set warn-factor to zero if you want no warnings.

      -v pruning
	Pruning  verbosity  mode  causes  mcl  to  emit  several  statistics related to the pruning process, each of which is described below. Use
	-v explain to get explanatory headers in the output as well (or simply use -v all).

  IMPLEMENTATION OPTIONS
      -sparse <int> (sparse matrix multiplication threshold)
	This value (by default set to 10) determines when mcl switches to sparse matrix/vector multiplication.	For a given column stochastic vec-
	tor  (corresponding  with  all the neighbours of a given node v according to the current mcl iterand) the sum S of neighbour counts of all
	neighbours of v is computed, counting duplicates. This is exactly the number of matrix entries involved in the computation of the new col-
	umn  vector for the matrix product. If S times <int> does not exceed the number of nodes in the graph (equal to both column and row dimen-
	sion of the matrices used) then a sparse implementation is used. Otherwise an optimized regular implementation is used. Intuitively,  this
	option can be thought of as the estimated overhead per matrix floating point operation incurred by the sparse implementation compared with
	the regular implementation.  MCL uses this estimated overhead to determine which implementation is likely to be quicker. Testing has shown
	this strategy works very well for graphs of a wide range of sizes, including graphs with up to 3 million nodes and 500 million edges.

	NOTE
	The  effectiveness  of	this  option is influenced by hardware-specific properties such as the CPU L2 cache size. The default value should
	work reasonably well across a wide variety of scenarios, but it may be possible to squeeze faster run times out  of  mcl  by  tuning  this
	parameter to the graphs that are specific for your application domain.

  EXAMPLES
      The following is an example of label input

      ---8<------8<------8<------8<------8<---
      cat hat  0.2
      hat bat  0.16
      bat cat  1.0
      bat bit  0.125
      bit fit  0.25
      fit hit  0.5
      hit bit  0.16
      --->8------>8------>8------>8------>8---

      It can be clustered like this:

      mcl cathat --abc -o out.cathat

      The file out.cathat should now like like this

      ---8<------8<------8<------8<------8<---
      cat hat bat
      bit fit hit
      --->8------>8------>8------>8------>8---

      A  few  things to note. First, MCL will symmetrize any arrow it finds. If it sees bat cat 1.0 it will act as if it also saw cat bat 1.0. You
      can explicitly specify cat bat 1.0, mcl will in the first parse stage simply  end  up  with  duplicate  entries.	Second,  MCL  deduplicates
      repeated edges by taking the one with the maximum value. So,

      ---8<------8<------8<------8<------8<---
      cat hat  0.2
      hat cat  0.16
      hat cat  0.8
      --->8------>8------>8------>8------>8---

      Will result in two arrows cat-hat and hat-cat both with value 0.8.

  APPLICABILITY
      mcl  will  work very well for graphs in which the diameter of the natural clusters is not too large. The presence of many edges between dif-
      ferent clusters is not problematic; as long as there is cluster structure, mcl will find it. It is less likely to work well for graphs  with
      clusters	(inducing  subgraphs)  of  large diameter, e.g. grid-like graphs derived from Euclidean data. So mcl in its canonical form is cer-
      tainly not fit for boundary detection or image segmentation. I experimented with a modified mcl and boundary detection in the thesis pointed
      to below (see REFERENCES). This was fun and not entirely unsuccesful, but not something to be pursued further.

      mcl  likes  undirected  input graphs best, and it really dislikes graphs with node pairs (i,j) for which an arc going from i to j is present
      and the counter-arc from j to i is absent. Try to make your input graph undirected.  Furthermore, mcl interprets edge weights in	graphs	as
      similarities. If you are used to working with dissimilarities, you will have to convert those to similarities using some conversion formula.
      The most important thing is that you feel confident that the similarities are reasonable, i.e. if X is similar to Y with weight 2, and X	is
      similar to Z with weight 200, then this should mean that the similarity of Y (to X) is neglectible compared with the similarity of Z (to X).

      mcl is probably not suited for clustering tree graphs. This is because mcl works best if there are multiple paths between different nodes in
      the natural clusters, but in tree graphs there is only one path between any pair of nodes. Trees are too sparse a structure for mcl to  work
      on.

      mcl may well be suited for clustering lattices. It will depend on the density characteristics of the lattice, and the conditions for success
      are the same as those for clustering graphs in general: The diameter of the natural clusters should not be too large.  NOTE when	clustering
      a lattice, you have to cluster the underlying undirected graph, and not the directed graph that represents the lattice itself. The reason is
      that one has to allow mcl (or any other cluster algorithm) to 'look back in time', so to speak.  Clustering  and	directionality	bite  each
      other (long discussion omitted).

      mcl  has	a  worst-case  time  complexity  O(N*k^2),  where N is the number of nodes in the graph, and k is the maximum number of neighbours
      tracked during computations. k depends on the -P and -S options. If the -S option is used (which is the default setting) then k  equals  the
      value  corresponding  with this option. Typical values for k are in the range 500..1000. The average case is much better than the worst case
      though, as cluster structure itself has the effect of helping mcl's pruning schemes, certainly if the diameter of natural  clusters  is  not
      large.

  FILES
      There are currently no resource nor configuration files.	The mcl matrix format is described in the mcxio(5) section.

  ENVIRONMENT
      MCLXIODIGITS
	When  writing matrices in interchange format, mcl will use this variable (if present) as the precision (number of digits) for printing the
	fractional part of values.

      MCLXIOVERBOSITY
	MCL and its sibling applications will usually report about matrix input/output from/to disk. The verbosity  level  can	be  regulated  via
	MCLXIOVERBOSITY. These are the levels it can currently be set to.

	1 Silent but applications may alter this.
	2 Silent and applications can not alter this.
	4 Verbose but applications may alter this.
	8 Verbose and applications can not alter this (default).

      MCLXIOFORMAT
	MCL  and  its  sibling	applications  will by default output matrices in interchange format rather than binary format (cf. mcxio(5)).  The
	desired format can be controlled via the variable MCLXIOFORMAT. These are the levels it can currently be set to.

	1 Interchange format but applications may alter this.
	2 Interchange format and applications can not alter this (default).
	4 Binary format but applications may alter this.
	8 Binary format and applications can not alter this.

      MCLXICFLAGS
	If matrices are output in interchange format, by default empty vectors will not be listed. Equivalently (during input time),  vectors  for
	which  no  listing  is present are understood to be empty - note that the presence of a vector is established using the domain information
	found in the header part.  It is possible to enforce listing of empty vectors by setting bit '1' in the variable MCLXICFLAGS.

      MCLXIOUNCHECKED
	MCL and its sibling applications will always check a matrix for consistency while it is being read. If this variable is set,  the  consis-
	tency  check  is  omitted. For large graphs the speed up can be considerable. However, if the input graph is not conforming it will likely
	crash the application that is using it.

  DIAGNOSTICS
      If mcl issues a diagnostic error, it will most likely be because the input matrix could not be parsed succesfully.  mcl tries to be  helpful
      in describing the kind of parse error.  The mcl matrix format is described in the mcxio(5) section.

  BUGS
      No known bugs at this time.

  AUTHOR
      Stijn van Dongen.

  HISTORY/CREDITS
      The  MCL	algorithm  was conceived in spring 1996 by the present author.	The first implementation of the MCL algorithm followed that spring
      and summer. It was written in Perl and proved the viability of the algorithm. The implementation described here began  its  life	in  autumn
      1997.  The  first  versions  of the vital matrix library were designed jointly by Stijn van Dongen and Annius Groenink in the period Oktober
      1997 - May 1999. The efficient matrix-vector multiplication routine was written by Annius. This routine is without significant changes still
      one of the cornerstones of this MCL implementation.

      Since  May  1999	all  MCL  libraries  have  seen much development and redesign by the present author. Matrix-matrix multiplication has been
      rewritten several times to take full advantage of the sparseness properties of the stochastic matrices brought forth by the  MCL	algorithm.
      This mostly concerns the issue of pruning - removal of small elements in a stochastic column in order to keep matrices sparse.

      Very  instructive  was that around April 2001 Rob Koopman pointed out that selecting the k largest elements out of a collection of n is best
      done using a min-heap. This was the key to the second major rewrite (now counting three) of the  MCL  pruning  schemes,  resulting  in  much
      faster code, generally producing a more accurate computation of the MCL process.

      In  May  2001  Anton Enright initiated the parallellization of the mcl code and threaded inflation. From this example, Stijn threaded expan-
      sion. This was great, as the MCL data structures and operands (normal matrix multiplication and Hadamard multiplication) just beg for paral-
      lellization.

      Onwards.	The January 2003 03-010 release introduced support for sparsely enumerated (i.e. indices need not be sequential) graphs and matri-
      ces, the result of a major overhaul of the matrix library and most higher layers.  Conceptually, the library now sees matrices  as  infinite
      quadrants of which only finite subsections happen to have nonzero entries.

      The  June  2003  03-154  release	introduced  unix-type pipelines for clustering, including the BLAST parser mcxdeblast and the mclblastline
      script.  The April 2004 04-105 release revived binary format, which has been a first class citizen every since.

      With the March 2005 05-090 release mcxsubs finally acquired a sane specification syntax. The November 2005 05-314 release brought the  abil-
      ity to stream label input directly into mcl. The subsequent release introduced a transformation language shared by various mcl siblings that
      allows arbitrary progressions of transformations to be applied to either stream values or matrix values.

      Joost van Baal set up the mcl CVS tree and packaged mcl for Debian GNU/Linux. He completely autotooled the sources, so much so that at first
      I found it hard to find them back amidst bootstrap, aclocal.m4, depcomp, and other beauties.

      Jan  van	der  Steen shared his elegant mempool code. Philip Lijnzaad gave useful comments. Philip, Shawn Hoon, Abel Ureta-Vidal, and Martin
      Mokrejs sent helpful bug reports.

      Abel Ureta-Vidal and Dinakarpandian Deendayal commented on and contributed to mcxdeblast and mcxassemble.

      Tim Hughes contributed several good bug reports for mcxassemble, mcxdeblast and zoem (a workhorse for clm format).

  SEE ALSO
      mclfaq(7) - Frequently Asked Questions.

      mcxio(5) - a description of the mcl matrix format.

      There are many more utilities. Consult mclfamily(7) for an overview of and links to all the documentation and the utilities in the mcl  fam-
      ily.

      minimcl is a 200-line perl implementation of mcl. It is shipped in the mcl distribution and can be found online at http://micans.org/mcl.

      mcl's home at http://micans.org/mcl/.

  REFERENCES
      [1] Stijn van Dongen, Graph Clustering by Flow Simulation.  PhD thesis, University of Utrecht, May 2000.
      http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm

      [2]  Stijn  van Dongen, Graph Clustering Via a Discrete Uncoupling Process, SIAM Journal on Matrix Analysis and Applications, 30(1):121-141,
      2008.  http://link.aip.org/link/?SJMAEL/30/121/1

      [3] Stijn van Dongen. A cluster algorithm for graphs.  Technical Report INS-R0010, National Research Institute for Mathematics and  Computer
      Science in the Netherlands, Amsterdam, May 2000.
      http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z

      [4]  Stijn  van Dongen. A stochastic uncoupling process for graphs.  Technical Report INS-R0011, National Research Institute for Mathematics
      and Computer Science in the Netherlands, Amsterdam, May 2000.
      http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z

      [5] Stijn van Dongen. Performance criteria for graph clustering  and  Markov  cluster  experiments.  Technical  Report  INS-R0012,  National
      Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000.
      http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z

      [6] Enright A.J., Van Dongen S., Ouzounis C.A.  An efficient algorithm for large-scale detection of protein families, Nucleic Acids Research
      30(7):1575-1584 (2002).

  NOTES
      This page was generated from ZOEM manual macros, http://micans.org/zoem. Both html and roff pages can be created from the same source  with-
      out having to bother with all the usual conversion problems, while keeping some level of sophistication in the typesetting.

  mcl 12-068							      8 Mar 2012							      mcl(1)
Linux and UNIX Man Pages

mcl(1) [debian man page]