Sponsored Content
Top Forums UNIX for Dummies Questions & Answers How to make a distance matrix Post 302417989 by auburn on Sunday 2nd of May 2010 06:54:15 AM
Old 05-02-2010
How to make a distance matrix

Hi,

I'm trying to generate a distance matrix between sample pairs for use in a tree-drawing program (example below). The example below demonstrates what I'd like to get out of the data - essentially, to calculate the proportion of positions where two samples differ.
Any help much appreciated! Also, any notes on how the functions work would be great!

Thanks! Image


Example input (note: comma indicates column separators, a:d are sample names):

a,1,2,4,4
b,2,1,4,4
c,1,2,3,4
d,1,0,4,0

Identify positions which differ between pairwise comparisons of samples a:d (score 1 for differ, 0 for shared in example below)
some comparisons are duplicates, e.g. ab and ba, and self-comparisons such as aa or bb are obviously all "1", but these are neccessary to make the matrix

aa,1,1,1,1
ab,1,1,0,0
ac,0,0,1,0
ad,0,1,0,1
ba,1,1,0,0
bb,1,1,1,1
bc,1,1,1,0
etc... to dd

Calculate proportion of differing positions between pairwise comparisons
aa,0
ab,0.5
ac,0.25
ad,0.5
ba,0.5
bb,0
bc,0.75
etc...to dd

prepare matrix (e.g. ab value plotted in [a,b]; ba value plotted in [b,a] etc...)

a,b,c,d
a,0,0.5,0.25,0.5
b,0.5,0,0.75 etc...
c
d
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Lat/Long Distance Calculation

I amtrying to write a script that would compute the distance between an "x" number of points. This is what I have come up with so far and it is not working. Can anyone modify it to make it work? A=34.16597 B=-84.33244 C=34.2344 D=-84.29189 test "$A" -eq "$C" -o "$B" -eq "$D" then echo... (3 Replies)
Discussion started by: Ernst
3 Replies

2. Shell Programming and Scripting

program to calculate distance between 5 atoms

Hello, I am a beginner with perl. I have a perl program to calculate the distance between 5 atoms or more. i have an array which looks like this: 6.324 32.707 50.379 5.197 32.618 46.826 4.020 36.132 46.259 7.131 38.210 45.919 6.719 38.935 42.270 2.986 39.221 ... (1 Reply)
Discussion started by: annie_singh
1 Replies

3. Programming

Converting distance list to distance matrix in R

Hi power user, I have this type of data (distance list): file1 A B 10 B C 20 C D 50I want output like this # A B C D A 0 10 30 80 B 10 0 20 70 C 30 20 0 50 D 80 70 50 0 Which is a distance matrix I have tried... (0 Replies)
Discussion started by: anjas
0 Replies

4. Shell Programming and Scripting

diagonal matrix to square matrix

Hello, all! I am struggling with a short script to read a diagonal matrix for later retrieval. 1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098 1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058 1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125 1.000 0.587 0.159 0.357... (11 Replies)
Discussion started by: yifangt
11 Replies

5. Shell Programming and Scripting

Calculate distance and azimuth

Hi all, I have a data file like this lat lon lat lon 12.000 25.125 14.235 25.012 14.200 81.000 25.584 25.014 45.023 25.365 25.152 35.222 I want to calculate distance and azimuth between this points eg:- 12.000,25.125 and 14.235,25.012 I want to use awk programming... (3 Replies)
Discussion started by: chamara
3 Replies

6. Ubuntu

How to convert full data matrix to linearised left data matrix?

Hi all, Is there a way to convert full data matrix to linearised left data matrix? e.g full data matrix Bh1 Bh2 Bh3 Bh4 Bh5 Bh6 Bh7 Bh1 0 0.241058 0.236129 0.244397 0.237479 0.240767 0.245245 Bh2 0.241058 0 0.240594 0.241931 0.241975 ... (8 Replies)
Discussion started by: evoll
8 Replies

7. Shell Programming and Scripting

finding distance between numbers

Hi, I have a file as ABC 1634230,1634284,1634349,1634468 1634272,1634301,1634356,1634534 What I want is to find distance between the numbers.. column 1 is the gene name and column 2 are starts and column 3 are their respective stops for the starts. So what I want is column 3 which has +1... (2 Replies)
Discussion started by: Diya123
2 Replies

8. Shell Programming and Scripting

Make Separated files from a single matrix - Perl

Hey Masters, Here is my input: fragmentID chromosome start end HEL25E TRIP1 r5GATC2L00037 chr2L 5301 6026 0.03 0.036 r5GATC2L00038 chr2L 6023 6882 -0.025 -0.041 r5GATC2L00040 chr2R 6921 7695 -0.031 0.005 r5GATC2L00042 chr2R 7715 8554 -0.006 -0.024 r5GATC2L00043 chr3L 8551 8798 0.042 0... (4 Replies)
Discussion started by: @man
4 Replies

9. Shell Programming and Scripting

Edit distance using perl or awk

Dear all, I am working on a large Sindhi lexicon which I hope to complete by 2017 and place in open source. The database is in Arabic script in two columns delimited by an equal to sign. Column 1 contains a word or words without the short vowel and also some extraneous information which is... (0 Replies)
Discussion started by: gimley
0 Replies

10. Shell Programming and Scripting

Calculate average, azimut and distance

Gents, Please i will to get the distance and azimut from 2 coordinates: Usig excel formula i get the correct values, but i will like to do it using awk. Example A 35089.0 50345.016 9 75 1 2101774 77 70 79 483911.6 2380106.9 137.4 1 1 6 1 A 35089.0 50345.01620 75... (8 Replies)
Discussion started by: jiam912
8 Replies
mcxarray(1)							  USER COMMANDS 						       mcxarray(1)

  NAME
      mcxarray - Transform array data to MCL matrices

  SYNOPSIS
      mcxarray [options]

      mcxarray	[-data	fname  (input  data  file)]  [-imx  fname  (input matrix file)] [-co num ((absolute) cutoff for output values (required))]
      [--pearson (use Pearson correlation (default))] [--spearman (use Spearman rank correlation)] [-fp <mode> (use fingerprint  measure)]  [--dot
      (use  dot product)] [--cosine (use cosine)] [-skipr <num> (skip <num> data rows)] [-skipc <num> (skip <num> data columns)] [-o fname (output
      file fname)] [-write-tab <fname> (write row labels to file)] [-l <num> (take labels from column <num>)] [-digits <num>  (output  precision)]
      [--write-binary  (write  output  in  binary format)] [-t <int> (use <int> threads)] [-J <intJ> (a total of <intJ> jobs are used)] [-j <intj>
      (this job has index <intj>)] [-start <int> (start at column <int> inclusive)] [-end <int> (end  at  column  <int>  EXclusive)]  [--transpose
      (work  with the transposed data matrix)] [--rank-transform (rank transform the data first)] [-tf spec (transform result network)] [-table-tf
      spec (transform input table before processing)] [-n mode (normalize input)] [--zero-as-na  (treat  zeroes  as  missing  data)]  [-write-data
      <fname>  (write  data  to file)] [-write-na <fname> (write NA matrix to file)] [--job-info (print index ranges for this job)] [--help (print
      this help)] [-h (print this help)] [--version (print version information)]

  DESCRIPTION
      mcxarray can either read a flat file containing array data (-data) or a matrix file satisfying the mcl input format (-imx).  In  the  former
      case it will by default work with the rows as the data vectors. In the latter case it will by default work with the columns as the data vec-
      tors (note that mcl matrices are presented as a listing of columns).  This can be changed for both using the --transpose option.

      The input data may contain missing data in the form of empty columns, NA values (not available/applicable), or NaN values  (not  a  number).
      The  program keeps track of these, and when computing the correlation between two rows or columns ignores all positions where any one of the
      two has missing data.

  OPTIONS
      -data fname (input data file)
	Specify the data file containing the expression values.  It should be tab-separated.

      -imx fname (input matrix file)
	The expression values are read from a file in mcl matrix format.

      --pearson (use Pearson correlation (default))
      --spearman (use Spearman rank correlation)
      --cosine (use cosine)
      --dot (use the dot product)
	Use one of these to specify the correlation measure. Note that the dot product is not normalised and should only be used  with	very  good
	reason.

      -fp <mode> (specify fingerprint measure)
	Fingerprints  are used to define an entity in terms of it having or not having certain traits. This means that a fingerprint can be repre-
	sented by a boolean vector, and a set of fingerprints can be represented by an array of such vectors. In the presence of many  traits  and
	entities  the  dimensions of such a matrix can grow large. The sparse storage employed by MCL-edge is ideally suited to this, and mcxarray
	is ideally suited to the computation of all pairwise comparisons between such fingerprints.  Currently mcxarray  supports  five  different
	types  of  fingerprint,  described  below.   Given  two fingerprints, the number of traits unique to the first is denoted by a, the number
	unique to the second is denoted by b, and the number that they have in common is denoted by c.

	hamming
	  The Hamming distance, defined as a+b.

	tanimoto
	  The Tanimoto similarity measure, c/(a+b+c).

	cosine
	  The cosine similarity measure, c/sqrt((a+c)*(b+c)).

	meet
	  Simply the number of shared traits, identical to c.

	cover
	  A normalised and non-symmetric similarity measure, representing the fraction of traits shared relative to the number of traits by a sin-
	  gle entity.  This gives the value c/(a+c) in one direction, and the value c/(b+c) in the other.

      -skipr <num> (skip <num> data rows)
	Skip the first <num> data rows.

      -skipc <num> (skip <num> data columns)
	Ignore the first <num> data columns.

      -l <num> (take labels from column <num>)
	Specifies to construct a tab of labels from this data column.  The tab can be written to file using -write-tab fname.

      -write-tab <fname> (write row labels to file)
	Write  a tab file. In the simple case where the labels are in the first data column it is sufficient to issue -skipc 1.  If more data col-
	umns need to be skipped one must explicitly specify the data column to take labels from with -l l.

      -t <int> (use <int> threads)
      -J <intJ> (a total of <intJ> jobs are used)
      -j <intj> (this job has index <intj>)
	Computing all pairwise correlations is time-intensive for large input.	If you	have  multiple	CPUs  available  consider  using  as  many
	threads.  Additionally	it  is	possible  to spread the computation over multiple jobs/machines.  Conceptually, each job takes a number of
	threads from the total thread pool.  Additionally, the number of threads (as specified by -t) currently must be the same for all jobs,	as
	it  is	used  by  each job to infer its own set of tasks.  The following set of options, if given to as many commands, defines three jobs,
	each running four threads.

	-t 4 -J 3 -j 0
	-t 4 -J 3 -j 1
	-t 4 -J 3 -j 2

      --job-info (print index ranges for this job)
      -start <int> (start at column <int> inclusive)
      -end <int> (end at column <int> EXclusive)
	--job-info can be used to list the set of column ranges to be processed by the job as a result of the command line options -t, -J, and -j.
	If a job has failed, this option can be used to manually split those ranges into finer chunks, each to be processed as a new sub-job spec-
	ified with -start and -end.  With the latter two options, it is impossible to use parallelization of any kind (i.e. any of the -t, -J, and
	-j options).

      -o fname (output file fname)
	Output file name.

      -digits <num> (output precision)
	Specify the precision to use in native interchange format.

      --write-binary (write output in binary format)
	Write output matrices in native binary format.

      -co num ((absolute) cutoff for output values)
	Output	values of magnitude smaller than num are removed (set to zero).  Thus, negative values are removed only if their positive counter-
	part is smaller than num.

      --transpose (work with the transpose)
	Work with the transpose of the input data matrix.

      --rank-transform (rank transform the data first)
	The data is rank-transformed prior to the computation of pairwise measures.

      -write-data <fname> (write data to file)
	This writes the data that was read in to file.	If --spearman is specified the data will be rank-transformed.

      -write-na <fname> (write NA matrix to file)
	This writes all positions for which no data was found to file, in native mcl matrix format.

      --zero-as-na (treat zeroes as missing data)
	This option can be useful when reading data with the -imx option, for example after it has been loaded from label input  by  mcxload.	An
	example  case  is  the processing of a large number of probe rankings, where not all rankings contain all probe names. The rankings can be
	loaded using mcxload with a tab file containing all probe names.  Probes that are present in the ranking are given a positive ordinal num-
	ber  reflecting  the ranking, and probes that are absent are implicitly given the value zero. With the present option mcxarray will handle
	the correlation computation in a reasonable way.

      -n mode (normalization mode)
	If mode is set to z the data will be normalized based on z-score. No other modes are currently supported.

      -tf spec (transform result network)
      -table-tf spec (transform input table before processing)
	The transformation syntax is described in mcxio(5).

      --help (print help)
      -h (print help)

      --version (print version information)

  AUTHOR
      Stijn van Dongen.

  SEE ALSO
      mcl(1), mclfaq(7), and mclfamily(7) for an overview of all the documentation and the utilities in the mcl family.

  mcxarray 12-068						      8 Mar 2012							 mcxarray(1)
All times are GMT -4. The time now is 05:19 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy