Sponsored Content
Top Forums UNIX for Beginners Questions & Answers UNIX command to select the best edge values from a network file Post 303045006 by Sanchari on Tuesday 10th of March 2020 06:16:47 PM
Old 03-10-2020
UNIX command to select the best edge values from a network file

I have a tab-delimited data representing network data (undirected). Among the duplicated edges, I wanted to select those edges for which I have the higher absolute value of the log values.
I have written a code in python, but its taking a lot of time. I would be grateful if someone helps me with an awk command. Kindly note, the network is undirected, i.e. A--B and B--A are duplicate edges. My original file has a large number of columns, I have given a simplified test data

Test data



Code:
     Gene1    Gene2    Log
    AT1G01020    AT1G01010    1.682708
    AT1G01020    AT1G01010    -1.90043
    AT1G01020    AT1G01010    -1.832192
    AT1G01070    AT1G01060    -0.591932
    AT1G01070    AT1G01060    -1.204241
    AT1G01073    AT1G01070    0.790549
    AT1G01060    AT1G01070    1.214972

Expected Output

Code:
    AT1G01020    AT1G01010    -1.90043
    AT1G01070    AT1G01060    1.214972
    AT1G01073    AT1G01070    0.790549

Code:
gene_table=file1.readlines() # In the real file, j[12]=Gene1, j[13]=Gene2 and j[27]=log value
lfc=[]
for j in gene_table:
    j=j.split("\t")
    j[12]=j[12].strip()
    j[13]=j[13].strip()
    lfc=[]
    int_list=[]
    lfc.append(float(j[27]))
    int_list.append(j[0])
    dict_int={}
    for k in gene_table:
        k=k.split("\t")
        k[12]=k[12].strip()
        k[13]=k[13].strip()
        if (j[0]!=k[0]) and ((j[12]==k[12] and j[13]==k[13]) or (j[12]==k[13] and j[12]==k[13])):
            lfc.append(float(k[27]))
    dict_int=dict(zip(int_list, lfc))
    x=max(lfc, key=abs)
    #print x
    listOfKeys = [key  for (key, value) in dict_int.items() if value == x]
    print listOfKeys


Last edited by Scrutinizer; 03-11-2020 at 12:29 AM..
 

8 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How to select a particular media from the printer with a UNIX command

Problem Overview: We have a scheduler that prints report on any of the network printer. Problem Statement: We need to find a UNIX command that picks up either A4, legal or letter size paper form the printer. I found out a command but it's not working on our environment. ... (3 Replies)
Discussion started by: HelpMeOUt
3 Replies

2. Shell Programming and Scripting

Select distinct values from a flat file

Hi , I have a similar problem. Please can anyone help me with a shell script or a perl. I have a flat file like this fruit country apple germany apple india banana pakistan banana saudi mango india I want to get a output like fruit country apple ... (7 Replies)
Discussion started by: smalya
7 Replies

3. Shell Programming and Scripting

Identify high values "ÿ" in a text file using Unix command

I have high values (such as ÿÿÿÿ) in a text file contained in an Unix AIX server. I need to identify all the records which are having these high values and also get the position/column number in the record structure if possible. Is there any Unix command by which this can be done to : 1.... (5 Replies)
Discussion started by: devina
5 Replies

4. Shell Programming and Scripting

Running a select script through UNIX and sending output to file

Hi, (Oracle, AIX) I have googled this and searched this forum, however I haven't had much luck with an answer and have tried several different things. Basically I have a SQL select statement which generates a whole load of UPDATE statements, I want to run the select statement via... (13 Replies)
Discussion started by: dbchud
13 Replies

5. Shell Programming and Scripting

Unix command to select first few characters and last character of a line

I have a huge file and I want to select first 10 charcters and last 2 characters of everyline and than will filter the unique line. I know, it must be easy bt I am new to unix scripting:) Ex. I have file as below and need to e3kbaird and last 2 characters. and than unique records. ... (3 Replies)
Discussion started by: Sanjeev Yadav
3 Replies

6. Shell Programming and Scripting

Comparing multiple network files (edge lists)

I want to compare 4 edge-lists to basically see if an edge is present in all 4 networks. The issue is that an edge A-B in one file can be present as B-A in another file. Example: Input 1: net1.txt A B 0.1 C D 0.65 D E 0.9 E A 0.7 Input 2: net2.txt A Z 0.1 C D 0.65 E D 0.9 E A... (1 Reply)
Discussion started by: Sanchari
1 Replies

7. Shell Programming and Scripting

UNIX command -Filter rows in fixed width file based on column values

Hi All, I am trying to select the rows in a fixed width file based on values in the columns. I want to select only the rows if column position 3-4 has the value AB I am using cut command to get the column values. Is it possible to check if cut -c3-4 = AB is true then select only that... (2 Replies)
Discussion started by: ashok.k
2 Replies

8. Shell Programming and Scripting

Split a content in a file with specific interval base on the delimited values using UNIX command

Hi All, we have a requirement to split a content in a text file every 5 rows and write in a new file . conditions: if 5th line falls between center of the statement . it should look upto after ";" files are below format: 1 UPDATE TABLE TEST1 SET VALUE ='AFDASDFAS' 2 WHERE... (3 Replies)
Discussion started by: KK230689
3 Replies
clmprotocols(5) 						   FILE FORMATS 						   clmprotocols(5)

   1. NAME
   2. DESCRIPTION
   3. Network representation
   4. Loading large networks
   5. Converting between formats
   6. Clustering similarity graphs encoded in BLAST results
   7. Clustering expression data
   8. Reducing node degrees in the graph
   9. SEE ALSO
  10. AUTHOR

  NAME
      clmprotocols - Work flows and protocols for mcl and friends

  DESCRIPTION
      A guide to doing analysis with mcl and its helper programs.

  Network representation
      The  clustering program mcl expects the name of file as its first argument.  If the --abc option is used, the file is assumed to adhere to a
      simple format where a network is specified edge by edge, one line and one edge at a time.  Each line describes an edge as two labels  and  a
      numerical  value, all separated by white space. The labels and the value respectively identify the two nodes and the edge weight. The format
      is called ABC-format, where 'A' and 'B' represent the two labels and 'C' represents the edge weight. The latter is optional; if omitted  the
      edge  weight  is set to one.  If ABC-format is used, the output is returned as a listing of clusters, each cluster given as a line of white-
      space separated labels.

      MCL can also utilize a second representation, which is a stringent and unambiguous format for both input and output.  This is called  matrix
      format  and  it is required when using other programs in the mcl suite, for example when comparing and analysing clusterings using clm(1) or
      when extracting and transforming networks using mcx(1).  Native mode (matrix format) is entered simply by not specifying --abc.

      The recommended approach using mcl is to convert an external format to ABC-format. The program mcxload(1) reads the  latter  and	creates  a
      native  network  file  and a dictionary file that maps network nodes to labels. All applications in the MCL suite, including mcl itself, can
      read this native network file format. Label output can be obtained using mcxdump(1). The workflow is thus:

	 #  External format has been converted to file data.abc (abc format)

	 mcxload -abc data.abc --stream-mirror -write-tab data.tab -o data.mci

	 mcl data.mci -I 1.4
	 mcl data.mci -I 2
	 mcl data.mci -I 4

	 mcxdump -icl out.data.mci.I14 -tabr data.tab -o dump.data.mci.I14
	 mcxdump -icl out.data.mci.I20 -tabr data.tab -o dump.data.mci.I20
	 mcxdump -icl out.data.mci.I40 -tabr data.tab -o dump.data.mci.I40

      In this example the cluster output is stored in native format and dumped to labels using mcxdump. The stored output can now be used to learn
      more  about the clusterings. An example is the following, where clm(1) is applied in mode dist to gauge the distance between different clus-
      terings.

	 clm dist --chain out.data.mci.I{14,20,40}

  Loading large networks
      If you deal with very large networks (say with hundreds of millions of edges), it is recommended to use binary format (cf  mcxio(5)).   This
      is  simply  achieved by adding --write-binary to the mcxload command line. The resulting file is no longer human-readable but will be faster
      to read by a factor between ten- or twenty-fold compared to standard MCL-edge network format, and a factor  around  fifty-fold  compared	to
      label  format.   All  MCL-edge  programs	are able to read binary format, and speed of reading will be somewhere in the order of millions of
      edges per second, compared to, for example, roughly 100K edges per second for label format.

      Memory usage for mcxload can be lowered by replacing the option --stream-mirror with -ri max.

  Converting between formats
      Converting label format to tabular format
      Label format, two or three (including weight) columns:

	 mcxload -abc data.abc --stream-mirror -write-tab data.tab -o data.mci
	 mcxdump -imx data.mci -tab data.tab --dump-table

      Simple Interaction File (SIF) format:

	 mcxload -sif data.sif --stream-mirror -write-tab data.tab -o data.mci
	 mcxdump -imx data.mci -tab data.tab --dump-table

      It can be noted that these two examples are very similar, and differ only in the way the input to mcxload is specified.

  Clustering similarity graphs encoded in BLAST results
      A specific instance of the workflow above is the clustering of proteins based on their sequence similarities. In the most  typical  scenario
      the external format is BLAST output, which needs to be transformed to ABC format.  In the examples below the input is in columnar blast for-
      mat obtained with the blast -m8 option.  It requires a version of mcl at least as recent as 09-061.  First we create an  ABC-formatted  file
      using the external columnar BLAST format, which is assumed to be in a file called seq.cblast.

	 cut -f 1,2,11 seq.cblast > seq.abc

      The  columnar  format in the file seq.cblast has, for a given BLAST hit, the sequence labels in the first two columns and the asssociated E-
      value in column 11. It is parsed by the standard UNIX cut(1) utility. The format must have been created with the BLAST -m8 option so that no
      comment  lines  are  present.  Alternatively  these can be filtered out using grep.  The newly created seq.abc file is loaded by mcxload(1),
      which writes both a network file seq.mci and a dictionary file seq.tab.

	 mcxload -abc seq.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)'
	       -o seq.mci -write-tab seq.tab

      The --stream-mirror option ensures that the resulting network will be undirected, as recommended when using mcl. Omitting this option  would
      result in a directed network as BLAST E-values generally differ between two sequences. The default course of action for mcxload(1) is to use
      the best value found between a pair of labels. The next option, --abc-neg-log10 tranforms the numerical values in the input  (the  BLAST	E-
      values)  by  taking  the	logarithm in base 10 and subsequently negating the sign. Finally, the transformed values are capped so that any E-
      value below 1e-200 is set to a maximum allowed edge weight of 200.

      To obtain clusterings from seq.mci and seq.tab one has two choices. The first is to generate an abstract clustering representation and  from
      that obtain the label output, as follows.  Below the -o option is not used, so mcl will create meaningful and unique output names by itself.
      The default way of doing this is to preprend the prefix out. and to append a suffix  encoding  the  inflation  value  used,  with  inflation
      encoded using two digits of precision and the decimal separator removed.

	 mcl seq.mci -I 1.4
	 mcl seq.mci -I 2
	 mcl seq.mci -I 4
	 mcl seq.mci -I 6

	 mcxdump -icl out.seq.mci.I14 -tabr seq.tab -o dump.seq.mci.I14
	 mcxdump -icl out.seq.mci.I20 -tabr seq.tab -o dump.seq.mci.I20
	 mcxdump -icl out.seq.mci.I40 -tabr seq.tab -o dump.seq.mci.I40
	 mcxdump -icl out.seq.mci.I60 -tabr seq.tab -o dump.seq.mci.I60

      Now  the	file  out.seq.tab.I14 and its associates can be used for example to compute the distances between the encoded clusterings with clm
      dist, to compute a set of strictly reconciled nested clusterings with clm order, or to compute an efficiency criterion with clm info.

      Alternatively, label output can be obtained directly from mcl as follows.

	 mcl seq.mci -I 1.4  -use-tab seq.tab
	 mcl seq.mci -I 2  -use-tab seq.tab
	 mcl seq.mci -I 4  -use-tab seq.tab
	 mcl seq.mci -I 6  -use-tab seq.tab

  Clustering expression data
      The clustering of expression data constitutes another workflow. In this case the external format usually is a tabular file format containing
      labels  for  genes  or  probes and numerical values measuring the expression values or fold changes across a series of conditions or experi-
      ments. Such tabular files can be processed by mcxarray(1), which comes installed with mcl. The program computes correlations (either Pearson
      or Spearmann) between genes, and creates an edge between genes if their correlation exceeds the specified cutoff. From this mcxarray(1) cre-
      ates both a network file and a dictionary file. In the example below, the file expr.data is in tabular format with one row of column headers
      (e.g. tags for experiments) and one column of row identifiers (e.g. probe or gene identifiers).

	 mcxarray -data expr.data -skipr 1 -skipc 1 -o expr.mci -write-tab expr.tab --pearson -co 0.7 -tf 'abs(),add(-0.7)'

      This  uses  the  Pearson	correlation, ignoring values below 0.7.  The remaining values in the interval [0.7-1] are remapped to the interval
      [0-0.3]. This is recommended so that the edge weights will have increased contrast between them, as mcl is affected by relative  differences
      (ratios)	between  edge weights rather than absolute differences. To illustrate this, values 0.75 and 0.95 are mapped to 0.05 and 0.25, with
      respective ratios 0.79 and 0.25.	The network file expr.mci and the dictionary file expr.tab can now be used as before.

      It is possible to investigate the effect of the correlation cutoff as follows.  First a network is generated at a very  low  threshold,  and
      this network is analysed using mcxquery.

	 mcxarray -data expr.data -skipr 1 -skipc 1 -o expr20.mcx --write-binary --pearson -co 0.2 -tf 'abs()'
	 mcx query -imx expr20.mcx --vary-correlation

      The  output  is  in a tabular format describing the properties of the network at increasing correlation thresholds. Examples are the size of
      the biggest component, the number of orphan nodes (not connected to any other node), and the mean and median node degrees.  A  good  way	to
      choose  the  cutoff  is to balance the number of singletons and the median node degree. Both should preferably not be too high.  For example
      the number of orphan nodes should be less than ten percent of the total number of nodes, and the median node degree should be  at  most  one
      hundred neighbours.

  Reducing node degrees in the graph
      A  good way to lower node degrees in a network is to require that an edge is among the best k edges (those of highest weight) for both nodes
      incident to the edge, for some value of k. This is achieved by using knn(k) in the argument to the -tf option to mcl or mcx alter.  To  give
      an example, a graph was formed on translations in Ensembl release 57 on 2.6M nodes.  The similarities were obtained from BLAST scores, lead-
      ing to a graph with a total edge count of 300M, with best-connected nodes of degree respectively 11148, 9083, 9070, 9019 and 8988, and  with
      mean  node  degree  233.	 These	degrees  are unreasonable.  The graph was subjected to mcx query to investigate the effect of varying k-NN
      parameters. A good heuristic is to choose a value that does not significantly change the number of singletons in the input  graph.   In  the
      example it meant that -tf 'knn(160)' was feasible, leading to a mean node degree of 98.

      A  second  approach to reduce node degrees is to employ the -ceil-nb option.  This ranks nodes by node degree, highest first. Nodes are con-
      sidered in order of rank, and edges of low weight are removed from the graph until a node satisfies the node degree threshold  specified	by
      -ceil-nb.

  SEE ALSO
      mcxio(5).

  AUTHOR
      Stijn van Dongen.

  clmprotocols 12-068						      8 Mar 2012						     clmprotocols(5)
All times are GMT -4. The time now is 03:46 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy