Average within reps reformat according to second file

04-20-2015

Registered User

16, 0

Join Date: Mar 2015

Last Activity: 5 October 2017, 4:20 PM EDT

Posts: 16

Thanks Given: 12

Thanked 0 Times in 0 Posts

Average within reps reformat according to second file

Please help me in the following,

I have a matrix file

Code:

 S2 S1 S3 S4 S5
G1 11 12 13 14 15
G2 21 22 23 24 25
G3 31 32 33 34 35
G4 41 42 43 44 45

a datafile

Code:

Sample Loc Rep T1 T2 T3 RC1 RC2 RC3 
S1 L1 1 1.5 NULL 45 R F T
S1 L1 2 2.5 2 NULL 35 F G
S1 L2 1 4 3 NULL F T R
S2 L1 1 56 45 24 F G Y
S2 L2 1 10 5 NULL G F Y
S2 L2 2 20 NULL 34 F G T
S3 L1 1 3.4 NULL 32 F T Y
S3 L2 1 4.6 3 21 D D R

and a query file

Code:

T1
T2
T3
T4

I would like to

1. Find average of Rep only for the query columns , group by Sample and Loc , the column order is not fixed, Sample is col1 in the example but maybe col5 in the data. So they should be taken dynamically as keywords like 'Sample' from the header column in datafile. Missing data is indicated by NULL.
If some entries like T4 in the query file are not present in the data, that column name can be ignored.

2. Output both the matrix and the datafiles as separate files so that they have the same common samples , and arranged in the same sequence.

Output

Code:

   S2 S1 S3
G1 11 12 13
G2 21 22 23
G3 31 32 33
G4 41 42 43

Output2

Code:

     
   S2 S1 S3
T1_L1 56 2 3.4	  
T1_L2 15 4 4.6
T2_L1 45 2 NULL 
T2_L2  5 3 3
T3_L1 24 45 32
T3_L2 34 NULL 21

Please not that columns are in same order in both files of output. Order doesn't matter as long as they are in sync in the same sequence like {s2,s1,s3} in both, {s1,s2,s3} is also acceptable.

I tried this

Code:

awk ' FILENAME=="QUERY.TXT" { cols[$0];next }
      FILENAME=="data.txt"  && NR==1 {
        for(i=1; i<=NF; i++)
        {
        if ($i=="Sample")
        s[1]=i
        if ($i=="Loc")
        s[2]=i
        if ($i=="Rep")
        s[3]=i
        }
        next;
  }
  {
    
    for(i=1; i<=NF; i++)
      if(($i in cols))
       for(j=1; j<=p; j++) {
               st=a[j]
               for(i in s){
                   st=st" "s[j];
               }
        print st }
    if  !($i in cols)
        delete s["i"];
    }
'  QUERY.TXT data.txt data.txt mat.txt

Last edited by sheetalk; 04-20-2015 at 04:01 PM..

sheetalk

View Public Profile for sheetalk

Find all posts by sheetalk

04-21-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You lost me, entirely. I can't understand your request nor your code sample. However, threefour comments on your code:
- FILENAME=="data.txt" && NR==1 won't ever become TRUE as NR is the stream counter and is gone with the first file; try FNR instead.
- The index into cols will be T1 ... T4, which occur in line 1 of data.txt only; so it's pointless to compare every single line, and two times.
- When defined, the s array is indexed by integers, so I doubt you can delete it via string index "i".
- The a array is never defined, so st will be initialized to "".

Last edited by RudiC; 04-21-2015 at 05:18 AM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-21-2015

Registered User

16, 0

Join Date: Mar 2015

Last Activity: 5 October 2017, 4:20 PM EDT

Posts: 16

Thanks Given: 12

Thanked 0 Times in 0 Posts

Let me explain once more with a simpler example.

I have a datafile where the columns are in no particular order and missing values indicated as "NULL".

Code:

Sample 	Loc 	Rep	T1	T2	C1	C2	C3
S1	L1	1	5	6	6	NULL	TY
S1	L1	2	NULL	8	YU	TY	NULL
S1	L2	1	7	NULL	GG	MP	RU
S2	L1	1	4	6	TY	NULL	NULL

This also could have been

Code:

T1	T2	C1	C2	C3	Sample 	Loc 	Rep
5	6	6	NULL	TY	S1	L1	1
NULL	8	YU	TY	NULL	S1	L1	2
7	NULL	GG	MP	RU	S1	L2	1
4	6	TY	NULL	NULL	S2	L1	1

Now I want to find the average of certain columns (listed in a query file) by headernames Sample and Loc. If there is a columnname in the query file like T3 which doesnt exist in the datafile, then just ignore that name.

Query file

Code:

T1
T2
T3

Output from this step

Code:

Sample 	Loc 	T1	T2
S1	L1	5	7
S1	L2	7	NULL
S2	L1	4	6

This can be reformatted by transposing to

Code:

	S1	S2
T1_L1	 5	4
T1_L2 	7	NULL
T2_L1 	7	6
T2_L2 	NULL	NULL

Now, there is a 3rd matrix file with the samples as columns

Code:

	S2	S1	S3
G1	11	22	77
G2	33	44	88
G3	55	66	99

I want to reformat this matrix to have as many columns as the output in the previous step and in the same order. Only the common samples are to be outputted for both files and in the same order.

Code:

Please let me know if I can explain further.

sheetalk

View Public Profile for sheetalk

Find all posts by sheetalk

Shell Programming and Scripting

Average within reps reformat according to second file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to reformat output if input file is empty, but not if file has data in it

Discussion started by: cmccabe

2. Shell Programming and Scripting

[Solved] File reformat

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk reformat file

Discussion started by: yifangt

4. Shell Programming and Scripting

Major File Reformat

Discussion started by: Blue Solo

5. Shell Programming and Scripting

Reformat file using nawk

Discussion started by: dazedandconfuse

6. Shell Programming and Scripting

Reformat a file

Discussion started by: r_t_1601

7. Shell Programming and Scripting

Reformat the data of a file.

Discussion started by: samkhu

8. Shell Programming and Scripting

Please help me reformat this file

Discussion started by: digipak

9. Shell Programming and Scripting

Reformat Crontab file

Discussion started by: alnita

10. Shell Programming and Scripting

reformat the file

Discussion started by: CamTu