Lookup horizontally and vertically and calculate counts


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Lookup horizontally and vertically and calculate counts
# 1  
Old 03-22-2015
Lookup horizontally and vertically and calculate counts

Hello,

Please help create the following report.

I have a matrix

Code:
-  S1 S2 S3 S4  
M1 AA AA TT -
M2 AG AG AA GG
M3 GG TT - -

a first lookup table

Code:
M3 chr7 4.456
M1 chr7 28.9
M2 chr8 129.678

a second lookup table

Code:
S1 GGHBBGG/DEDD(@DCCD)
S2 GGHBBGG/DEDD(@DCCD)//B-
S3 GGHBBGG/DEDD(@HH?)//B1@NNN
S4 GGHBBGG/DEDD(@DCCDH)#-BCF1

I want to count each nucleotide (As, Ts, Cs and Gs) for each row, and a few variables calculated as

Code:
total = NF-1
missing = total number of "-"
mono = total number of ( AA + GG + CC + TT)
mix = total number of ( AT + AC + AG + CT + CG + GT)
m = 2nd highest among (A,T,C,G) / total (A,T,C,G)
data = total - missing


An example calculation of m for M1 is there are 4 As and 2 Ts for M1. The rest are 0s
So m for M1 = total number of Ts, which is 2nd highest ( 2 ) / total A,T,G,C (6) = 0.33



Here is what is my report should look like
I cant seem to line up the table for some reason, it is space delimited.
Code:
NAME				                                  M1 M2 M3
CHR				                                  chr7 chr8 chr7 
POS				                                  28.9 129.678 4.456
A	                                                          4 4 0			        
T			                                          2 0 2			
G				                                  0 4 2				
C				                                  0 0 0
-				                                  1 0 2
m				                                  0.33 0.5 0.5
data				                                  3 4 2
mono				                                  3 2 2 
mixed				                                  0 2 0
total				                                  4 4 4
S1 GGHBBGG/DEDD(@DCCD)		                                  AA AG GG
S2 GGHBBGG/DEDD(@DCCD)//B-                                        AA AG TT
S3 GGHBBGG/DEDD(@HH?)//B1@NNN                                     TT AA -
S4 GGHBBGG/DEDD(@DCCDH)#-BCF1                                      - GG -

I tried this, please help

Code:
 awk 'NR==FNR{ l[$1]=$2 FS $3;next} $1 in l { $1=$1 FS l[$1]}1' lookup1 file |
	       	     awk '{ for (i=2;i++;i<=NF)
	       	     		if ($i=="AA" || $i=="GG" || $i=="CC" || $i=="TT")
	       	     			mono=mono+1
	       	     				if ($i=="AA")
	       	     					a=a+2
	       	     				else if ($i=="GG")
	       	     					g=g+2
	       	     				if ($i=="CC")
	       	     					c=c+2
	       	     				else if ($i=="TT")
	       	     					t=t+2
	       	     				fi
	       	     		else if ($i=="AT" || $i=="AG" || $i=="AC" || $i=="CT" || $i=="CG" || $i=="GT")
	       	     				mix=mix+1
	       	     		
	       	     		else if ($i="-")
	       	     				missing=missing+1	
	       	     		fi
	       	     		total=NF-1
	       	     		data=(NF-1)-missing
	       	     		$1= $1,a,c,t,g,mono,mix,missing,total,data
	       	     		}1' | awk '
		     { 
		         for (i=1; i<=NF; i++)  {
		             a[NR,i] = $i
		         }
		     }
		     NF>p { p = NF }
		     END {    
		         for(j=1; j<=p; j++) {
		             str=a[1,j]
		             for(i=2; i<=NR; i++){
		                 str=str" "a[i,j];
		             }
		             print str
		         }
	             }'  | awk  'NR==FNR{ l[$1]=$2 ;next} $1 in l { $1=$1 FS l[$1]}1' lookup2 - > final_report


Last edited by jianp83; 03-22-2015 at 12:19 PM..
# 2  
Old 03-22-2015
You're not printing anything in the first awk so there's nothing to pipe to the second. Even if there were, the second awk just reads file1 but not stdin from the pipe (you can use "-" to supply stdin as a "file name"). On top, it doesn't print either. Please be aware that the variable assignments are lost between different awk scripts.
This User Gave Thanks to RudiC For This Post:
# 3  
Old 03-22-2015
Thanks for your suggestions, would you please look at the script now? Also please help with the calculation of m, I feel I have the structure of the code right, but not providing with the desired report. How to include the row names like "NAME", "CHR" ,..etc ?
# 4  
Old 03-22-2015
Try
Code:
awk     'FNR==1         {FILE++}

         FILE==1        {CHR[$1]=$2
                         POS[$1]=$3
                         next
                        }

         FILE==2        {S2[$1]=$0
                         next
                        }

         FILE==3        {if (FNR==1)    {for (i=2; i<=NF; i++) S[i-1]=$i
                                         next
                                        }
                         M[$1]
                         for (i=2; i<=NF; i++)  {S1[$1,S[i-1]]=$i
                                                 total[$1]++
                                                 if ($i == "-")  missing[$1]++
                                                 else           {data[$1]++
                                                                 C1=substr($i,1,1)
                                                                 C2=substr($i,2,1)
                                                                 GN[C1]
                                                                 GN[C2]
                                                                 if (C1 == C2)  {mono[$1]++
                                                                                 G[$1,C1]+=2
                                                                                }
                                                                 else           {mixed[$1]++
                                                                                 G[$1,C1]++
                                                                                 G[$1,C2]++
                                                                                }
                                                                }
                                                }
                        }

         END            {printf "%-40s", "NAME"
                           for (m in M) printf "\t%s", m
                           printf "\n"
                         printf "%-40s",  "CHR"
                           for (m in M) printf "\t%s", CHR[m]
                           printf "\n"
                         printf "%-40s",  "POS"
                           for (m in M) printf "\t%s", POS[m]
                           printf "\n"

                         for (g in GN)  {printf  "%-40s", g
                                           for (m in M) printf "\t%s", G[m, g]+0
                                           printf "\n"
                                        }

                         printf  "%-40s", "-"
                           for (m in M) printf "\t%s", missing[m]+0
                           printf "\n"

                         printf "m"
                           printf "\n"

                         printf  "%-40s", "data"
                           for (m in M) printf "\t%s", data[m]+0
                           printf "\n"

                         printf  "%-40s", "mono"
                           for (m in M) printf "\t%s", mono[m]+0
                           printf "\n"

                         printf  "%-40s", "mixed"
                           for (m in M) printf "\t%s", mixed[m]+0
                           printf "\n"

                         printf  "%-40s", "total"
                           for (m in M) printf "\t%s", total[m]+0
                           printf "\n"

                         for (s in S)   {printf  "%-40s", S2[S[s]]
                                           for (m in M) printf "\t%s", S1[m, S[s]]
                                           printf "\n"
                                        }
                        }
        ' lookup1 lookup2 matrix
NAME                                        M1    M2    M3
CHR                                         chr7    chr8    chr7
POS                                         28.9    129.678    4.456
A                                           4    4    0
G                                           0    4    2
T                                           2    0    2
-                                           1    0    2
m
data                                        3    4    2
mono                                        3    2    2
mixed                                       0    2    0
total                                       4    4    4
S1 GGHBBGG/DEDD(@DCCD)                      AA    AG    GG
S2 GGHBBGG/DEDD(@DCCD)//B-                  AA    AG    TT
S3 GGHBBGG/DEDD(@HH?)//B1@NNN               TT    AA    -
S4 GGHBBGG/DEDD(@DCCDH)#-BCF1               -    GG    -

Base C is missing as it doesn't occur in the matrix. It is assumed that the data field count is constant across the matrix. Please excuse the very short variable/array names - I didn't konw the meaning behind them so could find a meaningful name.

I don't have the sligtest idea on how to compute m . The denominator is easy: 2 * (total - missing). But how to find the numerator, the second highest base count?

Last edited by RudiC; 03-22-2015 at 04:29 PM..
This User Gave Thanks to RudiC For This Post:
# 5  
Old 03-22-2015
Yes, the numerator is the minor base count, or the second highest base count.,,

Code:
M1 it is T (2)
M2 and M3 , the highest   and the second highest are the same , so its 4

# 6  
Old 03-22-2015
That doesn't necessarily help. I've seen that, but don't have an idea for an algorithm to find it.
# 7  
Old 03-22-2015
Could I suggest these changes in red for m calculation:

Code:
                         for (g in GN)  {printf  "%-40s", g
                                           for (m in M) {
                                               printf "\t%s", G[m, g]+0
                                               if(G[m, g] > Max1[m]) {
                                                  Max2[m]=Max1[m]+0
                                                  Max1[m]=G[m, g]
                                               } else if (G[m, g] > Max2[m]) Max2[m]=G[m, g]
                                           }
                                           printf "\n"
                                        }

                         printf  "%-40s", "-"
                           for (m in M) printf "\t%s", missing[m]+0
                           printf "\n"

                         printf  "%-40s", "m"
                           for (m in M) printf "\t%4.2f", 
                               missing[m]>=total[m]?0:(Max2[m]+0)/(2*(total[m]-missing[m]))
                           printf "\n"

These 2 Users Gave Thanks to Chubler_XL For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Orienting select choices vertically

From my script #!/bin/bash echo "Which of these does not belong in the group?"; \ select choice in Mercedes Audi Chevrolet Audi Porsche BMW Volkswagen; do if ]; then echo "Correct! Chevrolet is not a German marque."; break; fi echo "Errr...no. Try again." doneI'm... (3 Replies)
Discussion started by: Xubuntu56
3 Replies

2. Shell Programming and Scripting

Grep Delimited Line and Display Vertically

Hello All, I am trying to take a colon-delimited line from a bunch of lines such as apple:green:5cents:CA apple:red:4cents:FL orange:green:6cents:HI ...and display it vertically with label prefixes such as the following; Fruit: apple Color: green Price: 5cents Origin: CA So... (3 Replies)
Discussion started by: techieg
3 Replies

3. Shell Programming and Scripting

Calculation of column both horizontally and vertically

Hi ALL I have this data ail,UTT,id1_0,COMBO,21,24,21,19 al,UTHAST,id1_0,COMBO,342,390,361,361 dmo,UTST,id1_0,COMBO,21,15,22,23 vne,UAST,id1_0,COMBO,345,372,390,393 I wan the sum of column 5,6,7 & 8 both horizontal and vertical. There is one more prob the column keeps on increasing... (9 Replies)
Discussion started by: nikhil jain
9 Replies

4. Shell Programming and Scripting

Counting characters vertically

I do have a big file in the following format >A1 ATGCGG >A2 TCATGC >A3 -TGCTG The number of characters will be same under each subheader and only possible characters are A,T,G,C and - I want to count the number of A's, T's,G's, C's & -'s vertically for all the positions so that I... (5 Replies)
Discussion started by: Lucky Ali
5 Replies

5. Shell Programming and Scripting

Calculate age of a file | calculate time difference

Hello, I'm trying to create a shell script (#!/bin/sh) which should tell me the age of a file in minutes... I have a process, which delivers me all 15 minutes a new file and I want to have a monitoring script, which sends me an email, if the present file is older than 20 minutes. To do... (10 Replies)
Discussion started by: worm
10 Replies

6. Shell Programming and Scripting

SQL output vertically aligned?

I used the SQL query (taken from other threads here) to get the expected values to be written into a file. myQuery=`sqlplus -s cr_appsrvr/appsrvr@qwi << EndofFile set heading off; set tab off; set wrap off; set pages 0; set feedback off; SELECT CLEARINGHOUSE_TRACE_NUM, INSURED_ID FROM... (4 Replies)
Discussion started by: swame_sp
4 Replies

7. Shell Programming and Scripting

Need help in reading a file horizontally and printing vertically

Hi Every body, I have file which has enttries, with each 5 entries as a set of entries, I would like to read the file (line by line) and print five entries of a set vertically, the next entry should come in the next line. Example: cat sample_file I am a Unix Adminsitrator new to shell... (6 Replies)
Discussion started by: aruveiv
6 Replies

8. Shell Programming and Scripting

Appending two files vertically

Hi Need ur help for the below question. I have two files File-1 & File-2. File-1(This is a fixed file i.e. the content of this file is not going to change over a period of time) ------ a b c d e File-2 (This is a file which changes daily but the record count remains the same)... (1 Reply)
Discussion started by: 46019
1 Replies

9. Shell Programming and Scripting

Print a horizontal word vertically.

Say I have the word: zinger I want to change it to print z i n g e r This is for a sorting algorithm that I am testing out. I will then use sort on the vertical and change it back to horizontal printing using tr. Once it is horizontal again, I can compare that sorted jumble... (6 Replies)
Discussion started by: JimJ
6 Replies

10. UNIX for Advanced & Expert Users

Clueless about how to lookup and reverse lookup IP addresses under a file!!.pls help

Write a quick shell snippet to find all of the IPV4 IP addresses in any and all of the files under /var/lib/output/*, ignoring whatever else may be in those files. Perform a reverse lookup on each, and format the output neatly, like "IP=192.168.0.1, ... (0 Replies)
Discussion started by: choco4202002
0 Replies
Login or Register to Ask a Question