How to get min and max values using awk?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to get min and max values using awk?
# 8  
Old 08-03-2014
Hi redse171,
Thanks of rthe update. That gives us a better idea of what you are trying to do. Although the awk script you have shown us will not produce the output you showed us for the sample input you provided. (Your awk script doesn't copy the CD field to the output.)

I haven't dug into all of the details again yet, but I think that if we get answers to the following, we'll be able to help you write a script that will work:
  1. Do you want the output to contain the "CD" field from the input?
  2. Will all lines with the same combination of $5, $6, and $7 values be on contiguous lines in your input file? (The answer to this is "yes" for your sample input. Does it hold true for your real, huge input files?)
  3. If the answer to #2 is no, does the order of lines in your output file matter?
This User Gave Thanks to Don Cragun For This Post:
# 9  
Old 08-04-2014
Answers to Don Cragun's above question may kill the assumptions on which this is based. Try
Code:
awk     '$2 != "CD"     {next}                                          # not a "CD" line -> no action
         !($7 in LINE)  {LINE[$7]=$0}                                   # new $7? Keep line with first occurrence of $3/$4 in memory
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               # count $7 lines and keep last $3 and $4

         END            {for (i in LINE) if (CNT[i]>=2) {               # for the lines recorded, if count = 1: discard
                                 match (LINE[i],"[0-9]*\t[0-9]*\t[+-]") # search for $3 $4 +- pattern (you can use constants here if 
                                                                        # sure the file structure remains identical all over)
                                 if (substr (LINE[i], RSTART+RLENGTH-1, 1) == "-") {    # take decision on + or -
                                        POS=RSTART                      # where to replace
                                        STR=E3[i]}                      # what to put in 
                                  else {POS=RSTART+8
                                        STR=E4[i]} 
                                 print  substr (LINE[i], 1, POS-2),     # print first part of line, dep. on sign
                                        STR,                            #       replacement string
                                        substr (LINE[i], POS+8)         #       last part
                                }
                        }
        ' FS="\t" OFS="\t" file
SP12.3    CD    2249762    2252075    -    ID1_N006     ID2_N006T1
SP12.5    CD    3001307    3005025    +    ID1_N01140    ID2_N01140T0
SP12.3    CD    2249762    2253117    -    ID1_N006     ID2_N006T0
SP12.3    CD    2240806    2241681    +    ID1_N003     ID2_N003T0

This User Gave Thanks to RudiC For This Post:
# 10  
Old 08-04-2014
Quote:
Originally Posted by Don Cragun
Hi redse171,
Thanks of rthe update. That gives us a better idea of what you are trying to do. Although the awk script you have shown us will not produce the output you showed us for the sample input you provided. (Your awk script doesn't copy the CD field to the output.)

I haven't dug into all of the details again yet, but I think that if we get answers to the following, we'll be able to help you write a script that will work:
  1. Do you want the output to contain the "CD" field from the input?
  2. Will all lines with the same combination of $5, $6, and $7 values be on contiguous lines in your input file? (The answer to this is "yes" for your sample input. Does it hold true for your real, huge input files?)
  3. If the answer to #2 is no, does the order of lines in your output file matter?
Hi Don Crugan,

To answer your questions:-

1. Yes, i need to have "CD" field in my output file as shown in my sample output
2. Yes for my huge input files

thanks.

---------- Post updated at 10:12 AM ---------- Previous update was at 10:07 AM ----------

Quote:
Originally Posted by RudiC
Answers to Don Cragun's above question may kill the assumptions on which this is based. Try
Code:
awk     '$2 != "CD"     {next}                                          # not a "CD" line -> no action
         !($7 in LINE)  {LINE[$7]=$0}                                   # new $7? Keep line with first occurrence of $3/$4 in memory
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               # count $7 lines and keep last $3 and $4

         END            {for (i in LINE) if (CNT[i]>=2) {               # for the lines recorded, if count = 1: discard
                                 match (LINE[i],"[0-9]*\t[0-9]*\t[+-]") # search for $3 $4 +- pattern (you can use constants here if 
                                                                        # sure the file structure remains identical all over)
                                 if (substr (LINE[i], RSTART+RLENGTH-1, 1) == "-") {    # take decision on + or -
                                        POS=RSTART                      # where to replace
                                        STR=E3[i]}                      # what to put in 
                                  else {POS=RSTART+8
                                        STR=E4[i]} 
                                 print  substr (LINE[i], 1, POS-2),     # print first part of line, dep. on sign
                                        STR,                            #       replacement string
                                        substr (LINE[i], POS+8)         #       last part
                                }
                        }
        ' FS="\t" OFS="\t" file
SP12.3    CD    2249762    2252075    -    ID1_N006     ID2_N006T1
SP12.5    CD    3001307    3005025    +    ID1_N01140    ID2_N01140T0
SP12.3    CD    2249762    2253117    -    ID1_N006     ID2_N006T0
SP12.3    CD    2240806    2241681    +    ID1_N003     ID2_N003T0

Hi RudiC,

Tried your codes and thanks so much for your explanations. It seems working for my real input file except that there are few lines a little bit weird. I am checking on it now and try play around with your codes. Will give the feedback asap. Thanks

---------- Post updated at 09:25 PM ---------- Previous update was at 10:12 AM ----------

Hi,

just to give feedback. The codes by RudiC is modified to suit my real data. The codes worked well with the sample data but there was an issue with the number and position of digits (values) in $3 and $4 in my real huge file. So, i split the LINE into segments and take the value from the segments (info from awk manual). Thanks to RudiC for the codes and explanations that help me to understand better. Below is the codes that being modified and i got the results that i wanted.

Code:
awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT[i]>=2) {               
                                 match (LINE[i],"[0-9]*\t[0-9]*\t[+-]") 
                                                                        
                                 if (substr (LINE[i], RSTART+RLENGTH-1, 1) == "-") {    
                                        POS=RSTART                      
                                        STR=E3[i]
                                 split(LINE[i], seg, "\t")
                                 print  seg[1], seg[2], 
                                        STR,                            
                                        seg[4], seg[5], seg[6], seg[7] 
                                 }                      
                                 else {POS=RSTART+7
                                       STR=E4[i]
                                 split(LINE[i], seg, "\t")
                                 print  seg[1], seg[2], seg[3],     
                                        STR,                           
                                        seg[5], seg[6], seg[7] 

                                 }
                                }
                        }
        ' FS="\t" OFS="\t" File1

My first code was not informative enough as i don't have any idea how to find the min and max from my input file and what i gave was just to extract all line with CD patterns. The help that i got here is awesome and help me to learn and understand better. thanks a lot! . Smilie
# 11  
Old 08-04-2014
Hi redse171,
I'm very glad that RudiC was able to help you find a solution to your problem. Note that if you need to use split() to correctly group your fields, you don't need to also use match() and substr() to determine whether you have a + or - in field 5 (you can just look directly at seg[5]) after you call split(). You can then simplify your code to something like:
Code:
awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT[i]>=2) {               
                                 split(LINE[i], seg)
                                 if (seg[5] == "-") {    
					 print  seg[1], seg[2], E3[i], 
						seg[4], seg[5], seg[6], seg[7] 
                                 } else {
					 print  seg[1], seg[2], seg[3],     
						E4[i], seg[5], seg[6], seg[7]
                                 }
			 }
                        }
        ' FS="\t" OFS="\t" File1

and get the same results.

Hope this helps,
Don
# 12  
Old 08-05-2014
Further simplification:
Code:
awk     '$2 != "CD"     {next}
         !($7 in LINE)  {LINE[$7]=$0}
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}

         END            {for (i in LINE) if (CNT[i]>=2) {
                                split(LINE[i], seg)
                                if (seg[5] == "-")      seg[3] = E3[i]
                                else                    seg[4] = E4[i]
                                print  seg[1], seg[2], seg[3], seg[4], seg[5], seg[6], seg[7]
                         }
                        }
        ' FS="\t" OFS="\t" file

This User Gave Thanks to RudiC For This Post:
# 13  
Old 08-05-2014
Quote:
Originally Posted by Don Cragun
Hi redse171,
I'm very glad that RudiC was able to help you find a solution to your problem. Note that if you need to use split() to correctly group your fields, you don't need to also use match() and substr() to determine whether you have a + or - in field 5 (you can just look directly at seg[5]) after you call split(). You can then simplify your code to something like:
Code:
awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT[i]>=2) {               
                                 split(LINE[i], seg)
                                 if (seg[5] == "-") {    
					 print  seg[1], seg[2], E3[i], 
						seg[4], seg[5], seg[6], seg[7] 
                                 } else {
					 print  seg[1], seg[2], seg[3],     
						E4[i], seg[5], seg[6], seg[7]
                                 }
			 }
                        }
        ' FS="\t" OFS="\t" File1

and get the same results.

Hope this helps,
Don
Hi Don,

It does help!.. It just that i need to add a tiny part (in blue) there at printing part or else it wont show $4 in my output.


Code:
awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT[i]>=2) {               
                                 split(LINE[i], seg)
                                 if (seg[5] == "-") {    
					 print  seg[1], seg[2], seg[3]= E3[i], 
						seg[4], seg[5], seg[6], seg[7] 
                                 } else {
					 print  seg[1], seg[2], seg[3],     
						seg[4]=E4[i], seg[5], seg[6], seg[7]
                                 }
			 }
                        }
        ' FS="\t" OFS="\t" file1

Thanks a bunch Smilie

---------- Post updated at 09:32 AM ---------- Previous update was at 09:31 AM ----------

Quote:
Originally Posted by RudiC
Further simplification:
Code:
awk     '$2 != "CD"     {next}
         !($7 in LINE)  {LINE[$7]=$0}
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}

         END            {for (i in LINE) if (CNT[i]>=2) {
                                split(LINE[i], seg)
                                if (seg[5] == "-")      seg[3] = E3[i]
                                else                    seg[4] = E4[i]
                                print  seg[1], seg[2], seg[3], seg[4], seg[5], seg[6], seg[7]
                         }
                        }
        ' FS="\t" OFS="\t" file

Hi RudiC,

This is a lot cleaner!! Many thanks Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk Sort 2d histogram output from min(X,Y) to max(X,Y)

I've got Gnuplot-format 2D histogram data output which looks as follows. 6.5 -1.25 10.2804 6.5404 -1.25 10.4907 6.58081 -1.25 10.8087 6.62121 -1.25 10.4686 6.66162 -1.25 10.506 6.70202 -1.25 10.3084 6.74242 -1.25 9.68256 6.78283 -1.25 9.41229 6.82323 -1.25 9.43078 6.86364 -1.25 9.62408... (1 Reply)
Discussion started by: chrisjorg
1 Replies

2. Shell Programming and Scripting

awk search for max and min while ignoring special character

I am trying to get a simple min/max script to work with the below input. Note the special character (">") within it. Script awk 'BEGIN{max=0}{if(($1)>max) max=($1)}END {print max}' awk 'BEGIN{min=0}{if(($2)<min) min=($2)}END {print min}' Input -122.2840 42.0009 -119.9950 ... (7 Replies)
Discussion started by: ncwxpanther
7 Replies

3. Shell Programming and Scripting

awk script to find min and max value

I need to find the max/min of columns 1 and 2 of a 2 column file what contains the special character ">". I know that this will find the max value of column 1. awk 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' input.file But what if I needed to ignore special characters in the... (3 Replies)
Discussion started by: ncwxpanther
3 Replies

4. Shell Programming and Scripting

Get the min avg and max with awk

aaa: 3 ms aaa: 2 ms aaa: 5 ms aaa: 10 ms .......... to get the 3 2 5 10 ...'s min avg and max something like min: 2 ms avg: 5 ms max: 10 ms (2 Replies)
Discussion started by: yanglei_fage
2 Replies

5. Shell Programming and Scripting

Average, min and max in file with header, using awk

Hi, I have a file which looks like this: FID IID MISS_PHENO N_MISS N_GENO F_MISS 12AB43131 12AB43131 N 17774 906341 0.01961 65HJ87451 65HJ87451 N 10149 906341 0.0112 43JJ21345 43JJ21345 N 2826 906341 0.003118I would... (11 Replies)
Discussion started by: kayakj
11 Replies

6. UNIX for Dummies Questions & Answers

[Solved] Print a line using a max and a min values of different columns

Hi guys, I already search on the forum but i can't solve this on my own. I have a lot of files like this: And i need to print the line with the maximum value in last column but if the value is the same (2 in this exemple for the 3 last lines) i need get the line with the minimum value in... (4 Replies)
Discussion started by: MetaBolic0
4 Replies

7. Shell Programming and Scripting

AWK script - extracting min and max values from selected lines

Hi guys! I'm new to scripting and I need to write a script in awk. Here is example of file on which I'm working ATOM 4688 HG1 PRO A 322 18.080 59.680 137.020 1.00 0.00 ATOM 4689 HG2 PRO A 322 18.850 61.220 137.010 1.00 0.00 ATOM 4690 CD ... (18 Replies)
Discussion started by: grincz
18 Replies

8. Shell Programming and Scripting

Find min.max value if matching columns found using AWK

Input_ File : 2 3 4 5 1 1 0 1 2 1 -1 1 2 1 3 1 3 1 4 1 6 5 6 6 6 6 6 7 6 7 6 8 5 8 6 7 Desired output : 2 3 4 5 -1 1 4 1 6 5 6 8 5 8 6 7 (3 Replies)
Discussion started by: vasanth.vadalur
3 Replies

9. UNIX for Dummies Questions & Answers

Awk search for max and min field values

hi, i have an awk script and I managed to figure out how to search the max value but Im having difficulty in searching for the min field value. BEGIN {FS=","; max=0} NF == 7 {if (max < $6) max = $6;} END { print man, min} where $6 is the column of a field separated by a comma (3 Replies)
Discussion started by: Kirichiko
3 Replies

10. Shell Programming and Scripting

max values amd min values

Hello every one, I have following data ***CAMPAIGN 1998 CONTRIBUTIONS*** --------------------------------------------------------------------------- NAME PHONE Jan | Feb | Mar | Total Donated ... (12 Replies)
Discussion started by: devmiral
12 Replies
Login or Register to Ask a Question