How to get min and max values using awk?

08-03-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi redse171,
Thanks of rthe update. That gives us a better idea of what you are trying to do. Although the awk script you have shown us will not produce the output you showed us for the sample input you provided. (Your awk script doesn't copy the CD field to the output.)

I haven't dug into all of the details again yet, but I think that if we get answers to the following, we'll be able to help you write a script that will work:

Do you want the output to contain the "CD" field from the input?
Will all lines with the same combination of $5, $6, and $7 values be on contiguous lines in your input file? (The answer to this is "yes" for your sample input. Does it hold true for your real, huge input files?)
If the answer to #2 is no, does the order of lines in your output file matter?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-04-2014

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Answers to Don Cragun's above question may kill the assumptions on which this is based. Try

Code:

awk     '$2 != "CD"     {next}                                          # not a "CD" line -> no action
         !($7 in LINE)  {LINE[$7]=$0}                                   # new $7? Keep line with first occurrence of $3/$4 in memory
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               # count $7 lines and keep last $3 and $4

         END            {for (i in LINE) if (CNT[i]>=2) {               # for the lines recorded, if count = 1: discard
                                 match (LINE[i],"[0-9]*\t[0-9]*\t[+-]") # search for $3 $4 +- pattern (you can use constants here if 
                                                                        # sure the file structure remains identical all over)
                                 if (substr (LINE[i], RSTART+RLENGTH-1, 1) == "-") {    # take decision on + or -
                                        POS=RSTART                      # where to replace
                                        STR=E3[i]}                      # what to put in 
                                  else {POS=RSTART+8
                                        STR=E4[i]} 
                                 print  substr (LINE[i], 1, POS-2),     # print first part of line, dep. on sign
                                        STR,                            #       replacement string
                                        substr (LINE[i], POS+8)         #       last part
                                }
                        }
        ' FS="\t" OFS="\t" file
SP12.3    CD    2249762    2252075    -    ID1_N006     ID2_N006T1
SP12.5    CD    3001307    3005025    +    ID1_N01140    ID2_N01140T0
SP12.3    CD    2249762    2253117    -    ID1_N006     ID2_N006T0
SP12.3    CD    2240806    2241681    +    ID1_N003     ID2_N003T0

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-04-2014

Registered User

149, 1

Join Date: Dec 2010

Last Activity: 9 June 2015, 10:16 AM EDT

Posts: 149

Thanks Given: 100

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Don Cragun

Do you want the output to contain the "CD" field from the input?
Will all lines with the same combination of $5, $6, and $7 values be on contiguous lines in your input file? (The answer to this is "yes" for your sample input. Does it hold true for your real, huge input files?)
If the answer to #2 is no, does the order of lines in your output file matter?

Hi Don Crugan,

To answer your questions:-

1. Yes, i need to have "CD" field in my output file as shown in my sample output
2. Yes for my huge input files

thanks.

---------- Post updated at 10:12 AM ---------- Previous update was at 10:07 AM ----------

Quote:

Originally Posted by RudiC

Answers to Don Cragun's above question may kill the assumptions on which this is based. Try

Code:

awk     '$2 != "CD"     {next}                                          # not a "CD" line -> no action
         !($7 in LINE)  {LINE[$7]=$0}                                   # new $7? Keep line with first occurrence of $3/$4 in memory
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               # count $7 lines and keep last $3 and $4

         END            {for (i in LINE) if (CNT[i]>=2) {               # for the lines recorded, if count = 1: discard
                                 match (LINE[i],"[0-9]*\t[0-9]*\t[+-]") # search for $3 $4 +- pattern (you can use constants here if 
                                                                        # sure the file structure remains identical all over)
                                 if (substr (LINE[i], RSTART+RLENGTH-1, 1) == "-") {    # take decision on + or -
                                        POS=RSTART                      # where to replace
                                        STR=E3[i]}                      # what to put in 
                                  else {POS=RSTART+8
                                        STR=E4[i]} 
                                 print  substr (LINE[i], 1, POS-2),     # print first part of line, dep. on sign
                                        STR,                            #       replacement string
                                        substr (LINE[i], POS+8)         #       last part
                                }
                        }
        ' FS="\t" OFS="\t" file
SP12.3    CD    2249762    2252075    -    ID1_N006     ID2_N006T1
SP12.5    CD    3001307    3005025    +    ID1_N01140    ID2_N01140T0
SP12.3    CD    2249762    2253117    -    ID1_N006     ID2_N006T0
SP12.3    CD    2240806    2241681    +    ID1_N003     ID2_N003T0

Hi RudiC,

Tried your codes and thanks so much for your explanations. It seems working for my real input file except that there are few lines a little bit weird. I am checking on it now and try play around with your codes. Will give the feedback asap. Thanks

---------- Post updated at 09:25 PM ---------- Previous update was at 10:12 AM ----------

Hi,

just to give feedback. The codes by RudiC is modified to suit my real data. The codes worked well with the sample data but there was an issue with the number and position of digits (values) in $3 and $4 in my real huge file. So, i split the LINE into segments and take the value from the segments (info from awk manual). Thanks to RudiC for the codes and explanations that help me to understand better. Below is the codes that being modified and i got the results that i wanted.

Code:

awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT[i]>=2) {               
                                 match (LINE[i],"[0-9]*\t[0-9]*\t[+-]") 
                                                                        
                                 if (substr (LINE[i], RSTART+RLENGTH-1, 1) == "-") {    
                                        POS=RSTART                      
                                        STR=E3[i]
                                 split(LINE[i], seg, "\t")
                                 print  seg[1], seg[2], 
                                        STR,                            
                                        seg[4], seg[5], seg[6], seg[7] 
                                 }                      
                                 else {POS=RSTART+7
                                       STR=E4[i]
                                 split(LINE[i], seg, "\t")
                                 print  seg[1], seg[2], seg[3],     
                                        STR,                           
                                        seg[5], seg[6], seg[7] 

                                 }
                                }
                        }
        ' FS="\t" OFS="\t" File1

My first code was not informative enough as i don't have any idea how to find the min and max from my input file and what i gave was just to extract all line with CD patterns. The help that i got here is awesome and help me to learn and understand better. thanks a lot! .

redse171

View Public Profile for redse171

Find all posts by redse171

08-04-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi redse171,
I'm very glad that RudiC was able to help you find a solution to your problem. Note that if you need to use split() to correctly group your fields, you don't need to also use match() and substr() to determine whether you have a + or - in field 5 (you can just look directly at seg[5]) after you call split(). You can then simplify your code to something like:

Code:

awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT[i]>=2) {               
                                 split(LINE[i], seg)
                                 if (seg[5] == "-") {    
					 print  seg[1], seg[2], E3[i], 
						seg[4], seg[5], seg[6], seg[7] 
                                 } else {
					 print  seg[1], seg[2], seg[3],     
						E4[i], seg[5], seg[6], seg[7]
                                 }
			 }
                        }
        ' FS="\t" OFS="\t" File1

and get the same results.

Hope this helps,
Don

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-05-2014

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Further simplification:

Code:

awk     '$2 != "CD"     {next}
         !($7 in LINE)  {LINE[$7]=$0}
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}

         END            {for (i in LINE) if (CNT[i]>=2) {
                                split(LINE[i], seg)
                                if (seg[5] == "-")      seg[3] = E3[i]
                                else                    seg[4] = E4[i]
                                print  seg[1], seg[2], seg[3], seg[4], seg[5], seg[6], seg[7]
                         }
                        }
        ' FS="\t" OFS="\t" file

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-05-2014

Registered User

149, 1

Join Date: Dec 2010

Last Activity: 9 June 2015, 10:16 AM EDT

Posts: 149

Thanks Given: 100

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Don Cragun

Code:

awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT[i]>=2) {               
                                 split(LINE[i], seg)
                                 if (seg[5] == "-") {    
					 print  seg[1], seg[2], E3[i], 
						seg[4], seg[5], seg[6], seg[7] 
                                 } else {
					 print  seg[1], seg[2], seg[3],     
						E4[i], seg[5], seg[6], seg[7]
                                 }
			 }
                        }
        ' FS="\t" OFS="\t" File1

and get the same results.

Hope this helps,
Don

Hi Don,

It does help!.. It just that i need to add a tiny part (in blue) there at printing part or else it wont show $4 in my output.

Code:

awk     '$2 != "CD"     {next}                                          
         !($7 in LINE)  {LINE[$7]=$0}                                   
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}               

         END            {for (i in LINE) if (CNT[i]>=2) {               
                                 split(LINE[i], seg)
                                 if (seg[5] == "-") {    
					 print  seg[1], seg[2], seg[3]= E3[i], 
						seg[4], seg[5], seg[6], seg[7] 
                                 } else {
					 print  seg[1], seg[2], seg[3],     
						seg[4]=E4[i], seg[5], seg[6], seg[7]
                                 }
			 }
                        }
        ' FS="\t" OFS="\t" file1

Thanks a bunch

---------- Post updated at 09:32 AM ---------- Previous update was at 09:31 AM ----------

Quote:

Originally Posted by RudiC

Further simplification:

Code:

awk     '$2 != "CD"     {next}
         !($7 in LINE)  {LINE[$7]=$0}
                        {CNT[$7]++; E3[$7]=$3; E4[$7]=$4}

         END            {for (i in LINE) if (CNT[i]>=2) {
                                split(LINE[i], seg)
                                if (seg[5] == "-")      seg[3] = E3[i]
                                else                    seg[4] = E4[i]
                                print  seg[1], seg[2], seg[3], seg[4], seg[5], seg[6], seg[7]
                         }
                        }
        ' FS="\t" OFS="\t" file

Hi RudiC,

This is a lot cleaner!! Many thanks

redse171

View Public Profile for redse171

Find all posts by redse171

Shell Programming and Scripting

How to get min and max values using awk?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk Sort 2d histogram output from min(X,Y) to max(X,Y)

Discussion started by: chrisjorg

2. Shell Programming and Scripting

awk search for max and min while ignoring special character

Discussion started by: ncwxpanther

3. Shell Programming and Scripting

awk script to find min and max value

Discussion started by: ncwxpanther

4. Shell Programming and Scripting

Get the min avg and max with awk

Discussion started by: yanglei_fage

5. Shell Programming and Scripting

Average, min and max in file with header, using awk

Discussion started by: kayakj

6. UNIX for Dummies Questions & Answers

[Solved] Print a line using a max and a min values of different columns

Discussion started by: MetaBolic0

7. Shell Programming and Scripting

AWK script - extracting min and max values from selected lines

Discussion started by: grincz

8. Shell Programming and Scripting

Find min.max value if matching columns found using AWK

Discussion started by: vasanth.vadalur

9. UNIX for Dummies Questions & Answers

Awk search for max and min field values

Discussion started by: Kirichiko

10. Shell Programming and Scripting

max values amd min values

Discussion started by: devmiral