How to get min and max values using awk?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to get min and max values using awk?
# 1  
Old 08-02-2014
How to get min and max values using awk?

Hi,

I need your kind help to get min and max values from file based on value in $5 .

File1
Code:
SP12.3	stc	2240806	2240808	+	ID1_N003	 ID2_N003T0
SP12.3	sto	2241682	2241684	+	ID1_N003	 ID2_N003T0
SP12.3	XE	2239943	2240011	+	ID1_N003	 ID2_N003T0
SP12.3	XE	2240077	2241254	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2240806	2241254	+	ID1_N003	 ID2_N003T0
SP12.3	XE	2241471	2241684	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2241471	2241681	+	ID1_N003	 ID2_N003T0
SP12.3	stc	2245127	2245129	+	ID1_N005	 ID2_N005T0
SP12.3	sto	2246954	2246956	+	ID1_N005	 ID2_N005T0
SP12.3	XE	2244762	2247195	+	ID1_N005	 ID2_N005T0
SP12.3	CD	2245127	2246953	+	ID1_N005	 ID2_N005T0
SP12.3	stc	2253115	2253117	-	ID1_N006	 ID2_N006T0
SP12.3	sto	2249759	2249761	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2253090	2254054	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2253090	2253117	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2252492	2252908	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2252492	2252908	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2251730	2251882	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2251730	2251882	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2251591	2251664	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2251591	2251664	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2249887	2251530	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2249887	2251530	-	ID1_N006	 ID2_N006T0
SP12.3	XE	2249087	2249821	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2249762	2249821	-	ID1_N006	 ID2_N006T0
SP12.3	stc	2252073	2252075	-	ID1_N006	 ID2_N006T1
SP12.3	sto	2249759	2249761	-	ID1_N006	 ID2_N006T1
SP12.3	XE	2252492	2252973	-	ID1_N006	 ID2_N006T1
SP12.3	XE	2251730	2252227	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2251730	2252075	-	ID1_N006	 ID2_N006T1
SP12.3	XE	2251591	2251664	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2251591	2251664	-	ID1_N006	 ID2_N006T1
SP12.3	XE	2249887	2251530	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2249887	2251530	-	ID1_N006	 ID2_N006T1
SP12.3	XE	2249090	2249821	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2249762	2249821	-	ID1_N006	 ID2_N006T1
SP12.5	stc	3001307	3001309	+	ID1_N01140	ID2_N01140T0
SP12.5	sto	3005026	3005028	+	ID1_N01140	ID2_N01140T0
SP12.5	XE	3000439	3001397	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3001307	3001397	+	ID1_N01140	ID2_N01140T0
SP12.5	XE	3001572	3002765	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3001572	3002765	+	ID1_N01140	ID2_N01140T0
SP12.5	XE	3002821	3004797	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3002821	3004797	+	ID1_N01140	ID2_N01140T0
SP12.5	XE	3004855	3004929	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3004855	3004929	+	ID1_N01140	ID2_N01140T0
SP12.5	XE	3004994	3005417	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3004994	3005025	+	ID1_N01140	ID2_N01140T0

I did the following codes:-

Code:
awk -F"\t" '$2=="CD"{if ($5~/\+/) {print $1"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7} else {print $1"\t"$4"\t"$3"\t"$5"\t"$6"\t"$7}}' file1

But the results shows all lines containing "CD" patterns like below:
Code:
SP12.3	CD	2240806	2241254	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2241471	2241681	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2245127	2246953	+	ID1_N005	 ID2_N005T0
SP12.3	CD	2253090	2253117	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2252492	2252908	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2251730	2251882	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2251591	2251664	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2249887	2251530	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2249762	2249821	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2251730	2252075	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2251591	2251664	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2249887	2251530	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2249762	2249821	-	ID1_N006	 ID2_N006T1
SP12.5	CD	3001307	3001397	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3001572	3002765	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3002821	3004797	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3004855	3004929	+	ID1_N01140	ID2_N01140T0
SP12.5	CD	3004994	3005025	+	ID1_N01140	ID2_N01140T0


The real output that i want will only show min and max value if "CD" pattern is found, and it should be based on value in $5. If "+", then the value in $3 for the first "CD" found and value in $4 for the last "CD" found for each ID2 ($6) will be printed in $3 and $4 of output file respectively. If "-", then the value in $4 for the first "CD" found and value in $3 for the last "CD" found for each ID2($6) will be printed in $4 and $3 respectively like below:-

Code:
SP12.3	CD	2240806	2241681	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2249762	2253117	-	ID1_N006	 ID2_N006T0
SP12.3	CD	2249762	2252075	-	ID1_N006	 ID2_N006T1
SP12.5	CD	3001307	3005025	+	ID1_N01140	ID2_N01140T0

If there is only 1 CD for any ID2 ($7), the line will also be omitted. Would appreciate if you can help me on this. thanks

Last edited by redse171; 08-03-2014 at 06:20 PM.. Reason: for better sample and description
# 2  
Old 08-02-2014
I don't understand your selection of the left value for the "+" sign not the right value for the "-" sign. With this code
Code:
awk     '$2 != "CD"     {next}
         !($7 in EXT3)  {EXT3[$7]=EXT4[$7]= -1E100 * ($5"1")}
                        {CNT[$7]++;SGN[$7]=$5}
         $5 == "+"      {if ($3 > EXT3[$7]) EXT3[$7] = $3
                         if ($4 > EXT4[$7]) EXT4[$7] = $4}
         $5 == "-"      {if ($3 < EXT3[$7]) EXT3[$7] = $3
                         if ($4 < EXT4[$7]) EXT4[$7] = $4}

         END            {for (i in EXT3) if (2 <= CNT[i]) print "SP12.3", "CD", EXT3[i], EXT4[i], SGN[i], substr (i, 2, 8), i}
        ' FS="\t" OFS="\t" file

i get the result
Code:
SP12.3    CD    2249762    2249821    -    ID2 N006     ID2 N006T1
SP12.3    CD    2249762    2249821    -    ID2 N006     ID2 N006T0
SP12.3    CD    2241471    2241681    +    ID2 N003     ID2 N003T0

which does not match your requirement for above mentioned values...
# 3  
Old 08-02-2014
Hi RudiC,

Thanks a lot for your quick response.
I am not really clear about your question above but, I am extracting info for gene features and that's how to find out the region for the coding sequence.

i tried your code but it did not give accurate results on my real data. I tried to change and play around with your code but still the result is not correct. below is the sample result that i got:-


Code:
SP12.5	CD	2249762	2249821	-	ID2_N006	 ID2_N006T1
SP12.5	CD	3004994	3005025	+	D2_N0114	ID2_N01140T0
SP12.5	CD	2249762	2249821	-	ID2_N006	 ID2_N006T0
SP12.5	CD	2241471	2241681	+	ID2_N003	 ID2_N003T0

If u don't mind, can you explain about your codes? The above data is just a sample. for $1, i have many different values, not only SP12.3. So, i changed "print "SP12.3"" to print "$1". But the output is still wrong. Thanks

Last edited by redse171; 08-03-2014 at 06:23 PM..
# 4  
Old 08-02-2014
Code:
awk '
	$2=="CD" {
		key=$5"|"$9"|";
		($3>A[key"max"] || A[key"max"]=="")? A[key"max"]=$3:"";
		($4>A[key"max"] || A[key"max"]=="")? A[key"max"]=$4:"";
		($3<A[key"min"] || A[key"min"]=="")? A[key"min"]=$3:"";
		($4<A[key"min"] || A[key"min"]=="")? A[key"min"]=$4:"";
		!(key in line)? line[key]=$0: "";
		count[$9]++;
	}
	END {
		for(key in line) {
			split(key,s,"|");
			if(count[s[2]] > 1) {
				sub(/[0-9]+\s+[0-9]+/, A[key"min"]" "A[key"max"], line[key]);
				print line[key];
			}
		}
	}
' file


Last edited by jethrow; 08-03-2014 at 05:25 PM..
This User Gave Thanks to jethrow For This Post:
# 5  
Old 08-02-2014
Hi jethrow,

thanks so much for your response. tried your code but the result is not accurate.

Code:
SP12.5	CD	3001307	3001397	+	ID1_N01140	ID2_N01140T0
SP12.3	CD	2240806	2241254	+	ID1_N003	 ID2_N003T0
SP12.3	CD	2251730	2252075	-	ID1_N006	 ID2_N006T1
SP12.3	CD	2253090	2253117	-	ID1_N006	 ID2_N006T0


Last edited by redse171; 08-03-2014 at 06:32 PM..
# 6  
Old 08-03-2014
Quote:
Originally Posted by redse171
Hi,

I need your kind help to get min and max values from file based on value in $5 .

File1
Code:
SP12.3	stc	2240806	2240808	+	ID1 N003	 ID2 N003T0
SP12.3	sto	2241682	2241684	+	ID1 N003	 ID2 N003T0
SP12.3	XE	2239943	2240011	+	ID1 N003	 ID2 N003T0
SP12.3	XE	2240077	2241254	+	ID1 N003	 ID2 N003T0
SP12.3	CD	2240806	2241254	+	ID1 N003	 ID2 N003T0
SP12.3	XE	2241471	2241684	+	ID1 N003	 ID2 N003T0
SP12.3	CD	2241471	2241681	+	ID1 N003	 ID2 N003T0
SP12.3	stc	2245127	2245129	+	ID1 N005	 ID2 N005T0
SP12.3	sto	2246954	2246956	+	ID1 N005	 ID2 N005T0
SP12.3	XE	2244762	2247195	+	ID1 N005	 ID2 N005T0
SP12.3	CD	2245127	2246953	+	ID1 N005	 ID2 N005T0
SP12.3	stc	2253115	2253117	-	ID1 N006	 ID2 N006T0
SP12.3	sto	2249759	2249761	-	ID1 N006	 ID2 N006T0
SP12.3	XE	2253090	2254054	-	ID1 N006	 ID2 N006T0
SP12.3	CD	2253090	2253117	-	ID1 N006	 ID2 N006T0
SP12.3	XE	2252492	2252908	-	ID1 N006	 ID2 N006T0
SP12.3	CD	2252492	2252908	-	ID1 N006	 ID2 N006T0
SP12.3	XE	2251730	2251882	-	ID1 N006	 ID2 N006T0
SP12.3	CD	2251730	2251882	-	ID1 N006	 ID2 N006T0
SP12.3	XE	2251591	2251664	-	ID1 N006	 ID2 N006T0
SP12.3	CD	2251591	2251664	-	ID1 N006	 ID2 N006T0
SP12.3	XE	2249887	2251530	-	ID1 N006	 ID2 N006T0
SP12.3	CD	2249887	2251530	-	ID1 N006	 ID2 N006T0
SP12.3	XE	2249087	2249821	-	ID1 N006	 ID2 N006T0
SP12.3	CD	2249762	2249821	-	ID1 N006	 ID2 N006T0
SP12.3	stc	2252073	2252075	-	ID1 N006	 ID2 N006T1
SP12.3	sto	2249759	2249761	-	ID1 N006	 ID2 N006T1
SP12.3	XE	2252492	2252973	-	ID1 N006	 ID2 N006T1
SP12.3	XE	2251730	2252227	-	ID1 N006	 ID2 N006T1
SP12.3	CD	2251730	2252075	-	ID1 N006	 ID2 N006T1
SP12.3	XE	2251591	2251664	-	ID1 N006	 ID2 N006T1
SP12.3	CD	2251591	2251664	-	ID1 N006	 ID2 N006T1
SP12.3	XE	2249887	2251530	-	ID1 N006	 ID2 N006T1
SP12.3	CD	2249887	2251530	-	ID1 N006	 ID2 N006T1
SP12.3	XE	2249090	2249821	-	ID1 N006	 ID2 N006T1
SP12.3	CD	2249762	2249821	-	ID1 N006	 ID2 N006T1

I did the following codes:-

Code:
awk -F"\t" '$2=="CD"{if ($5~/\+/) {print $1"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7} else {print $1"\t"$4"\t"$3"\t"$5"\t"$6"\t"$7}}' file1

But the results still shows all lines containing "CD" pattern. The real output that i want will only show min and max value based on $5 ((blue color for "+" and red color for "-") as below. :-

Code:
SP12.3	CD	2240806	2241681	+	ID1 N003	 ID2 N003T0
SP12.3	CD	2249762	2253117	-	ID1 N006	 ID2 N006T0
SP12.3	CD	2249762	2252075	-	ID1 N006	 ID2 N006T1

If there is only 1 CD for any ID2 ($7), the line will also be omitted. Would appreciate if you can help me on this. thanks
It is no wonder that the results you are getting are not what you want. Your description of how to process the input is so vague that we do not understand what you want.

The code you showed us prints parts of every line with "CD" in the 2nd field. For those lines, it throws away fields 2, 8, and 9; and, if $5 is "+", it swaps fields 3 and 4 before printing the remainder of the line. But, the output you say you want shows every field (keeping fields 2, 8, and 9). And if fields 3 and 4 have been swapped, it isn't obvious to me.

You mentioned ID2 ($7), but it looks like you are looking for the minimum $3 value and the maximum $4 value for each different value in field 9 (not field 7). And from the data shown, I don't see that the + or - in field 5 makes any difference at all.

You have shown us data where fields 1, 6, and 8 are all constants. You have said that $1 may change, but you haven't given any indication of how, or if, that should affect the output produced.

Please give us a clear English description of what you are trying to do and explain what the meaning is for each of the fields in your file.

Also, lots of gene data that we're asked to help with has huge files to process. If that is the case here as well, any details you can give us about the data may help speed up the process considerably. For example, what you have shown us could be sorted with field 1, 5, or 9 as a primary sort key. If data is to be grouped using field 9 as a key and the input is sorted on field 9, we can produce any needed output every time the contents of field 9 changes (as opposed to accumulating all of the input into memory and processing everything at the end).

We also need to know up front whether or not it is important that the output be in the same order as the input.

And, finally: just saying that the code you were given did't give you accurate results is useless information. Show us the output you got, the output you wanted, and explain why (based on your description of what you wanted) the output you got was wrong! Help us help you!
These 3 Users Gave Thanks to Don Cragun For This Post:
# 7  
Old 08-03-2014
Hi Don Crugan,

Thank u for your comments. Forgive me for the vague description. I just edited my question and sample above. I tried my best to explain my issue. My data is long and huge and has different conditions and i tried my best to make it simple for the sample. but it seems that it created more confusion. my mistake. thanks
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk Sort 2d histogram output from min(X,Y) to max(X,Y)

I've got Gnuplot-format 2D histogram data output which looks as follows. 6.5 -1.25 10.2804 6.5404 -1.25 10.4907 6.58081 -1.25 10.8087 6.62121 -1.25 10.4686 6.66162 -1.25 10.506 6.70202 -1.25 10.3084 6.74242 -1.25 9.68256 6.78283 -1.25 9.41229 6.82323 -1.25 9.43078 6.86364 -1.25 9.62408... (1 Reply)
Discussion started by: chrisjorg
1 Replies

2. Shell Programming and Scripting

awk search for max and min while ignoring special character

I am trying to get a simple min/max script to work with the below input. Note the special character (">") within it. Script awk 'BEGIN{max=0}{if(($1)>max) max=($1)}END {print max}' awk 'BEGIN{min=0}{if(($2)<min) min=($2)}END {print min}' Input -122.2840 42.0009 -119.9950 ... (7 Replies)
Discussion started by: ncwxpanther
7 Replies

3. Shell Programming and Scripting

awk script to find min and max value

I need to find the max/min of columns 1 and 2 of a 2 column file what contains the special character ">". I know that this will find the max value of column 1. awk 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' input.file But what if I needed to ignore special characters in the... (3 Replies)
Discussion started by: ncwxpanther
3 Replies

4. Shell Programming and Scripting

Get the min avg and max with awk

aaa: 3 ms aaa: 2 ms aaa: 5 ms aaa: 10 ms .......... to get the 3 2 5 10 ...'s min avg and max something like min: 2 ms avg: 5 ms max: 10 ms (2 Replies)
Discussion started by: yanglei_fage
2 Replies

5. Shell Programming and Scripting

Average, min and max in file with header, using awk

Hi, I have a file which looks like this: FID IID MISS_PHENO N_MISS N_GENO F_MISS 12AB43131 12AB43131 N 17774 906341 0.01961 65HJ87451 65HJ87451 N 10149 906341 0.0112 43JJ21345 43JJ21345 N 2826 906341 0.003118I would... (11 Replies)
Discussion started by: kayakj
11 Replies

6. UNIX for Dummies Questions & Answers

[Solved] Print a line using a max and a min values of different columns

Hi guys, I already search on the forum but i can't solve this on my own. I have a lot of files like this: And i need to print the line with the maximum value in last column but if the value is the same (2 in this exemple for the 3 last lines) i need get the line with the minimum value in... (4 Replies)
Discussion started by: MetaBolic0
4 Replies

7. Shell Programming and Scripting

AWK script - extracting min and max values from selected lines

Hi guys! I'm new to scripting and I need to write a script in awk. Here is example of file on which I'm working ATOM 4688 HG1 PRO A 322 18.080 59.680 137.020 1.00 0.00 ATOM 4689 HG2 PRO A 322 18.850 61.220 137.010 1.00 0.00 ATOM 4690 CD ... (18 Replies)
Discussion started by: grincz
18 Replies

8. Shell Programming and Scripting

Find min.max value if matching columns found using AWK

Input_ File : 2 3 4 5 1 1 0 1 2 1 -1 1 2 1 3 1 3 1 4 1 6 5 6 6 6 6 6 7 6 7 6 8 5 8 6 7 Desired output : 2 3 4 5 -1 1 4 1 6 5 6 8 5 8 6 7 (3 Replies)
Discussion started by: vasanth.vadalur
3 Replies

9. UNIX for Dummies Questions & Answers

Awk search for max and min field values

hi, i have an awk script and I managed to figure out how to search the max value but Im having difficulty in searching for the min field value. BEGIN {FS=","; max=0} NF == 7 {if (max < $6) max = $6;} END { print man, min} where $6 is the column of a field separated by a comma (3 Replies)
Discussion started by: Kirichiko
3 Replies

10. Shell Programming and Scripting

max values amd min values

Hello every one, I have following data ***CAMPAIGN 1998 CONTRIBUTIONS*** --------------------------------------------------------------------------- NAME PHONE Jan | Feb | Mar | Total Donated ... (12 Replies)
Discussion started by: devmiral
12 Replies
Login or Register to Ask a Question