awk to calculate total and percent off field in file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to calculate total and percent off field in file
# 1  
Old 03-30-2017
awk to calculate total and percent off field in file

Trying to use awk to print the lines in file that have either REF or SNV in $3, add a header line, sort by $4 in numerical order. The below code does that already, but where I am stuck is on the last part where the total lines are counted and printed under Total_Targets, under Targets_less_than is all targets under 15 in $4, just the count, and then under Percent_less_than is the value of Total_targets / Total_less_than * 100. There is probably a better way to do this but my awk is below. Thank you Smilie.


file tab-delimited
Code:
Position	Gene	Type	Reads	Total_Targets	Total_less_than	Percent_less_than
chr10:89720664	PTEN	REF	15
chr3:10183752	VHL	REF	20
chr3:10183734	VHL	REF	21
chr3:10183763	VHL	REF	28
chr3:10183754	VHL	REF	20
chr3:10183758	VHL	REF	20
chr10:89720663	PTEN	REF	15
chr3:10183759	VHL	REF	20
chr3:10183764	VHL	REF	28
chr3:10183765	VHL	REF	28
chr10:89720764	PTEN	CN	25
chr10:89721664	PTEN	CN	15

awk
Code:
awk -F'\t' -v OFS='\t' '$3=="REF" || $3=="SNV" {print $1,$2,$3,$4}' file | awk 'BEGIN {print  "Position\tGene\tType\tReads\tTotal_Targets\tTotal_less_than\tPercent_less_than"}1' | sort -t $'\t' -k4,4n > out

desired out tab-delimeted
Code:
Position	Gene	Type	Reads	Total_Targets	Total_less_than	Percent_less_than
chr10:89720663	PTEN	REF	15	10	2	20
chr10:89720664	PTEN	REF	15
chr3:10183752	VHL	REF	20
chr3:10183754	VHL	REF	20
chr3:10183758	VHL	REF	20
chr3:10183759	VHL	REF	20
chr3:10183734	VHL	REF	21
chr3:10183763	VHL	REF	28
chr3:10183764	VHL	REF	28
chr3:10183765	VHL	REF	28


Last edited by cmccabe; 03-30-2017 at 10:49 PM.. Reason: fixed format
# 2  
Old 03-30-2017
Code:
awk -F'\t' -v OFS='\t' '$3=="REF" || $3=="SNV" {print $1,$2,$3,$4}' file | sort -t $'\t' -k4,4n | (print "Position\tGene\tType\tReads\tTotal_Targets\tTotal_less_than\tPercent_less_than" ; cat -)

This User Gave Thanks to vgersh99 For This Post:
# 3  
Old 03-30-2017
I will give the a try tomorrow.

what does ; cat - do, that is new to me? Thank you very much Smilie.
# 4  
Old 03-31-2017
I don't see how the code vgersh99 suggested performs the calculations you requested.

The - pathname operand to the cat utility causes cat to copy the contents of standard input to standard output. When there is only one operand, the command cat - produces exactly the same results as the command cat.

Moving back to your original problem... If by "under 15" you mean "less than or equal to 15" (instead of the way I would normally interpret that quote ("less than 15")) and you really want the common definition of percentage (instead of the formula you specified), then the following seems to do what you want:
Code:
#!/bin/ksh
TFN=${0##*/}.$$
trap 'rm -f "$TFN"' EXIT

awk -v tfn="$TFN" '
BEGIN {	FS = OFS = "\t"
	sort_cmd = "sort -t\"\t\" -k4,4n -o \"" tfn "\""
}
NR == 1 {
	print
	next
}
$3 == "REF" || $3 == "SNV" {
	tt++
	if($4 <= 15)
		tlt++
	print  | sort_cmd
}
END {	close(sort_cmd)
	while((getline line < tfn) == 1)
		if(++nr == 1)
			print line, tt, tlt, 100 * tlt / tt
		else	print line
}' file

and it produces exactly the output you said you want. Note that it only invokes awk once (not twice like your script does).

This was written and tested using a Korn shell, but will work with any shell that uses Bourne shell syntax and performs the basic parameter expansions required by the POSIX standards. As always, if you want to run this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
# 5  
Old 03-31-2017
Try using numeric array enumeration order in GNU awk :
Code:
awk -v lim=15 '
  NR==1 {
    PROCINFO["sorted_in"] = "@val_num_asc"
    h=$0
    next
  }

  $3=="REF" || $3=="SNV" {
    A[$0]=$4
    tt++
    if($4<=lim)
      tl++
  }

  END{
    print h
    for (i in A)
      if(!n++) print i, tt, tl, tl*100/tt
      else print i
  }
' FS='\t' OFS='\t' infile



--

With GNU awk Co-processing you can also avoid using a temporary file when using external sort..
For example:

Code:
awk -v lim=15 '
  NR==1 {
    cmd="sort -k4,4n -k1,1"
    h=$0
    next
  }

  $3=="REF" || $3=="SNV" {
    tt++
    if($4<=lim)
      tl++
    print |& cmd
  } 

  END{ 
    print h
    close(cmd, "to")
    while (( cmd |& getline)>0) {
      if(!n++) print $0, tt, tl, tl*100/tt
      else print
    } 
  }
' FS='\t' OFS='\t' infile


Last edited by Scrutinizer; 03-31-2017 at 03:07 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Calculate percent using values in 2 files

Trying to use file1 which is the actual counts in $2 associated with each $1 entry. The total of each $1 is in file2 with the total in $3. So when there is a match between $1 in file1 with $1 in file2, then the % is calculated using the $2 value of file1 and $3 value of file2. Thank you :). ... (4 Replies)
Discussion started by: cmccabe
4 Replies

2. Shell Programming and Scripting

Calculate the total 4 field based on the conditions

Please help me to write a script Match with ACNO & NAME if it matched calculate the total val1 val2 val3 and val4 and GT is total of ACNO wise.please check the output Table ----------------- 1005|ANDP|ACN|20|50|10|30 1005|ANDP|ACN|20|10|30|40 1001|AND|NAC|40|50|40|50... (22 Replies)
Discussion started by: kalia4u
22 Replies

3. Shell Programming and Scripting

Match first two columns and calculate percent of average in third column

I have the need to match the first two columns and when they match, calculate the percent of average for the third columns. The following awk script does not give me the expected results. awk 'NR==FNR {T=$3; next} $1,$2 in T {P=T/$3*100; printf "%s %s %.0f\n", $1, $2, (P>=0)?P:-P}' diff.file... (1 Reply)
Discussion started by: ncwxpanther
1 Replies

4. Shell Programming and Scripting

Select lines in which column have value greater than some percent of total file lines

i have a file in following format 1 32 3 4 6 4 4 45 1 45 4 61 54 66 4 5 65 51 56 65 1 12 32 85 now here the total number of lines are 8(they vary each time) Now i want to select only those lines in which the values... (6 Replies)
Discussion started by: vaibhavkorde
6 Replies

5. UNIX for Dummies Questions & Answers

Total cpu usage in percent(%)

How to determine the total percentage cpu usage of a server having 4 cpu cores(quad core)? (3 Replies)
Discussion started by: proactiveaditya
3 Replies

6. Shell Programming and Scripting

Calculate total space, total used space and total free space in filesystem names matching keyword

Good afternoon! Im new at scripting and Im trying to write a script to calculate total space, total used space and total free space in filesystem names matching a keyword (in this one we will use keyword virginia). Please dont be mean or harsh, like I said Im new and trying my best. Scripting... (4 Replies)
Discussion started by: bigben1220
4 Replies

7. Shell Programming and Scripting

Calculate total sum from a file

The file content is dynamic and using this format: name1 number1 name2 number2 name3 number3 name4 number4 .................... Need a smooth way to calculate the sum of all the numbers in that file (number1 + number2 + number3 + number4........ = total ) (11 Replies)
Discussion started by: TehOne
11 Replies

8. Shell Programming and Scripting

awk script to calculate total

Hi First field is the Record Type. A Record Type 5 can have multiple Record Type 6's before another Record Type 5 appears. I want to calculate the total of fields at position 8-11 on Record type 6 when Record Type 5 has a field at position 11-14 equals to '2222'. then it should delete the lines... (2 Replies)
Discussion started by: appsguy616
2 Replies

9. Shell Programming and Scripting

Awk help needed to calculate total

Hi all, I have a flat file like 10 steven 25 mike 47 Charles 127 Nancy 34 steven 23 mike 67 Charles 7761 Nancy 8 steven 54 mike 88 Charles 1267 Nancy I need to calculate the total of steven and all the members , for this I am using like grep "`sed -n 1p patterns.txt`"... (7 Replies)
Discussion started by: senthilkumar_ak
7 Replies
Login or Register to Ask a Question