awk to output specific matches in file

12-28-2015

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk to output specific matches in file

Using the attached file, the below awk command results in the output below:
I can not seem to produce the desired results and need some expert help. Thank you

Code:

awk -F'[ |=]' '
{
id[$2] += $4
value[$2] += $5
occur[$2]++
}
END{
printf "%-8s%8s%8s%8s\n", "Gene", "Targets", "Average Depth", "Average GC"
for (i in id)       
printf "%-8s%8d%8.1f%8.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]
}' file.txt > output.txt

Current output

Code:

Gene     TargetsAverage DepthAverage GC
GC    Average       1     0.0     0.0
gc           803     0.0     0.0

Desired output

Code:

Gene         Targets  Average Depth      Average GC
CSF3R         15   225.2    59.8
RAD51C         9   178.1    40.7
EPO            5   148.0    61.4
SRP72         13   204.2    40.6
SF3B1         25   207.7    41.4

file2.txt (33.9 KB)

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

12-28-2015

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Code:

BEGIN {
  tab=sprintf("\t")
  FS="(" tab ")|[|=]"
}
FNR>1 {
   id[$2] += $4
   value[$2] += $5
   occur[$2]++
}
END{
  printf "%-8s%8s%8s%8s\n", "Gene", "Targets", "Average Depth", "Average GC"
  for (i in id)
    printf "%-8s%8d%8.1f%8.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]
}

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

12-28-2015

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

"Average Depth" contains 13 characters and you are allowing only 8 for justification. Same with "Average GC" which has 10. If you want to align better give it more. Something like:

Code:

awk -F'[ |=]' '
FNR>1{
id[$2] += $4
value[$2] += $5
occur[$2]++
}
END{
printf "%-8s%8s%15s%15s\n", "Gene", "Targets", "Average Depth", "Average GC"
for (i in id)
printf "%-8s%8d%15.1f%15.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]
}' file2.txt | head -30

Code:

Gene     Targets  Average Depth     Average GC
TINF2          8           48.7           53.1
             496           18.6            0.0
RPS24          6           11.6           45.3
RAB27A         5            0.0           43.9
FANCF          1            0.0           59.4
NF1           51          306.4           41.0
NOP10          1            0.0           47.9
RPS26          1            0.0           62.2
RAC2           5            0.0           61.0
TGFB1          7            0.0           61.4
RAD51C         9            0.0           40.7
PALB2         12           25.4           43.5
FAN1          13            0.0           49.9
RPS19          1            0.0           63.9
FANCI         32           22.6           42.1
USB1           7            0.0           56.4
KRAS           5            0.0           35.9
GSTT1          5           30.9           58.8
G6PC3          6            0.0           58.2
ERCC4         10           21.9           44.3
BRCA2         24           24.9           36.6
SETBP1         6            0.0           52.3
FANCM         21           25.9           36.6
ELANE          5            0.0           67.9
FANCA         39           21.8           53.4
RUNX1          9          102.6           54.0
BRIP1         18           42.7           38.3

By the way, I added the FNR>1, because you need to ignore the Header already in the file: Target Gene|GC Average Depth

---------- Post updated at 04:34 PM ---------- Previous update was at 12:33 PM ----------

Another example with visual markers to help you understand how's working:

Code:

...
printf "|%-8s|%8s|%14s|%11s|\n", "Gene", "Targets", "Average Depth", "Average GC"
...
printf "|%-8s|%8d|%14.1f|%11.1f|\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]
...

Code:

|Gene    | Targets| Average Depth| Average GC|
|TINF2   |       8|          48.7|       53.1|
|        |     496|          18.6|        0.0|
|RPS24   |       6|          11.6|       45.3|
|RAB27A  |       5|           0.0|       43.9|
|FANCF   |       1|           0.0|       59.4|
|NF1     |      51|         306.4|       41.0|
|NOP10   |       1|           0.0|       47.9|
|RPS26   |       1|           0.0|       62.2|
|RAC2    |       5|           0.0|       61.0|
|TGFB1   |       7|           0.0|       61.4|
|RAD51C  |       9|           0.0|       40.7|
|PALB2   |      12|          25.4|       43.5|
|FAN1    |      13|           0.0|       49.9|
|RPS19   |       1|           0.0|       63.9|
|FANCI   |      32|          22.6|       42.1|
|USB1    |       7|           0.0|       56.4|
|KRAS    |       5|           0.0|       35.9|
|GSTT1   |       5|          30.9|       58.8|
|G6PC3   |       6|           0.0|       58.2|
|ERCC4   |      10|          21.9|       44.3|
|BRCA2   |      24|          24.9|       36.6|
|SETBP1  |       6|           0.0|       52.3|
|FANCM   |      21|          25.9|       36.6|
|ELANE   |       5|           0.0|       67.9|
|FANCA   |      39|          21.8|       53.4|
|RUNX1   |       9|         102.6|       54.0|
|BRIP1   |      18|          42.7|       38.3|

Aia

View Public Profile for Aia

Find all posts by Aia

12-29-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

As a general rule, when creating format strings to display what are expected to be aligned strings with an assumed maximum length built into the format specifiers; it is a good idea to include a physical field separator in the format string between such fields instead of just increasing the specified length assuming that the expected length will never be exceeded.

For example, the printf statement in the 1st post in this thread:

Code:

printf "%-8s%8d%8.1f%8.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]

would more safely be written:

Code:

printf "%-8s %7d %7.1f %7.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]

If all values being printed are in the expected ranges, the output from both of the above statements will be identical. But, if one or more of the values overflow the expected range, the separation between fields can disappear in the 1st form while field separation is maintained by the 2nd form.

A human may be able to figure out the intended fields either way, but if an awk script using default field delimiters, a shell script using read with the default IFS value, etc. tries to find fields in the output of the above format strings, fields can disappear with the first.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

awk to output specific matches in file

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update specific value in file with match and add +1 to specific digit

Discussion started by: cmccabe

2. Shell Programming and Scripting

Using awk to output matches and mismatches between two files to one file

Discussion started by: cmccabe

3. Shell Programming and Scripting

Using awk to output matches between two files to one file and mismatches to two others

Discussion started by: cmccabe

4. Shell Programming and Scripting

awk to remove lines in file if specific field matches

Discussion started by: cmccabe

5. Shell Programming and Scripting

AWK specific output filename

Discussion started by: LMSteed

6. Shell Programming and Scripting

Replace column that matches specific pattern, with column data from another file

Discussion started by: prashali

7. Shell Programming and Scripting

awk to sum specific field when pattern matches

Discussion started by: ux4me

8. Shell Programming and Scripting

Assigning a specific format to a specific column in a text file using awk and printf

Discussion started by: goodbenito

9. Shell Programming and Scripting

Getting a specific date from cal output with AWK

Discussion started by: Casey