awk to output specific matches in file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to output specific matches in file
# 1  
Old 12-28-2015
awk to output specific matches in file

Using the attached file, the below awk command results in the output below:
I can not seem to produce the desired results and need some expert help. Thank you Smilie.

Code:
awk -F'[ |=]' '
{
id[$2] += $4
value[$2] += $5
occur[$2]++
}
END{
printf "%-8s%8s%8s%8s\n", "Gene", "Targets", "Average Depth", "Average GC"
for (i in id)       
printf "%-8s%8d%8.1f%8.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]
}' file.txt > output.txt

Current output
Code:
Gene     TargetsAverage DepthAverage GC
GC    Average       1     0.0     0.0
gc           803     0.0     0.0

Desired output
Code:
Gene         Targets  Average Depth      Average GC
CSF3R         15   225.2    59.8
RAD51C         9   178.1    40.7
EPO            5   148.0    61.4
SRP72         13   204.2    40.6
SF3B1         25   207.7    41.4

# 2  
Old 12-28-2015
Code:
BEGIN {
  tab=sprintf("\t")
  FS="(" tab ")|[|=]"
}
FNR>1 {
   id[$2] += $4
   value[$2] += $5
   occur[$2]++
}
END{
  printf "%-8s%8s%8s%8s\n", "Gene", "Targets", "Average Depth", "Average GC"
  for (i in id)
    printf "%-8s%8d%8.1f%8.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]
}

# 3  
Old 12-28-2015
"Average Depth" contains 13 characters and you are allowing only 8 for justification. Same with "Average GC" which has 10. If you want to align better give it more. Something like:

Code:
awk -F'[ |=]' '
FNR>1{
id[$2] += $4
value[$2] += $5
occur[$2]++
}
END{
printf "%-8s%8s%15s%15s\n", "Gene", "Targets", "Average Depth", "Average GC"
for (i in id)
printf "%-8s%8d%15.1f%15.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]
}' file2.txt | head -30

Code:
Gene     Targets  Average Depth     Average GC
TINF2          8           48.7           53.1
             496           18.6            0.0
RPS24          6           11.6           45.3
RAB27A         5            0.0           43.9
FANCF          1            0.0           59.4
NF1           51          306.4           41.0
NOP10          1            0.0           47.9
RPS26          1            0.0           62.2
RAC2           5            0.0           61.0
TGFB1          7            0.0           61.4
RAD51C         9            0.0           40.7
PALB2         12           25.4           43.5
FAN1          13            0.0           49.9
RPS19          1            0.0           63.9
FANCI         32           22.6           42.1
USB1           7            0.0           56.4
KRAS           5            0.0           35.9
GSTT1          5           30.9           58.8
G6PC3          6            0.0           58.2
ERCC4         10           21.9           44.3
BRCA2         24           24.9           36.6
SETBP1         6            0.0           52.3
FANCM         21           25.9           36.6
ELANE          5            0.0           67.9
FANCA         39           21.8           53.4
RUNX1          9          102.6           54.0
BRIP1         18           42.7           38.3

By the way, I added the FNR>1, because you need to ignore the Header already in the file: Target Gene|GC Average Depth

---------- Post updated at 04:34 PM ---------- Previous update was at 12:33 PM ----------

Another example with visual markers to help you understand how's working:

Code:
...
printf "|%-8s|%8s|%14s|%11s|\n", "Gene", "Targets", "Average Depth", "Average GC"
...
printf "|%-8s|%8d|%14.1f|%11.1f|\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]
...

Code:
|Gene    | Targets| Average Depth| Average GC|
|TINF2   |       8|          48.7|       53.1|
|        |     496|          18.6|        0.0|
|RPS24   |       6|          11.6|       45.3|
|RAB27A  |       5|           0.0|       43.9|
|FANCF   |       1|           0.0|       59.4|
|NF1     |      51|         306.4|       41.0|
|NOP10   |       1|           0.0|       47.9|
|RPS26   |       1|           0.0|       62.2|
|RAC2    |       5|           0.0|       61.0|
|TGFB1   |       7|           0.0|       61.4|
|RAD51C  |       9|           0.0|       40.7|
|PALB2   |      12|          25.4|       43.5|
|FAN1    |      13|           0.0|       49.9|
|RPS19   |       1|           0.0|       63.9|
|FANCI   |      32|          22.6|       42.1|
|USB1    |       7|           0.0|       56.4|
|KRAS    |       5|           0.0|       35.9|
|GSTT1   |       5|          30.9|       58.8|
|G6PC3   |       6|           0.0|       58.2|
|ERCC4   |      10|          21.9|       44.3|
|BRCA2   |      24|          24.9|       36.6|
|SETBP1  |       6|           0.0|       52.3|
|FANCM   |      21|          25.9|       36.6|
|ELANE   |       5|           0.0|       67.9|
|FANCA   |      39|          21.8|       53.4|
|RUNX1   |       9|         102.6|       54.0|
|BRIP1   |      18|          42.7|       38.3|

# 4  
Old 12-29-2015
As a general rule, when creating format strings to display what are expected to be aligned strings with an assumed maximum length built into the format specifiers; it is a good idea to include a physical field separator in the format string between such fields instead of just increasing the specified length assuming that the expected length will never be exceeded.

For example, the printf statement in the 1st post in this thread:
Code:
printf "%-8s%8d%8.1f%8.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]

would more safely be written:
Code:
printf "%-8s %7d %7.1f %7.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]

If all values being printed are in the expected ranges, the output from both of the above statements will be identical. But, if one or more of the values overflow the expected range, the separation between fields can disappear in the 1st form while field separation is maintained by the 2nd form.

A human may be able to figure out the intended fields either way, but if an awk script using default field delimiters, a shell script using read with the default IFS value, etc. tries to find fields in the output of the above format strings, fields can disappear with the first.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update specific value in file with match and add +1 to specific digit

I am trying to use awk to match the NM_ in file with $1 of id which is tab-delimited. The NM_ will always be in the line of file that starts with > and be after the second _. When there is a match between each NM_ and id, then the value of $2 in id is substituted or used to update the NM_. Each NM_... (3 Replies)
Discussion started by: cmccabe
3 Replies

2. Shell Programming and Scripting

Using awk to output matches and mismatches between two files to one file

In the tab-delimited files, I am trying to match $1,$2,$3,$4,$5 in fiel1 with $1,$2,$3,$4,$5 in fiel2 and create and output file that lists what matches and what was not found (or doesn't match). However the awk below seems to skip the first line and does not produce the desired output. I think... (2 Replies)
Discussion started by: cmccabe
2 Replies

3. Shell Programming and Scripting

Using awk to output matches between two files to one file and mismatches to two others

I am trying to output the matches between $1 of file1 to $3 of file2 into a new file match. I am also wanting to output the mismatches between those same 2 files and fields to two separate new files called missing from file1 and missing from file2. The input files are tab-delimited, but the... (9 Replies)
Discussion started by: cmccabe
9 Replies

4. Shell Programming and Scripting

awk to remove lines in file if specific field matches

I am trying to remove lines in the target.txt file if $5 before the - in that file matches sorted_list. I have tried grep and awk. Thank you :). grep grep -v -F -f targets.bed sort_list grep -vFf sort_list targets awk awk -F, ' > FILENAME == ARGV {to_remove=1; next} > ! ($5 in... (2 Replies)
Discussion started by: cmccabe
2 Replies

5. Shell Programming and Scripting

AWK specific output filename

Hi All, I'd like to create a specific output filename for AWK. The file I am processing with AWK looks like: output_081012.csv* 27*TEXT*1.0*2.0*3.0 where * is my delimeter and the first line of the file is the output filename i'd like to create is there a way to assign an awk... (10 Replies)
Discussion started by: LMSteed
10 Replies

6. Shell Programming and Scripting

Replace column that matches specific pattern, with column data from another file

Can anyone please help with this? I have 2 files as given below. If 2nd column of file1 has pattern foo1@a, find the matching 1st column in file2 & replace 2nd column of file1 with file2's value. file1 abc_1 foo1@a .... abc_1 soo2@a ... def_2 soo2@a .... def_2 foo1@a ........ (7 Replies)
Discussion started by: prashali
7 Replies

7. Shell Programming and Scripting

awk to sum specific field when pattern matches

Trying to sum field #6 when field #2 matches string as follows: Input data: 2010-09-18-20.24.44.206117 UOWEXEC db2bp DB2XYZ hostname 1 2010-09-18-20.24.44.206117 UOWWAIT db2bp DB2XYZ hostname ... (3 Replies)
Discussion started by: ux4me
3 Replies

8. Shell Programming and Scripting

Assigning a specific format to a specific column in a text file using awk and printf

Hi, I have the following text file: 8 T1mapping_flip02 ok 128 108 30 1 665000-000008-000001.dcm 9 T1mapping_flip05 ok 128 108 30 1 665000-000009-000001.dcm 10 T1mapping_flip10 ok 128 108 30 1 665000-000010-000001.dcm 11 T1mapping_flip15 ok 128 108 30... (2 Replies)
Discussion started by: goodbenito
2 Replies

9. Shell Programming and Scripting

Getting a specific date from cal output with AWK

Hi guys! I'll make this short... Is there any good way to get the day number that first matches the Monday column from the cal command output with awk (or any other text manipulator commands) ? I'm sorry if my question wasn't clear at all. For example... One cal output would be $... (6 Replies)
Discussion started by: Casey
6 Replies
Login or Register to Ask a Question