#!/bin/ksh
awk -v sq="'" ' # Externally set a variable to a single quote character for use
# in the script.
# Set initial subfield and sub-subfield search values and set the output file
# field separator:
BEGIN { sf8AF = "^AF="
sf8READS = "^FDP=" # Subfield name FDP will be changed to READS
# when we save the field in the READS[] array.
ssf8coding = sq "coding" sq
OFS = "\t"
}
# Gather search terms from the "coding" sub-subfield of the last subfield of
# field 8 in the second operand input file (default file2). Fields in this file
# are separated by tabs. Subfields in field 8 are separated by semicolons and
# (except for the last subfield) are of form:
# name=value
# The last subfield of field 8 is a list sub-subfields separated by commas and
# are of the form:
# 'name':'value'
#
# The value associated with the "coding" sub-subfield will be used as the index
# into the arrays AF[] and READS[]. The AF[] array values will be collected
# from the AF subfield associated with this record and the READS[] array values
# will be collected from the FDP subfield associated with this record with the
# name of that field changed from "FDP" to "READS".
FNR == NR {
# Split field 8 subfields into array f8sf[] keeping number of subfields
# found in nf8sf:
nf8sf = split($8, f8sf, /;/)
# And split the last field 8 subfield into sub-subfields into array
# s8ssf[] keeping the number of sub-subfields found in nf8ssf with odd
# elements being a single-quoted sub-subfield name and even elements
# being the corresponding single-quoted sub-subfield value:
nf8ssf = split(f8sf[nf8sf], f8ssf, /[:,]/)
# Look for the sub-subfield with name "coding":
for(i = 1; i < nf8ssf; i += 2)
if(f8ssf[i] == ssf8coding) {
# Coding sub-subfield found. Save the value for this
# sub-subfield (without the surrounding single-quotes)
# as the index for the AP[] and READS[] arrays:
key = substr(f8ssf[i + 1], 2, length(f8ssf[i + 1]) - 2)
break
}
if(i > nf8ssf) {
# No coding sub-subfield was found:
print "WARNING: No coding sub-subfield found in record #" NR
next
}
# We found a coding sub-subfield... Look for field 8 subfields that
# have AF and FDP field names. If found save AF=... and READS=...
# output field values. Default values will be set to "none" in case no
# matching subfields are found:
AF[key] = "AF=none"
READS[key] = "READS=none"
AFfound = READSfound = 0
for(i = 1; i < nf8sf && !(AFfound && READSfound); i++)
if(f8sf[i] ~ sf8AF) {
AF[key] = f8sf[i]
AFfound = 1
} else if(f8sf[i] ~ sf8READS) {
READS[key] = f8sf[i]
READSfound = 1
sub(/FDP/, "READS", READS[key])
}
next
}
# Process records from the file named by the first operand (default: file1):
{ if($2 in AF)
print $0, READS[$2], AF[$2]
else print $0, "NOT DETECTED"
}' FS='\t' "${2:-file2}" FS=' ' "${1:-file1}"
This was written and tested using a Korn shell on macOS Mojave on a MacBook Pro.
It should work with any shell that performs standard POSIX shell variable expansions. I hope the comments help explain what it is doing. If something isn't clear, ask...
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
When invoked with no operands in a directory containing the sample file1 and file2 that you provided, it produces the output:
Code:
AKT1 c.49G>A p.E17K NOT DETECTED
AKT1 c.155T>G p.L52R NOT DETECTED
APC c.4033G>T p.E1345* READS=1999 AF=0.248124
EGFR c.2237_2255delAATTAAGAGAAGCAACATCinsT p.E746_S752delinsV READS=1963 AF=0.0,0.0,0.0,0.139582
which seems to match what you want.
This User Gave Thanks to Don Cragun For This Post:
Hi Friends,
I have a file with the following values..
xyz.txt,12345.xml
abc.txt,04567.xml
cde.txt,12134.xml
I would like to extract all the 2nd column values twice as shown in the example like
12345,12345.xml
04567,04567.xml
12134,12134.xml
Please advice!!
In the formus one of... (7 Replies)
I have read another post about this issue and am wondering how to adapt it
to my own, much simpler, issue.
I have a file of user IDs like so:
333333
321321
546465
...etc
I need to take each number and use it to print records wherein the 5th
field matches the user ID pulled from the... (2 Replies)
Hi everyone,
I have file1 and file2 comma separated both.
file1 is:
Header1,Header2,Header3,Header4,Header5,Header6,Header7,Header8,Header9,Header10
Code7,,,,,,,,,
Code5,,,,,,,,,
Code3,,,,,,,,,
Code9,,,,,,,,,
Code2,,,,,,,,,file2... (17 Replies)
Hello friends,
I have a text file with many columns (no. columns vary from row to row) separated by space. I need to collect all the values from 18th column to the end from each line and group them as pairs and then numbering like below..
1. 18th-col-value 19th-col-value 2. 20th-col-value ... (5 Replies)
In the below awk I am trying to print expName only if another tag planExecuted is true. In addition to the expName I am also printing planShortID. For some reason the word experiment gets printed so I remove it with sed. I have attached the complete index.html as well as included a sample of it... (1 Reply)
The below awk is used with the attached index.html and matches the specific user id in the sub portion with path of /rundb/api/v1/plugin/49/. The command does run but the output is blank. Something changed in the file structure as it used to work.
So using the first line in the output:
... (2 Replies)
I am trying to use awk to match the NM_ in file with $1 of id which is tab-delimited. The NM_ will always be in the line of file that starts with > and be after the second _. When there is a match between each NM_ and id, then the value of $2 in id is substituted or used to update the NM_. Each NM_... (3 Replies)
I am trying to use awk to find all the $2 values in file2 which is ~30MB and tab-delimited, that are between $2 and $3 in file1 which is ~2GB and tab-delimited.
I have just found out that I need to use $1 and $2 and $3 from file1 and $1 and $2of file2 must match $1 of file1 and be in the range... (6 Replies)
In the awk below which executes as is, I am trying to add a condition that will extract the text or
value after the FR= for the lines in each line of file1 compared
to file2. As is the lines between the two files are either a match, Missing in file 1, or Missing in file2,
but I can not add the... (1 Reply)